Rankings and Ratings, by Chuck Bower

This article originally appeared in the September 1999 issue of GammOnLine.
Thank you to Kit Woolsey for his kind permission to reproduce it here.

Rankings and Ratings

By Chuck Bower

"Who's the best backgammon player in the world? How does my game compare with the best players, other players on my internet server, players in my local club?" These are questions backgammon players often ask. Answering these inquiries for backgammon is even more difficult than for other intellectual games like chess and contract bridge. The reason is not surprising, since the answer is the same for many backgammon questions: "the dice!" Keeping this in mind, there are several techniques to rank players. This article is an attempt to summarize those methods.

Before I begin, let me emphasize some oft-overlooked points. Any measurement (for example, temperature of the outside air, distance between two cities, survey of voters for a presidential election) has associated with it some hidden factors: assumptions and mathematical uncertainties. If the assumptions on which a measurement is made are invalid, so is the result. Also, measurments have associated statistical uncertainties (if there are random processes which affect the result, and there almost always are) and systematic uncertainties (for example, if a thermometer's calibration is off and it always reads 2 degrees high). Whenever you read about a measurement result and conclusion, you should ask "what were the assumptions, and what are the uncertainties?" A careful researcher will have included these for you. Unfortunately, all too often these details are lost along the way. I will try and list (and where possible, quantify) assumptions and uncertainties throughout this article.

A clarification of jargon is in order: ranking is a sequential ordering which indicates relative strength, but does not necessarily quantify the results. It says the 35th ranked player has not performed as well as the 34 ahead but better than all those of lower ranking. It does not necessarily indicate how much better or worse the player is. A ratings system actually indicates how much better player A (with rating a) is than player B (with lower rating b).

I. Kaufman Ratings System

This method is a modification of the successful ratings system used in chess. To my knowledge, the backgammon rating system in common use on today's online servers is a modification by Larry Kaufman of the chess ratings system. I refer the interested reader to an article by Kaufman in Inside Backgammon, vol. 1, #5 (Sept-Oct. 1991) and to a web page by Kevin Bastian (http://www.bkgm.com/articles/McCool/ratings.html) for more detail. I now list advantages and disadvantages of this system. I also assign a letter grade (A thru C from best to worst) to the qualities of the various systems.

Objectivity (A). The rules are the same for everyone, and they are clearly spelled out in quantitative terms.
Universality (A). The method is available to everyone who has access to a computer with internet access. Of course the online servers (e.g. FIBS, GamesGrid, and Netgammon) play a big part in this. Keep in mind that the ratings only reflect the relative strengths of the participants—the system is closed. For example if two "universes" exist with no common cross-links, then a player rated 2100 in universe #1 cannot be compared to a player of similar rating in universe #2.
Reliability (B). The system has some weaknesses, particularly in its match length parameterization (which is where it differs most strongly from chess). However, the Kaufman rating system appears to cross-correlate rather well with other ranking methods. As far as I know, (and with the exception of some minor tweaking) no one has come up with a more reliable system which has only match results as its input.
Integrity (B). There are ways to create unnatural results. The four most common are:
1. Throwing matches to co-conspirators for the purpose of artificial ratings increase;
2. Collaboration (for example, asking a top robot what play/cube decision to make and then choosing that path);
3. Angling. Most common is to seek out new players whose ratings are artificially high due to their initialization point rating and the fact that they haven't had sufficient time to equilibrate to their true rating;
4. Sandbagging, intentionally making inferior plays/cube decisions in order to obtain a lower rating than one's true ability.
Sensitivity to Randomness (B). The Kaufman system requires a rather long time for a player to reach equilibrium (true rating) from a standard starting point. Something between 500 and 1000 "experience points" (sum of match lengths of all matches played) seems to be required. After reaching equilibrium, swings in ratings (due to dice) of 50-100 points in either direction seem to be typical.
Other qualities (A). One character which the Kaufman system possesses that none of its competitors has is the ability to predict the outcome (by assigning odds) to a match of specified length between two players with known (reliable) ratings. Thus, by my definitions above, the Kaufman system is not only a ranking system but a rating system as well.

The four biggest Kaufman Ratings databases are the three above-mentioned internet servers and Kent Goulding's International Backgammon Rating List which has unfortunately been mothballed since 1996. As most readers of this magazine have access to at least one of the online databases, I'll not cover them here. These systems have as their predecessor KG's list. This was compiled from non-weekly tournament results from medium to large events conducted around the world over several years. The last compilation occurred in July 1996. Only players active in the previous 3 years were included. To be listed in the top 100, a player had to have 1000 or more lifetime experience points. The top ten players from that final listing were

Rank       name          experience    rating     match win percentage

 1.  Edward O'Laughlin      7741        1856              62
 2.  Billy Horan            3501        1831              63
 3.  Harry Zilli            4307        1790              61
 4.  Matthias Pauen         1727        1786              63
 5.  Mika Inkinen           1475        1783              71
 6.  David Wells            1276        1774              66
 7.  Mike Svobodny          4553        1772              55
 8.  Ray Glaeser            4151        1770              57
 9.  Hugh Sconyers          1804        1768              57
10.  Evert Van Eijck        1815        1760              63

Most of these names are easily recognized by those who play or read up on the results of big tournaments. We will come back to this list later when the Giant 32 system is discussed. Those familiar with the online server ratings will notice that the top player KG rating is 200-300 lower than the top online ratings. This is likely due to the tighter spread of skill levels for face-to-face tournaments.

Note that local clubs can (and some do) keep similar ratings systems, so size isn't an impediment to setting up a Kaufman system. Without the aid of automated scoring, however, the amount of work required is significant.

II. Earnings System

This type of ranking/scoring system awards points for high finishes in tournaments. These are quite common in local clubs. Some grades:

Objectivity (A). Awards are predetermined based on finish and number of entrants. In events with divisions based on skill, higher skill events usually award higher points.
Universality (B/C). Scoring is directly related to attendance. Although this is good for rewarding players for perseverance, often that is the overriding factor. Those with more free time (and/or travel money) have more opportunities to score. Again, there is nothing wrong with this as a reward system, but it is less likely to rank players according to actual skill compared to the other systems.
Reliability (A). Accomplishments lead to rewards. Simple as that!
Integrity (A). Pretty hard to cheat here, short of being a dice mechanic.
Sensitivity to randomness (B). A run of luck can go a long way, especially in competitions where there are not a lot of events.

One Earnings system worthy of note is Bill Davis's American Backgammon Tour (ABT). Now in its 4th year and growing, this competition is entered by attendance at several weekend (regional) events around the US. Bill also keeps a lifetime list. You can view the 1999 standings as well as the lifetime rankings at the Chicago Point WWW page (http://www.chicagopoint.com/abt.html).

III. Surveys

This is similar to an election. Ballots are made available (sometimes to a restricted set of people) and players are ranked individually by each voter. Points are assigned based on those rankings, and a final ranking is tallied.

Objectivity (C). Not surprising since this type of contest is almost the definition of 'subjectivity'.
Universality (B/C). This depends a lot on the distribution of ballots. People tend to vote on players they know, and that can lead to regional bias. A really strong player who isn't widely known will slip through this system completely.
Reliability (B). Again, quite subject to the distribution of ballots. Also "old names" are likely to get overly high rankings. Just because someone used to be a top player doesn't mean s/he still is.
Integrity (A). As long as they don't allow campaigning... :)
Sensitivity to randomness (A). This may be the survey's strongest characteristic. You don't make it to the top without A lot of success.

A well known survey is the Giant 32 of Backgammon conducted by Yamin Yamin with behind the scenes help from Carol Joy Cole, John Stryker, Jake Jacobs, and Howard Ring. Ballots contain 32 slots and the voter is given a ranking of 2, no exceptions! This survey has been taken every other year, but apparently it is being discontinued as well. It has not gone unnoticed in the "Old World" that recent rankings have been highly biased towards US players. Again, this appears directly related to the distribution of ballots (or maybe fairly stated, to the return of ballots). The 1997 Giant 32 Top 10 (from Flint Area BackgammoNews #217, Mar/Apr 1998):

Rank       Player         Points   Number of ballots-  KG Rating list
                                    1st place votes    ranking (1996)
                                  (out of 57 possible)

 1. Wilcox Snellings       1548          56-12              12
 2. Mike Senkiewicz        1423          54-10              15
 3. Nack Ballard           1211          50-7               41
 4. Mike Svobodny          1170          53-0                7
 5. Paul Magriel           1115          52-2             >100
 6. Neil Kazaross          1082          49-2               16
 7. Billy Horan            1077          49-5                2
 8. Kit Woolsey            1048          51-1               27
 9. Jerry Grandell          864          40-7               72
10. Bill Robertie           840          47-0               97

Grandell was the only player of the first 14 from outside the US. In defense of the promoters, I point out that besides tournament success, voters were asked to include money play prowess and intangibles such as fear instilled in opponents.

IV. Play-by-Play Judging

This method is the newest of the systems I cover. It has only been available for a little over a year. The commercially available backgammon software, Snowie by Oasya, is not only a world-class player in its own right but also has the feature that it will analyze an entire match and quantitatively assign grades to all checker plays and cube decisions (including lack of cube decisions).

One big advantage of this system is that it quantifies luck, and removes it from the results. A player is judged solely on his/her performance with the actual dice rolls. If you play a poor roll well, that is much better than playing a joker with mediocrity.

Objectivity (A). Snowie doesn't know who you are, nor does it care. All it sees are the moves you make.
Universality (A). If you are willing to shell out the going price for Snowie Professional Edition, you will have the tools to have your matches analyzed.
Reliability (A/B). Snowie is prosecutor, judge, and jury. Within the limitation of Snowie's skill level, you get a meaningful assessment. Since Snowie is one of the best players in the world, most players will get a good evaluation. Exceptions include playing a known opponent where a technically inferior play was chosen because it was anticipated to be best against the given opponent. Snowie can't see this and only looks for the best technical play. The same weakness applies for cubes tailored to a particular opponent. Also, occasionally Snowie, even with a rollout, will find the wrong best play because of a systematic weakness in its playing style. All of these shortcomings are sufficiently infrequent and small enough in magnitude that it is unlikely they introduce much systematic uncertainty into the results.
Integrity (A). Snowie doesn't take bribes.
Sensitivity to Randomness (A). Here this system has no rival. It eliminates the luck associated with the rolls. There is still some random influence. (Maybe all the decisions were 'easy'. The game played itself.) Again, these are minor compared to the other systems.

Harald Johanni, editor of Backgammon Magazin, has recently published a rating list based upon Snowie evaluation of match play. This list has some inherent weaknesses. The main problem is its (non-)universality which can be seen from the distribution of matches used in making up the list. Part of this is likely due to regionality—Johanni is European (as is most of his readership?) and he receives results primarily from European tournaments. Also, Johanni only accepts hand recorded match transcripts (as opposed to online computer recordings). I'm not sure of the reason but it is possibly due to an integrity question. (For example, I record only the matches I do well in and send those to Harald.) Anyway, I'm not sure but it is possible that promoters/followers of European events are more diligent in recording big matches, so these understandably would get preferentially included in the system. In any case, this system is likely to become the standard for the future and improvements are likely.

A Case Study: Jellyfish v3.0 Level-7 vs. Chuck Bower

I have limited data (but hopefully that is better than none at all) which allows a cross-comparison of some of these systems. Recently I played a series of 54 matches to 7 points against Jellyfish version 3.0 at its strongest playing level. There are some tantalizing hints about ratings/rankings systems and backgammon competition in general than can be speculated upon based on this sample. However, more study is required to beef up the statistics before strong conclusions can be drawn.

Before giving detailed results of that study, I give some figures on how these two players rate/rank in some of the above systems:

I. FIBS Ratings (Kaufman System)

Player                       ranking      rating     experience

Chuck Bower (c_ray)            656       1743.90         741
jellyfish (from Dec 1997)        3       2037.68      29,548

For a 7-point match, the FIBS ratings formula predicts JF should have a winning probability of 71%.

II. Earnings System

(There is no comparable data on me and The Fish for this kind of system.)

III. Surveys

Neither of us made the 1997 Giant 64 (the 1997 Giant 32 balloting results listing continued through the top 64 players). Jellyfish played 200 game money sessions with each of Mike Senkiewicz and Nack Ballard in a well documented event conducted by Malcolm Davis a couple summers ago. JF broke exactly even in the 400 games. Senkiewicz ranked 2nd and Ballard 3rd in the 1997 Giant 32. My lifetime record against the 1997 Giant 64 is 10 wins, 11 losses.

IV. Play-by-Play Judging

The only data I have on Jellyfish play analyzed by Snowie is for the 54 matches of this case study. For myself against other opponents, I have analyzed only 8 matches from FIBS tournament play. For this small sample, my overall error rate is 0.00605 and my checker play error rate is 0.00753. The checker play error rate is the average error per non-forced move in cubeless equity units. The overall error rate is similar but includes doubling cube errors as well as checker play errors. The overall error rate would put me in 26th place in Johanni's ranking system for issue 1996/II of Backgammon Magazin. (I don't for a minute believe I am even close to 26th best in the world. This is just for comparison sake and will be discussed later.)

V. Head-to-Head Money Play

Over the past few years I have played Jellyfish several hundred games at money style play. A summary of those sessions:

  opponent        total games played       net ppg

vs. JF1 level-7           1620              -0.13
vs. JF2 level-7            820              -0.40
vs. JF3 level-7            107              -0.26

VII. Results of 54 Matches

My record in the 54 7-point matches was 26 wins and 28 losses. According to Snowie analysis, my overall play error rate was 0.00833 and my checker play error rate was 0.00941. Jellyfish played extremely well according to Snowie: overall error rate was 0.00205 with a checker play error rate of 0.00215. Also, Snowie said I had more luck at the rate 0.00150. (Again, the units here are cubeless equity per dice roll.) We can compare these error rating numbers to Johanni's rating list. Jellyfish would be rated number 1 in both overall error rate and checker play error rate. Furthermore, its competition isn't close. Best (in the 1999/II issue) was 0.00371 total error rate and 0.00449 checker play error rate. My performance would rank me 69th overall, and 70th (out of the 108 listed) in checker play.

There appear to be several inconsistencies in this study:

The FIBS ratings would predict that I win less than 16 of the 54 matches, yet I won 26. This is a three standard deviation result (meaning I should do this well only about 1 time in 1000).
For the eight matches I played on FIBS (against sundry human opponents), Snowie said I performed much better than I did against JF, ranking me 26th "in the world" against FIBSTERS but only 69th when playing Jellyfish.
At money play, my performance against Jellyfish was pretty dismal. This would have led one to conclude that my match- play performance would also be poor.
My performance against the 'Giant 64' was decent. Jellyfish's was equally good (and probably a bit better since it was playing higher ranked competition). This admittedly meager data would predict a close competition, as actually occurred.
Snowie's analysis of JF's and my performances appears to indicate I was outplayed by a significant amount. The luck rate only makes up for part of the discrepancy between Snowie's evaluation of play and the actual 26-28 win-loss outcome.

What is going on here? I have some ideas, but I'm not really sure. Here is some brainstorming:

My FIBS rating is probably undervalued. It is way below both my performance vs. the 'Giant 64' and my match play record vs. JF. I suspect my limited experience (741 points) could have something to do with this.
Relatively speaking, I'm stronger against JF at matchplay than at money play. This would explain my poor money play performance vs. JF and my stronger matchplay results.
I take tournament matches (FIBS and 'Giant 64') more seriously than my JF matches.
Snowie may rate Jellyfish anomolously high because it is a fellow robot! This is not prejudicial. It could be that neural net training leads to similar views of "right and wrong" which are not fundamental. That is, Snowie may not be as good at recognizing Jellyfish's weaknesses because it has those same weaknesses itself.
My performance (as determined by Snowie) is higher against weaker opponents (my FIBS matches) than against world class competition (Jellyfish). This explains the discrepancy in my 'world rankings' for the different opponents.

The last point is one of the more interesting speculations. Does a player perform better (in Snowie's view) against weaker opposition? One way to think of this is: Do strong opponents create tougher decisions?

Finally, there is one more tidbit which I have yet to reveal. Of the 54 matches, the luckier player won 52 and lost 2! Of my 26 wins, I was luckier than JF in all of the matches. For JF's 28 wins, it was luckier in 26 and I was luckier in 2. So Jellyfish's considerable skill advantage only netted two match wins. Is this evidence that backgammon is mostly luck? Something to think about!

Return to:

GammOnLine Article Index
Backgammon Galore