This article originally appeared in the September 2001 issue of GammOnLine.
Thank you to Kit Woolsey for his kind permission to reproduce it here.

Quantifying Backgammon Skill

By Chuck Bower

I. Recent History: Ten Years of Neural Net Backgammon

It's interesting to follow the evolution of respect (or lack thereof) that computer players have received over the past 10 years. In 1990, Expert Backgammon for the PC (EXBG) was the only commercially available game with any reasonable skill at all, and it was widely branded "intermediate". Then in 1991 Tesauro started a revolution with TD-Gammon, the first neural net player. Its performance was met with skepticism, but it was immediately seen that TD-G was a vast improvement, and likely an indication of future strengths of computer players (since tagged 'robots' or 'bots', for short).

Initially the opportunity to play against TD-Gammon was the priveledge of a chosen few. Unfortunately TD-Gammon didn't become available to the public until later (1995), as part of a carrot by IBM for enticing PC users to install the OS2/Warp operating system. This public release was a play-only version, and although it would have been a very exciting opponent in 1991, by the time TD-G was publically released, the first commercially available robot -- Jellyfish (JF) -- had already hit the scene in a big way (in 1994). JF's name was chosen because its brain is said to be of similar strength to the aquatic sea creature. (And I didn't realize that jellyfish could even play backgammon. :) With JF's wide distribution, the critics were quite numerous, if mixed of opinion. For example, Bill Robertie, who had been one of TD-G's biggest fans, took a mildly contrarian view of Jellyfish1.0 in the periodical Inside Backgammon vol. 5, #1 (Jan-Feb 1995): "Although we've been pointing out JellyFish errors ... let me reiterate ... that it does far more things right than it does wrong, and there's no doubt in my mind that it's the strongest commercially-available program right now. (It's closest competition was EXBG. --CB) But it's not [yet] as strong as TD-Gammon, and it's not a world-class opponent."

However, regardless of the negative criticisms levelled at the robots, many experienced players saw the value of using them for rollouts, even those performed by the weak bot EXBG. Meanwhile, several homegrown neural net backgammon players were appearing on the internet servers. Some of these were quite formidable opponents, LONER being among the best and probably at least on par with JF and TD-G. Among the general populace (particularly, online players), the robots' strengths were receiving wide acclaim. Among top-level, seasoned tournament players, however, there was still considerable skepticism. See, for example, Nack Ballard's 1997 comments in an annotation of a Backgammon-by-the-Bay club game between Ron Karr and Richard McIntosh.

Jellyfish's strength and reputation continued to grow with each release. Then in 1998, a second commercially available neural net player -- Snowie (name shortened from 'SnoWhite' to avoid potential legal battles with Disney) was marketed. Currently in its 3rd release, Snowie has become the accepted authority on the game. Despite its documented shortcomings, there has been a tendancy for its evaluations and especially its rollouts to be taken as gospel. If Snowie says you are wrong, then you are wrong. Period. That view may seem extreme, but it is common among Snowie users. To the more objective comes the question "but how do we know ...?"

II. Skill Measurement Methods

A. Head-to-head Money Play

One logical reason for skepticism of a player's strength (robot or human) is the difficulty in objectively measuring that strength. Unlike tennis or chess, backgammon is permeated with randomness and it's very difficult to filter out the luck to find the skill that remains. Prior to the 1960's (when tournament play was popularized by Obolensky), the only measure of skill was head-to-head money play. Even here, many long sessions are likely to be required if statistically significant results are desired. The standard deviation of a money session is approximately 3*sqrt(N) where N is the number of games played.

B. Tournament Performance

With the popularization of tournament backgammon in the 60's and especially during the 70's, a new method of measurement appeared. Initially there was little quantitative data recorded, and a player's reputation was based as much upon opinion as fact. Paul Magriel was widely regarded as the best player of the 70's, based partly upon his excellent trophy collection, but likely also on his worldwide media exposure.

In the 80's, Kent Goulding (KG) and associates began logging tournament performance, not just wins and losses but also the strength of the opponent. Mimicking chess ratings, they instituted a (US) national tournament performance rating and ranking system. Unforutnately this doesn't help with robot skill measurement because non-humans have never been allowed to enter tournaments against humans.

C. Online Performance

With the advent of internet online play in 1989, yet another yardstick was available -- online performance ratings. This method is identical to the KG system used for tournaments and mentioned above. However, finally robots could be rated on an equal footing with human players. There was a minor systematic error which cropped up online that hadn't been a problem for KG -- dropping. This larcenous activity, fortunately the practice of a minority, consists of not finishing matches (and thus not having them affect one's ratings) which are virtually guaranteed to end in losses for the dropper.

Not only is the dropper's rating thus higher than his/her true skill level, but the droppee's rating is equivalently lower than his/her/its skill level. To add to the discrepancy, it has been speculated (by David Montgomery) that robots are more likely to be dropped upon as compared to humans, and thus will be likely to suffer larger ratings deficits compared to their human competitors. Though unproven (to my knowledge), this theory is reasonable, analogous to the fact that many people (e.g. shoplifters and tax cheats) are known to be more likely to comit crimes against nameless, faceless organizations than against individual humans. Also, a bot is probably less like to complain to the online administrator about being dropped upon, or at least that is potential rationale.

D. Third Party Judgement

If a competitor's play (for example, a match) is recorded, then an intelligent, objective third party could review that record and determine skill level. A big advantage here is that most of the luck is irrelevant, since the performance was only measured based upon how the given roll was played, regardless of whether that roll was actually beneficial to the player or not.

Better than simply "intelligent (and) objective" analysis is quantitative analysis. In the July-August 1995 issue of the Hoosier Backgammon Club newsletter, I made a comparison of Jellyfish evaluations, Jellyfish rollouts, TD-Gammon evaluations, TD-Gammon rollouts, and EXBG rollouts. (Expert BackGammon did not give quantitative evaluations. It merely reported its decision: play or cube.) This study was based upon 10 positions which had been written up in Inside Backgammon (May-June 1994) and that's where I got the TD-G results. My conclusion (which admittedly was not at a stastically significant level) was that assuming TD-Gammon rollouts as the benchmark, Jellyfish1.0 level-6 (2-ply) evaluation was better than TD-Gammon evaluation for those 10 positions.

The most recent versions of Snowie have incorporated the best ever skill measurement tool -- full match analysis. Snowie will record a match that it has played, or import a match played on a server between any two opponents, a Jellyfish match, or even a hand recorded match (when transcribed into the proper form). It can then analyze the entire match, one play at a time, with either evaluations, rollouts, or some combination of these. A quantitative error figure is reported for each play, and those errors are tallied at the end to give a rating. Snowie will also keep a cumulative record of performance in any number of matches.


A. Head-to-Head Money Play: JF vs. World-Class Humans

In 1997 Malcolm Davis initiated a contest by inviting two of the world's best human players, Nack Ballard and Mike Senkiewicz, to Texas to play against Jellyfish3.0. Human players put up their own money and Harvey Huie backed Jellyfish. Ballard and Senkiewicz were not teamed up, so actually there were two independent tests. Dice were human rolled to remove any concern that JF's generated dice were less than random. Each contest consisted of 300 independent money games. Coincidentally, Jellyfish finished dead even, beating Senkiewicz by 58 points and losing an identical amount to Ballard. JF's creator, statistician Fredrik Dahl, was quick to point out that a 58 point win in a 300 game sample is insufficient to conclude superiority. Ballard's win and Senkiewicz's loss were only significant at around one standard devition each -- not particularly meaningiful. Taken together, clearly neither the human race nor the droids could even hint at having an edge.

B. Head-to-Head Matchplay: JF vs. SW

Shortly after the release of Snowie, Larry Strommen approached both Olivier Egger (Snowie's creator) and Fredrick Dahl to propose a common interface so that the two best robots could go at each other without human intervention. Hundreds of matches could be contested in a reasonable (about a week) time period. Neither showed interest. I've since heard of matches with human intervention, but these have been understandably time consuming and the results are not statistically significant. (For example, if player A has a match win expectation of 55% against player B, it would take on the order of 400 matches to establish this at the 95% confidence level.)

C. Online Performance: JF and SW

Both Snowie and Jellyfish have played thousands of matches on FIBS. One might think that their performance there could give a relative measure of their strengths. This is true to a point, but unfortunately with large enough (this time both statistical and systematic) uncertainty that one cannot conclude which is the better player. The problems are twofold. First, there is a temporal problem: Jellyfish had stopped playing on FIBS before Snowie started. It is well documented that server ratings vary with time, primarily due to the preferential retirement of weaker players. The second (statistical) uncertainty has to do with the fluctuations in ratings, due primarily to the dice. Although I'm not aware of the true numbers, I believe it has been shown that swings of +-100 points for even a highly experienced online player is not uncommon. Relative to all players on FIBS, both JF and SW were consistently in the top 5. All that we can say from the FIBS experience is that the two bots are close to equal.

D. Third Party Judgement: Comparing the Skills of JF and SW

Up to now, this article has been merely a review of past history. Recently I played a 19-point match against Snowie3.2, 3-ply (huge,100%) and then let Snowie grind away for over 8 days doing a 2-ply rollout of EVERY checker play as well as all cube decisions, but only for Snowie's side. (I didn't have 16 days to comit my PC to get the rollout analysis for both sides!) The 'Huge' searchspace was used for both checker play and cube decision rollouts. 144 trials truncated at 10 were performed on several candidates for each of Snowie's plays. 360 UNtruncated trials measured the cube actions as well as failure-to-double errors. The entire match can be found here.

The end result was that Snowie rollouts judged that Snowie player committed errors at the average rate of 1.753 millipoints per move (mppm). (For comparison, if a player made only two errors in 100 moves, with one of magnitude 0.100 cube-adjusted equity and one of magnitude 0.075 cube-adjusted equity, the net result would be 1000*(0.100 + 0.075)/100 = 1.75 mppm.)

I then stepped through the entire match with Jellyfish as a kibitzer and asked JF3.0, level-7 (3-ply), time factor = 1000 to indicate how it would have made each play and handled the cube. The same Snowie rollouts measured JF's error rate for this match at 2.533 mppm.

In an attempt to be fair, I then took three 7-point matches I had played against JF3.0, level-7 last year whose total number of moves (702) was close to the Snowie 19-point match (681 moves for both sides). I had Snowie roll these matches out play-by-play at the same 2-ply settings used in the Snowie 19-point match analysis. Snowie rollouts said that Jellyfish's error rate was 1.164 mppm. I also had Snowie do a simple evaluation (huge, 100%) of these three matches and tallied up the errors that Snowie would effectively have made in the same situations. Here the Snowie rollouts reported that Snowie's (player) error rate was 0.672 mppm.

E. Statistical Analysis

As with any measurement, there are both statistical and systematic uncertainties associated with these quantities. To get an estimate, I looked at the standard deviations in JF3.0 level-7's error rate in a subset of 7-point matches played against me over the past year. I also computed a standard deviation of my own error rate. Note that these distribtions aren't Gaussian. In particular, they aren't even symmetric, since there is a hard lower bound (zero) but no upper bound. I eliminated 11 outliers (matches with large error rates) from a sample of 81 matches to determine the standard deviations. I then multiplied by the squareroot of the mean number of plays in those remaining 70 matches (N=224.5) to come up with the standard deviation of the error rate per move. JF's standard deviation was 9.38 mppm and mine was 23.40 mppm.

In summary, the statistical error on a SW rating of a series of plays (for example, a match) is just s.d.m./sqrt(N) where s.d.m. is the per move standard deviation on the error rate (9.38 for SW and 23.40 for a 'typical' human) and N is the number of plays for BOTH sides.

We can now assign statistical uncertaintiess to the earlier error rates for Snowie (player) and Jellyfish for the 19-point match: SW == 1.753 +-0.705 (95% confidence) and JF == 2.533 +-0.705. For the three 7-pointers the numbers are SW == 0.672 +- 0.694 and JF == 1.164 +- 0.694. For all 81 7-point matches (17,473 plays) between JF and me: JF == 1.502 +- 0.139. CRB == 5.042 +- 0.347. (All uncertainties are at 95% confidence level.)

Systematic uncertainties are always difficult to quantify. Qualitatively, two sources of systematic error are the 'robot bias' (particularly the Snowie bias) and the method of analysis. Snowie bias comes about because Snowie rollout is not a perfect judge of skill. The rollout results contain systematic uncertainties based upon Snowie's less than perfect play. In addition, since the robots tend to play similarly, Snowie will likely give another robot (JF) a higher rating than it deserves. I don't really have much of a feel for the magnitude of this effect, but would crudely guess it's around 0.5 mppm for a typical 7-point match.

The systematic error due to the analysis method could be quantified but I haven't had the time to devote to that effort. Basically there are three ways for Snowie to analyze a match: evaluation only, rollout only, and a combination. Most people either do evaluation only or have Snowie evaluate and then roll out plays where the evaluation error is larger than some threshold. For example, in my 81 matches with Snowie, all plays and cube decisions where evaluation indicated an equity error of 0.030 or higher were rolled out. For positions that are rolled out, the rollout result takes precedence over evaluation.

For the above described combination method (roll out positions with error greater than some specified threshold), only two candidates are rolled out for each error position: the actual play made in the match and the play Snowie evaluation considers best. If some third candidate would do better in a rollout, this will not be discovered because that candidate will not be rolled out. When all positions of a match are rolled out, however, then typically several candidates are rolled out for each position. In this method there is less likelihood that the best play (that a rollout could find) falls through the cracks. A complete match rollout is therefore a stricter judge of quality of play than a combination method evaluation. My interpretation of this is that the systematic error of a match which is completely rolled out is less than the systematic error for a match which is only partially rolled out.

F. Third Party Judgement: Human Ratings

Harald Johanni has been analyzing human vs. human matches for a few years now and has built up a database of matches which have been analyzed by Snowie. He does a 3-ply (huge, 100%) analysis and then rolls out (2-ply) all errors with magnitude larger than 0.1 cube-adjusted equity. He publishes a ranking/rating list ( based upon those analyzed matches. Dirk Schiemann is the top rated player with an error rate of 2.927 mppm based upon having had 17 of his matches analyzed. He is the only player in Johanni's list with an error rate under 3 mppm. 13 players have an error rate under 4 mppm and on 44 have an error rate under 5 mppm.

Although Johanni's system is far from comprehensive in rating tournament players, consider this: his top 30 players include 17 with five or more analyzed matches. They are Schiemann, Grandell, Paul Weaver, Heitmuller, Levermann, Muysers, Mads Andersen, Sax, Ballard, Johanni, Magriel, Winslow, Goulding, Karsten Nielsen, Robertie, Granstedt, and Meyburg. Their average rating is 3.93 +-0.15 (conservative 95% confidence) for 327 matches.

I can think of three reasons why a human rating can be expected to be less than an equally skillful robot. Two have been mentioned previously: human vulnerabilities (for example: fatigue, concentration, confidence) and Snowie bias. The third is a psychological factor. Robots only make the plays they consider to be best, technically (that is, the play it would make against an equal opponent). Sometimes a human player will intentionally make a technically inferior play, but one s/he expects the opponent to misreact to, the overall effect being a higher equity gain compared to making the best technical play. This can be especially valuable in doubling actions. For example, a double may be technically too early, but still correct if there is a reasonable chance the opponent might pass.

IV. Conclusions

Although the skill measurement methods detailed here give the analyst many more options than were available even as short as 3 years ago, there is still considerable systematic uncertainty in their results. We may have to wait until the game of backgammon has been solved to have a really accurate rating/ranking system. In the meantime, although Jellyfish and Snowie are among the best players in the world, one should keep in mind that they are not perfect, and even the most powerful rollout result isn't exact. Always keep one skeptical eye open.

Return to: