Backgammon Ratings

Ratings

Ratings and rankings

From:   Chuck Bower
Address:   bower@bigbang.astro.indiana.edu
Date:   23 December 1997
Subject:   rankings and ratings...
Forum:   rec.games.backgammon
Google:   67pd8h$ph8$1@dismay.ucs.indiana.edu

(NOTE: don't confuse this Subject with "rantings and ravings..." That was a post I made about a week ago.) From time to time there are questions pertaining to how to decide who is better than whom. There was a rash of them recently relating to the BOTS and their performances on FIBS. I'm writing this as an overview/summary of the topic. Some ideas (including speculations) are my own and others have already been expressed (in this newsgroup). I'll try to differentiate, but pardon me if I plagiarize. And, especially I ask forgiveness for NOT giving credit to the authors of any ideas that I rehash. One other disclaimer: I will be discussing some "methods" and their "keepers". DO NOT FOR ONE SECOND conclude that these people contend that their methods are anything more than just another piece of data which can be input into the unanswerable question "who is better than whom?". They provide a valuable service (for no remuneration) and if someone were to attempt to lecture, scold, or otherwise chastise any of them, then such person has completely misunderstood both their efforts and, in addition, this article! The BEST way (that I can think of) to determine which of TWO players is better is to have them play a LONG session against each other. Up until recently (last 10 years or so) this was about the ONLY way to answer the question "who is better...". Normally this is done for money (to "reward" the better player as well as to attempt to ensure that each is playing at his/her best)! Unfortunately the number of games/matches required to determine the answer with statistical confidence is so large that it just takes too much time to reach a reliable answer. As the skill difference between the players gets small, the number of trials required becomes HUGE. As an example, after last summer's JF challenge, Fredrik pointed out (with statistics) that a difference of 58 points in 300 games isn't nearly enough to draw a conclusion because of the large fluctuations (from the dice). (Try DejaNews for more specifics.) It could be that people who play against each other A LOT (one or more sessions per week over years, for example) could actually collect enough data to reach a statistically significant answer. However, you can always surmise: "Maybe one of the players improved more relative to the other over the time the data were taken. Who is the better player NOW?" Currently we have dedicated 24 hour players (like commercial Jellyfish), and the question can be answered knowing that at least one of the players isn't improving! I haven't seen much on the newsgroup (that is, substantiated with numbers) comparing human vs. JF. I keep such tallies for my own play (and have posted results in the past) but haven't been getting in enough play-vs-JF time lately to collect sufficient statistics on the current version. I'm SURE that JFv2.0 level-7 was a better player than I. Anyone want to take the other side of that argument? Gee, thanks. What about global measures? I know of three which are currently available, but none of them is perfect, either. They are surveys, performance points, and ratings based methods. All have their pluses and minuses. One example of a survey is Yamin Yamin's "Giant 32 of Backgammon". This is a biannual survey of on the order of 100 persons (responses, that is). The results are published in the Flint Area Backgammon News. Actually, this survey was just completed within the last couple of weeks and I expect to see the results in either the next Flint newsletter or the issue after that. The problem with surveys is that they are inherently subjective. For example, some Europeans have complained that Yamin's survey is biased toward North American players (and I agree with them). I, for one, am NOT complaining about this survey. The nature of BG (as it is played today) is regional in nature. IMHO that is an irrefutable fact. Most of Yamin's survey respondants are North Americans, and even if they aren't socioligically biased (let's hope that's the case!) their experience is in this hemisphere and players from other parts of the world play within their own travel zones. Not very many (in fact, NO) events give a truly geographically unbiased sampling. Actually, the online Internet tournaments are probably the least biased from that standpoint. Performance rankings are another measure. Bill Davis has been coordinating such a point system--The American Backgammon Tour. How good is this at determining the best players in "America"? Well, it is probably a decent (though still statistically insufficient) way of deciding who is "best" among those who play in a LOT of ABT events! Problem is, for whatever reason, a lot of strong Western Hemisphere players don't participate frequently on this tour. It's a fun way of recognizing players who are doing well, but it's just another piece in the puzzle. The third method is one which has really caught on recently thanks to Internet backgammon. That is ratings systems. Copied from chess ratings systems, this is an objective method of ranking players who share a common playground. Kent Goulding (and colleagues) had been keeping a ratings system for large tournament results over the past several years. Unfortunately, instigated by lost-data problems, I believe his effort has been inactive since the summer of 1996. Still, to my mind, KG deserves much of the credit for the current popularity of the online ratings systems. One obvious weakness of any ratings system is that it really only applies "locally". Only FIBS players get FIBS ratings/rankings. Only GAMESGRID players get GAMESGRID ratings/rankings. Etc. At best you only get a reliable ranking among the participants and conditions of that rating system. Maybe the highest ranked player is just "a small fish in a small pond", so to speak. And it's really worse than that, because sometimes the players within a rating system don't intermingle much. For example, some FIBS players only play within a small cluster of "friends", so even though s/he has a FIBS ratings, it's not as universal as it appears. The two ideas I've mentioned in this paragraph have been discussed previously (multiple times) in this newsgroup. In addition, they are covered in greater detail, with some nice examples, (or counterexamples...) in the Jacobs-Trice book "Can a Fish Taste Twice as Good?". This is recommended reading for anyone wanting to delve more deeply into the subject. Online ratings systems can be tricked as well. (This is no secret.) By carrying the "don't intermingle" idea to an extreme, a person can play against him/herself (using two or more different ID's) and artificially inflate his/her rating. This is almost always easily detected. A very high rating with very low experience is certainly suspicous (though it's apparently theoretically possible to do this honestly). There are other low-integrity tactics which have been pointed out in this newsgroup as well, like preferential dropping, and "fishing" (searching out weak players whose ratings are higher than deserved, for one reason or another). I believe these problems are inherent. There will always be "clever" cheaters who find a way to work around attempts to prevent such tactics. One other thing worth mentioning (and covered previously in the newsgroup) is the observation that the common ratings systems may have the weakness of overrating players who only compete in 1-point matches. It seems like a difficult thing to prove, but there does appear to be circumstantial evidence. Maybe these special players should have their own (segregated) ratings system. Now I am going to attempt to break new ground and start speculating. (Oh, you thought that's what I'd already been doing!) In particular I'm going to focus on the robots' ratings on FIBS--another hot topic recently. Do the robots' high ratings really give them the title "best players on FIBS". Maybe, but not necessarily. I am going to list (in no particular order) some reasons why their high ratings could be brought under suspicion. Note that in case you are new to the newsgroup, I have no hidden malice towards them. I have a very high respect for these players. 1) Selection effects. (Basically I'm repeating the above problems with intermingling.) Do the bots take on all comers? Should they? Do the best humans take on all comers? Should they? Are weak humans more likely to challenge a highly ranked bot than I highly ranked human? My guess is "yes". Computers are incapable of sneering when they turn you down. Even if you argue that human experts don't do this (and my experience is that they are among the best mannered experts of any kind in the world!), that doesn't keep the inexperienced player from suspecting such a thing could happen. 2) Exhaustion. Computers don't get tired. Humans do. 3) Emotion. Computers don't feel emotion. They don't notice bad dice. I suspect even the best human experts, as hard as they try, still feel the pain of unlucky dice. I'm sure it doesn't have the same magnitude of adverse effects as it does for the typical player, but it has to have a small impact in any case. How about elation? Is that an advantage for a human player to have? What about embarassment? Do strong human players make errors based on ego? (I can't lose to THIS person!) Again, they know it's a detriment to good play, and thus work on eliminating such concentration killers, but it still must affect things sometimes. 4) Distractions. Does a computer's spouse interupt and call it to dinner? Does it have to break it's concentration when the modem rings? Is it watching a ballgame at the same time it's playing? Does it play better or worse after having a couple alcoholic drinks? 5) "Giant killing". (I am of the belief that this is potentially a HUGE advantage for the bots.) Let me start with a (true) story. I had heard through r.g.bg and also from conversations with other players that there was a "new kid on the block"--SnowWhite. (This was a while back.) I was on FIBS and decided to watch this maiden take on one of the seven dwarves. I watched for all of about two dice rolls. Why? I was annoyed (make that disgusted). SnowWhite's opponent was in some kind of SUPER BACKGAME. Three or four points in SwowWhite's board, only one or two checker's in his/her home board--you get the picture. Gee. This looked like the typical backgammon game that I play... So, why do I believe that such tactics are a big advantage to the bots? Simple. We're not (necessarily) talking about a highly rated FIBS player trying to outsmart a bot by playing a backgame. We're talking about Joe-typical-player. Even if it's true that an expert backgammon player can "make money" using backgame tactics, it is usually done by getting the bot to indiscriminately elevate the cube in a few games. The bot wins most of the games (many of which are gammons) with the cube at a low level. The human expert wins a few games WITH THE CUBE AT SOME ASTRONOMICAL VALUES. This isn't likely to work at match play due to the finite match length. Secondly, backgames are quite tricky. Your Joe-typical-player is going to be giving away equity by playing sub-optimally, so even if some experts can outplay a bot by seeking backgames, my guess is that most FIBS players are going to screw up bad enough to end up becoming cannon fodder for the bots. Now. Suppose this same Joe-typical-player is in a match with a human expert. Do you think he is going to steer into a backgame? I can tell that there is at least one Chuck-typical- player who won't! I realize that there are likely to be some biases that work against the bots. For example, I wouldn't be surprised if the bots have a higher percentage of their matches dropped. "Hey, bots have no feelings, so why should I feel guilty pulling the plug when I'm losing a match to one of them?" And even if the biases favor the bots, that certainly doesn't mean they aren't better anyway. My main point is to read the ratings systems with a skeptical eye, whether comparing bot vs. bot, bot vs. human, or human vs. human. Chuck bower@bigbang.astro.indiana.edu c_ray on FIBS

Did you find the information in this article useful?

Do you have any comments you'd like to add?

Ratings

Constructing a ratings system (Matti Rinta-Nikkola, Dec 1998)

Converting to points-per-game (David Montgomery, Aug 1998) [Recommended reading]

Cube error rates (Joe Russell+, July 2009) [Long message]

Different length matches (Jim Williams+, Oct 1998)

Different length matches (Tom Keith, May 1998) [Recommended reading]

ELO system (seeker, Nov 1995)

Effect of droppers on ratings (Gary Wong+, Feb 1998)

Emperical analysis (Gary Wong, Oct 1998)

Error rates (David Levy, July 2009)

Experience required for accurate rating (Jon Brown+, Nov 2002)

FIBS rating distribution (Gary Wong, Nov 2000)

FIBS rating formula (Patti Beadles, Dec 2003)

FIBS vs. GamesGrid ratings (Raccoon+, Mar 2006) [GammOnLine forum]

Fastest way to improve your rating (Backgammon Man+, May 2004)

Field size and ratings spread (Daniel Murphy+, June 2000) [Long message]

Improving the rating system (Matti Rinta-Nikkola, Nov 2000) [Long message]

KG rating list (Daniel Murphy, Feb 2006) [GammOnLine forum]

KG rating list (Tapio Palmroth, Oct 2002)

MSN Zone ratings flaw (Hank Youngerman, May 2004)

No limit to ratings (David desJardins+, Dec 1998)

On different sites (Bob Newell+, Apr 2004)

Opponent's strength (William Hill+, Apr 1998)

Possible adjustments (Christopher Yep+, Oct 1998)

Rating versus error rate (Douglas Zare, July 2006) [GammOnLine forum]

Ratings and rankings (Chuck Bower, Dec 1997) [Long message]

Ratings and rankings (Jim Wallace, Nov 1997)

Ratings on Gamesgrid (Gregg Cattanach, Dec 2001)

Ratings variation (Kevin Bastian+, Feb 1999)

Ratings variation (FLMaster39+, Aug 1997)

Ratings variation (Ed Rybak+, Sept 1994)

Strange behavior with large rating difference (Ron Karr, May 1996)

Table of ratings changes (Patti Beadles, Aug 1994)

Table of win rates (William C. Bitting, Aug 1995)

Unbounded rating theorem (David desJardins+, Dec 1998)

What are rating points? (Lou Poppler, Apr 1995)

Why high ratings for one-point matches? (David Montgomery, Sept 1995)

[GammOnLine forum] From GammOnLine [Long message] Long message [Recommended reading] Recommended reading [Recent addition] Recent addition

Book Suggestions
Books
Cheating
Chouettes
Computer Dice
Cube Handling
Cube Handling in Races
Equipment
Etiquette
Extreme Gammon
Fun and frustration
GNU Backgammon
History
Jellyfish
Learning
Luck versus Skill
Magazines & E-zines
Match Archives
Match Equities
Match Play
Match Play at 2-away/2-away
Miscellaneous
Opening Rolls
Pip Counting
Play Sites
Probability and Statistics
Programming
Propositions
Puzzles
Ratings
Rollouts
Rules
Rulings
Snowie
Software
Source Code
Strategy--Backgames
Strategy--Bearing Off
Strategy--Checker play
Terminology
Theory
Tournaments
Uncategorized
Variations

Return to: Backgammon Galore : Forum Archive Main Page