Backgammon Ratings

Ratings

Improving the rating system

From:   Matti Rinta-Nikkola
Address:   rintanikkola@my-deja.com
Date:   6 November 2000
Subject:   Backgammon rating system. PART II
Forum:   rec.games.backgammon
Google:   8u5u6g$rl0$1@nnrp1.deja.com

Hi, two years ago here was some discussion about FIBS rating system: how does it work with different match lengths etc... This article is very long and might be also a bit complicate, sorry for that. Anyway, I think, that it explains quite nicely anomalous rating data collected from FIBS. Article answers to question: how much does better player get advantage from the cube? It explains also how the rating system should be modified in order that it would work better for different match lengths. Best regards, Matti Rinta-Nikkola 1. The ELO system ----------------- Basic assumption in ELO rating system is that the rating distribution of players will follow Gaussian distribution. The assumption leads to the match winning probability formula: 1 P(D) = --------------------- , (1) 10**(-D*sqrt(S)/W) + 1 where D is the ELO difference S=S(N) is opportunity for skill in the N point match W is the class width In ELO system the winner of the match will gain (1-P)M*sqrt(S) (2) points and the looser will lost the same amount. On rating system the class width W and the mean of the rating distribution <ELO> can be set arbitrary. Backgammon servers (that I know Netgammon, FIBS and Gamesgrid) have chosen class width W=2000, <ELO> =1500 and constant M=5 (eq 2). Note the relationship between constant M and class width W. If you desire to set lower (higher) value for W you should also lower (higher) the value of M. Skill function S(N) has been solved assuming that the game winner will always get 1 point (ref 1). That assumption leads to the function S(N) = N. (3) If we take account gammons and doubling cube it is shown that Skill function has a form N - 1 S(N) = 1 + C ----- ; N=1,3,5,7.... , (4) 2 where constant C should be solved using the match statistics of the server (ref 2). We know that the value of C will be .8 < C < 2 (ref 3,4). 2. Determining Skill function constant -------------------------------------- 2.1 Continuos zero volatility game ---------------------------------- Constant C on eq(4) can be divided in two parts C=cp+ch, where cp presents checker playing and ch cube handling skill. It has been shown that cp=.84 (ref 3,4) while the part of the cube handling skill is still more or less open question. Opportunity for skill could be equally understood as possibility for error. One way to estimate ch is to determine maximum "reasonable" cube handling error that players could do and evaluate its effect to game result. In order to estimate ch we will make following assumptions: 1) maximum cube handling error = cube is never used 2) backgammon game = continuos and zero volatility game 3) equity change is directly proportional to the skill Two players with equal checker play skills but maximum diversity in cube handling skills play money game. We will assume that perfect doubling point is p=.75, where p is cubeless game winning probability, and that doubles are always dropped (from zero volatility assumption it follows that perfect doubles are equally correct to take or drop). Money game equity can be written as a function of cubeless game winning probability Eq(p) = 2.67p - 1, (5) see figure 1. At the beginning of the game Eq(.5)=.33 i.e. maximum advantage that player can obtain using cube. In order to win a game equity change DEq=.67 has to be concurred by checker play. Using assumption 3) we can write cp C = ---- = 1.3 ; assuming N >> 3. .67 Notice small error on figure 1 and equation (5). Average win for the player who does not use cube is bigger than one. Player who uses cube does not get additional cubing advantage from the gammons because this advantage is compensated by gammon wins of player who plays without cube. Using different assumption for gammonless doubling point .8> p >.75 Skill function constant varies 1.1< C <1.3. Doubling point .8 corresponds alive cube and .75 corresponds dead cube. Average doubling point (omitting gammons) is p=.78 (ref 5,6) which will give C=1.2. .75 --------------- 1 / / |/ -/- - .33 /| -----/--------- 0 / / / / / | --------------- -1 0 .5 1 Figure 1. Money game equity as a function of cubeless game winning probability. 2.2 Average number of games --------------------------- Note that in Skill equation (4) opportunity for skill is expressed "1 point match skill" as unit. The complication determining constant C is partly rising from the fact that N point match (N>1) there are elements which are missing from one point match i.e. cube and gammon factor. The problem can be simplified if we are considering only the odd point matches longer than one point. We assume that in every game is involved same amount of skill. That is we assume that in a game of three point match there is same opportunity for skill than in a game of 11 point match (for example). This is reasonable approximation -from Jellyfish money game statistics it can be seen that cube is turned only 1.2 times/game (ref 5). I think that the above cubing number is valid also in pre-Crawford games in match play. Although match score is affecting to cube decisions, I think, that in a single cubing in three point match there is in average same opportunity for error or equivalently for skill than in a single cubing in 11 point match (for example). If average number of games/match are known we can simply write Skill function, notice the analogy with rolls method (ref 3). Luckily that data can be retrieved from big_brother match archive (ref 7), see Table 1. Skill function for odd point matches (N>1) can be written as N - 3 S(N)= 1 + C' ----- ; N=3,5,7,9... (6) 2 Here skill is expressed "3 point match skill" as unit. Table 1. Big_brother match archive: average number of games/Match and Skill functions. Match # of matches in average # of average # of games length archive games (2.35 as unit) 1 350 1.00 (1.00) - 3 634 2.35 (2.35) 1.00 (1.00) 5 1184 3.83 (3.70) 1.63 (1.60) 7 492 5.02 (5.05) 2.14 (2.20) 9 42 7.24 (6.40) 3.08 (2.80) 11 31 7.81 (7.75) 3.32 (3.40) eq. 4 eq. 6 C=1.35 C'=.60 Note that without additional assumptions using the table data we cannot say how much more opportunities for skill there are in 3 point match than in 1 point match. In the middle column on Table 1 we have simply assumed that in 1 point match game is equivalent to the N point match game N>1. 2.3 Data of match results ------------------------- Determining the constant C using players' ELO ratings is somewhat tricky business. It's tricky because we are using ELO data which has been obtained using faulty rating system to correct the rating system. If rating system has a wrong Skill function constant then we do not have rating system which predicts consistently the winning chances of players in all match lengths. In fact every match length will have its own rating system with different class width values W'(N). If player plays mainly 1 point matches his rating will follow 1 point rating system and if he plays mainly 3 point matches his rating will follow 3 point rating system and so on. Note that player's rating does not depend only on the match length he usually plays but also the ratings of his opponents and the match lengths they mainly play. Of course ratings are describing also players' backgammon playing skills despite erroneous Skill function. In order to understand where and how big is the error resulted from the FIBS rating system lets take closer look to the heart of the rating system i.e. to the match winning probability function eq(1). It can be shown that the exponent in the formula D*sqrt(S) --------- = constant; N=constant (7) W If we change the values S or W the rating differences of the players will change so that the above equation is constant. In other words the rating system does not change players' match winning probabilities! Assuming that we have bg-server equipped with two independent rating systems which have same class width value W but different Skill functions S and S'. Both Skill functions have the form of that expressed in equation 4 but they have different constant C. Lets assume that rating system with Skill function S is perfect i.e. measured class width value is constant and equal to W -the value used in rating system. Using equation (7) we will define function W D' S Err(N) = --- = sqrt(---); N=1,3,5,7,... (8) W'D S' which will be used later to examine quality of erroneous rating system and to determine constant C. The class width W' on above equation can be measured using the match data of server and eq(1) (ref 8). Note that W' is a function of match length. Here D and D' should be understood as class widths of players' ELO point distribution rather than ELO difference of two individual players. Assuming that there is no ratings mixing i.e. all match lengths will have their own ELO rating system and the ratings of different match lengths aren't mixed. In that case W/W'=1 and Skill function constant could be retrieved from D'(N) Err(N) = -----; N=1,3,5,7,... (9) D where D and D'(N) are measured class widths of match rating distributions. D is class width of 1 point and D'(N) N point match rating system. Unfortunately there is no practical use for equation (9) because class widths D'(N) cannot be measured easily. In realistic case ratings of different match lengths are well mixed and although faulty rating system players ELO distribution have well defined class width Dm' (ref 9). Now equation (8) can be written W Dm' Err(N) = ----- ----; N=1,3,5,7,... (10) W'(N) D(N) Note that here Dm' resulted by rating system is constant while the correct class width D=D(N) is function of match length. Also this equation has its difficulties. We know W, Dm' is easily obtained from server and W'(N) can be measured. The problem here is that we do not know how to get D(N). o Example 1: FIBS ratings formula. In a case of FIBS ratings system S'(N)=N i.e. C=2 on eq(4). Equation (8) can be written as 1+C(N-1)/2 Err(N) = sqrt(----------); N=1,3,5,7,... (11) N Function is tabulated using various N and C values on the table below. We see from Table 2 that FIBS rating system works reasonable well for matches longer than one point, because function Err(N>1) is nearly constant. Assume that FIBS rating system is used to rate players who mainly play N>1 matches and one point matches are played only occasionally so that players' ELO rating distribution is not affected and it's perfectly following N>1 match ratings system. Note that due the error in Skill function FIBS rating system follows more aggressively N>1 match rating than one point match rating distribution, see equation (2). Also in a case that players play notable amount of one point matches the rating distribution will probably follow N>1 ratings system. In that case expected class width for one point match rating system would be about W'(1) ----- = <Err(N>1)>, (12) W where <Err(N>1)> is the average of Err function values N>1. Because rating distribution is following N>1 rating system D/Dm'=Err and equation (10) can be written W 2 ---- = Err (N), N>1 (13) W'(N) After simple algebra Skill function constant is in our hands (W/W')N - 1 C = ------------ , N>1 (14) 1/2 (N - 1) Only unknown on equation (14) is W'(N) which can be measured by following the method explained on reference 8. Table 2. Function Err(N) with expected class width values W'(1) (eq 12) and W'(N>1) (eq 13). C \ N 1 3 5 11 21 W'(1) W'(3) W'(5) .8 1.0 .77 .72 .67 .65 1440 3373 3858 1.0 1.0 .82 .77 .73 .72 1540 2974 3373 1.2 1.0 .85 .82 .80 .79 1640 2768 2974 1.4 1.0 .89 .87 .85 .85 1740 2524 2642 1.6 1.0 .93 .92 .90 .90 1840 2312 2363 o Experiment 1 by Gary Wong (ref 8) Computer player Abbot on FIBS has been set to record its one point matches. Recorded data has been used to test FIBS rating system. The best fit to the collected data and match winning rating formula has been obtained using class width value W'(1)=1634. Assuming that Abbot's opponents mainly play N>1 matches Skill function constant would be C=1.2, see Table 2. Note that ELO rating of Abbot is around 1500. It is unlikely that Abbot is incorrectly rated and so error on ELO difference D (eq 1) is all coming from the error of Abbot's opponents' ELO ratings. o Experiment 2 by Jim Williams (ref 10) Also this experiment like many others has been done on FIBS. Match results of 1-5 point matches have been collected and the data has been used to test empirically the validity of the match winning probability function eq(1). Here Skill function has been chosen so that the formula eq(1) gave the best fit with the observed data i.e. S(N)=Neff, where Neff is "effective match length", see ref 10. For our analysis class width W' is more suitable fitting variable than Skill function. Class width W' can be calculated from W'(N)=2000*sqrt(N/Neff). The results of the experiment are summarized on Table 3. If ratings are well mixed it is enough to measure W'(N) for one match length in order to be able to determine constant C. Here W'(N) has been measured for three match lengths and every measure leads to the same value of C within the accuracy of the measurement. This fact is an empirical proof that Skill function has the form presented on equation (4) and that one rating system can be design for all match lengths. Table 3. Observations. On row ave is weighted average of specific column. Average is weighted using the number of observed match results. N # of matches Neff W' C Dm'/D (x10**3) (C=1.1) 1 20.0 1.6 1581 1.1 .79 3 12.0 1.6 2739 1.19 1.15 5 8.6 2.1 3086 1.13 1.23 ave 2242 1.13 .99 o Note 1 The weighted average of Dm'/D over all match lengths is about one. If we use C<1.1 the average is smaller than one and if C>1.1 the average is bigger than one. o Note 2. Ratio Dm'/D can be used to estimate players' true ratings. Assuming that we have two players with ELO=1900 playing on system described on experiment 2. One of these players plays only one point matches. His true rating can be estimated as (1900 - 1500)*.79 + 1500 = 1816. The other player plays only 5 point matches and his true rating would be (1900 - 1500)*1.23 + 1500 = 1992 (C=1.1). So if you want to be top rated player it's not sufficient that you are the best player but you need to know also how the rating system works. o Note 3. I have estimated my FIBS ELO rating differences against JellyFish to be as following: N=1 -> 210, N=3 -> 176 and N=5 -> 165 (ref 11). Also these ratings can be corrected using measured ratio Dm'/D! Rating of one point match is correct by definition -no ratings mixing here. Using C=1.1 true rating difference for 3 point match would be 176*1.15 = 202 (209) and for 5 point match 165*1.23 = 203 (208). In parenthesis are values obtained by using match equity table JF-mrn (ref 11) and equation (1), where Skill function eq(4) with constant C=1.1 is used. 3. Summary ---------- Three completely different approaches have been used to determine Skill function constant C (eq 4). All approaches lead to constant C that fall in interval 1.1-1.4, see Table 4, while in FIBS formula is used C=2. Experimental data has been used in two different ways to fix constant C: method 1) average number of games/match and method 2) data of match results and ELO ratings. First method gives C=1.35 and latter 1.1<C<1.2. The difference between these two values can be explained by the fact that method 1) does not take account obvious differences of one point and N>1 point matches (cube and gammon factor) while in method 2) these differences are covered. Bad side of the method 2) is that it relays on data which has been obtained by faulty rating system. This makes analysis more complicated but more over it can be even so that from the available data constant C cannot be solved accurately. I think that after the first correction there is still need for a small fine tuning to reach "correct" C value. Anyway, I think, that what ever value is picked up from range 1.1-1.4 the resulted rating system would be superior compared to the one which is currently in use, compare Tables 2 and 4. Table 4. Expected accuracy of corrected rating system i.e. function Err(N). C=1.2 has been chosen for correct Skill function constant. On the right has been shown C values obtained by different methods: 1) Continuos zero volatility game (ch 2.1) 2) Average number of games (ch 2.2) 3) Data of match results (ch 2.3) 4) Game statistic JF-mrn (ref 11) C \ N 1 3 5 11 21 C \ Method 1 2 3 4 1.0 1.00 1.05 1.06 1.08 1.09 1.0 1.1 1.00 1.02 1.03 1.04 1.04 1.1 x x x 1.2 1.00 1.00 1.00 1.00 1.00 1.2 x x 1.3 1.00 .98 .97 .97 .96 1.3 x 1.4 1.00 .96 .96 .94 .93 1.4 x If FIBS rating system is corrected by choosing new Skill function with C=1.2, it might be a good idea to change also class width value W on rating system so that the change on players ELO distribution is minimized. Class width W could be chosen for example so that current "11 point rating system" remains intact i.e. W=2500. Experiment 2 suggests to use somewhat lower value W=<W'>=2250. Constant M I would let as it is. Note that class width of ELO distribution should remain equal to the one in old rating system if we change W correctly. New rating system could be implemented side by the old one so we would have direct comparison between rating systems. References ---------- 1) ELO ranking http://www.netgammon.com/us/facts/elo2.htm 2) "Derivation of backgammon Skill function" by M.Rinta-Nikkola http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419254506&fmt=text http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419293370&fmt=text 3) FIBS--Rating Formula Different length matches by Tom Keith http://www.bkgm.com/rgb/rgb.cgi?view+523 4) "Constructing a ratings system" by M.Rinta-Nikkola http://www.bkgm.com/rgb/rgb.cgi?view+621 5) Cubeful distribution by Roland Sutter http://www.deja.com/getdoc.xp?AN=491955947&fmt=text 6) "Doubling in money game: drop, take or beaver" by M.Rinta-Nikkola http://www.deja.com/[ST_rn=qs]/getdoc.xp?AN=464753854&fmt=text 7) Big_Brother match archive http://www.bkgm.com/rgb/rgb.cgi?menu+matcharchives 8) FIBS--Rating Formula: Emperical analysis by Carry Wong http://www.bkgm.com/rgb/rgb.cgi?view+601 9) Rating distributions of bg-servers by Daniel Murphy http://www.deja.com/getdoc.xp?AN=480105756&fmt=text 10) FIBS--Rating Formula: Different length matches by Jim Williams http://www.bkgm.com/rgb/rgb.cgi?view+603 11) JF-mrn game statistics by M.Rinta-Nikkola http://www.deja.com/getdoc.xp?AN=503022535&fmt=text

Did you find the information in this article useful?

Do you have any comments you'd like to add?

Ratings

Constructing a ratings system (Matti Rinta-Nikkola, Dec 1998)

Converting to points-per-game (David Montgomery, Aug 1998) [Recommended reading]

Cube error rates (Joe Russell+, July 2009) [Long message]

Different length matches (Jim Williams+, Oct 1998)

Different length matches (Tom Keith, May 1998) [Recommended reading]

ELO system (seeker, Nov 1995)

Effect of droppers on ratings (Gary Wong+, Feb 1998)

Emperical analysis (Gary Wong, Oct 1998)

Error rates (David Levy, July 2009)

Experience required for accurate rating (Jon Brown+, Nov 2002)

FIBS rating distribution (Gary Wong, Nov 2000)

FIBS rating formula (Patti Beadles, Dec 2003)

FIBS vs. GamesGrid ratings (Raccoon+, Mar 2006) [GammOnLine forum]

Fastest way to improve your rating (Backgammon Man+, May 2004)

Field size and ratings spread (Daniel Murphy+, June 2000) [Long message]

Improving the rating system (Matti Rinta-Nikkola, Nov 2000) [Long message]

KG rating list (Daniel Murphy, Feb 2006) [GammOnLine forum]

KG rating list (Tapio Palmroth, Oct 2002)

MSN Zone ratings flaw (Hank Youngerman, May 2004)

No limit to ratings (David desJardins+, Dec 1998)

On different sites (Bob Newell+, Apr 2004)

Opponent's strength (William Hill+, Apr 1998)

Possible adjustments (Christopher Yep+, Oct 1998)

Rating versus error rate (Douglas Zare, July 2006) [GammOnLine forum]

Ratings and rankings (Chuck Bower, Dec 1997) [Long message]

Ratings and rankings (Jim Wallace, Nov 1997)

Ratings on Gamesgrid (Gregg Cattanach, Dec 2001)

Ratings variation (Kevin Bastian+, Feb 1999)

Ratings variation (FLMaster39+, Aug 1997)

Ratings variation (Ed Rybak+, Sept 1994)

Strange behavior with large rating difference (Ron Karr, May 1996)

Table of ratings changes (Patti Beadles, Aug 1994)

Table of win rates (William C. Bitting, Aug 1995)

Unbounded rating theorem (David desJardins+, Dec 1998)

What are rating points? (Lou Poppler, Apr 1995)

Why high ratings for one-point matches? (David Montgomery, Sept 1995)

[GammOnLine forum] From GammOnLine [Long message] Long message [Recommended reading] Recommended reading [Recent addition] Recent addition

Book Suggestions
Books
Cheating
Chouettes
Computer Dice
Cube Handling
Cube Handling in Races
Equipment
Etiquette
Extreme Gammon
Fun and frustration
GNU Backgammon
History
Jellyfish
Learning
Luck versus Skill
Magazines & E-zines
Match Archives
Match Equities
Match Play
Match Play at 2-away/2-away
Miscellaneous
Opening Rolls
Pip Counting
Play Sites
Probability and Statistics
Programming
Propositions
Puzzles
Ratings
Rollouts
Rules
Rulings
Snowie
Software
Source Code
Strategy--Backgames
Strategy--Bearing Off
Strategy--Checker play
Terminology
Theory
Tournaments
Uncategorized
Variations

Return to: Backgammon Galore : Forum Archive Main Page