Ratings

Forum Archive : Ratings

 
Improving the rating system

From:   Matti Rinta-Nikkola
Address:   rintanikkola@my-deja.com
Date:   6 November 2000
Subject:   Backgammon rating system. PART II
Forum:   rec.games.backgammon
Google:   8u5u6g$rl0$1@nnrp1.deja.com

Hi, two years ago here was some discussion about FIBS rating system:
how does it work with different match lengths etc... This article
is very long and might be also a bit complicate, sorry for that.
Anyway, I think, that it explains quite nicely anomalous rating data
collected from FIBS. Article answers to question: how much does
better player get advantage from the cube? It explains also how the
rating system should be modified in order that it would work better
for different match lengths.

Best regards,

Matti Rinta-Nikkola


1. The ELO system
-----------------

Basic assumption in ELO rating system is that the rating distribution of
players will follow Gaussian distribution.  The assumption leads to the
match winning probability formula:

                   1
  P(D) = --------------------- , (1)
         10**(-D*sqrt(S)/W) + 1

where D is the ELO difference
      S=S(N) is opportunity for skill in the N point match
      W is the class width

In ELO system the winner of the match will gain

  (1-P)M*sqrt(S)  (2)

points and the looser will lost the same amount.

On rating system the class width W and the mean of the rating
distribution <ELO> can be set arbitrary.  Backgammon servers (that I
know Netgammon, FIBS and Gamesgrid) have chosen class width W=2000,
<ELO> =1500 and constant M=5 (eq 2).  Note the relationship between
constant M and class width W.  If you desire to set lower (higher) value
for W you should also lower (higher) the value of M.  Skill function
S(N) has been solved assuming that the game winner will always get 1
point (ref 1).  That assumption leads to the function

  S(N) = N. (3)

If we take account gammons and doubling cube it is shown that Skill
function has a form

               N - 1
  S(N) = 1 + C ----- ; N=1,3,5,7.... , (4)
                 2

where constant C should be solved using the match statistics of the
server (ref 2).  We know that the value of C will be .8 < C < 2
(ref 3,4).


2. Determining Skill function constant
--------------------------------------

2.1 Continuos zero volatility game
----------------------------------

Constant C on eq(4) can be divided in two parts C=cp+ch, where cp
presents checker playing and ch cube handling skill.  It has been shown
that cp=.84 (ref 3,4) while the part of the cube handling skill is still
more or less open question.  Opportunity for skill could be equally
understood as possibility for error.  One way to estimate ch is to
determine maximum "reasonable" cube handling error that players could do
and evaluate its effect to game result.  In order to estimate ch we will
make following assumptions:

    1) maximum cube handling error = cube is never used
    2) backgammon game = continuos and zero volatility game
    3) equity change is directly proportional to the skill

Two players with equal checker play skills but maximum diversity in cube
handling skills play money game.  We will assume that perfect doubling
point is p=.75, where p is cubeless game winning probability, and that
doubles are always dropped (from zero volatility assumption it follows
that perfect doubles are equally correct to take or drop).  Money game
equity can be written as a function of cubeless game winning probability

  Eq(p) = 2.67p - 1,   (5)

see figure 1.  At the beginning of the game Eq(.5)=.33 i.e.  maximum
advantage that player can obtain using cube.  In order to win a game
equity change DEq=.67 has to be concurred by checker play.  Using
assumption 3) we can write

       cp
  C = ---- = 1.3 ; assuming N >> 3.
      .67

Notice small error on figure 1 and equation (5). Average win for the
player who does not use cube is bigger than one. Player who uses cube
does not get additional cubing advantage from the gammons because this
advantage is compensated by gammon wins of player who plays without
cube. Using different assumption for gammonless doubling point
.8> p >.75 Skill function constant varies 1.1< C <1.3. Doubling point
.8 corresponds alive cube and .75 corresponds dead cube. Average
doubling point (omitting gammons) is p=.78 (ref 5,6) which will give
C=1.2.

            .75
   ---------------  1
             /
            /
          |/
         -/-     - .33
         /|
   -----/---------  0
       /
      /
     /
    /
   /      |
   --------------- -1
  0      .5      1

  Figure 1. Money game equity as a function of cubeless game winning
            probability.


2.2 Average number of games
---------------------------

Note that in Skill equation (4) opportunity for skill is expressed "1
point match skill" as unit.  The complication determining constant C is
partly rising from the fact that N point match (N>1) there are elements
which are missing from one point match i.e.  cube and gammon factor.

The problem can be simplified if we are considering only the odd point
matches longer than one point.  We assume that in every game is involved
same amount of skill.  That is we assume that in a game of three point
match there is same opportunity for skill than in a game of 11 point
match (for example).  This is reasonable approximation -from Jellyfish
money game statistics it can be seen that cube is turned only 1.2
times/game (ref 5).  I think that the above cubing number is valid also
in pre-Crawford games in match play.  Although match score is affecting
to cube decisions, I think, that in a single cubing in three point match
there is in average same opportunity for error or equivalently for skill
than in a single cubing in 11 point match (for example).  If average
number of games/match are known we can simply write Skill function,
notice the analogy with rolls method (ref 3).  Luckily that data can be
retrieved from big_brother match archive (ref 7), see Table 1.  Skill
function for odd point matches (N>1) can be written as

               N - 3
  S(N)= 1 + C' ----- ; N=3,5,7,9... (6)
                 2

Here skill is expressed "3 point match skill" as unit.

  Table 1. Big_brother match archive: average number of games/Match and
           Skill functions.

  Match    # of matches in   average # of   average # of games
  length        archive          games        (2.35 as unit)

    1           350          1.00  (1.00)      -
    3           634          2.35  (2.35)     1.00   (1.00)
    5          1184          3.83  (3.70)     1.63   (1.60)
    7           492          5.02  (5.05)     2.14   (2.20)
    9            42          7.24  (6.40)     3.08   (2.80)
   11            31          7.81  (7.75)     3.32   (3.40)
                                   eq. 4             eq. 6
                                   C=1.35            C'=.60

Note that without additional assumptions using the table data we cannot
say how much more opportunities for skill there are in 3 point match
than in 1 point match.  In the middle column on Table 1 we have simply
assumed that in 1 point match game is equivalent to the N point match
game N>1.

2.3 Data of match results
-------------------------

Determining the constant C using players' ELO ratings is somewhat tricky
business.  It's tricky because we are using ELO data which has been
obtained using faulty rating system to correct the rating system.  If
rating system has a wrong Skill function constant then we do not have
rating system which predicts consistently the winning chances of players
in all match lengths.  In fact every match length will have its own
rating system with different class width values W'(N).  If player plays
mainly 1 point matches his rating will follow 1 point rating system and
if he plays mainly 3 point matches his rating will follow 3 point rating
system and so on.  Note that player's rating does not depend only on the
match length he usually plays but also the ratings of his opponents and
the match lengths they mainly play.  Of course ratings are describing
also players' backgammon playing skills despite erroneous Skill
function.

In order to understand where and how big is the error resulted from the
FIBS rating system lets take closer look to the heart of the rating
system i.e.  to the match winning probability function eq(1).  It can be
shown that the exponent in the formula

  D*sqrt(S)
  --------- = constant;  N=constant  (7)
     W

If we change the values S or W the rating differences of the players
will change so that the above equation is constant.  In other words the
rating system does not change players' match winning probabilities!
Assuming that we have bg-server equipped with two independent rating
systems which have same class width value W but different Skill
functions S and S'.  Both Skill functions have the form of that
expressed in equation 4 but they have different constant C.  Lets assume
that rating system with Skill function S is perfect i.e.  measured class
width value is constant and equal to W -the value used in rating system.
Using equation (7) we will define function

           W D'        S
  Err(N) = --- = sqrt(---); N=1,3,5,7,...  (8)
           W'D         S'

which will be used later to examine quality of erroneous rating system
and to determine constant C.  The class width W' on above equation can
be measured using the match data of server and eq(1) (ref 8).  Note that
W' is a function of match length.  Here D and D' should be understood as
class widths of players' ELO point distribution rather than ELO
difference of two individual players.

Assuming that there is no ratings mixing i.e.  all match lengths will
have their own ELO rating system and the ratings of different match
lengths aren't mixed.  In that case W/W'=1 and Skill function constant
could be retrieved from

           D'(N)
  Err(N) = -----;   N=1,3,5,7,...   (9)
             D

where D and D'(N) are measured class widths of match rating
distributions.  D is class width of 1 point and D'(N) N point match
rating system.  Unfortunately there is no practical use for equation (9)
because class widths D'(N) cannot be measured easily.  In realistic case
ratings of different match lengths are well mixed and although faulty
rating system players ELO distribution have well defined class width Dm'
(ref 9).  Now equation (8) can be written

             W    Dm'
  Err(N) = ----- ----; N=1,3,5,7,... (10)
           W'(N) D(N)

Note that here Dm' resulted by rating system is constant while the
correct class width D=D(N) is function of match length.  Also this
equation has its difficulties.  We know W, Dm' is easily obtained from
server and W'(N) can be measured.  The problem here is that we do not
know how to get D(N).

o Example 1:  FIBS ratings formula.
  In a case of FIBS ratings system S'(N)=N i.e. C=2 on eq(4).
  Equation (8) can be written as

                  1+C(N-1)/2
    Err(N) = sqrt(----------); N=1,3,5,7,... (11)
                      N

  Function is tabulated using various N and C values on the table below.
  We see from Table 2 that FIBS rating system works reasonable well for
  matches longer than one point, because function Err(N>1) is nearly
  constant.
  Assume that FIBS rating system is used to rate players who mainly play
  N>1 matches and one point matches are played only occasionally so that
  players' ELO rating distribution is not affected and it's perfectly
  following N>1 match ratings system.  Note that due the error in Skill
  function FIBS rating system follows more aggressively N>1 match rating
  than one point match rating distribution, see equation (2).  Also in a
  case that players play notable amount of one point matches the rating
  distribution will probably follow N>1 ratings system.  In that case
  expected class width for one point match rating system would be about

    W'(1)
    ----- = <Err(N>1)>,     (12)
      W

  where <Err(N>1)> is the average of Err function values N>1.  Because
  rating distribution is following N>1 rating system D/Dm'=Err and
  equation (10) can be written

     W        2
    ---- = Err (N),   N>1  (13)
    W'(N)

  After simple algebra Skill function constant is in our hands

        (W/W')N - 1
    C = ------------ , N>1  (14)
         1/2 (N - 1)

  Only unknown on equation (14) is W'(N) which can be measured by
  following the method explained on reference 8.

    Table 2. Function Err(N) with expected class width values W'(1)
    (eq 12) and W'(N>1) (eq 13).

     C \ N   1    3     5     11    21    W'(1)   W'(3)   W'(5)

     .8     1.0  .77   .72   .67   .65    1440    3373    3858
    1.0     1.0  .82   .77   .73   .72    1540    2974    3373
    1.2     1.0  .85   .82   .80   .79    1640    2768    2974
    1.4     1.0  .89   .87   .85   .85    1740    2524    2642
    1.6     1.0  .93   .92   .90   .90    1840    2312    2363


o Experiment 1 by Gary Wong (ref 8)
  Computer player Abbot on FIBS has been set to record its one point
  matches.  Recorded data has been used to test FIBS rating system.  The
  best fit to the collected data and match winning rating formula has
  been obtained using class width value W'(1)=1634.  Assuming that
  Abbot's opponents mainly play N>1 matches Skill function constant
  would be C=1.2, see Table 2.  Note that ELO rating of Abbot is around
  1500.  It is unlikely that Abbot is incorrectly rated and so error on
  ELO difference D (eq 1) is all coming from the error of Abbot's
  opponents' ELO ratings.

o Experiment 2 by Jim Williams (ref 10)
  Also this experiment like many others has been done on FIBS.  Match
  results of 1-5 point matches have been collected and the data has been
  used to test empirically the validity of the match winning probability
  function eq(1).  Here Skill function has been chosen so that the
  formula eq(1) gave the best fit with the observed data i.e. S(N)=Neff,
  where Neff is "effective match length", see ref 10.  For our analysis
  class width W' is more suitable fitting variable than Skill function.
  Class width W' can be calculated from W'(N)=2000*sqrt(N/Neff).  The
  results of the experiment are summarized on Table 3. If ratings
  are well mixed it is enough to measure W'(N) for one match length
  in order to be able to determine constant C. Here W'(N) has been
  measured for three match lengths and every measure leads to the
  same value of C within the accuracy of the measurement. This fact is
  an empirical proof that Skill function has the form presented on
  equation (4) and that one rating system can be design for all match
  lengths.

    Table 3. Observations. On row ave is weighted average of specific
    column. Average is weighted using the number of observed match
    results.

    N  # of matches  Neff   W'     C     Dm'/D
         (x10**3)                       (C=1.1)
    1      20.0      1.6   1581   1.1     .79
    3      12.0      1.6   2739   1.19   1.15
    5       8.6      2.1   3086   1.13   1.23
   ave                     2242   1.13    .99

o Note 1
  The weighted average of Dm'/D over all match lengths is about one.  If
  we use C<1.1 the average is smaller than one and if C>1.1 the average
  is bigger than one.

o Note 2.
  Ratio Dm'/D can be used to estimate players' true ratings.  Assuming
  that we have two players with ELO=1900 playing on system described on
  experiment 2.  One of these players plays only one point matches.  His
  true rating can be estimated as (1900 - 1500)*.79 + 1500 = 1816.  The
  other player plays only 5 point matches and his true rating would be
  (1900 - 1500)*1.23 + 1500 = 1992 (C=1.1).  So if you want to be top
  rated player it's not sufficient that you are the best player but you
  need to know also how the rating system works.

o Note 3.
  I have estimated my FIBS ELO rating differences against JellyFish to
  be as following:  N=1 -> 210, N=3 -> 176 and N=5 -> 165 (ref 11).
  Also these ratings can be corrected using measured ratio Dm'/D!
  Rating of one point match is correct by definition -no ratings mixing
  here.  Using C=1.1 true rating difference for 3 point match would be
  176*1.15 = 202 (209) and for 5 point match 165*1.23 = 203 (208).  In
  parenthesis are values obtained by using match equity table JF-mrn
  (ref 11) and equation (1), where Skill function eq(4) with constant
  C=1.1 is used.

3. Summary
----------

Three completely different approaches have been used to determine Skill
function constant C (eq 4).  All approaches lead to constant C that fall
in interval 1.1-1.4, see Table 4, while in FIBS formula is used C=2.
Experimental data has been used in two different ways to fix constant C:
method 1) average number of games/match and method 2) data of match
results and ELO ratings.  First method gives C=1.35 and latter
1.1<C<1.2.  The difference between these two values can be explained by
the fact that method 1) does not take account obvious differences of one
point and N>1 point matches (cube and gammon factor) while in method 2)
these differences are covered.  Bad side of the method 2) is that it
relays on data which has been obtained by faulty rating system.  This
makes analysis more complicated but more over it can be even so that
from the available data constant C cannot be solved accurately.  I think
that after the first correction there is still need for a small fine
tuning to reach "correct" C value.  Anyway, I think, that what ever
value is picked up from range 1.1-1.4 the resulted rating system would
be superior compared to the one which is currently in use, compare
Tables 2 and 4.

    Table 4. Expected accuracy of corrected rating system i.e.
    function Err(N). C=1.2 has been chosen for correct Skill function
    constant.
    On the right has been shown C values obtained by different
    methods: 1) Continuos zero volatility game (ch 2.1)
             2) Average number of games (ch 2.2)
             3) Data of match results (ch 2.3)
             4) Game statistic JF-mrn (ref 11)

     C \ N   1     3     5     11     21     C \ Method
                                                 1 2 3 4
    1.0     1.00  1.05  1.06  1.08   1.09    1.0
    1.1     1.00  1.02  1.03  1.04   1.04    1.1 x   x x
    1.2     1.00  1.00  1.00  1.00   1.00    1.2 x   x
    1.3     1.00   .98   .97   .97    .96    1.3 x
    1.4     1.00   .96   .96   .94    .93    1.4   x

If FIBS rating system is corrected by choosing new Skill function with
C=1.2, it might be a good idea to change also class width value W on
rating system so that the change on players ELO distribution is
minimized.  Class width W could be chosen for example so that current
"11 point rating system" remains intact i.e.  W=2500.  Experiment 2
suggests to use somewhat lower value W=<W'>=2250.  Constant M I would
let as it is.  Note that class width of ELO distribution should remain
equal to the one in old rating system if we change W correctly.  New
rating system could be implemented side by the old one so we would have
direct comparison between rating systems.

References
----------
1) ELO ranking
   http://www.netgammon.com/us/facts/elo2.htm
2) "Derivation of backgammon Skill function" by M.Rinta-Nikkola
   http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419254506&fmt=text
   http://www.deja.com/[ST_rn=ap]/getdoc.xp?AN=419293370&fmt=text
3) FIBS--Rating Formula  Different length matches by Tom Keith
   http://www.bkgm.com/rgb/rgb.cgi?view+523
4) "Constructing a ratings system" by M.Rinta-Nikkola
   http://www.bkgm.com/rgb/rgb.cgi?view+621
5) Cubeful distribution by Roland Sutter
   http://www.deja.com/getdoc.xp?AN=491955947&fmt=text
6) "Doubling in money game: drop, take or beaver" by M.Rinta-Nikkola
   http://www.deja.com/[ST_rn=qs]/getdoc.xp?AN=464753854&fmt=text
7) Big_Brother match archive
   http://www.bkgm.com/rgb/rgb.cgi?menu+matcharchives
8) FIBS--Rating Formula:  Emperical analysis by Carry Wong
   http://www.bkgm.com/rgb/rgb.cgi?view+601
9) Rating distributions of bg-servers by Daniel Murphy
   http://www.deja.com/getdoc.xp?AN=480105756&fmt=text
10) FIBS--Rating Formula:  Different length matches by Jim Williams
   http://www.bkgm.com/rgb/rgb.cgi?view+603
11) JF-mrn game statistics by M.Rinta-Nikkola
   http://www.deja.com/getdoc.xp?AN=503022535&fmt=text
 
Did you find the information in this article useful?          

Do you have any comments you'd like to add?     

 

Ratings

Constructing a ratings system  (Matti Rinta-Nikkola, Dec 1998) 
Converting to points-per-game  (David Montgomery, Aug 1998)  [Recommended reading]
Cube error rates  (Joe Russell+, July 2009)  [Long message]
Different length matches  (Jim Williams+, Oct 1998) 
Different length matches  (Tom Keith, May 1998)  [Recommended reading]
ELO system  (seeker, Nov 1995) 
Effect of droppers on ratings  (Gary Wong+, Feb 1998) 
Emperical analysis  (Gary Wong, Oct 1998) 
Error rates  (David Levy, July 2009) 
Experience required for accurate rating  (Jon Brown+, Nov 2002) 
FIBS rating distribution  (Gary Wong, Nov 2000) 
FIBS rating formula  (Patti Beadles, Dec 2003) 
FIBS vs. GamesGrid ratings  (Raccoon+, Mar 2006)  [GammOnLine forum]
Fastest way to improve your rating  (Backgammon Man+, May 2004) 
Field size and ratings spread  (Daniel Murphy+, June 2000)  [Long message]
Improving the rating system  (Matti Rinta-Nikkola, Nov 2000)  [Long message]
KG rating list  (Daniel Murphy, Feb 2006)  [GammOnLine forum]
KG rating list  (Tapio Palmroth, Oct 2002) 
MSN Zone ratings flaw  (Hank Youngerman, May 2004) 
No limit to ratings  (David desJardins+, Dec 1998) 
On different sites  (Bob Newell+, Apr 2004) 
Opponent's strength  (William Hill+, Apr 1998) 
Possible adjustments  (Christopher Yep+, Oct 1998) 
Rating versus error rate  (Douglas Zare, July 2006)  [GammOnLine forum]
Ratings and rankings  (Chuck Bower, Dec 1997)  [Long message]
Ratings and rankings  (Jim Wallace, Nov 1997) 
Ratings on Gamesgrid  (Gregg Cattanach, Dec 2001) 
Ratings variation  (Kevin Bastian+, Feb 1999) 
Ratings variation  (FLMaster39+, Aug 1997) 
Ratings variation  (Ed Rybak+, Sept 1994) 
Strange behavior with large rating difference  (Ron Karr, May 1996) 
Table of ratings changes  (Patti Beadles, Aug 1994) 
Table of win rates  (William C. Bitting, Aug 1995) 
Unbounded rating theorem  (David desJardins+, Dec 1998) 
What are rating points?  (Lou Poppler, Apr 1995) 
Why high ratings for one-point matches?  (David Montgomery, Sept 1995) 

[GammOnLine forum]  From GammOnLine       [Long message]  Long message       [Recommended reading]  Recommended reading       [Recent addition]  Recent addition
 

  Book Suggestions
Books
Cheating
Chouettes
Computer Dice
Cube Handling
Cube Handling in Races
Equipment
Etiquette
Extreme Gammon
Fun and frustration
GNU Backgammon
History
Jellyfish
Learning
Luck versus Skill
Magazines & E-zines
Match Archives
Match Equities
Match Play
Match Play at 2-away/2-away
Miscellaneous
Opening Rolls
Pip Counting
Play Sites
Probability and Statistics
Programming
Propositions
Puzzles
Ratings
Rollouts
Rules
Rulings
Snowie
Software
Source Code
Strategy--Backgames
Strategy--Bearing Off
Strategy--Checker play
Terminology
Theory
Tournaments
Uncategorized
Variations

 

Return to:  Backgammon Galore : Forum Archive Main Page