Backgammon Theory

Theory

Variance reduction

From:   Oliver Riordan
Address:   initial.surname@dpmms.cam.ac.uk
Date:   21 July 2003
Subject:   Bot comparison: money play or matches?
Forum:   rec.games.backgammon
Google:   3F1B2FD0.6080107@dpmms.cam.ac.uk

Warning: this and the next couple of posts are written from the point of view of a mathematician. The overall aim is to point out a technical problem with the way variance reduction is done to decide which of two bots is stronger and by how much. This problem is unlikely to be important in practice, but is a problem if you want a rigorous statistical result. My explanation of the proposed solution (and perhaps of the problem) may well be comprehensible by, and of interest to, only `boffins'. Recently, most of the posts about comparing bots have concentrated on money play, running long sessions to estimate one bots expectation against the other. But if you want a definite result, there is a problem with this, due to very large cubes. As is fairly well known, the expectation may not even be defined, and, even more likely, the variance is probably infinite. This may sound like a purely theoretical problem, but it could well be a practical one. For example, (and I don't know the details about this), supposedly an earlier version of Jellyfish misunderstood some backgame positions so badly that it would double from behind. In any case, it's easy to imagine that there are very rare positions where one of the bots doubles from behind, in a situation which is more or less stable for several moves. If the other bot redoubles in this kind of position, very rare large cubes can result. (I've known a human to double when closed out, with his other 14 men on the 1 and 2 points, I suppose because he was `ahead in the race'; if he'd had the courage of his convictions, a very large cube could have resulted while I was bringing my spares round the board.) For good bots, such a position is obviously unusual, but how unusual, and how do we know? If you have in total 5000 games between the bots, say, and this hasn't occurred, then you can say it probably doesn't happen on average in more than 1 in 1500 games, say: If in the very long run it occurred 1 in 1500 times, there would be a 96.4% chance it happened in 5000 games (.964=1-exp(-5000/1500)). When it does happen, how high does the cube get? It could easily be high enough to dominate other equity differences between the bots, and certainly to throw off any variance estimate by a significant amount. Ok - you might say, we'll just limit the cube, to 16 say, if both bots have settings to play with limited cubes. Or you could say, we'll let them play, but estimate `expected result in games without very large cubes', if you're happy with that instead. But this isn't very satisfactory: a bot that mostly wins a bit, but very rarely lets you take 1024 points off it is not a good money player in the long run! But, when it comes to variance reduction, there is still a problem: you can't really tell what the variance is, because the rare large cubes may bump it up, and you don't know how often they happen. E.g., a 16 point result every 1500 games adds more than .1 to the variance, which (I guess from a recent post of Zare) is comparable with the apparent variance after reduction. Match play has a big advantage: the results are bounded: calling a win/loss +1/-1, we know the variance is at most 1. This makes it much easier (possible) to formulate rigorous statistical tests to show with a certain confidence that one bot is better than the other. --- Ok, so for rigorous comparisons between bots it seems a good idea to look at match play. But variance reduction is still needed: I'm going to work in units of %mwc (match winning chance), and to subtract 50, so all results will be changes in %mwc from the initial position. In these terms, the result of a match from the first player's point of view is +/- 50. Thus, the unreduced variance of a single result is close to 2500 and standard deviation (sd) to 50%mwc. (I'll keep talking about variance estimates, but quoting sd=sqrt(variance) for the numbers, as these are more informative.) The difference between two good bots could well be around 1% in a short match; to have a chance of detecting such a difference we need to get the sd down to .6%mwc say (allowing 95% confidence as a satisfactory result), which takes (50/.6)^2 > 6900 matches. To be sure of detecting it we need quite a few more. (The confidence interval with width .6% either side of the centre might happen to end up with its center .6% away from the true value, so could still include 0 if the true value is 1.2%.) So let's use luck analysis to reduce the variance. This bit is standard, but to briefly recap why this is fair, and because we'll modify it later: Let X be the result of a match, so X is a random variable which is +/- 50. We want to estimate the average or expectation E(X), which is the first player's edge in the long run. We can do this indirectly as follows: let Y be any random variable known to have expectation 0. Then E(X-Y)=E(X)-E(Y)=E(X), so estimate E(X-Y) instead. This helps if X-Y has small variance. A good choice of Y is any reasonable unbiased luck estimate, formed by, before each roll, deciding an estimated equity for each possible roll, and taking the luck of the actual roll as the equity of the actual roll minus the average. Because we take real minus average over all rolls, Y is guaranteed to have expectation 0, no matter which bot calculates it. (Aside: the `luck adjusted results' reported by gnu when you analyse your matches against it, say, are obtained in this way. If you get 32% is doesn't really mean you had 32% chance of winning. The result is an _unbiased estimator_ of your chance of winning: if you play many matches (and play in the same way!), the average `luck adjusted result' converges to your chance of winning. Of course the actual result is also an unbiased estimator of your chance of winning, but with larger variance; the average of the luck adjusted result will converge faster.) Here are some numbers calculated this way using gnubg (0.12, but it doesn't really matter) for the variance reduction of the Jellyfish light 3.5 vs Snowie 3.2 matches posted on Tony Lezard's web page. I've used both 0-ply and 1-ply variance reduction. Variance reduced results from Jellyfish's point of view. #matches #moves 0-ply: sd sd/sqrt(n-1) interval 1-ply: sd sd/sqrt(n-1) interval 1pt 1000 56593 10.77 0.34 [-1.54,-0.42] 9.09 0.29 [-1.35,-0.40] 3pt 100 9175 11.95 1.20 [-3.18,0.77] 11.39 1.14 [-3.37,0.40] 5pt 100 15770 15.08 1.52 [-5.00,-0.02] 14.20 1.43 [-5.11,-0.41] 7pt 200 41468 12.45 0.88 [-3.65,-0.74] 11.68 0.83 [-3.58,-0.85] 25pt 100 68159 12.95 1.30 [-8.14,-3.86] 12.32 1.24 [-8.49,-4.42] All numbers in %mwc-50, so, for example, the end of the first line suggests that Jellyfish has at most 50+(-0.40) = 49.6% chance against Snowie 3 in a 1pt match. The intervals have 5% tails either side; this was chosen because in this case it's sensible to look just at the upper end, since it was already believed that Snowie is stronger. (Note that 1-ply luck analysis is better than 0-ply, but not by very much: it's probably still worth doing, as it's much quicker than running the actual games, and does reduce the variance a bit.) Note also that longer matches seem as good or maybe slightly better for discriminating - the same number of moves seem to be needed to get a (non-rigorously) significant result. Whether this is actually true depends on where in the intervals the real results are. I find it slightly surprising, as the games in long matches have different importances - rising near the end of a close match - and I'd guess games with equal importance give better results. The confidence interval results are non-rigorous for the following reason: the variances are estimates from the data, and could be wrong. Also, for these numbers of matches, the distribution of the mean does _not_ converge very well to a normal distribution. (It is true that the centres of the intervals are unbiased estimators of the true figures; we just don't know their variance.) Just looking at the table, one can see that the standard deviations are not very reliable: e.g., it's unlikely that the real number for 5pt matches is higher than for both 3pt and 7pt. Looking at the X-Y numbers, with 0-ply reduction there's an outlier of -74 (all others in range [-30,35]), which on it's own contributes 2 to the standard deviation. There isn't enough data to tell whether such outliers have frequency 1 in 100 or 1 in 400 or 1 in 200, say, so the standard deviation is uncertain. While these and similar results are not rigorously justifiable, I do believe they are correct, at least after modest widening of the confidence intervals. --- So, now we come to the rigorous part: X is the distribution of the result of a match (of a certain length) between the bots, so X is +/-50%mwc with certain probabilities, and the expectation E(X) is what we wish to calculate. We also have a way of estimating the luck Y in a match which is unbiased, and having done variance reduction, we have n independent samples from the distribution Z=X-Y. How does one make a valid confidence interval for the expectation E(Z)=E(X)? Just using a central limit approximation with the estimated variance of Z is not good enough: (a) the estimated variance could be out by a fair bit, and (b) for smallish numbers of matches, the convergence to a normal distribution is not good enough - at least we don't know it is without some assumptions on the distribution. Actually, things are much worse: although X is bounded, X-Y need not be, and might have infinite variance! (Actually, it can't be infinite, because of results about matches terminating, but it could be very large.) It could be that the analysis program misunderstands some position, and thinks one bot throws a joker when it throws an average roll, so the luck estimate is way off the real luck. Or it could be that one bot does throw a joker but doesn't realize it, and plays a neutral move. Either of these could happen many times in a row. So we'd like to modify Y so that it's bounded: here's how we do it. Let's fix a bound on Y in advance, say +/-70%mwc. (This bound should be at least a bit more than 50%, because the real luck in a match between good players is near 50%.) Before each roll, decide on luck figures for each possible roll as before. If none of the possible figures added to Y (or rather, the running total luck estimate which at the end of the match will give us Y) takes Y out of range, add the figure for the actual roll. Otherwise do nothing. (Or better, add a multiple of the actual figure, chosen not depending on the actual roll, so that for any possible roll, the result is in range.) This way, Z=X-Y is guaranteed to be bounded, by +/-120%mwc, say. Since at each stage the increment we add to Y has mean zero (it is the old mean zero increment multiplied by a factor not depending on the actual roll), the modified Y still has mean zero, so we still have E(Z)=E(X). There's still a problem estimating the variance: large Z values may seem unlikely, but the following definitely can't be ruled out. Suppose your total sample is n matches. Maybe, every n/3 matches, one bot, 'the fish' gets confused, and throws an almost won position. In this case, even if the luck estimate is good, because the fish gets a result of -50% when it was lucky enough to get a result of 50%, Z is around -100%. Thus, even though Z may seem to have a nice concentrated distribution, we can't rule out that really it takes values near the maximum/minimum once every n/3 times. (Here n/3 comes in because exp(-3)=4.98% which is a non-negligible probability, so even having observed no +/-100s in n matches, this isn't sufficient evidence that they don't occur one in n/3.) So how do we get a confidence interval? I decided to construct a statistical test valid for any distribution on a given interval [-M,M] (here M=120%mwc). (Douglas Zare/anyone else: do you know if anyone else has done this for bg/ if there's a standard method?) Knowing nothing about the distribution, we use the `order statistics'. Say Z_1,...,Z_n are the n samples from Z=X-Y sorted in increasing order. Let's choose increasing probabilities p_1,...,p_n. Suppose that the following hypothesis H holds: H: at least a fraction p_i of the (unknown) distribution of Z is <= Z_i. This gives a step function lower bound on the distribution function of Z, so the mean of Z is at most B = p_1*Z_1+sum(Z_i*(p_i-p_{i-1}),i=2..n)+M*(1-p_n) - the highest mean distribution satisfying the constraint has weight p_1 at Z_1, p_i-p_{i-1} at Z_i, and 1-p_n at M. How do we chose the p_i? Well, the probability that H holds essentially doesn't depend on the unknown distribution: it's always at least the probability that U_i>=p_i where U_1,...,U_n are sorted independent samples from a uniform distribution on the interval [0,1]. (It's exactly equal to this probability if Z has a continuous distribution.) So we can choose any p_i for which the latter probability is 95%, say. What's a good choice? Depends on the distribution of Z: if we knew this, we could optimize the expectation, say, of the bound B. So what we do in practice is take some of the data (e.g., 100 1pt matches) and use that to guess (by linear interpolation, say) a distribution function for Z. Then we construct a test that works well for this distribution, and apply it to the remaining data. The optimization process is simple in theory, but not so easy in practice. Alex Selby and I spent some time programming this and getting it to converge reasonably well. (The last part was mostly Alex.) The description above gives a method for calculating one end (the upper one) of the confidence interval; obvious modifications give the lower end. Results - all match lengths together. As the primary rigorous test, I decided to merge all the data, as it seemed likely to be borderline whether there is enough data to produce an interval excluding zero. A fussy person might notice that the procedure requires independent samples from a single distribution, not a mixture. This was taken into account (choosing a total number 1172 of matches, and for each match independently choosing the length to be 1,3,5,7,25 pts with probabilities 10,1,1,2,1 over 15, and hoping that in the data set on Tony Lezard's page there were enough matches of each length). Final rigorous result: for a match length chosen at random as above, a 90% confidence interval for Jellyfish's edge is [-3.67%mwc,0.55%mwc]. This used a bound on Y of 70%mwc, obtained by optimizing the expectation of the upper end of the interval, using a distribution for Z guessed from the data not used in the actual test. In other words (assuming the non-rigorous results are correct, which I believe) there isn't quite enough data to show rigorously by our method that Snowie 3.2 is better than Jellyfish 3.5. Results - individual match lengths Here are the results for the individual match lengths. Note that the self contained procedure of using 10%, say, of the data to estimate the distribution to optimize for only makes much sense for the 1pt matches. For the longer matches, the distribution for each match length has been used to optimize for the next one. length #matches #moves bound on Y used 90% confidence interval 1pt 1000 56593 60 [-3.47,1.41] 3pt 100 9175 56 [-7.83,7.63] 5pt 100 15770 53 [-16.7,5.40] 7pt 200 41468 40 [-11.0,9.22] (7pt 200 41468 55 [-9.47,4.58]) 25pt 100 68159 60 [-15.5,2.07] For 7pt, 40 is a stupid value suggested by optimizing using the data for 5pt matches. 55 is a guess at a better value. (Here are the 1-ply non-rigorous results again: 1pt 1000 56593 [-1.35,-0.40] 3pt 100 9175 [-3.37,0.40] 5pt 100 15770 [-5.11,-0.41] 7pt 200 41468 [-3.58,-0.85] 25pt 100 68159 [-8.49,-4.42] ) Note that for these rigorous tests, short matches seem to be better: basically, as discussed above, there's an unavoidable 300/n contribution to each end of the confidence interval, from the possibility of maximum/minimum results that we happen not to see. This compares with the sd/sqrt(n) contribution which occurs in both the rigorous and simple (ordinary CLT) methods. For few long matches, the first term dominates. Also (a much more minor point), for many short matches we have a better guess at the distribution to start with. (The variability and smallness of the bounds on Y for lengths other than 1 suggests to me that the guessed distribution I used was not very reliable, and that some kind of a priori guess might have worked better.) Finally, a theoretical point: in the large n limit, the method here does not work very well: using a central limit theorem (CLT) with error terms, for very large n one can first get an upper bound on the variance with high confidence (applying the CLT to the squares of the samples, using the fourth power of M as a bound on their variance), which will be close to the real variance. Then one can apply the CLT to the samples themselves. Thus, for n large, the confidence interval will have asymptotically width (Phi^{-1}(.95)-Phi^{-1}(.05))*sd/sqrt(n) that one expects. Our method will give asymptotically larger width by some constant factor. Question: given an unknown distribution on [-M,M], construct a confidence interval for the mean based on n samples in a way that (ideally) is optimal if one has a correct guess for the distribution, or at least works well for moderate n and is asymptotically optimal. Oliver Riordan P.S. If you wish to E-mail my directly about this, my address is initial.surname@dpmms.cam.ac.uk

Did you find the information in this article useful?

Do you have any comments you'd like to add?