Backgammon Programming

Programming

Neural net questions

From:   Brian Sheppard
Address:   bsheppard@my-dejanews.com
Date:   1 March 1999
Subject:   Re: Neurality
Forum:   rec.games.backgammon
Google:   7beftd$ngn$1@nnrp1.dejanews.com

Sam Pottle wrote: > I've been reading some of the papers on computer backgammon, and I have > some questions for the neuralistas: > > -- I found the following description of Berliner's blockading feature in > his IJCAI '77 paper: > > > For each combination of zero to seven blockading points at a > > distance of 1 to 12 spaces in front of a man, we computed the number > > of rolls that could legally be played by the blockade runner. > > Is there a fuller description somewhere in the literature? I haven't been able to find one in the literature. I got a description from Gerry Tesauro. You precompute a table of blockade strengths. A "blockade" is considered to be the pattern of blocked points in a set of 12 consecutive points. I use a table of 4096 bytes, where the strength of a blockade is encoded in a single byte. The strength of a blockade equals the number of rolls that cannot be played fully. For example, if you just hold one point 10 pips away then 6-4 and 5-5 cannot be played fully, so the blockade strength is 3. If you added the point 6 pips away, then 2-2, 3-3, 6-6, 4-2, and 5-1 could not be played fully, so the blockade strengh is 3 + 7 = 10. Quick suggestion: normalize these numbers by dividing by 36.0. Keeping all your neural network inputs in a roughly comparable range improves learning rates. Once you have a table of blockade strengths, you can use it to compute two important measures. One is the Actual blocking, which is the strongest blockade that actually has an enemy man behind it. Another is the Potential blockade, which is the strongest blockade of all. Maybe you can come up with other uses of this table. > -- The trend in NN design for backgammon seems strongly toward nets with > lots of inputs, and the Neurogammon paper has some nice comparative > studies indicating that having various input features is lots better > than not having them. But it seems to me that there is a tradeoff > available here. If I have a network with 400 inputs and 150 neurons, > that's 60,000 weights. I could use a really primitive board encoding > to design a network with 60 inputs and 1,000 neurons that would > evaluate just as fast. Does this really lose? Well, the specific tradeoff you mentioned would lose because you cannot provide a good state description for backgammon using only 60 inputs, but you have a good question. First off, the key insight of the Neurogammon project is that the accuracy of the neural network increases when the inputs match crucial domain concepts. For example, in backgammon it is logically sufficient simply to provide an array of 26 integers as inputs. Unfortunately this does not provide good discrimination because the quality of a position is a highly non-linear function of the input. Gerry Tesauro's (and his coauthor's (Terry Sejnowski??)) insight is that when you use boolean inputs for small numbers of men(e.g. one boolean to represent having a White blot, one for 2 White men, etc.) and only use an integer to represent "excess men" then the neural network learns much faster and performs at a higher level. Per point per color you have Blot, Point, Builder, and Excess inputs (some representations even have an input for having exactly 4 men on a point, but I haven't found this useful except for the 6, 12, 13, and 19 points) so you have a total of 25 * 8 = 200 inputs. (You needn't encode the number of men off the board, since that is a linear function of the others.) Now to your question: a neural network with 200 inputs and 80 neurons has 16000 weights. If you used only 26 inputs you could have 615 neurons and still have the same number of weights. How can this lose? The key observation is that the 200 inputs are almost always zero, and when an input is zero it contributes nothing to the calculation expense because you do not have to propagate its weights. It follows that the 200 input network is very much faster than the 26 input network. It turns out that the correct equivalent number of neurons for the 26-input network is 80. Notice that in the group of 8 inputs that encode the contents of a single point we have at most one non-zero input. And if a point is empty then we have no non-zero inputs. It follows that the number of non-zero inputs in the 200-input encoding is exactly the same as the number of non-zero inputs in the 26-input encoding. So if you wished to be equivalent in speed, the 26-input network must have 80 neurons. (Actually a little less, because there are also benefits to having inputs whose value is 1.0, since that avoids a scalar multiplication in the hidden-node computation.) Of course, actually buidling such a 26-input network would be a big step backwards because the 200-input network has a much richer input encoding. Since boolean inputs are usually zero, it costs relatively little to add an input, and relatively more to add a neuron. This suggests that adding additional "feature recognizers" is pretty cheap. Having said the foregoing, I must report that for quite some time I haven't been able to increase the strength of my network by adding new inputs, whereas I have been able to increase its strength by increasing the number of neurons. > -- Speaking of network architecture, it seems like everybody's using a > single hidden layer these days, although Tesauro found that two > layers worked a bit better for Neurogammon. How come? There are a lot of practical problems to multi-layer networks. The backprop training algorithms that works so well in theory do not work in practice because the gradient for layers beyond the first is nearly zero. Having a gradient that is so small means that there is almost no learning going on in early levels. It follows that you have to "hack" your learning algorithm, and that will suck up a lot of your time. The common justification for multilayer networks is that they "respresent richer features." This isn't true in theory, of course, since a single-layer network is a universal function approximator, but in practice there may be functions that are easy to approximate using sigmoids of signmoids that are hard to approximate using sigmoids alone. But in backgammon you aren't likely to find such functions. The reason is that the ability of a feature to discriminate between winning and losing positions is a logistic function (i.e. A/(B + exp(C * x)). It follows that a single-layer function contains all the necessary richness. From a computational perspective, instead of using deeper networks why not double the number of neurons in a single layer? It will train faster and probably play better. And if you need extra input features, why not code them by hand? > -- I recall some mention in this forum of some theoretical studies > suggesting that lambda should be roughly 1/branching-factor. Does > anyone have references for these? This is in one of Sutton's papers. Check out his papers online at UMass's web site. By the way, I do not recommend using non-zero lambdas. Using lambda == 0 results in a much faster, simpler program and the training quality is very nearly as good. After all, what difference is there between lambda == 0.0278 and lambda == 0? Hardly any signal filters through when lambda == 0.0278. Why not just make lambda zero, and take the benefits of simpler and faster code? Warm Regards, Brian Sheppard

Did you find the information in this article useful?

Do you have any comments you'd like to add?