Sam Pottle wrote:
> I've been reading some of the papers on computer backgammon, and I have
> some questions for the neuralistas:
>
>  I found the following description of Berliner's blockading feature in
> his IJCAI '77 paper:
>
> > For each combination of zero to seven blockading points at a
> > distance of 1 to 12 spaces in front of a man, we computed the number
> > of rolls that could legally be played by the blockade runner.
>
> Is there a fuller description somewhere in the literature?
I haven't been able to find one in the literature. I got a description from
Gerry Tesauro.
You precompute a table of blockade strengths. A "blockade" is considered to
be the pattern of blocked points in a set of 12 consecutive points. I use a
table of 4096 bytes, where the strength of a blockade is encoded in a
single byte.
The strength of a blockade equals the number of rolls that cannot be played
fully. For example, if you just hold one point 10 pips away then 64 and
55 cannot be played fully, so the blockade strength is 3. If you added the
point 6 pips away, then 22, 33, 66, 42, and 51 could not be played
fully, so the blockade strengh is 3 + 7 = 10.
Quick suggestion: normalize these numbers by dividing by 36.0. Keeping all
your neural network inputs in a roughly comparable range improves learning
rates.
Once you have a table of blockade strengths, you can use it to compute two
important measures. One is the Actual blocking, which is the strongest
blockade that actually has an enemy man behind it. Another is the Potential
blockade, which is the strongest blockade of all.
Maybe you can come up with other uses of this table.
>  The trend in NN design for backgammon seems strongly toward nets with
> lots of inputs, and the Neurogammon paper has some nice comparative
> studies indicating that having various input features is lots better
> than not having them. But it seems to me that there is a tradeoff
> available here. If I have a network with 400 inputs and 150 neurons,
> that's 60,000 weights. I could use a really primitive board encoding
> to design a network with 60 inputs and 1,000 neurons that would
> evaluate just as fast. Does this really lose?
Well, the specific tradeoff you mentioned would lose because you cannot
provide a good state description for backgammon using only 60 inputs, but
you have a good question.
First off, the key insight of the Neurogammon project is that the accuracy
of the neural network increases when the inputs match crucial domain
concepts. For example, in backgammon it is logically sufficient simply to
provide an array of 26 integers as inputs. Unfortunately this does not
provide good discrimination because the quality of a position is a highly
nonlinear function of the input.
Gerry Tesauro's (and his coauthor's (Terry Sejnowski??)) insight is that
when you use boolean inputs for small numbers of men(e.g. one boolean to
represent having a White blot, one for 2 White men, etc.) and only use an
integer to represent "excess men" then the neural network learns much
faster and performs at a higher level.
Per point per color you have Blot, Point, Builder, and Excess inputs (some
representations even have an input for having exactly 4 men on a point, but
I haven't found this useful except for the 6, 12, 13, and 19 points) so you
have a total of 25 * 8 = 200 inputs. (You needn't encode the number of men
off the board, since that is a linear function of the others.)
Now to your question: a neural network with 200 inputs and 80 neurons has
16000 weights. If you used only 26 inputs you could have 615 neurons and
still have the same number of weights. How can this lose?
The key observation is that the 200 inputs are almost always zero, and when
an input is zero it contributes nothing to the calculation expense because
you do not have to propagate its weights. It follows that the 200 input
network is very much faster than the 26 input network.
It turns out that the correct equivalent number of neurons for the 26input
network is 80. Notice that in the group of 8 inputs that encode the
contents of a single point we have at most one nonzero input. And if a
point is empty then we have no nonzero inputs. It follows that the number
of nonzero inputs in the 200input encoding is exactly the same as the
number of nonzero inputs in the 26input encoding. So if you wished to be
equivalent in speed, the 26input network must have 80 neurons. (Actually a
little less, because there are also benefits to having inputs whose value
is 1.0, since that avoids a scalar multiplication in the hiddennode
computation.)
Of course, actually buidling such a 26input network would be a big step
backwards because the 200input network has a much richer input encoding.
Since boolean inputs are usually zero, it costs relatively little to add an
input, and relatively more to add a neuron. This suggests that adding
additional "feature recognizers" is pretty cheap.
Having said the foregoing, I must report that for quite some time I haven't
been able to increase the strength of my network by adding new inputs,
whereas I have been able to increase its strength by increasing the number
of neurons.
>  Speaking of network architecture, it seems like everybody's using a
> single hidden layer these days, although Tesauro found that two
> layers worked a bit better for Neurogammon. How come?
There are a lot of practical problems to multilayer networks. The backprop
training algorithms that works so well in theory do not work in practice
because the gradient for layers beyond the first is nearly zero. Having a
gradient that is so small means that there is almost no learning going on
in early levels. It follows that you have to "hack" your learning
algorithm, and that will suck up a lot of your time.
The common justification for multilayer networks is that they "respresent
richer features." This isn't true in theory, of course, since a
singlelayer network is a universal function approximator, but in practice
there may be functions that are easy to approximate using sigmoids of
signmoids that are hard to approximate using sigmoids alone.
But in backgammon you aren't likely to find such functions. The reason is
that the ability of a feature to discriminate between winning and losing
positions is a logistic function (i.e. A/(B + exp(C * x)). It follows that
a singlelayer function contains all the necessary richness.
From a computational perspective, instead of using deeper networks why not
double the number of neurons in a single layer? It will train faster and
probably play better. And if you need extra input features, why not code
them by hand?
>  I recall some mention in this forum of some theoretical studies
> suggesting that lambda should be roughly 1/branchingfactor. Does
> anyone have references for these?
This is in one of Sutton's papers. Check out his papers online at UMass's
web site.
By the way, I do not recommend using nonzero lambdas. Using lambda == 0
results in a much faster, simpler program and the training quality is very
nearly as good. After all, what difference is there between lambda ==
0.0278 and lambda == 0? Hardly any signal filters through when lambda ==
0.0278. Why not just make lambda zero, and take the benefits of simpler and
faster code?
Warm Regards,
Brian Sheppard
