Rollouts

 How reliable are rollouts?

 From: David Montgomery Address: monty@cs.umd.edu Date: 2 August 1999 Subject: Re: How Reliable are Rollouts? [Was: Difficult Crawford checker play] Forum: rec.games.backgammon Google: 7o4esc\$b2m\$1@krackle.cs.umd.edu

```> What's the basis for having any confidence in a rollout?

The basis is: if we do a rollout long enough, and we play in the rollout
just the way we would play over the board, then eventually the rollout
results will converge to be arbitrarily close to the expected values.
(I'm going to ignore positions that diverge.)  It's just statistics.

A computer rollout is actually a completely reliable estimate of a
position's equity -- assuming that both sides are played by that same
computer program.

You can also look at the rollout at trying to simulate "perfect" play,
rather than "actual" play.  This doesn't change much.  The rollouts
are only a rough simulation of either one.

Doing a rollout long enough isn't hard anymore, because of the bots.

However, we can never have complete assurance that a rollout is played
the way that we would play over the board.  Most players' play selections
are somewhat random, and in any event each player is unique.

There is no general theoretical assurance that rollout results will
closely approximate results for any two human players.  In fact, there
are positions where the rollout results are known to be wildly different.

There is actually very little empirical evidence to support the idea
that computer rollouts are generally close to the expected results for
human players.  This is because it takes too long to gather any decent
data with human play.  The best evidence is probably from some Jellyfish
interactive rollouts, which use variance reduction to squeeze more
information out of manual rollouts.  But I don't know of any work where
someone has tried to show that the bot rollouts actually reflect the
results humans would get.

Despite this lack of solid evidence, there are very good reasons for
trusting computer rollouts most of the time.  Most importantly, we know
that the computer programs play very well.  Because they play well, we
expect their results to closely reflect the results between two strong
human players, most of the time.

A second major reason is due to the nature of backgammon itself.  Most
backgammon positions quickly engender a large number of variations.
After the first few rolls, there are a wide variety of different kinds
of problems.  A computer program's lack of complete understanding of a
certain  kind of problem that arises occasionally in a rollout won't
necessarily destroy the validity of the rollout, because that problem is
probably only a small fraction of the decisions that must be made.  And
although the program makes some mistakes on these problems, these are
likely to be offset to some degree by mistakes made when playing for the
opposing side.

> If you don't know which is the better strategy/tactic/move in a
> particular position, and you question Snowie's evaluation at 3-ply, you
> do a rollout.  Snowie proceeds to play out the position numerous times.
> However, why should only the specific position that you're investigating
> pose a difficult question?  On the very next roll, the very same, or a
> similar, strategical or tactical issue may be presented again.

This is true, and with these kinds of positions you should think more
carefully about whether you trust the rollout results.  If the same
thematic idea gets tested over and over again, then if the bot doesn't
understand it, the rollout is likely to be worthless.

> Or maybe an entirely different, but still difficult decision will have

This is less problematic, because once we have a variety of decisions,
difficult or otherwise, it is less likely that the bot will botch them
all.

The fact that there are difficult decisions, even (especially?) for the
bot, means that errors will be made in the rollout.  If the errors are
small, it isn't of great concern unless many of them accumulate.  Small
errors in the play will make only a small difference in the rollout
result.  If the errors are large, that can be a problem.

Many positions are of a nature that big errors are rare, simply because
most plays are very close in equity.  For example, bearing in and off
against contact from the bar.

Other positions are of a nature that although big errors are not so rare,
they occur for both sides.  In this case, they offset each other somewhat,
and the overall effect on the rollout is not so severe.

The real problem is when big errors are not rare, and they occur
predominantly for one side.  And in this case the rollouts won't be
reliable.  The most diagnosable situation like this is when one side
often makes a big error on its first turn.

> If you're not sure what the "best" move is to start out with, and you
> don't know whether Snowie is making the best decisions in subsequent
> positions, what's the basis for your confidence in the rollout?

The hope is that the bot is making almost all "good" moves, where "good"
may not necessarily be "best."

> In fact, in positions that involve anything more than racing, how do
> we *ever* have confidence that a rollout yields the "correct" play?

One important idea is that the bots are less likely to completely obscure
the big errors.  Let's take your example.  Say it is a terrible mistake
to run off the anchor, and yet the bot likes it.  Now, if you roll out
running and not-running, in the not-running variation the bot is likely
to run on its next turn, obscuring the difference between the two thematic
approaches.  However, if running is a big enough error, then there will
still tend to be some difference due to the first play.

And in general, the bigger the error the bigger the difference that will
show up, other things being equal.  For most positions you can have a lot
of faith in a rollout that produces a large difference.  Rollouts that
generate a small difference are much less reliable, but also less
important.

> I don't think it's any answer that we can have confidence in the rollout
> because Snowie has proven over time that it's a good BG player.  The
> same argument can be used to justify Snowie's decisions at 3-ply.  Yet
> when we question a 3-ply decision by Snowie, we do a rollout on Snowie.
> It seems rather circular.

The difference is, if you do a long rollout, then you see what the equity
*is* (to within some statistical uncertainty) *assuming that the bot plays
the position*.  With an evaluation, you just have the opinion of a very
strong player.  With a rollout you have the results of thousands of actual
games.

The rollout is the answer to the question: "What is the equity in this
position if the bot plays both sides?"  The question you then have to ask
yourself is whether the answer to this question is close enough to the
answer to your real question, which is probably something like: "What is my
equity in this position against the people I tend to play against?"

For strong players playing other strong players, these questions will
usually have similar answers, so the expert can often rely on the rollout.
However, experts generally look at rollout results with a somewhat critical
eye, and if the results don't seem right, then they will consider reasons
why the rollout might be wrong.  (They will also consider reasons why their
own understanding of the position might be wrong.)

> Frankly, I have the very same question about rollouts that are done by
> humans.  If an expert is not sure of the correct strategy in a
> particular position, how can he do an effective rollout if subsequent
> positions keep presenting similar strategic decisions?

All the same problems occur with human rollouts.  Humans have the advantage
that they can learn.  They have several disadvantages, too.  The biggest
is that they are too slow.

--
David Montgomery                   Beltway Backgammon Club
davidmontgomery@netzero.net        Washington DC area BG Tournaments
monty on FIBS and GG               www.cs.umd.edu/~monty/bbc.htm
```

### Rollouts

Cautionary tale  (Kit Woolsey, Sept 1995)
Combining rollouts  (Gregg Cattanach+, Dec 2003)
Confidence intervals  (Bob Koca, Nov 2010)
Confidence intervals  (Timothy Chow, May 2010)
Confidence intervals  (Gerry Tesauro, Feb 1994)
Cubeless vs centered-cube rollouts  (Ron Karr, Dec 1997)
Duplicate dice  (David Montgomery, June 1998)
How reliable are rollouts?  (David Montgomery, Aug 1999)
Level-5 versus level-6 rollouts  (Michael J. Zehr, June 1998)
Level-5 versus level-6 rollouts  (Chuck Bower, Aug 1997)
Positions with inaccurate rollouts  (Douglas Zare, Oct 2002)
Reporting results of rollouts  (David Montgomery, June 1995)
Rollout settings  (Lokicol+, Apr 2010)
Settlement limit  (Michael J. Zehr, Apr 1998)
Settlement limit  (Kit Woolsey, Dec 1997)
Settlement limit in races  (Alexander Nitschke, Dec 1997)
Some guidelines  (Kit Woolsey, Apr 1996)
Standard error and JSD  (rambiz+, Feb 2011)
Standard error and JSD  (Stick+, Oct 2007)
Systematic error  (Chuck Bower, Oct 1996)
Tips for doing rollouts  (Douglas Zare, June 2002)
Truncated rollouts  (Gregg Cattanach, Oct 2002)
Truncated rollouts: pros and cons  (Jason Lee+, Jan 2006)
What is a rollout?  (Gregg Cattanach, Dec 1999)