Saturday, October 13, 2007

How I Spent My Summer

I decided to sign up for a undergraduate summer research project this last summer with Dr. H., my machine learning professor, who I felt like I had a pretty good rapport with. I'd been hoping to do research on something related to Natural Language Processing (NLP), which will be my graduate area of focus, but after some negotiation, Dr. H. and I settled on computer Go, specifically Monte Carlo techniques using Upper Confidence Bounds applied to Trees (UCT).

UCT is a refinement of the Monte Carlo techniques that have been around for years, and currently the best game in town when it comes to computer Go programs. The basic theory is pretty simple, but the details get ugly. In essence, the Monte Carlo theory goes as follows:

A move is chosen from a uniform random distribution of all available legal moves. From this board position, a large number of random playouts is done, playing both players until the game ends and a winner or loser can be evaluated. In theory, if the original move was a good move, then even if you play randomly from that point on, your odds of winning are better than if it had been a bad move. To evaluate those odds, you can randomly sample the game space below that position.

UCT builds on this basic theory by creating a tree which stores the current expected value of a given position in the game space, along with a measurement of confidence in that value. Then it explores those tree nodes which have the highest expected value at any given moment, ignoring nodes which don't seem to pay out well, or at least not visiting them often. It's a pruning technique which permits you to ignore large pieces of the game tree, which is valuable for games like Go, where the full game tree is astronomically large.

This is a general purpose AI technique for decision making that seems to work really well for a number of classes of problems. The papers that I was reading in relation to it reference the Multi-Armed Bandit problem, which is a classic explore/exploit tradeoff problem where you're trying to maximize your payout from a bank of slot machines with unknown pay-out distributions.

The implication of this for Go is subtle. It's obvious when you're dealing with slot machines that if you randomly sample them, the best machines will become clear over time, as your sample size gets larger. However, it's not as apparent for complex strategy games like Go that just randomly sampling moves will have the desired effect. (Or that having an opponent that plays randomly is a good evaluator of game position.) However, it does point up the importance of good position choice in Go. One good move really CAN improve your position for the rest of the game.
Post a Comment