Machines That Learn to Play Games
edited by Johannes Fürnkranz and Miroslav Kubat
Advances in Computation: Theory and Practice, Volume 8

Introduction

Should Machines Learn How to Play Games?
Miroslav Kubat

1.1 Motivation

Excellence in playing games like chess has always been perceived as a reliable indicator of keen intelligence. While a few prodigiously talented adepts have achieved the highest ranks in their teens, most players will never make it beyond the expert level. The feats of masters are shrouded in mystery. Indeed, what is it that makes Kasparov capable of beating other grandmasters and triumphing in simultaneous play? Does he command some deeper understanding of the game that makes him see patterns hidden to other mortals? Does his brain just work faster, or does it also possess some vital connections that other people lack?

Scientists and technologists ask themselves how to emulate similar skills with software. Computers have never been meant to be mere number crunchers, and the idea that machines should be endowed with more human capabilities has always been around, with early attempts dating back to mid fifties. Now, at the core of any scientific undertaking is an experiment. In artificial intelligence, adequate testbeds are needed to demonstrate the performance of competing techniques and to validate hypotheses. Intuitively, a game seems quite appropriate to this end, at least for a certain class of tasks. Can my program beat yours? Then I declare it more "intelligent," whatever this obscure term means.

Sound practical reasons motivate systematic studies of algorithms for game playing. Ours is an age of advanced technology that calls for smart machines to deal with domains where poor judgements of fallible humans can incur great losses. Fields of interest include job-shop scheduling in industrial plants, channel allocation in cellular telephones, command and control in battlefield theaters, and many others. Consider the task of dispatching elevators in a skyscraper where, at any given moment, dozens of cabins move up, down, or remain idle. Suppose a person on the twentieth floor pushes the call button, intending to lunch at the fancy restaurant on the top of the building. Which elevator should be dispatched in response to this call? Should it be the closest cabin from among those that happen to move in the required direction? Should perhaps one of the idle cabins be preferred? Or is the one that makes the least number of stops more appropriate? In the long run, the strategy adopted by the decision maker should minimize waiting and transportation times, and avoid situations where one miserable cabin stops at each floor while the more lucky ones move along without interruptions.

The problem can be reformulated as a "game" where the dispatcher plays against a generator of passengers that compete for a set of elevators. At each moment, the state of the game is defined by the locations of the individual cabins, by their intended destinations and planned stops, and by the list of active calls. The decision maker's actions are confronted with new calls and perhaps with occasional removal of a cabin due to maintenance. An optimum strategy will correctly respond to the higher frequency of passengers at some locations and to the prevalence of certain destinations (e.g., the ground floor). It will address conflicting requirements while taking into consideration the nature of the involved probabilistic processes and the risks associated with each decision. All these characteristics, unknown at the beginning, may vary in time, precluding the existence of a permanently best strategy.

What this example purports to show is that certain decision-making problems can be cast into the game-playing scenario where two players take turns. A pushed call button must be answered by an action on the part of the dispatcher. The example also indicates that we are interested in discrete games, as opposed to some other tasks, such as driving a car, where a player has to act continuously, without being prodded by an opponent.

The generic testbeds provided by chess, backgammon, Go, and other games, are particularly attractive to the scientific community. They have well-defined and widely known rules, a feature that facilitates performance comparisons of disparate techniques and theoretical frameworks. To be sure, more often than not, the real world will be much more complex than a game. But nothing is wrong with a methodology that starts with relatively simple problems and later tailors the developed algorithms to more difficult settings.

Apart from the application-driven interest of the practically minded, the mere fascination with the phenomenon of games is enough to nurture deeper interests of dreamers and visionaries alike. What can be more enticing, captivating, and seductive, than the challenge of writing programs that outsmart the best humans in their own fields of expertise? Curiosity conspires with technological demand. Studies of computer programs for game playing now represent a well-established discipline with its own conferences, workshops, journals, monographs, and, above all, outstanding proponents. The field has matured.

Given the richness of existing literature, why bother putting together yet another book? In fact, we have a very specific motivation. The mind-set that has dominated the history of computer game playing relies on straightforward exploitation of the available computing power. The fact that a machine can explore millions of variations sooner than the sluggish human can wink an eye has inspired hopes that the mystery of intelligence can be cracked, or at least side-stepped, by sheer force. Decades of the steadily growing strength of computer programs have attested to the soundness of this approach.

In the end, the power itself was indeed able to make it. In its hailed encounter with Kasparov, IBM's DEEP BLUE explored billions of variations---a true triumph of silicon technology---and this staggering performance helped it fulfill one of the oldest promises of artificial intelligence, namely that a machine would one day defeat the chess World champion. However, the event was in some sense disquieting. Not many of those who prophesied it forty years ago expected that the feat would be accomplished with a program based on brute force. The belief that advanced knowledge representation and sophisticated logic would also have to be involved has always been simmering below the surface. No wonder that the fact that mere number crunching can ever become so powerful came as a surprise to many.

But look at the event also from the opposite perspective. Being no more than a human (even if a genius), Kasparov could hardly analyze more than a few dozen variations per move. Yet, his playing strength was par with that of the machine! This amazing observation indicates that deeper understanding can cut the amount of necessary calculations by orders of magnitude. Experience acquired over many years of studies and practical play made the champion see clues, in the chess-board patterns, that told him which variations deserved to be examined and which were mere blind alleys.

No extraordinary stretch of imagination is needed if we want to see the next logical step: what if we manage to endow the computer with pattern-recognition skills and with the ability to generalize? What if we marry Kasparov's positional insight with the machine's calculating power? The strength of the resulting programs will dwarf even the fabulous accomplishment of Deep Blue.

The potential technological benefits of such super-machines encourage us to probe further. The fundamental limitations of traditional approaches will show us the merits of machine learning.

1.2 Limitations of search

The essence of conventional implementations of game playing in computers is a search through the space of variations. An example will clarify the point better than formal definitions. Figure 1.1 illustrates a very simple game, tic-tac-toe, where two players take turns on a board of three by three. One of them plays crosses and the other plays circles, placing them on carefully selected squares with the goal to create a line of three marks of the same type, as shown in the right part of the picture where the player with crosses has won. If nobody has obtained such a line by the time the board is full, the game is considered a draw.

In the terminology of artificial intelligence, each combination of crosses and circles represents a state; placing a cross or a circle on the board is called an action. At a given state, a player can usually choose from two or more different actions, seeking to increase his or her winning chances while impeding the ambitions of the opponent. For instance, if in the left part of Figure 1.1 the player with circles is to move, then the obvious choice is to land a circle at the bottom left corner. The opponent then cannot prevent a line of circles being formed in the next move either in the left column or at the bottom row. For the case where the other player is to move, the winning action is shown in the picture ("place a cross at the bottom left corner").

The most straightforward way to select the best move is to explore all possible consequences of any action that can be taken at the given state. Denote the two players by O (playing circles) and X (playing crosses) and suppose the game has reached some yet undecided state where O is to move. If some of the possible actions completes a line of circles, then the choice is trivial. If no action is a win, then O has to look also at X's responses to prevent an immediate loss. For example, by failing to place a circle at the bottom left corner in Figure 1.1, O would allow the opponent to create a line of crosses.

Figure 1.1: A player of tic-tac-toe wins by placing three marks in a line.

If none of O's actions leads to an immediate win or loss, then O has to look further into the future, considering all actions that are possible after each of X's responses, then looking again at X's responses to these actions, and so on. The analysis continues until the game ends. Player O thus carries out an exhaustive search in the space of variations, seeking a path---a sequence of O's actions and X's responses---that leads to a win regardless of X's behavior (provided such a path exists). Conversely, if some intermediate state is won for X, then O will try to refrain from all actions leading to this state.

The catch is that there are just too many variations. At the beginning, the board is empty and O can choose from 9 different actions. Then, X selects one out of 8 different responses. Next, O has 7 different moves (because two squares are already occupied), each of them permitting 6 different responses. This continues until the board is full, giving rise to a total of 9! = 362,880 variations, a number that does not look outrageous when a computer carrying out millions of operations per second is used. But this is still a ridiculously simple example. Difficulties begin if the number of actions possible at each state is high. To decide about the best first move in tic-tac-toe on a board with 100 by 100 fields, the player would have to go through 10,000! variations, a combinatorial explosion that defies imagination.

Fairness forces us to mention that this explosion can to some extent be tamed by taking into consideration the symmetries of the tic-tac-toe board. For instance, only three qualitatively different actions are possible when the three-by-three board is empty: corner, center, and edge. In more complicated games, though, the task remains intractable. Common folklore has it that there are more different positions in chess then there are atoms in the universe, a value that is sometimes estimated as 10⁸⁰. In reality, the number of chess positions is actually higher by several orders of magnitude. And in Go, the number of variations makes chess look simple. Exhaustive search is out of question.

Figure 1.2: The minimax search, looking ahead one move.

Only a compromise can make the search-based approach feasible. One possibility is to examine variations of a limited length, and then to look at the "values" of the resulting states. For example, a winning state will be assigned a high positive value, whereas the value of a lost state will be negative, and the values between the two extremes will grade the winning chances. In Figure 1.2, the search depth is minimal. At the initial state, two actions are available to O, each followed by two possibilities for X. The final states are associated with values that O wants to maximize. If O chooses the action leading to state 1, the worst possible outcome after X's response is 0.1. O will prefer the second action (leading to state 2) because it guarantees a value of at least 1.1. This line of thought characterizes the minimax approach, whose name reflects the circumstance that X attempts to minimize a value that O seeks to maximize. The strategy guarantees that O will reach at least a certain minimum value under the best possible play on the part of X.

A few comments will expose the difficulties inherent in the minimax strategy. First and foremost, the notion of a state value is somewhat nebulous. In practical implementations, these values are obtained with the help of heuristics that estimate how close the given state is to a winning position. This may be pretty complicated. In chess, significant advantage in material usually promises an easy win, but the player also has to consider coordination of pieces, existence of passed pawns, exposed kings, and many other features. The value of each chess-board position is then determined by an "evaluation" function that comprises all these factors. An appropriately defined evaluation function can be more vital than the search depth.¹ In fact, a perfect evaluation function that correctly evaluates each state would render the search unnecessary.

Second, the assumption that X will always choose the best defense may be unreasonably conservative and O can forgo a fast win simply by being too pessimistic. Here, we are not alluding to the clichéd fact that even a grandmaster (if playing X's part) can miss the strongest continuation. What we have in mind is that the opponent can be totally indifferent to O's aspirations. In the elevator domain, the dispatcher does not know where the next passenger will appear. The environment behaves randomly, lacking any ambition to complicate the decision maker's life. Moreover, in some card games, such as poker, an action can lead to alternative states, each with a different probability of occurrence, each with a different state value. This means that the worst case will ensue only with some (maybe very small) likelihood. Advanced game playing programs have to consider the distribution of an action's consequences and they should perhaps also take chances in the same way as successful human players do.

Third, the problem of the search depth can be tricky. Common terminology refers to the actions as plies, whereas the pair [O's action, X's response] is called a move. In Figure 1.2, only a single move (two plies) is considered. But one move is rarely enough because deeper analysis can contradict early results: it may happen that any action taken by O at state 6 may hurt this player's game more than any action taken at state 4. The program designer must consider the trade-off between costs and reliability. Long variations provide stronger evidence, while being expensive to calculate. If the state values are reliably estimated, deep exploration may be unnecessary.

As a final point, let us mention that scientists have developed ingenious techniques to speed up the search by mechanisms that help the machine avoid dubious paths, following rather those variations that promise fast success. Further details would stray from the subject of this chapter. What should be clear, at this stage, is that the strength of a search-based game playing program hinges on two fundamental factors: depth of analysis and the quality of state evaluation.

1.3 Merits of learning

Unless the evaluation function predicts the state values reliably, the search through the maze of variations has to go deep, incurring expensive computation. While the cost-versus-quality dilemma is hard to escape for good, any alleviation, even if partial, is welcome. This is why machine learning comes to the fore.

The simplest way to reduce the costs is by making use of instruction, similarly as human players get advice from their coaches. First of all, experience shows that, at the game's beginning, some moves deserve to be preferred, while others better be avoided. For instance, chess coaches discourage their pupils from starting a game with 1.a4 or 1.h4. In the three-by-three tic-tac-toe, a good initial action is to place the mark in the center of the board. Recommendations of this kind can go much deeper and, as a matter of fact, commercial programs use entire books of opening variations. Predefined moves at the game's beginning greatly increase the playing strength and save precious computation time.

The reduced material towards the game's end makes it possible to store a complete set of endgame positions, each associated with a best move. This, too, curtails search costs. Rather than examining millions of variations, the program finds the best move in a look-up table. To be sure, endgame databases are large and the search for a concrete position may be expensive, especially if the database is stored on a CD. Still, these costs are much lower than those entailed by systematic examination of all variations.

In the middlegame, the situation is not so simple. As already mentioned, one way to avoid prohibitively deep search is to rely on a good evaluation function. But where does the function come from? Experts offer hypotheses as to which aspects should be taken into consideration and how the impact of each of them should be weighted, but no one really knows exactly how to capture all pertinent features in a mathematical formula.

Another possibility is to capitalize on patterns that point to preferable variations. But, again, where to procure the patterns? Can you describe your mother's face eloquently enough for anyone to recognize her? People use patterns implicitly, and their "evaluation functions" are so intangible and elusive that they are next to impossible to encode. For one thing, grandmasters are notoriously inarticulate about why they choose a particular plan over other plans possible in a given position. The concepts and rules suggested in textbooks are fuzzy and, more often than not, tend to contradict each other. Years of study are needed if the player is to apply them to his or her advantage.

For all these reasons, coaches like to illustrate vital concepts using selected games of masters. Those who seek to improve their skill, play through these games, contemplating, with the help of commentaries, all the specific aspects of each position as well as the merits of alternative moves. As the learner generalizes from the specific circumstances of a chess-board position, his or her strength grows. What cannot be formalized by precise rules is thus conveyed by examples. The player's learning abilities compensate for the lack of precise recipes. You may not be able to describe your mother's face, but if you show me a few pictures, I will easily learn myself how to recognize her. In other words, learning from examples can take care of knowledge that is hard to formalize.

There is a deeper philosophical reason for this approach. A chess player that repeatedly falls for the same kind of a knight fork will hardly be admired; intelligent behavior is unthinkable without the ability to learn.

Figure 1.3: Which slot machine gives the highest returns can be figured out in the course of clever experimentation.

This book is about machine-learning algorithms for game playing. To prepare the reader for what is to come, let us show one learning method. Figure 1.3 depicts a game in which a player faces several slot machines, each returning an amount generated by a different probabilistic distribution. To learn which machine yields the highest returns, the gambler may pull at each lever many times, keeping a detailed tally with the intention of finding the one with the highest averages. But each trial costs a coin and the gambler will often bet on "bad" levers. As correct estimates of averages are possible only in the long run, considerable losses may be incurred. A more pragmatic experimenter will therefore first pull at each lever only a few times, just to develop some initial idea. Then, while sticking to the lever currently appearing to be best, the player will occasionally try also the other machines, just in case his or her opinion may have to be corrected. Combining exploration with exploitation, the player will converge to the optimal solution much faster than by systematic experimentation.

This principle, called reinforcement learning, is frequently employed in game playing programs, albeit in more sophisticated guises. In tic-tac-toe, each "lever" may represent a different state-action pair, and the player may at each state have to chose from levers representing the given state. Those actions that lead to a win will be rewarded, while those that lead to a loss will be punished. However, too many levers are required if each state-action is to be considered separately. Sophisticated algorithms have been developed that generalize from specific to more general states so as to reduce the problem size and render the solution tractable.

Many other machine learning principles have been developed and employed, each with a different role in the game-playing context. Some focus on the induction of evaluation functions, others on methods to represent patterns and rules in the memory, and yet others on how to develop an opening book and endgame database. The best way to describe them is in the context of concrete applications.

1.4 Bon Voyage!

The secret behind human game-playing skills is the ability to generalize from instruction and from past experience, to learn patterns that expedite decision making. With the continuing progress in machine learning, the number of scientists adopting this view and pursuing serious research along these lines has grown to the point that gives a raison d'être to an entire book devoted solely to this theme.

The papers collected in this volume describe how to instill learning skills in game-playing machines. The reader is asked to keep in mind that this is not just about games---the possibility that the discussed techniques will be used in control systems and in decision support always looms in the background. However, to keep the text focused, we decided to concentrate exclusively on games and ignore their potential in more "practical" domains.

Following roughly the issues outlined in the preceding sections, the papers presented in the sequel expound topics related to human learning, to opening book learning, to the search for useful features in evaluation functions, to representation and induction of patterns, and to opponent modeling. To cover a sufficiently broad class of problems, various games are addressed, including chess, Go, Othello, and poker. All papers have been written by prominent scientists (including a Nobel laureate) that have amassed extensive experience in writing game playing programs and spent years doing systematic research in this vibrant field.

Welcome aboard!

¹In many real-world applications, the search also has to consider the costs of an action. The player wants to win, but not at any price.