Despite decades of work, the strongest Go computer programs could only play at the level of human amateurs. Standard AI methods, which test all possible moves and positions using a search tree, can’t handle the sheer number of possible Go moves or evaluate the strength of each possible board position.
Two players, using either white or black stones, take turns placing their stones on a board. The goal is to surround and capture their opponent's stones or strategically create spaces of territory. Once all possible moves have been played, both the stones on the board and the empty points are tallied. The highest number wins.
As simple as the rules may seem, Go is profoundly complex. There are an astonishing 10 to the power of 170 possible board configurations - more than the number of atoms in the known universe. This makes the game of Go a googol times more complex than chess.
We created AlphaGo, a computer program that combines advanced search tree with deep neural networks. These neural networks take a description of the Go board as an input and process it through a number of different network layers containing millions of neuron-like connections.
One neural network, the “policy network”, selects the next move to play. The other neural network, the “value network”, predicts the winner of the game. We introduced AlphaGo to numerous amateur games to help it develop an understanding of reasonable human play. Then we had it play against different versions of itself thousands of times, each time learning from its mistakes.
Over time, AlphaGo improved and became increasingly stronger and better at learning and decision-making. This process is known as reinforcement learning. AlphaGo went on to defeat Go world champions in different global arenas and arguably became the greatest Go player of all time.
AlphaGo then competed against legendary Go player Mr Lee Sedol, the winner of 18 world titles, who is widely considered the greatest player of the past decade. AlphaGo's 4-1 victory in Seoul, South Korea, on March 2016 was watched by over 200 million people worldwide. This landmark achievement was a decade ahead of its time.
The game earned AlphaGo a 9 dan professional ranking, the highest certification. This was the first time a computer Go player had ever received the accolade. During the games, AlphaGo played several inventive winning moves, several of which - including move 37 in game two - were so surprising that they upended hundreds of years of wisdom. Players of all levels have extensively examined these moves ever since.
In January 2017, we revealed an improved, online version of AlphaGo called Master. This online player achieved 60 straight wins in time-control games against top international players.
Four months later, AlphaGo took part in the Future of Go Summit in China, the birthplace of Go. The five-day festival created an opportunity to explore the mysteries of Go in a spirit of mutual collaboration with the country’s top players. Designed to help unearth even more strategic moves, the summit included various game formats such as pair Go, team Go, and a match with the world’s number one player Ke Jie.
This powerful technique is no longer constrained by the limits of human knowledge. Instead, the computer program accumulated thousands of years of human knowledge during a period of just a few days and learned to play Go from the strongest player in the world, AlphaGo.
AlphaGo Zero quickly surpassed the performance of all previous versions and also discovered new knowledge, developing unconventional strategies and creative new moves, including those which beat the World Go Champions Lee Sedol and Ke Jie. These creative moments give us confidence that AI can be used as a positive multiplier for human ingenuity.
AlphaZero replaces hand-crafted heuristics with a deep neural network and algorithms that are given nothing beyond the basic rules of the game. By teaching itself, AlphaZero developed its own unique and creative style of play in all three games.
In its chess games, for example, players saw it had developed a highly dynamic and “unconventional” style of play that differed from any previous chess playing engine. Many of its “game changing” ideas have since been taken up at the highest level of play.
It does this by learning a model of its environment and combining it with AlphaZero’s powerful lookahead tree search. This allows it to plan winning strategies in unknown domains, a significant leap forward in the capabilities of reinforcement learning algorithms and an important step towards our mission of building general-purpose learning systems.
While it is still early days, the ideas behind MuZero's powerful learning and planning algorithms may pave the way towards tackling new problems in messy real-world environments where the “rules of the game” are unknown.