Play Gomoku
Dive into Gomoku, or “Five in a Row,” where strategy meets fun! Players alternate placing stones on a board, racing to line up five first. Will it be horizontally, vertically, or diagonally? Your move!
Here’s the interactive game:
The Computer’s Heuristic Scoring System
1. Scoring the Lines
The computer evaluates each possible move by considering several factors:
- Consecutive Stones: More consecutive stones in a row, column, or diagonal are scored higher.
- Empty Sides: Lines with more empty spaces on either side are prioritized.
- Blocking Moves: Moves that block the opponent from completing five in a row are given a boost.
Scoring Formula
The total score S
for a cell is calculated as:
$$ S = C \times 3 + E \times 2 + B $$
Where:
C
= number of consecutive stonesE
= number of empty sidesB
= blocking value (5 points if it blocks the player)
The computer picks the cell with the highest score for its move!
Reinforcement Learning-Based Gomoku AI
We can also use reinforcement learning (RL) to train the Gomoku AI, specifically by leveraging Monte Carlo Tree Search (MCTS) and a neural network, similar to AlphaGo Zero. The system employs self-play to continuously refine its strategy, learning from thousands of simulated games.
1. Monte Carlo Tree Search (MCTS)
MCTS is a search algorithm that evaluates the potential moves by simulating possible future game scenarios. The main steps include:
- Selection: The tree is traversed from the root to a leaf node, selecting actions that maximize the Upper Confidence Bound (UCB) score, defined as:
$$ U(s, a) = Q(s, a) + c_{puct} \times P(s, a) \times \frac{\sqrt{N(s)}}{1 + N(s, a)} $$
Where:
Q(s, a)
represents the value of actiona
in states
.P(s, a)
is the prior probability of taking actiona
in states
, provided by the neural network.N(s)
is the total number of visits to states
.N(s, a)
is the number of times actiona
has been taken from states
.c_{puct}
controls the trade-off between exploration and exploitation.Expansion: If the selected node is not fully expanded, new child nodes are created for each possible action.
Simulation: From the expanded node, the algorithm simulates random games until a terminal state is reached.
Backpropagation: The result of the simulation is backpropagated through the nodes, updating their statistics.
2. Neural Network Architecture
The neural network approximates both the policy (which move to take) and the value (the probability of winning). The architecture is as follows:
- Input Layer: The board state, represented as an
n x n
grid, is fed into the network. - Convolutional Layers: These layers identify spatial patterns like rows, columns, and diagonals, which are crucial in Gomoku.
- Fully Connected Layers: The processed features are passed through dense layers to generate the policy vector
pi
and valuev
.
The network outputs:
- Policy vector (
pi
): A probability distribution over all possible moves. - Value (
v
): A scalar representing the expected outcome of the game from the current state.
3. Self-Play and Training
Training the AI involves generating data through self-play and using it to improve the neural network. The steps include:
Self-Play: The AI plays games against itself, collecting data in the form of board states, move probabilities, and game outcomes.
Training: The collected data is used to train the neural network, updating its weights to better predict the best moves and the expected outcome of the game.
Evaluation: The updated model is evaluated by competing against previous versions. The new model is accepted if it shows improved performance.
4. Mathematical Formulation of Training
The training process aims to minimize the loss function, which combines the policy loss and value loss:
$$ L(\theta) = -\sum_{i=1}^{n} \left( \pi_i \cdot \log(p_i) \right) + \lambda \cdot (v_i - z_i)^2 $$
Where:
theta
are the parameters of the neural network.pi_i
is the target policy (action distribution from MCTS).p_i
is the predicted policy from the network.v_i
is the predicted value.z_i
is the actual game outcome (1 for a win, -1 for a loss, 0 for a draw).lambda
is a regularization parameter.
The policy network outputs a probability distribution pi
over the possible actions. Mathematically, the probability of selecting action a
given state s
is denoted as:
$$ \pi(a|s) = \frac{e^{f_{\theta}(s,a)}}{\sum_{b} e^{f_{\theta}(s,b)}} $$
Where:
theta(s,a)
represents the score for actiona
in states
, as predicted by the neural network with parameterstheta
.- The denominator normalizes these scores across all possible actions
b
, ensuring that the probabilities sum to 1.
5. Coach Class
The Coach
class orchestrates the learning process, coordinating self-play, training, and evaluation to ensure continuous improvement of the model.
Code Overview
- GomokuGame Class: Manages the board, including move generation and game state evaluation.
- NNetWrapper Class: Interfaces with the neural network, providing methods for training, prediction, and checkpoint management.
- GomukuNNet Class: Defines the neural network architecture used for policy and value prediction.
- MCTS Class: Implements Monte Carlo Tree Search for decision-making.
- Coach Class: Oversees the self-play and training process, ensuring the model continues to improve over time.
Ref: