Play Gomoku

Dive into Gomoku, or “Five in a Row,” where strategy meets fun! Players alternate placing stones on a board, racing to line up five first. Will it be horizontally, vertically, or diagonally? Your move!

Here’s the interactive game:

The Computer’s Heuristic Scoring System

1. Scoring the Lines

The computer evaluates each possible move by considering several factors:

  • Consecutive Stones: More consecutive stones in a row, column, or diagonal are scored higher.
  • Empty Sides: Lines with more empty spaces on either side are prioritized.
  • Blocking Moves: Moves that block the opponent from completing five in a row are given a boost.

Scoring Formula

The total score S for a cell is calculated as:

$$ S = C \times 3 + E \times 2 + B $$

Where:

  • C = number of consecutive stones
  • E = number of empty sides
  • B = blocking value (5 points if it blocks the player)

The computer picks the cell with the highest score for its move!


Reinforcement Learning-Based Gomoku AI

We can also use reinforcement learning (RL) to train the Gomoku AI, specifically by leveraging Monte Carlo Tree Search (MCTS) and a neural network, similar to AlphaGo Zero. The system employs self-play to continuously refine its strategy, learning from thousands of simulated games.

1. Monte Carlo Tree Search (MCTS)

MCTS is a search algorithm that evaluates the potential moves by simulating possible future game scenarios. The main steps include:

  • Selection: The tree is traversed from the root to a leaf node, selecting actions that maximize the Upper Confidence Bound (UCB) score, defined as:

$$ U(s, a) = Q(s, a) + c_{puct} \times P(s, a) \times \frac{\sqrt{N(s)}}{1 + N(s, a)} $$

Where:

  • Q(s, a) represents the value of action a in state s.

  • P(s, a) is the prior probability of taking action a in state s, provided by the neural network.

  • N(s) is the total number of visits to state s.

  • N(s, a) is the number of times action a has been taken from state s.

  • c_{puct} controls the trade-off between exploration and exploitation.

  • Expansion: If the selected node is not fully expanded, new child nodes are created for each possible action.

  • Simulation: From the expanded node, the algorithm simulates random games until a terminal state is reached.

  • Backpropagation: The result of the simulation is backpropagated through the nodes, updating their statistics.

2. Neural Network Architecture

The neural network approximates both the policy (which move to take) and the value (the probability of winning). The architecture is as follows:

  • Input Layer: The board state, represented as an n x n grid, is fed into the network.
  • Convolutional Layers: These layers identify spatial patterns like rows, columns, and diagonals, which are crucial in Gomoku.
  • Fully Connected Layers: The processed features are passed through dense layers to generate the policy vector pi and value v.

The network outputs:

  • Policy vector (pi): A probability distribution over all possible moves.
  • Value (v): A scalar representing the expected outcome of the game from the current state.

3. Self-Play and Training

Training the AI involves generating data through self-play and using it to improve the neural network. The steps include:

  • Self-Play: The AI plays games against itself, collecting data in the form of board states, move probabilities, and game outcomes.

  • Training: The collected data is used to train the neural network, updating its weights to better predict the best moves and the expected outcome of the game.

  • Evaluation: The updated model is evaluated by competing against previous versions. The new model is accepted if it shows improved performance.

4. Mathematical Formulation of Training

The training process aims to minimize the loss function, which combines the policy loss and value loss:

$$ L(\theta) = -\sum_{i=1}^{n} \left( \pi_i \cdot \log(p_i) \right) + \lambda \cdot (v_i - z_i)^2 $$

Where:

  • theta are the parameters of the neural network.
  • pi_i is the target policy (action distribution from MCTS).
  • p_i is the predicted policy from the network.
  • v_i is the predicted value.
  • z_i is the actual game outcome (1 for a win, -1 for a loss, 0 for a draw).
  • lambda is a regularization parameter.

The policy network outputs a probability distribution pi over the possible actions. Mathematically, the probability of selecting action a given state s is denoted as:

$$ \pi(a|s) = \frac{e^{f_{\theta}(s,a)}}{\sum_{b} e^{f_{\theta}(s,b)}} $$

Where:

  • theta(s,a) represents the score for action a in state s, as predicted by the neural network with parameters theta.
  • The denominator normalizes these scores across all possible actions b, ensuring that the probabilities sum to 1.

5. Coach Class

The Coach class orchestrates the learning process, coordinating self-play, training, and evaluation to ensure continuous improvement of the model.


Code Overview

  • GomokuGame Class: Manages the board, including move generation and game state evaluation.
  • NNetWrapper Class: Interfaces with the neural network, providing methods for training, prediction, and checkpoint management.
  • GomukuNNet Class: Defines the neural network architecture used for policy and value prediction.
  • MCTS Class: Implements Monte Carlo Tree Search for decision-making.
  • Coach Class: Oversees the self-play and training process, ensuring the model continues to improve over time.

Ref: