splendor.agents.our_agents.ppo package

Subpackages

Submodules

splendor.agents.our_agents.ppo.arguments_parsing module

All things related to command-line arguments parsing for PPO training.

class splendor.agents.our_agents.ppo.arguments_parsing.Arguments[source]

Bases: TypedDict

TypedDict representing the command-line arguments.

architecture: Required[str]

device_name: Required[Literal['cuda', 'cpu', 'mps']]

learning_rate: Required[float]

opponent: Required[str]

saved_weights: Required[Path | None]

seed: Required[int]

test_opponent: Required[str]

weight_decay: Required[float]

working_dir: Required[Path]

class splendor.agents.our_agents.ppo.arguments_parsing.NeuralNetArch(name: str, ppo_factory: PPOBaseFactory, is_recurrent: bool, default_saved_weights: Path, agent_relative_import_path: str, hidden_state_dim: tuple[int, ...] | None = None)[source]

Bases: object

dataclass for storing all essential information regarding a specific neural network architecture.

agent_relative_import_path: str

default_saved_weights: Path

hidden_state_dim: tuple[int, ...] | None = None

is_recurrent: bool

name: str

ppo_factory: PPOBaseFactory

splendor.agents.our_agents.ppo.arguments_parsing.parse_args() → Arguments[source]

Parse command-line arguments.

Returns:: dictionary storing all the required arguments.

splendor.agents.our_agents.ppo.common module

Collection of useful calculation functions.

splendor.agents.our_agents.ppo.common.calculate_advantages(returns: Tensor, values: Tensor, normalize: bool = True) → Tensor[source]

Calculate the advantages.

Parameters:

returns – the returns (cumulative summation of rewards).
values – the value estimates for each state.
normalize – should the advantages be normalized, i.e. have 0 mean and variance of 1.

Returns:

the calculated advantages.

splendor.agents.our_agents.ppo.common.calculate_loss(policy_loss: Tensor, value_loss: Tensor, entropy_bonus: Tensor) → Tensor[source]

final loss of clipped objective PPO, as seen here: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91

Parameters:

policy_loss – the calculated policy loss.
value_loss – the calculation value loss.
entropy_bonus – the calculated entropy bonus.

Returns:

the PPO objective, i.e. a linear combination of those losses & entropy bonus.

splendor.agents.our_agents.ppo.common.calculate_policy_loss(action_prob: Tensor, actions: Tensor, log_prob_actions: Tensor, advantages: Tensor, ppo_clip: float) → tuple[Tensor, Tensor, Tensor][source]

calculate the clipped policy loss.

Parameters:

action_prob – the actions probabilities.
actions – the actions taken.
log_prob_actions – the log-probabilities of the actions.
advantages – the advantages.
ppo_clip – the PPO clipped objective clipping epsilon.

Returns:

the policy loss, the Kullback-Leibler divergence estimate & the entropy gain.

splendor.agents.our_agents.ppo.common.calculate_returns(rewards: Tensor, discount_factor: float, normalize: bool = True) → Tensor[source]

calculate episodes returns (cumulative summation of the rewards).

Parameters:

rewards – the rewards obtained throughout each episode.
discount_factor – by how much rewards decay over time.
normalize – should the returns be normalized (have 0 mean and variance of 1).

Returns:

the calculated returns.

splendor.agents.our_agents.ppo.constants module

Constants relevant for MLP-based PPO and Gradient-Descent based learning.

splendor.agents.our_agents.ppo.input_norm module

Implementation of an input normalization layer.

class splendor.agents.our_agents.ppo.input_norm.InputNormalization(num_features: int, epsilon: float = 1e-08)[source]

Bases: Module

Input normalization layer - using a running average for calibrating the mean & variance.

forward(x: Tensor) → Tensor[source]

Normalize the input using a running mean & variance estimators. The output should have 0 mean and variance of 1.

Parameters:: x – the un-normalized input.
Returns:: a normalized x.

splendor.agents.our_agents.ppo.network module

Implementation of neural network for PPO using MLP architecture.

class splendor.agents.our_agents.ppo.network.PPO(input_dim: int, output_dim: int, hidden_layers_dims: list[int] | None = None, dropout: float = 0.2)[source]

Bases: PPOBase

Neural Network, in MLP architecture, for PPO.

forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) → tuple[Float[Tensor, 'batch actions'], Float[Tensor, 'batch 1'], None][source]

Pass input through the network to gain predictions.

Parameters:

x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).
action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from Engine.Splendor.gym.envs.actions
hidden_state – hidden state of the recurrent unit. expected shape: (batch_size, num_layers, hidden_state_dim) or (num_layers, hidden_state_dim).

Returns:

the actions probabilities and the value estimate.

splendor.agents.our_agents.ppo.ppo module

entry-point for PPO training.

splendor.agents.our_agents.ppo.ppo.extract_game_stats(final_game_state: SplendorState, agent_id: int) → list[float][source]

Extract game statistics from the final (terminal) game state.

Parameters:

final_game_state – the final, terminal state of the game.
agent_id – the ID (turn) of the PPO agent in training.

Returns:

list of the statistics values.

splendor.agents.our_agents.ppo.ppo.main() → None[source]: Entry-point for the ppo console script.

splendor.agents.our_agents.ppo.ppo.save_model(model: Module, path: Path) → None[source]

Save given model weights into a file at given path.

Parameters:

model – the model whose weights should be stored.
path – Where to store the weights.

splendor.agents.our_agents.ppo.ppo.train(working_dir: Path = PosixPath('/home/eyal/Desktop/projects/Splendor-AI'), learning_rate: float = 1e-06, weight_decay: float = 0.0001, seed: int = 1234, device_name: str = 'cpu', transfer_learning: bool = False, saved_weights: Path | None = None, opponent: str = 'random', test_opponent: str = 'minimax', architecture: str = 'mlp') → Module[source]

Train a PPO agent.

Parameters:

working_dir – Where to store the statistics and weights.
learning_rate – The learning rate of the gradient descent based learning.
weight_decay – L2 regularization coefficient.
seed – Which seed to use during training.
device_name – Name of the device used for mathematical computations.
transfer_learning – Whether or not the PPO agent should be initialized from a pre-trained weights.
saved_weights – Path to the weights of a pre-trained PPO agent that would be loaded and used as initialization for this training session. This argument would be ignored if transfer_learning is False and required if transfer_learning is True.
opponent – Opponent agent name that the PPO would train against.
test_opponent – Test opponent name that the PPO would be evaluated against.
architecture – PPO network architecture name that should be used.

Returns:

The trained model (PPO agent).

splendor.agents.our_agents.ppo.ppo_agent module

Implementation of a PPO agent with MLP neural network.

class splendor.agents.our_agents.ppo.ppo_agent.PPOAgent(_id: int, load_net: bool = True)[source]

Bases: PPOAgentBase

PPO agent with MLP neural network.

SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) → CollectAction | ReserveAction | BuyAction[source]: select an action to play from the given actions.

load() → PPOBase[source]: load and return the weights of the network.

splendor.agents.our_agents.ppo.ppo_agent.myAgent: alias of PPOAgent

splendor.agents.our_agents.ppo.ppo_agent_base module

Definition for a base class for all PPO-based agents.

class splendor.agents.our_agents.ppo.ppo_agent_base.PPOAgentBase(_id: int, load_net: bool = True)[source]

Bases: Agent

base class for all PPO-based agents.

abstractmethod SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) → CollectAction | ReserveAction | BuyAction[source]: select an action to play from the given actions.

abstractmethod load() → PPOBase[source]: load and return the weights of the network.

load_policy(policy: Module) → None[source]: Use a given policy as the agent’s network policy.

splendor.agents.our_agents.ppo.ppo_base module

Base class for all neural network that should be used by a PPO agent.

class splendor.agents.our_agents.ppo.ppo_base.PPOBase(input_dim: int, output_dim: int)[source]

Bases: Module, ABC

Base class for all neural network that should be used by a PPO agent.

static create_hidden_layers(input_dim: int, hidden_layers_dims: list[int], dropout: float) → Module[source]: Create hidden layers based on given dimensions.

abstractmethod forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) → tuple[Tensor, Tensor, Any][source]

Pass input through the network to gain predictions.

Parameters:

x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).
action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from splendor.Splendor.gym.envs.actions

Returns:

the actions probabilities and the value estimate.

init_hidden_state(device: device) → Any[source]: return the initial hidden state to be used.

class splendor.agents.our_agents.ppo.ppo_base.PPOBaseFactory(*args, **kwargs)[source]

Bases: Protocol

factory for PPO models.

splendor.agents.our_agents.ppo.ppo_base.unused(x: Any) → None[source]: Mark the given argument as unused, like casting to void in C.

splendor.agents.our_agents.ppo.rollout module

Implementation of a rollout buffer - a tracker of essential values for learning purposes, during an episode.

class splendor.agents.our_agents.ppo.rollout.RolloutBuffer(size: int, input_dim: int, action_dim: int, is_recurrent: bool = False, hidden_states_shape: tuple[int, ...] | None = None, device: device | None = None)[source]

Bases: object

The rollout buffer.

action_dim: int

action_mask_history: Tensor

actions: Tensor

calculate_gae(discount_factor: float) → tuple[Tensor, Tensor][source]

Compute the Generalized Advantage Estimation (GAE).

Parameters:: discount_factor – by how much a reward decays over time.
Returns:: the calculated advantages & returns.

cell_states: Tensor | None

clear() → None[source]: clean the rollout buffer.

device: device | None = None

dones: Tensor

full: bool = False

hidden_states: Tensor | None

hidden_states_shape: tuple[int, ...] | None = None

index: int = 0

input_dim: int

is_recurrent: bool = False

log_prob_actions: Tensor

remember(state: Tensor, action: Tensor, action_mask: Tensor, log_prob_action: Tensor, value: float, reward: float, done: bool, hidden_state: Tensor | None = None, cell_state: Tensor | None = None) → None[source]

Store essential values in the rollout buffer.

Parameters:

state – feature vector of a state.
action – the action taken in that state.
action_mask – the actions mask in that state.
log_prob_action – the log of the probabilities for each action in that state.
value – the value estimation of that state.
reward – the reward given after taken the action.
done – is this a terminal state.
hidden_state – the hidden state used, only relevant for recurrent PPO.
cell_state – the hidden state used, only relevant for recurrent PPO, specifically for LSTM.

rewards: Tensor

size: int

states: Tensor

unpack(discount_factor: float) → tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor][source]: unpack all the stored values from the rollout buffer.

values: Tensor

splendor.agents.our_agents.ppo.training module

Implementation of the actual training of the PPO.

class splendor.agents.our_agents.ppo.training.LearningParams(optimizer: Optimizer, discount_factor: float, ppo_steps: int, ppo_clip: float, loss_fn: _Loss, seed: int, device: device, is_recurrent: bool, hidden_states_shape: tuple[int, ...] | None = None)[source]

Bases: object

Placeholder for various learning parameters.

discount_factor: by how much the reward decays over environment steps (turns in the game). optimizer: Which optimizer should be used (Adam, SGD, etc). ppo_steps: how many gradient descent steps should be performed. ppo_clip: which “epsilon” to use the policy loss clipping. loss_fn: Which loss function should be used as the loss of he value-regression (L1, L2, Huber, etc). device: On which device to execute the calculations. is_recurrent: Is the given policy incorporates a recurrent unit in it’s architecture. This parameter is here to tell if the hidden states should be ignored or not.

device: device

discount_factor: float

hidden_states_shape: tuple[int, ...] | None = None

is_recurrent: bool

loss_fn: _Loss

optimizer: Optimizer

ppo_clip: float

ppo_steps: int

seed: int

splendor.agents.our_agents.ppo.training.evaluate(env: Env, policy: Module, is_recurrent: bool, seed: int, device: device) → float[source]

Evaluate the PPO agent (in training) performance against the test opponent.

Parameters:

env – The test environment, configured to simulate a game against the test opponent.
policy – The network of the PPO agent.
is_recurrent – Is the network of the PPO agent incorporates a recurrent unit or not. This signals this functions whether or not the hidden states should be ignored or used.
seed – the seed used by the environment.
device – On which device the mathematical computations should be performed.

Returns:

The reward the PPO agent collected during a single episode.

splendor.agents.our_agents.ppo.training.train_single_episode(env: Env, policy: Module, learning_params: LearningParams) → tuple[float, float, float][source]

Execute the training procedure for a single episode (game), i.e. record a complete episode trajectory (trace of a full game) and then perform multiple gradient descent steps on the policy network.

Parameters:

env – The environment that would be used to simulate an episode.
policy – The network of the PPO agent.
learning_params – Various learning parameters required to define the learning procedure, such as the learning rate.

Returns:

the average policy & value losses and the episode reward.

splendor.agents.our_agents.ppo.training.update_policy(policy: Module, rollout_buffer: RolloutBuffer, learning_params: LearningParams) → tuple[float, float][source]

Update the policy using several gradient descent steps (via the given optimizer) on the PPO-Clip loss function.

Parameters:

policy – the neutal network to optimize.
rollout_buffer – a record for a complete trajectory of an episode (trace of a full game).
learning_params – all argument required in order to define the learning procedure.

Returns:

The average policy loss and the average value loss.

splendor.agents.our_agents.ppo.utils module

Collection of utility functions.

splendor.agents.our_agents.ppo.utils.load_saved_model(path: Path, ppo_factory: PPOBaseFactory, *args, **kwargs) → PPOBase[source]: Load saved weights of a PPO model from a given path, if no path was given the installed weights of the PPO agent will be loaded.

splendor.agents.our_agents.ppo.utils.load_saved_ppo(path: Path | None = None) → PPO[source]: Load saved weights of a PPO model from a given path, if no path was given the installed weights of the PPO agent will be loaded.

splendor.agents.our_agents.ppo package

Subpackages

Submodules

splendor.agents.our_agents.ppo.arguments_parsing module

splendor.agents.our_agents.ppo.common module

splendor.agents.our_agents.ppo.constants module

splendor.agents.our_agents.ppo.input_norm module

splendor.agents.our_agents.ppo.network module

splendor.agents.our_agents.ppo.ppo module

splendor.agents.our_agents.ppo.ppo_agent module

splendor.agents.our_agents.ppo.ppo_agent_base module

splendor.agents.our_agents.ppo.ppo_base module

splendor.agents.our_agents.ppo.rollout module

splendor.agents.our_agents.ppo.training module

splendor.agents.our_agents.ppo.utils module

Module contents