splendor.agents.our_agents.ppo package

Subpackages

Submodules

splendor.agents.our_agents.ppo.arguments_parsing module

All things related to command-line arguments parsing for PPO training.

class splendor.agents.our_agents.ppo.arguments_parsing.Arguments[source]

Bases: TypedDict

TypedDict representing the command-line arguments.

architecture: Required[str]
device_name: Required[Literal['cuda', 'cpu', 'mps']]
learning_rate: Required[float]
opponent: Required[str]
saved_weights: Required[Path | None]
seed: Required[int]
test_opponent: Required[str]
weight_decay: Required[float]
working_dir: Required[Path]
class splendor.agents.our_agents.ppo.arguments_parsing.NeuralNetArch(name: str, ppo_factory: PPOBaseFactory, is_recurrent: bool, default_saved_weights: Path, agent_relative_import_path: str, hidden_state_dim: tuple[int, ...] | None = None)[source]

Bases: object

dataclass for storing all essential information regarding a specific neural network architecture.

agent_relative_import_path: str
default_saved_weights: Path
hidden_state_dim: tuple[int, ...] | None = None
is_recurrent: bool
name: str
ppo_factory: PPOBaseFactory
splendor.agents.our_agents.ppo.arguments_parsing.parse_args() Arguments[source]

Parse command-line arguments.

Returns:

dictionary storing all the required arguments.

splendor.agents.our_agents.ppo.common module

Collection of useful calculation functions.

splendor.agents.our_agents.ppo.common.calculate_advantages(returns: Tensor, values: Tensor, normalize: bool = True) Tensor[source]

Calculate the advantages.

Parameters:
  • returns – the returns (cumulative summation of rewards).

  • values – the value estimates for each state.

  • normalize – should the advantages be normalized, i.e. have 0 mean and variance of 1.

Returns:

the calculated advantages.

splendor.agents.our_agents.ppo.common.calculate_loss(policy_loss: Tensor, value_loss: Tensor, entropy_bonus: Tensor) Tensor[source]

final loss of clipped objective PPO, as seen here: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91

Parameters:
  • policy_loss – the calculated policy loss.

  • value_loss – the calculation value loss.

  • entropy_bonus – the calculated entropy bonus.

Returns:

the PPO objective, i.e. a linear combination of those losses & entropy bonus.

splendor.agents.our_agents.ppo.common.calculate_policy_loss(action_prob: Tensor, actions: Tensor, log_prob_actions: Tensor, advantages: Tensor, ppo_clip: float) tuple[Tensor, Tensor, Tensor][source]

calculate the clipped policy loss.

Parameters:
  • action_prob – the actions probabilities.

  • actions – the actions taken.

  • log_prob_actions – the log-probabilities of the actions.

  • advantages – the advantages.

  • ppo_clip – the PPO clipped objective clipping epsilon.

Returns:

the policy loss, the Kullback-Leibler divergence estimate & the entropy gain.

splendor.agents.our_agents.ppo.common.calculate_returns(rewards: Tensor, discount_factor: float, normalize: bool = True) Tensor[source]

calculate episodes returns (cumulative summation of the rewards).

Parameters:
  • rewards – the rewards obtained throughout each episode.

  • discount_factor – by how much rewards decay over time.

  • normalize – should the returns be normalized (have 0 mean and variance of 1).

Returns:

the calculated returns.

splendor.agents.our_agents.ppo.constants module

Constants relevant for MLP-based PPO and Gradient-Descent based learning.

splendor.agents.our_agents.ppo.input_norm module

Implementation of an input normalization layer.

class splendor.agents.our_agents.ppo.input_norm.InputNormalization(num_features: int, epsilon: float = 1e-08)[source]

Bases: Module

Input normalization layer - using a running average for calibrating the mean & variance.

forward(x: Tensor) Tensor[source]

Normalize the input using a running mean & variance estimators. The output should have 0 mean and variance of 1.

Parameters:

x – the un-normalized input.

Returns:

a normalized x.

splendor.agents.our_agents.ppo.network module

Implementation of neural network for PPO using MLP architecture.

class splendor.agents.our_agents.ppo.network.PPO(input_dim: int, output_dim: int, hidden_layers_dims: list[int] | None = None, dropout: float = 0.2)[source]

Bases: PPOBase

Neural Network, in MLP architecture, for PPO.

forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) tuple[Float[Tensor, 'batch actions'], Float[Tensor, 'batch 1'], None][source]

Pass input through the network to gain predictions.

Parameters:
  • x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).

  • action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from Engine.Splendor.gym.envs.actions

  • hidden_state – hidden state of the recurrent unit. expected shape: (batch_size, num_layers, hidden_state_dim) or (num_layers, hidden_state_dim).

Returns:

the actions probabilities and the value estimate.

splendor.agents.our_agents.ppo.ppo module

entry-point for PPO training.

splendor.agents.our_agents.ppo.ppo.extract_game_stats(final_game_state: SplendorState, agent_id: int) list[float][source]

Extract game statistics from the final (terminal) game state.

Parameters:
  • final_game_state – the final, terminal state of the game.

  • agent_id – the ID (turn) of the PPO agent in training.

Returns:

list of the statistics values.

splendor.agents.our_agents.ppo.ppo.main() None[source]

Entry-point for the ppo console script.

splendor.agents.our_agents.ppo.ppo.save_model(model: Module, path: Path) None[source]

Save given model weights into a file at given path.

Parameters:
  • model – the model whose weights should be stored.

  • path – Where to store the weights.

splendor.agents.our_agents.ppo.ppo.train(working_dir: Path = PosixPath('/home/eyal/Desktop/projects/Splendor-AI'), learning_rate: float = 1e-06, weight_decay: float = 0.0001, seed: int = 1234, device_name: str = 'cpu', transfer_learning: bool = False, saved_weights: Path | None = None, opponent: str = 'random', test_opponent: str = 'minimax', architecture: str = 'mlp') Module[source]

Train a PPO agent.

Parameters:
  • working_dir – Where to store the statistics and weights.

  • learning_rate – The learning rate of the gradient descent based learning.

  • weight_decay – L2 regularization coefficient.

  • seed – Which seed to use during training.

  • device_name – Name of the device used for mathematical computations.

  • transfer_learning – Whether or not the PPO agent should be initialized from a pre-trained weights.

  • saved_weights – Path to the weights of a pre-trained PPO agent that would be loaded and used as initialization for this training session. This argument would be ignored if transfer_learning is False and required if transfer_learning is True.

  • opponent – Opponent agent name that the PPO would train against.

  • test_opponent – Test opponent name that the PPO would be evaluated against.

  • architecture – PPO network architecture name that should be used.

Returns:

The trained model (PPO agent).

splendor.agents.our_agents.ppo.ppo_agent module

Implementation of a PPO agent with MLP neural network.

class splendor.agents.our_agents.ppo.ppo_agent.PPOAgent(_id: int, load_net: bool = True)[source]

Bases: PPOAgentBase

PPO agent with MLP neural network.

SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) CollectAction | ReserveAction | BuyAction[source]

select an action to play from the given actions.

load() PPOBase[source]

load and return the weights of the network.

splendor.agents.our_agents.ppo.ppo_agent.myAgent

alias of PPOAgent

splendor.agents.our_agents.ppo.ppo_agent_base module

Definition for a base class for all PPO-based agents.

class splendor.agents.our_agents.ppo.ppo_agent_base.PPOAgentBase(_id: int, load_net: bool = True)[source]

Bases: Agent

base class for all PPO-based agents.

abstract SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) CollectAction | ReserveAction | BuyAction[source]

select an action to play from the given actions.

abstract load() PPOBase[source]

load and return the weights of the network.

load_policy(policy: Module) None[source]

Use a given policy as the agent’s network policy.

splendor.agents.our_agents.ppo.ppo_base module

Base class for all neural network that should be used by a PPO agent.

class splendor.agents.our_agents.ppo.ppo_base.PPOBase(input_dim: int, output_dim: int)[source]

Bases: Module, ABC

Base class for all neural network that should be used by a PPO agent.

static create_hidden_layers(input_dim: int, hidden_layers_dims: list[int], dropout: float) Module[source]

Create hidden layers based on given dimensions.

abstract forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) tuple[Tensor, Tensor, Any][source]

Pass input through the network to gain predictions.

Parameters:
  • x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).

  • action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from splendor.Splendor.gym.envs.actions

Returns:

the actions probabilities and the value estimate.

init_hidden_state(device: device) Any[source]

return the initial hidden state to be used.

class splendor.agents.our_agents.ppo.ppo_base.PPOBaseFactory(*args, **kwargs)[source]

Bases: Protocol

factory for PPO models.

splendor.agents.our_agents.ppo.ppo_base.unused(x: Any) None[source]

Mark the given argument as unused, like casting to void in C.

splendor.agents.our_agents.ppo.rollout module

Implementation of a rollout buffer - a tracker of essential values for learning purposes, during an episode.

class splendor.agents.our_agents.ppo.rollout.RolloutBuffer(size: int, input_dim: int, action_dim: int, is_recurrent: bool = False, hidden_states_shape: tuple[int, ...] | None = None, device: device | None = None)[source]

Bases: object

The rollout buffer.

action_dim: int
action_mask_history: Tensor
actions: Tensor
calculate_gae(discount_factor: float) tuple[Tensor, Tensor][source]

Compute the Generalized Advantage Estimation (GAE).

Parameters:

discount_factor – by how much a reward decays over time.

Returns:

the calculated advantages & returns.

cell_states: Tensor | None
clear() None[source]

clean the rollout buffer.

device: device | None = None
dones: Tensor
full: bool = False
hidden_states: Tensor | None
hidden_states_shape: tuple[int, ...] | None = None
index: int = 0
input_dim: int
is_recurrent: bool = False
log_prob_actions: Tensor
remember(state: Tensor, action: Tensor, action_mask: Tensor, log_prob_action: Tensor, value: float, reward: float, done: bool, hidden_state: Tensor | None = None, cell_state: Tensor | None = None) None[source]

Store essential values in the rollout buffer.

Parameters:
  • state – feature vector of a state.

  • action – the action taken in that state.

  • action_mask – the actions mask in that state.

  • log_prob_action – the log of the probabilities for each action in that state.

  • value – the value estimation of that state.

  • reward – the reward given after taken the action.

  • done – is this a terminal state.

  • hidden_state – the hidden state used, only relevant for recurrent PPO.

  • cell_state – the hidden state used, only relevant for recurrent PPO, specifically for LSTM.

rewards: Tensor
size: int
states: Tensor
unpack(discount_factor: float) tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor][source]

unpack all the stored values from the rollout buffer.

values: Tensor

splendor.agents.our_agents.ppo.training module

Implementation of the actual training of the PPO.

class splendor.agents.our_agents.ppo.training.LearningParams(optimizer: Optimizer, discount_factor: float, ppo_steps: int, ppo_clip: float, loss_fn: _Loss, seed: int, device: device, is_recurrent: bool, hidden_states_shape: tuple[int, ...] | None = None)[source]

Bases: object

Placeholder for various learning parameters.

discount_factor: by how much the reward decays over environment steps (turns in the game). optimizer: Which optimizer should be used (Adam, SGD, etc). ppo_steps: how many gradient descent steps should be performed. ppo_clip: which “epsilon” to use the policy loss clipping. loss_fn: Which loss function should be used as the loss of he value-regression (L1, L2, Huber, etc). device: On which device to execute the calculations. is_recurrent: Is the given policy incorporates a recurrent unit in it’s architecture. This parameter is here to tell if the hidden states should be ignored or not.

device: device
discount_factor: float
hidden_states_shape: tuple[int, ...] | None = None
is_recurrent: bool
loss_fn: _Loss
optimizer: Optimizer
ppo_clip: float
ppo_steps: int
seed: int
splendor.agents.our_agents.ppo.training.evaluate(env: Env, policy: Module, is_recurrent: bool, seed: int, device: device) float[source]

Evaluate the PPO agent (in training) performance against the test opponent.

Parameters:
  • env – The test environment, configured to simulate a game against the test opponent.

  • policy – The network of the PPO agent.

  • is_recurrent – Is the network of the PPO agent incorporates a recurrent unit or not. This signals this functions whether or not the hidden states should be ignored or used.

  • seed – the seed used by the environment.

  • device – On which device the mathematical computations should be performed.

Returns:

The reward the PPO agent collected during a single episode.

splendor.agents.our_agents.ppo.training.train_single_episode(env: Env, policy: Module, learning_params: LearningParams) tuple[float, float, float][source]

Execute the training procedure for a single episode (game), i.e. record a complete episode trajectory (trace of a full game) and then perform multiple gradient descent steps on the policy network.

Parameters:
  • env – The environment that would be used to simulate an episode.

  • policy – The network of the PPO agent.

  • learning_params – Various learning parameters required to define the learning procedure, such as the learning rate.

Returns:

the average policy & value losses and the episode reward.

splendor.agents.our_agents.ppo.training.update_policy(policy: Module, rollout_buffer: RolloutBuffer, learning_params: LearningParams) tuple[float, float][source]

Update the policy using several gradient descent steps (via the given optimizer) on the PPO-Clip loss function.

Parameters:
  • policy – the neutal network to optimize.

  • rollout_buffer – a record for a complete trajectory of an episode (trace of a full game).

  • learning_params – all argument required in order to define the learning procedure.

Returns:

The average policy loss and the average value loss.

splendor.agents.our_agents.ppo.utils module

Collection of utility functions.

splendor.agents.our_agents.ppo.utils.load_saved_model(path: Path, ppo_factory: PPOBaseFactory, *args, **kwargs) PPOBase[source]

Load saved weights of a PPO model from a given path, if no path was given the installed weights of the PPO agent will be loaded.

splendor.agents.our_agents.ppo.utils.load_saved_ppo(path: Path | None = None) PPO[source]

Load saved weights of a PPO model from a given path, if no path was given the installed weights of the PPO agent will be loaded.

Module contents