splendor.agents.our_agents.ppo package
Subpackages
- splendor.agents.our_agents.ppo.ppo_rnn package
- splendor.agents.our_agents.ppo.self_attn package
Submodules
splendor.agents.our_agents.ppo.arguments_parsing module
All things related to command-line arguments parsing for PPO training.
- class splendor.agents.our_agents.ppo.arguments_parsing.Arguments[source]
Bases:
TypedDict
TypedDict representing the command-line arguments.
- architecture: Required[str]
- device_name: Required[Literal['cuda', 'cpu', 'mps']]
- learning_rate: Required[float]
- opponent: Required[str]
- saved_weights: Required[Path | None]
- seed: Required[int]
- test_opponent: Required[str]
- weight_decay: Required[float]
- working_dir: Required[Path]
- class splendor.agents.our_agents.ppo.arguments_parsing.NeuralNetArch(name: str, ppo_factory: PPOBaseFactory, is_recurrent: bool, default_saved_weights: Path, agent_relative_import_path: str, hidden_state_dim: tuple[int, ...] | None = None)[source]
Bases:
object
dataclass for storing all essential information regarding a specific neural network architecture.
- agent_relative_import_path: str
- default_saved_weights: Path
- is_recurrent: bool
- name: str
- ppo_factory: PPOBaseFactory
splendor.agents.our_agents.ppo.common module
Collection of useful calculation functions.
- splendor.agents.our_agents.ppo.common.calculate_advantages(returns: Tensor, values: Tensor, normalize: bool = True) Tensor [source]
Calculate the advantages.
- Parameters:
returns – the returns (cumulative summation of rewards).
values – the value estimates for each state.
normalize – should the advantages be normalized, i.e. have 0 mean and variance of 1.
- Returns:
the calculated advantages.
- splendor.agents.our_agents.ppo.common.calculate_loss(policy_loss: Tensor, value_loss: Tensor, entropy_bonus: Tensor) Tensor [source]
final loss of clipped objective PPO, as seen here: https://github.com/openai/baselines/blob/ea25b9e8b234e6ee1bca43083f8f3cf974143998/baselines/ppo2/model.py#L91
- Parameters:
policy_loss – the calculated policy loss.
value_loss – the calculation value loss.
entropy_bonus – the calculated entropy bonus.
- Returns:
the PPO objective, i.e. a linear combination of those losses & entropy bonus.
- splendor.agents.our_agents.ppo.common.calculate_policy_loss(action_prob: Tensor, actions: Tensor, log_prob_actions: Tensor, advantages: Tensor, ppo_clip: float) tuple[Tensor, Tensor, Tensor] [source]
calculate the clipped policy loss.
- Parameters:
action_prob – the actions probabilities.
actions – the actions taken.
log_prob_actions – the log-probabilities of the actions.
advantages – the advantages.
ppo_clip – the PPO clipped objective clipping epsilon.
- Returns:
the policy loss, the Kullback-Leibler divergence estimate & the entropy gain.
- splendor.agents.our_agents.ppo.common.calculate_returns(rewards: Tensor, discount_factor: float, normalize: bool = True) Tensor [source]
calculate episodes returns (cumulative summation of the rewards).
- Parameters:
rewards – the rewards obtained throughout each episode.
discount_factor – by how much rewards decay over time.
normalize – should the returns be normalized (have 0 mean and variance of 1).
- Returns:
the calculated returns.
splendor.agents.our_agents.ppo.constants module
Constants relevant for MLP-based PPO and Gradient-Descent based learning.
splendor.agents.our_agents.ppo.input_norm module
Implementation of an input normalization layer.
splendor.agents.our_agents.ppo.network module
Implementation of neural network for PPO using MLP architecture.
- class splendor.agents.our_agents.ppo.network.PPO(input_dim: int, output_dim: int, hidden_layers_dims: list[int] | None = None, dropout: float = 0.2)[source]
Bases:
PPOBase
Neural Network, in MLP architecture, for PPO.
- forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) tuple[Float[Tensor, 'batch actions'], Float[Tensor, 'batch 1'], None] [source]
Pass input through the network to gain predictions.
- Parameters:
x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).
action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from Engine.Splendor.gym.envs.actions
hidden_state – hidden state of the recurrent unit. expected shape: (batch_size, num_layers, hidden_state_dim) or (num_layers, hidden_state_dim).
- Returns:
the actions probabilities and the value estimate.
splendor.agents.our_agents.ppo.ppo module
entry-point for PPO training.
- splendor.agents.our_agents.ppo.ppo.extract_game_stats(final_game_state: SplendorState, agent_id: int) list[float] [source]
Extract game statistics from the final (terminal) game state.
- Parameters:
final_game_state – the final, terminal state of the game.
agent_id – the ID (turn) of the PPO agent in training.
- Returns:
list of the statistics values.
- splendor.agents.our_agents.ppo.ppo.save_model(model: Module, path: Path) None [source]
Save given model weights into a file at given path.
- Parameters:
model – the model whose weights should be stored.
path – Where to store the weights.
- splendor.agents.our_agents.ppo.ppo.train(working_dir: Path = PosixPath('/home/eyal/Desktop/projects/Splendor-AI'), learning_rate: float = 1e-06, weight_decay: float = 0.0001, seed: int = 1234, device_name: str = 'cpu', transfer_learning: bool = False, saved_weights: Path | None = None, opponent: str = 'random', test_opponent: str = 'minimax', architecture: str = 'mlp') Module [source]
Train a PPO agent.
- Parameters:
working_dir – Where to store the statistics and weights.
learning_rate – The learning rate of the gradient descent based learning.
weight_decay – L2 regularization coefficient.
seed – Which seed to use during training.
device_name – Name of the device used for mathematical computations.
transfer_learning – Whether or not the PPO agent should be initialized from a pre-trained weights.
saved_weights – Path to the weights of a pre-trained PPO agent that would be loaded and used as initialization for this training session. This argument would be ignored if
transfer_learning
isFalse
and required iftransfer_learning
isTrue
.opponent – Opponent agent name that the PPO would train against.
test_opponent – Test opponent name that the PPO would be evaluated against.
architecture – PPO network architecture name that should be used.
- Returns:
The trained model (PPO agent).
splendor.agents.our_agents.ppo.ppo_agent module
Implementation of a PPO agent with MLP neural network.
- class splendor.agents.our_agents.ppo.ppo_agent.PPOAgent(_id: int, load_net: bool = True)[source]
Bases:
PPOAgentBase
PPO agent with MLP neural network.
- SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) CollectAction | ReserveAction | BuyAction [source]
select an action to play from the given actions.
splendor.agents.our_agents.ppo.ppo_agent_base module
Definition for a base class for all PPO-based agents.
- class splendor.agents.our_agents.ppo.ppo_agent_base.PPOAgentBase(_id: int, load_net: bool = True)[source]
Bases:
Agent
base class for all PPO-based agents.
- abstract SelectAction(actions: list[CollectAction | ReserveAction | BuyAction], game_state: SplendorState, game_rule: SplendorGameRule) CollectAction | ReserveAction | BuyAction [source]
select an action to play from the given actions.
splendor.agents.our_agents.ppo.ppo_base module
Base class for all neural network that should be used by a PPO agent.
- class splendor.agents.our_agents.ppo.ppo_base.PPOBase(input_dim: int, output_dim: int)[source]
Bases:
Module
,ABC
Base class for all neural network that should be used by a PPO agent.
Create hidden layers based on given dimensions.
- abstract forward(x: Float[Tensor, 'batch sequence features'] | Float[Tensor, 'batch features'] | Float[Tensor, 'features'], action_mask: Float[Tensor, 'batch actions'] | Float[Tensor, 'actions'], *args, **kwargs) tuple[Tensor, Tensor, Any] [source]
Pass input through the network to gain predictions.
- Parameters:
x – the input to the network. expected shape: one of the following: (features,) or (batch_size, features) or (batch_size, sequance_length, features).
action_mask – a binary masking tensor, 1’s signals a valid action and 0’s signals an invalid action. expected shape: (actions,) or (batch_size, actions). where actions are equal to len(ALL_ACTIONS) which comes from splendor.Splendor.gym.envs.actions
- Returns:
the actions probabilities and the value estimate.
return the initial hidden state to be used.
splendor.agents.our_agents.ppo.rollout module
Implementation of a rollout buffer - a tracker of essential values for learning purposes, during an episode.
- class splendor.agents.our_agents.ppo.rollout.RolloutBuffer(size: int, input_dim: int, action_dim: int, is_recurrent: bool = False, hidden_states_shape: tuple[int, ...] | None = None, device: device | None = None)[source]
Bases:
object
The rollout buffer.
- action_dim: int
- action_mask_history: Tensor
- actions: Tensor
- calculate_gae(discount_factor: float) tuple[Tensor, Tensor] [source]
Compute the Generalized Advantage Estimation (GAE).
- Parameters:
discount_factor – by how much a reward decays over time.
- Returns:
the calculated advantages & returns.
- cell_states: Tensor | None
- device: device | None = None
- dones: Tensor
- full: bool = False
- index: int = 0
- input_dim: int
- is_recurrent: bool = False
- log_prob_actions: Tensor
- remember(state: Tensor, action: Tensor, action_mask: Tensor, log_prob_action: Tensor, value: float, reward: float, done: bool, hidden_state: Tensor | None = None, cell_state: Tensor | None = None) None [source]
Store essential values in the rollout buffer.
- Parameters:
state – feature vector of a state.
action – the action taken in that state.
action_mask – the actions mask in that state.
log_prob_action – the log of the probabilities for each action in that state.
value – the value estimation of that state.
reward – the reward given after taken the action.
done – is this a terminal state.
hidden_state – the hidden state used, only relevant for recurrent PPO.
cell_state – the hidden state used, only relevant for recurrent PPO, specifically for LSTM.
- rewards: Tensor
- size: int
- states: Tensor
- unpack(discount_factor: float) tuple[Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor, Tensor] [source]
unpack all the stored values from the rollout buffer.
- values: Tensor
splendor.agents.our_agents.ppo.training module
Implementation of the actual training of the PPO.
- class splendor.agents.our_agents.ppo.training.LearningParams(optimizer: Optimizer, discount_factor: float, ppo_steps: int, ppo_clip: float, loss_fn: _Loss, seed: int, device: device, is_recurrent: bool, hidden_states_shape: tuple[int, ...] | None = None)[source]
Bases:
object
Placeholder for various learning parameters.
discount_factor: by how much the reward decays over environment steps (turns in the game). optimizer: Which optimizer should be used (Adam, SGD, etc). ppo_steps: how many gradient descent steps should be performed. ppo_clip: which “epsilon” to use the policy loss clipping. loss_fn: Which loss function should be used as the loss of he value-regression (L1, L2, Huber, etc). device: On which device to execute the calculations. is_recurrent: Is the given policy incorporates a recurrent unit in it’s architecture. This parameter is here to tell if the hidden states should be ignored or not.
- device: device
- discount_factor: float
- is_recurrent: bool
- loss_fn: _Loss
- optimizer: Optimizer
- ppo_clip: float
- ppo_steps: int
- seed: int
- splendor.agents.our_agents.ppo.training.evaluate(env: Env, policy: Module, is_recurrent: bool, seed: int, device: device) float [source]
Evaluate the PPO agent (in training) performance against the test opponent.
- Parameters:
env – The test environment, configured to simulate a game against the test opponent.
policy – The network of the PPO agent.
is_recurrent – Is the network of the PPO agent incorporates a recurrent unit or not. This signals this functions whether or not the hidden states should be ignored or used.
seed – the seed used by the environment.
device – On which device the mathematical computations should be performed.
- Returns:
The reward the PPO agent collected during a single episode.
- splendor.agents.our_agents.ppo.training.train_single_episode(env: Env, policy: Module, learning_params: LearningParams) tuple[float, float, float] [source]
Execute the training procedure for a single episode (game), i.e. record a complete episode trajectory (trace of a full game) and then perform multiple gradient descent steps on the policy network.
- Parameters:
env – The environment that would be used to simulate an episode.
policy – The network of the PPO agent.
learning_params – Various learning parameters required to define the learning procedure, such as the learning rate.
- Returns:
the average policy & value losses and the episode reward.
- splendor.agents.our_agents.ppo.training.update_policy(policy: Module, rollout_buffer: RolloutBuffer, learning_params: LearningParams) tuple[float, float] [source]
Update the policy using several gradient descent steps (via the given optimizer) on the PPO-Clip loss function.
- Parameters:
policy – the neutal network to optimize.
rollout_buffer – a record for a complete trajectory of an episode (trace of a full game).
learning_params – all argument required in order to define the learning procedure.
- Returns:
The average policy loss and the average value loss.
splendor.agents.our_agents.ppo.utils module
Collection of utility functions.
- splendor.agents.our_agents.ppo.utils.load_saved_model(path: Path, ppo_factory: PPOBaseFactory, *args, **kwargs) PPOBase [source]
Load saved weights of a PPO model from a given path, if no path was given the installed weights of the PPO agent will be loaded.