The top of the learner hierarchy is more conceptual than functional. The different classes distinguish algorithms in such a way that we can automatically determine when an algorithm is not applicable for a problem.
Top-level class for all reinforcement learning algorithms. Any learning algorithm changes a policy (in some way) in order to increase the expected reward/fitness.
Bases: pybrain.rl.learners.learner.Learner
Assumes the task is episodic, not life-long, and therefore does a learning step only after the end of each episode.
Bases: pybrain.rl.learners.learner.EpisodicLearner
A class for learners that learn from a dataset, which has no target output but only a reinforcement signal for each sample. It requires a ReinforcementDataSet object (which provides state-action-reward tuples).
Bases: pybrain.rl.learners.learner.Learner
A Learner determines how to change the adaptive parameters of a module.
Bases: pybrain.rl.learners.learner.Learner
The class of learners that (in contrast to value-based learners) searches directly in policy space.
Bases: pybrain.rl.learners.learner.ExploringLearner, pybrain.rl.learners.learner.DataSetLearner, pybrain.rl.learners.learner.EpisodicLearner
An RL algorithm based on estimating a value-function.
Bases: pybrain.rl.learners.valuebased.valuebased.ValueBasedLearner
Learn on the current dataset, either for many timesteps and even episodes (batchMode = True) or for a single timestep (batchMode = False). Batch mode is possible, because Q-Learning is an off-policy method.
In batchMode, the algorithm goes through all the samples in the history and performs an update on each of them. if batchMode is False, only the last data sample is considered. The user himself has to make sure to keep the dataset consistent with the agent’s history.
Bases: pybrain.rl.learners.valuebased.valuebased.ValueBasedLearner
Q-lambda is a variation of Q-learning that uses an eligibility trace.
Bases: pybrain.rl.learners.valuebased.valuebased.ValueBasedLearner
State-Action-Reward-State-Action (SARSA) algorithm.
In batchMode, the algorithm goes through all the samples in the history and performs an update on each of them. if batchMode is False, only the last data sample is considered. The user himself has to make sure to keep the dataset consistent with the agent’s history.
Bases: pybrain.rl.learners.valuebased.valuebased.ValueBasedLearner
Neuro-fitted Q-learning
Bases: pybrain.rl.learners.directsearch.directsearch.DirectSearchLearner, pybrain.rl.learners.learner.DataSetLearner, pybrain.rl.learners.learner.ExploringLearner
PolicyGradientLearner is a super class for all continuous direct search algorithms that use the log likelihood of the executed action to update the weights. Subclasses are ENAC, GPOMDP, or REINFORCE.
Bases: pybrain.rl.learners.directsearch.policygradient.PolicyGradientLearner
Reinforce is a gradient estimator technique by Williams (see “Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning”). It uses optimal baselines and calculates the gradient with the log likelihoods of the taken actions.
Bases: pybrain.rl.learners.directsearch.policygradient.PolicyGradientLearner
Episodic Natural Actor-Critic. See J. Peters “Natural Actor-Critic”, 2005. Estimates natural gradient with regression of log likelihoods to rewards.
Note
Black-box optimization algorithms can also be seen as direct-search RL algorithms, but are not included here.