Datasets are useful for allowing comfortable access to training, test and validation data. Instead of having to mangle with arrays, PyBrain gives you a more sophisticated datastructure that allows easier work with your data.
For the different tasks that arise in machine learning, there is a special dataset type, possibly with a few sub-types. The different types share some common functionality, which we’ll discuss first.
A dataset can be seen as a collection of named 2d-arrays, called fields in this context. For instance, if DS implements DataSet:
inp = DS['input']
returns the input field. The last dimension of this field corresponds to the input dimension, such that
inp[0,:]
would yield the first input vector. In most cases there is also a field named ‘target’, which follows the same rules. However, didn’t we say we will spare you the array mangling? Well, in most cases you will want iterate over a dataset like so:
for inp, targ in DS:
...
Note that whether you get one, two, or more sample rows as a return depends on the number of linked fields in the DataSet: These are fields containing the same number of samples and assumed to be used together, like the above ‘input’ and ‘target’ fields. You can always check the DS.link property to see which fields are linked.
Similarly, DataSets can be created by adding samples one-by-one – the cleaner but slower method – or by assembling them from arrays.
for inp, targ in samples:
DS.appendLinked(inp, targ)
# or alternatively, with ia and ta being arrays:
assert(ia.shape[0] == ta.shape[0])
DS.setField('input', ia)
DS.setField('target', ta)
In the latter case DS cannot check the linked array dimensions for you, otherwise it would not be possible to build a dataset from scratch.
You may add your own linked or unlinked data to the dataset. However, note that many training algorithms iterate over the linked fields and may fail if their number has changed:
DS.addField('myfield')
DS.setField('myfield', myarray)
DS.linkFields('input','target','myfield') # must provide complete list here
A useful utility method for quick generation of randomly picked training and testing data is also provided:
>>> len(DS)
100
>>> TrainDS, TestDS = DS.splitWithProportion(0.8)
>>> len(TrainDS), len(TestDS)
(80, 20)
As the name says, this simplest form of a dataset is meant to be used with supervised learning tasks. It is comprised of the fields ‘input’ and ‘target’, the pattern size of which must be set upon creation:
>>> from pybrain.datasets import SupervisedDataSet
>>> DS = SupervisedDataSet( 3, 2 )
>>> DS.appendLinked( [1,2,3], [4,5] )
>>> len(DS)
1
>>> DS['input']
array([[ 1., 2., 3.]])
This dataset introduces the concept of sequences. With this we are moving further away from the array mangling towards something more practical for sequence learning tasks. Essentially, its patterns are subdivided into sequences of variable length, that can be accessed via the methods
getNumSequences()
getSequence(index)
getSequenceLength(index)
Creating a Sequentialdataset is no different from its parent, since it still contains only ‘input’ and ‘target’ fields. Sequentialdataset inherits from SupervisedDataSet, which can be seen as a special case with a sequence length of 1 for all sequences.
To fill the dataset with content, it is advisable to call newSequence() at the start of each sequence to be stored, and then add patterns by using appendLinked() as above. This way, the class handles indexing and such transparently. One can theoretically construct a Sequentialdataset directly from arrays, but messing with the index field is not recommended.
A typical way of iterating over a sequence dataset DS would be something like:
for i in range(DS.getNumSequences):
for input, target in DS.getSequenceIterator(i):
# do stuff
The purpose of this dataset is to facilitate dealing with classification problems, whereas the above are more geared towards regression. Its ‘target’ field is defined as integer, and it contains an extra field called ‘class’ which is basically an automated backup of the targets, for reasons that we be apparent shortly. For the most part, you don’t have to bother with it. Initialization requires something like:
DS = ClassificationDataSet(inputdim, nb_classes=2, class_labels=['Fish','Chips'])
The labels are optional, and mainly used for documentation. Target dimension is supposed to be 1. The targets are class labels starting from zero. If for some reason you don’t know beforehand how many you have, or you fiddled around with the setField() method, it is possible to regenerate the class information using assignClasses(), or calculateStatistics():
>>> DS = ClassificationDataSet(2, class_labels=['Urd', 'Verdandi', 'Skuld'])
>>> DS.appendLinked([ 0.1, 0.5 ] , [0])
>>> DS.appendLinked([ 1.2, 1.2 ] , [1])
>>> DS.appendLinked([ 1.4, 1.6 ] , [1])
>>> DS.appendLinked([ 1.6, 1.8 ] , [1])
>>> DS.appendLinked([ 0.10, 0.80 ] , [2])
>>> DS.appendLinked([ 0.20, 0.90 ] , [2])
>>> DS.calculateStatistics()
{0: 1, 1: 3, 2: 2}
>>> print DS.classHist
{0: 1, 1: 3, 2: 2}
>>> print DS.nClasses
3
>>> print DS.getClass(1)
Verdandi
>>> print DS.getField('target').transpose()
[[0 1 1 1 2 2]]
When doing classification, many algorithms work better if classes are encoded into one output unit per class, that takes on a certain value if the class is present. As an advanced feature, ClassificationDataSet does this conversion automatically:
>>> DS._convertToOneOfMany(bounds=[0, 1])
>>> print DS.getField('target')
[[1 0 0]
[0 1 0]
[0 1 0]
[0 1 0]
[0 0 1]
[0 0 1]]
>>> print DS.getField('class').transpose()
[[0 1 1 1 2 2]]
>>> DS._convertToClassNb()
>>> print DS.getField('target').transpose()
[[0 1 1 1 2 2]]
In case you want to do sequence classification, there is also a SequenceClassificationDataSet, which combines the features of this class and the Sequentialdataset.
This is another extension of Sequentialdataset that allows assigning different weights to patterns. Essentially, it works like its parent, except comprising another linked field named ‘importance’, which should contain a value between 0.0 and 1.0 for each pattern. A Sequentialdataset is a special case with all weights equal to 1.0.
We have packed this functionality into a different class because it is rarely used and drains some computational resources. So far, there is no corresponding non-sequential dataset class.