.. datasets: Using Datasets ============== Datasets are useful for allowing comfortable access to training, test and validation data. Instead of having to mangle with arrays, PyBrain gives you a more sophisticated datastructure that allows easier work with your data. For the different tasks that arise in machine learning, there is a special dataset type, possibly with a few sub-types. The different types share some common functionality, which we'll discuss first. A dataset can be seen as a collection of named 2d-arrays, called `fields` in this context. For instance, if DS implements :class:`DataSet`:: inp = DS['input'] returns the input field. The last dimension of this field corresponds to the input dimension, such that :: inp[0,:] would yield the first input vector. In most cases there is also a field named 'target', which follows the same rules. However, didn't we say we will spare you the array mangling? Well, in most cases you will want iterate over a dataset like so:: for inp, targ in DS: ... Note that whether you get one, two, or more sample rows as a return depends on the number of `linked fields` in the DataSet: These are fields containing the same number of samples and assumed to be used together, like the above 'input' and 'target' fields. You can always check the DS.link property to see which fields are linked. Similarly, DataSets can be created by adding samples one-by-one -- the cleaner but slower method -- or by assembling them from arrays. :: for inp, targ in samples: DS.appendLinked(inp, targ) # or alternatively, with ia and ta being arrays: assert(ia.shape[0] == ta.shape[0]) DS.setField('input', ia) DS.setField('target', ta) In the latter case DS cannot check the linked array dimensions for you, otherwise it would not be possible to build a dataset from scratch. You may add your own linked or unlinked data to the dataset. However, note that many training algorithms iterate over the linked fields and may fail if their number has changed:: DS.addField('myfield') DS.setField('myfield', myarray) DS.linkFields('input','target','myfield') # must provide complete list here A useful utility method for quick generation of randomly picked training and testing data is also provided:: >>> len(DS) 100 >>> TrainDS, TestDS = DS.splitWithProportion(0.8) >>> len(TrainDS), len(TestDS) (80, 20) :ref:`superviseddataset` ------------------------ As the name says, this simplest form of a dataset is meant to be used with supervised learning tasks. It is comprised of the fields 'input' and 'target', the pattern size of which must be set upon creation:: >>> from pybrain.datasets import SupervisedDataSet >>> DS = SupervisedDataSet( 3, 2 ) >>> DS.appendLinked( [1,2,3], [4,5] ) >>> len(DS) 1 >>> DS['input'] array([[ 1., 2., 3.]]) :ref:`sequentialdataset` ------------------------ This dataset introduces the concept of ``sequences``. With this we are moving further away from the array mangling towards something more practical for sequence learning tasks. Essentially, its patterns are subdivided into sequences of variable length, that can be accessed via the methods :: getNumSequences() getSequence(index) getSequenceLength(index) Creating a :class:`Sequentialdataset` is no different from its parent, since it still contains only 'input' and 'target' fields. :class:`Sequentialdataset` inherits from :class:`SupervisedDataSet`, which can be seen as a special case with a sequence length of 1 for all sequences. To fill the dataset with content, it is advisable to call :meth:`newSequence` at the start of each sequence to be stored, and then add patterns by using :meth:`appendLinked` as above. This way, the class handles indexing and such transparently. One can theoretically construct a :class:`Sequentialdataset` directly from arrays, but messing with the index field is not recommended. A typical way of iterating over a sequence dataset ``DS`` would be something like:: for i in range(DS.getNumSequences): for input, target in DS.getSequenceIterator(i): # do stuff :ref:`classificationdataset` ---------------------------- The purpose of this dataset is to facilitate dealing with classification problems, whereas the above are more geared towards regression. Its 'target' field is defined as integer, and it contains an extra field called 'class' which is basically an automated backup of the targets, for reasons that we be apparent shortly. For the most part, you don't have to bother with it. Initialization requires something like:: DS = ClassificationDataSet(inputdim, nb_classes=2, class_labels=['Fish','Chips']) The labels are optional, and mainly used for documentation. Target dimension is supposed to be 1. The targets are class labels starting from zero. If for some reason you don't know beforehand how many you have, or you fiddled around with the :meth:`setField` method, it is possible to regenerate the class information using :meth:`assignClasses`, or :meth:`calculateStatistics`:: >>> DS = ClassificationDataSet(2, class_labels=['Urd', 'Verdandi', 'Skuld']) >>> DS.appendLinked([ 0.1, 0.5 ] , [0]) >>> DS.appendLinked([ 1.2, 1.2 ] , [1]) >>> DS.appendLinked([ 1.4, 1.6 ] , [1]) >>> DS.appendLinked([ 1.6, 1.8 ] , [1]) >>> DS.appendLinked([ 0.10, 0.80 ] , [2]) >>> DS.appendLinked([ 0.20, 0.90 ] , [2]) >>> DS.calculateStatistics() {0: 1, 1: 3, 2: 2} >>> print DS.classHist {0: 1, 1: 3, 2: 2} >>> print DS.nClasses 3 >>> print DS.getClass(1) Verdandi >>> print DS.getField('target').transpose() [[0 1 1 1 2 2]] When doing classification, many algorithms work better if classes are encoded into one output unit per class, that takes on a certain value if the class is present. As an advanced feature, :class:`ClassificationDataSet` does this conversion automatically:: >>> DS._convertToOneOfMany(bounds=[0, 1]) >>> print DS.getField('target') [[1 0 0] [0 1 0] [0 1 0] [0 1 0] [0 0 1] [0 0 1]] >>> print DS.getField('class').transpose() [[0 1 1 1 2 2]] >>> DS._convertToClassNb() >>> print DS.getField('target').transpose() [[0 1 1 1 2 2]] In case you want to do sequence classification, there is also a :class:`SequenceClassificationDataSet`, which combines the features of this class and the :class:`Sequentialdataset`. :ref:`importancedataset` ------------------------ This is another extension of :class:`Sequentialdataset` that allows assigning different weights to patterns. Essentially, it works like its parent, except comprising another linked field named 'importance', which should contain a value between 0.0 and 1.0 for each pattern. A :class:`Sequentialdataset` is a special case with all weights equal to 1.0. We have packed this functionality into a different class because it is rarely used and drains some computational resources. So far, there is no corresponding non-sequential dataset class.