【转载】Monophonic Instrument Classification

Overview

In this first example, we will demonstrate how to use a deep network for performing framewise classification of monophonic instrument spectra. This will touch on some tricks for training and transforming large datasets, address the basics of Theano, and show you how to use deep networks for classification.


 

Contents


 

Source Code

This example requires and references the following files:

 

Note: We strongly recommend that you simply clone the provided deep learning tutorial repository, as directed on the tutorial page, and have followed all necessary system configuration prior to attempting anything here.


 

Implementing a Deep Network Classifier

As we will see momentarily, building a deep network in Python can be quite straightforward with the help of Theano. The process consists of four stages, outlined below:


 

1. Define the Network’s Architecture

Despite the excitement circling the notion of “deep learning,” a deep network is simply a non-linear function that maps inputs to outputs. And, like any other function, it has some general form and is specified by its coefficients, e.g. a parabola. The advantage inherent to deep learning is that an arbitrarily complex equation can be chosen or designed, and the coefficients that best model this relationship can be found through numerical optimization.

Restating for convenience, Theano is a mathematical expression library, offering symbolic differentiation, the ability to compile symbolic functions into C-code, and tight integration with NVIDIA GPUs for parallelization. Of those three attributes, the first is of particular interest here. In the spirit of simplicity, Theano offers two basic kinds of symbolic datatypes: variables (vectors, matrices, tensors, etc), and sharedparameters.

In a Theano expression, variables serve as placeholders for independent data (i.e. input variables) that will be pushed through the network. Therefore, it is necessary to indicate what the expected number of dimensions for this data by creating the right type, e.g. vector versus 3d-tensor. By and large, you will only ever need to explicitly declare variables for the inputs of a network, and will therefore have a very clear idea what your data looks like.

Conversely, shared parameters (known simply in Theano as a “shared” type) are used to represent the parameters of a model; it may help to think of shared types as coefficients. Shared parameters offer a few key benefits that should be addressed specifically:

  • Shared parameters are persistent; you will not need to provide shared types as inputs to a function, but you may want to keep a handle to them.
  • You can differentiate with respect to shared types (addressed here).
  • Shared types can be updated implicitly during a function call (addressed here).

For the time being, it is sufficient to consider only the first point. As opposed to variables, shared types take and store real numerical values. This is a crucial distinction to make, and we call special attention to it now: a shared type ownspersistent numerical values, which are allowed to change during program execution. In this sense, shared types must be initialized with some numerical value.

Having discussed the prerequisite Theano concepts, we can review a sample network definition to see these principles in practice:

Building the instrument classifier network.

First, we see the input to the network, a matrix, is declared in Line 157. This is consistent with the instrument data, which consists of Constant-Q vectors. The input data type is a matrix because we will provide batches of vectors, i.e. matrices, to the network.

Lines 165:168 create two shared parameters, taking the names mu and sigma, used to standardize the input data. For now, we initialize the means to a vector of zeros and the standard deviations to a vector of ones, effectively by-passing this operation. Later we will see how these parameters are updated with statistics computed over the training set, and the significance of naming these parameters with useful labels.

Lines 171:174 show the definition of the first layer of this network, an affine (fully-connected) transform. Here, a weight matrix and a vector of biases are initialized with normally distributed values and zeros, respectively. Note that these shared types are also given unique names, where the layer index is clearly indicated in the label. The output of the first layer, z_out0, is computed as the dot-product of the input x_input with the weight matrix weights0, followed by an additive biasbias0 and point-wise half-wave rectification (written elsewhere in the provided code).

Finally, the other two layers are almost identical to the first. The sole difference is the application of the softmax operation in Line 186. This enforces the output of the network to sum to 1 for a single datapoint. Further note that the output of this function, effectively a multiple observation posterior, will also be a matrix.


 

2. Define a Loss Function

Now that we have defined our classification network, we need to establish some means of measuring how well the current model fits our data. In the language of neural networks, this is known as a loss function, or simply the loss. For more information, we refer the curious reader to this tutorial on Energy-based Learning.

For a network that aims to classify an observation as one of a predefined set of classes, we would like to make the loss of the “correct” class smaller than the loss of all other classes. Since we have defined a network that produces a posterior, we can define the loss as the negative log-likelihood of this output:

Defining a loss function.

To implement this in Theano, we first need to declare an integer vector input for the correct classes of the data processed by the network (Line 191). Similarly, we also need to create a range vector (Line 192) that spans from [0:batch_size]. These two vectors are used to index the posterior and select the values of the right answers (Line 193). Note that this produces a symbolic scalar loss over a batch of data, where the loss of multiple observations are averaged together.

For completeness, we can also define a symbolic scalar accuracy value (Line 194). This is defined here by comparing the predicted classes (the argmax of the posterior, z_output) with the given correct classes (y_target). A prediction is correct when these two vectors are equal, and the accuracy is computed as the mean value over these observations.


 

3. Compute Parameter Update Rules

Importantly, when the fitness of a model can be calculated as a scalar output of a differentiable function, gradient optimization can be used to find, or learn, better values for the parameters. The intuition behind this method is actually a simple one; move the parameter values in tiny steps toward more “correct” answers:

Computing parameter update rules.

Like before, the first step is to create a scalar input, eta, to control the step size, orlearning rate, at every update iteration (Line 199). Then, for each parameter in the network we’d like to update, we compute the gradient with respect to it (Line 206). Then, we can define an update rule for this parameter, as its difference with its gradient, scaled by eta (Line 208).

This final aspect of Theano –update functions– is very useful, if somewhat opaque, so we draw attention to it here. In Python / Theano, it is possible to define an update function for shared types, stored in an Ordered dictionary and keyed by a specific instance. As we will see in the next stage, these update functions can beimplicitly invoked for each shared parameter, making the learning process significantly easier and effectively transparent.


 

4. Compile Theano Functions

As mentioned at the start of this section, Theano enables mathematical expressions to be compiled to C-code. Not only does this accelerate the run-time of computation, but it also enables shared parameters to be updated behind the scenes.

Compiling Theano functions.

In practice, this proceeds as follows: all symbolic variables necessary to compute some output, which can be nearly anything, are provided as a list of inputs; this output is declared as the output; any update rules, as a dictionary of functions keyed by shared parameter, are provided to the updates argument. As an implementation detail, we allow downcasting from float64 to float32. The result is a function that can be called like any standard Python method.

Before proceeding, compare the two functions, objective_fx andprediction_fx. The first takes three arguments: x_input, the data to transform,y_target, the posterior indices corresponding to the correct classes, and eta, the learning rate. Update rules are defined for this function, so the numerical values for the parameters listed in Line 202 will change at every iteration eta > 0. Conversely, the second function takes only a single argument, x_input, and has no defined update rules. This function is used solely for inference, or computing the posterior for each row in the input matrix.


 

Dataset

To facilitate the exploration of deep learning with musically relevant problems, we provide a precomputed dataset of constant-Q coefficients for 10 different instrument classes, sampled from the University of Iowa Music Instrument Samples.

Be aware that this is, strictly speaking, a toy dataset and its use in “real” research is discouraged. However, it will more than serve our purposes to demonstrate how to implement deep network classifier.

For more information regarding what exactly this data represents, please consult the README contained in the archive.


 

Shuffling the Data

It is yet another advantage of deep learning algorithms that they are able to synthesize arbitrarily large datasets during training. This is achieved by computing gradients over small data batches and updating the parameters of the network accordingly; this is referred to as mini-batch optimization.

Though there is some theory developing around guiding heuristics for populating a mini-batch at a given iteration [see Bengio2009], randomly sampling the training set has shown to work reasonably well in practice. To these ends, we write a generatorthat returns batches of data and the corresponding labels forever.

Shuffling the data with a generator.

Here, this function consumes the full matrix of feature observations and corresponding vector of labels, and returns (at each call to next()) a randomly sampled subset of batch_size points. Notice that this method will produce infinite permutations of the dataset with minimal computational overhead, as the order of the data is only shuffled after a complete epoch (Lines 57:59).


 

Stepping through Data

Similar to the goals of a data shuffler, there may be instances where a collection of data is too large to transform all at once. We can again use a generator to step through this data for us, making it very easy to iterate over non-overlapping slices of the data.

Stepping through data with a generator.

Shortly (in Inference), we will see how to use an instance of this generator and list comprehension to compactly transform large collections of data without clogging up system resources.


 

Training

Now that the network training and inference have been compiled to Theano functions, we can use the data shuffling strategy discussed above to actually train the model. This is an iterative process that often runs for a large number of iterations. As we will see, our experiments show 40-50k iterations to be sufficient; more than this runs the risk of overfitting.

Training a network.

The important parts of this code block are found in Lines 256:257. Training simply consists of sampling the shuffler by calling next(), which yields a batch of data,x_m, and corresponding labels, y_m, and passing them to the objective function along with a learning rate. This call returns the average classification error over the batch, written to a pre-allocated array, and after potentially printing its progress to the console, the loop repeats.

As will be discussed in the final section, this while-statement is wrapped in a try-except block that gracefully exits when the user triggers a keyboard interrupt,ctrl+C. This is provided as a manual form of early stopping, but in practice you’ll probably want to develop a heuristic to do this automatically.


 

Inference

After training the network to completion, we can investigate how this model generalizes to the holdout test set. Using the data stepper discussed above makes this really easy:

Classifying the holdout data.

Here, we take advantage of list comprehension to apply the prediction functionprediction_fx to each batch of data in the test set. At each iteration, the argmax of the posterior over x_m is computed and compared with the correct classes iny_m, returning a list of booleans indicating whether the observation was predicted correctly or not. From this, we can take the mean of the overall accuracy, and subtract it from 1 to get the total error.

As a point of reference, running this code with a learning rate of 0.025 for 50k iterations (run-time of 6 minutes) resulted in a training error below 5%, and a test error of 6.23%. Using sklearn’s SVM classifier (SVC) with a linear kernel obtains a training error of 15.23% and a test error of 19.07%.


 

Next Steps

At this point, we’ve covered enough of the provided source code that you should be able to comfortably follow the program execution of training a deep network. There are a several exercises you may want to explore with this dataset:

  • How might you implement automatic early stopping?
  • What are the effects of batch size and learning rate? Or a variable learning rate?
  • How does performance change as a function of layer width (size of each transform) or depth (number of layers)?
  • Do different non-linearities change affect performance? Or initial parameters?

Once you feel you’ve mastered this example, proceed to the next one: Learning Chroma Features from DFT Spectra | back to the tutorial page

posted @ 2015-06-11 23:57  张旭龙  阅读(241)  评论(0)    收藏  举报