【转载】Learning Chroma Features from DFT Spectra
Overview
Now we will demonstrate how to leverage annotated data to learn a desired representation (chroma). The discussion here builds upon the Instrument Classification walk-through, and assumes the reader has mastered those concepts. Therefore, the explanation of repeated parts (data shuffling, etc) is skipped here.
Contents
Source Code
Note: Again, we strongly recommend that you simply clone the provided deep learning tutorial repository, as directed on the tutorial page, and have followed all necessary system configuration prior to attempting anything here.
From DFT Spectra to Chroma
Introduced by Fujishima in 1999, pitch-class profiles (PCP), now commonly known as chroma features, estimate the activation of the different pitch classes in 12-tone equal temperament, the standard tuning system of Western music.
A common way of computing this representation consists of three steps: first, the discrete Fourier transform (DFT) is applied to an audio signal, yielding a magnitude spectra representation; then, this time-frequency representation is compressed into a log-frequency representation, referred to here as pitch spectra; finally, this pitch representation is folded into a estimate of the energy in each pitch class.
Here, the center frequency of each DFT channel is mapped to its nearest pitch, and the energy contribution of each, in the magnitude spectra array mag_spec, is summed into a pitch map, starting at C1 (index 24).
Then, chroma can be producing by computing the dot-product between thepitch_map and a matrix that places a “1″ at every octave in a given pitch class, which we will call an octave equivalence matrix:
It is quite unlikely, however, that every octave should contribute equally to the final pitch class estimate. This raises a rather obvious question: is there a better matrix to use for pooling chroma features? Here, we’ll use the basic tools of deep learning to find out.
1. Defining a Network Architecture
Mirroring the DFT-to-chroma process described above, we will now design a single layer network to perform the same basic role as the octave equivalence matrix, turning pitch spectra into chroma:
Since we will be using the same algorithm above to compress the linear-frequency DFT coefficients into a log-frequency pitch representation, the input to our network will be a vector of 90 coefficients, indicated by line 169. The remainder of this snippet is quite similar to the previous classification example, with the exception of line 181.
Here, we use a hyperbolic tangent as our non-linearity, and followed by a softmax operation. These are desirable here for two reasons. One, using a half-wave rectified output in a single layer network runs the risk of pulling all outputs to zero, a poor local minima (we will discuss why this occurs in more detail shortly). Two, the softmax operation forces the output of the transform to be unit norm in L1-space, a common normalization strategy in chroma computation and a useful constraint in the learning problem.
2. Define a Loss Function
As opposed to classification, we are defining the desired output of the network beforehand. There is some freedom in how the network is trained to produce a target output for a given input, and we here minimize the mean-squared error between the two.
Providing the target representation, y_target, as an input to the network, line 254 computes the Euclidean distance between the target and the actual output. This distance is then averaged over a mini-batch of inputs to yield a scalar loss.
The remaining two steps in building the model for training —defining update rules and compiling Theano functions— are nearly identical to the previous example, and the details are spared here.
Dataset
For training, a pre-computed dataset of DFT magnitude spectra from polyphonic audio and corresponding chord labels can be found here:
In the spirit of being maximally useable, care has been taken to sample the training and test set from disjoint file splits. DFT-coefficients are computed from audio with a samplerate of 11025Hz and a framesize of 8192 after applying a non-normalized Hanning window of equal length.
For more information regarding what exactly this data represents, please consult the README contained in the archive.
Creating Chroma Templates
Ideally, a chroma transform would be trained using information about the specific pitches that were sounding in a short-time observation. This data can prove quite hard to accurately collect in large quantities, however, and we must be somewhat resourceful. Lucky for us, chord annotations are slightly easier to collect, and we can use them as a coarse approximation of the notes that are likely present in an observation.
Using chords as a weak pitch label, we can map each chord name to a binary vector of expected pitch classes, and normalized to sum to 1. These boolean targets are the templates we provide to the network during training as y_target. In the provided source, chords are defined for all roots in 13 qualities, e.g. major, minor, etc; chord labels outside this set are ignored during training.
For clarity, a set of 61 chord templates looks like the following:
Making Sense of Chord Labels
It is important to note that the space of chord labels is potentially massive, depending on the level of detail exercised by the annotator. Therefore, some chord labels will consist of missing notes, non-chord tones that have been added, and the possible voicing of the shape.
To make the process of dealing with these labels less agonizing, we have provided a human-readable JSON file that maps all known chord labels to a unique integer, referred to as a “chord index”. Three vocabularies are provided to these ends:
- v025: All chords are mapped to major or minor, per common MIREX evaluation.
- v061: Chords are filtered by quality, constrained to the set [maj, min, maj7, min7, 7]; all remaining chords are assigned to “-1″, an ignored class.
- v157: Same set as v061, extended to include [maj6, min6, aug, dim, sus2, sus4, hdim7, dim7]; all remaining chords are assigned to “-1″, an ignored class.
Each chord index can be interpreted as a chord by the following logic: the modulo of the index with 12 gives the pitch class (0 -> C, 4->E, etc), and the integer division with 12 gives the quality (0 -> maj, 3 -> min7). Note that indexing starts at 0, the derived quality is limited by the size of the vocabulary, and that the highest index is always equal to “no-chord” (N).
To help prevent any confusion, consider the following mappings from chord label to chord index in the three vocabularies:
| Chord Label | v025 | v061 | v157 |
|---|---|---|---|
| C:min | 12 | 12 | 12 |
| E:maj7(*5) | 4 | 28 | -1 |
| E:sus2/2 | 4 | -1 | 124 |
| N | 24 | 60 | 156 |
Training
By now, you should be entirely unsurprised to discover training uses the exact same code as the instrument classification example. Run the training script,chroma_learning.py, with the provided data to start learning the parameters for your chroma transform!
Comparing the Transforms
After running the training script to completion (or if you’re impatient, using the provided parameters, sample_params.pk), you can see a side-by-side comparison of the baseline chroma transform versus the data-driven version by passing the proper arguments to chroma_transform.py:
Here we can see the two chroma representations of the 30-second clip, SMC_281, provided with the dataset (original source). The excerpt is both polyphonic and harmonically rich, and the learned chroma representation (bottom) looks much cleaner than that obtained with the octave equivalence matrix (top).
Why is this the case? Let’s take a closer look at the weights learned during training…
From the plot of the learned weight matrix, we can see that not all octaves are of equal importance! There is a heavy emphasis placed on the mid-range of the pitch scale, and only four octaves really contribute to the chroma estimate. In particular, high frequencies are mostly ignored, making the transform more robust to wide-band noise and complex sound mixtures. While this kind of weighting isn’t exactly new in the design of chroma transforms, we do find an unexpected result: coefficients adjacent to the pitch class of interest are negatively weighted. In hindsight, this is intuitively appealing; the contribution of a pitch should count for less if the local frequency range has a higher energy level.
Now compare the representations obtained for ivi.wav, the other provided sample file. Here, a synthesized piano plays a C major, followed by four ascending quarter notes to a G major, ending again on C major.

In contrast to the previous result, there are two interesting observations to draw from this comparison:
- The learned transform (bottom) is much better at estimating true polyphony. When multiple notes are sounding simultaneously, the chroma representation rather closely approximates the notes in the mixture.
- The baseline transform (top) is much better at estimating notes in isolation (frames 20-40). This actually makes a good deal of sense; the transform is only trained to produce chords, and the templates place a heavy prior on pitch simultaneity with a minimum polyphony of three. Similarly, the data itself is sampled from polyphonic audio, and monophonic parts are rare, if even present.
Next Steps
Now that we’ve stepped through a few examples, you should be well on your way to mastering deep learning in Python. If you’re feeling up for a challenge, here are some ideas to extend the code and data provided:
- How might you change the templates to more faithfully match the labels (i.e. Bb:maj(*3,*5), for example)?
- Here, the softmax operator was used to constraint the output space. As deeper networks are prone to over-fitting the data, how would you implement other forms of regularization, e.g. weight decay, to help prevent such undesirable behavior?
- While chroma is one potentially useful representation, there are certainly other harmonic spaces that could we try to reproduce, such as the Tonnetz. How would you translate these chord annotations into coordinates in Tonnetz space?








浙公网安备 33010602011771号