Deep Dive into Our Biaxial Recurrent Neural Network

For each time step we have the following paramters to our input vector:

Position: Current pitch of a particular note, ranging from 0-127 (MIDI note value)
Pitchlass: Consisting of 12 values, one per pitchclass, incrementing by 1 for each value (A will be 0, A# will be 1.... G# will be 11). In this assumption, Johnson posits that this will allow selection of "more common chords." However, this assumption is incredibly naive, given his assertion that is is necessarily "more common to have a C major chord than E-flat major chord." In certain iterations of our model, we adjusted this paramter to reflect a different weighting of commonality of various chords.
Previous Vicinity: Consisting of 50 values, or rather 3 octaves span of 12 pitch classes, each value is given a 1 if that pitch was present in the last timestep and a zero otherwise.
Previous Context: Similar to Previous Vicinity paramter, but condensed to 1 octave, number of times a certain pitch class was played across any octave is reported across the 12 values.
Beat: Using a 4 row binary representation, one for each beat of the 4/4 time signature, each 1 represents the "note-on" MIDI property for each timestep.

Now, to break down the two LSTM stacks:
First Hidden LSTM Stack: LSTMs that have reccurent connections along time axis, where the last layer outputs a note state representing any particular time pattern

Second Hidden LSTM Stack: LSTMs that have recurrent connections along note axis, whos last layer outputs 2 values. The first value, Play Probability, indicates the probability that a particular note should be chosen to be played. The second value, Articulate Probability (applies only for notes that have already been played in prior time step), is probability the note should last longer, say, from a sixteenth note to an eighth note.

Model Software Implementation

Having studied a variety of machine learning software this quarter, the biaxial network uses Theano, a python library developed by computer scientists at Université de Montréal. This software compiles the network to run most efficiently by making the code GPU-optimized.

Randomly select batch of short music segments from our dataset, and feed into biaxial RNN
Take output probabilities from second hidden LSTM stack, and calculate cross-entropy (i.e., likelihood of generating particular output)
Take probabilities and use optimizer (AdaDelta) to calculate weights
Batch all notes together and train time-axis layers
Do the reverse: batch all times together and train the note-axis layers

Avoiding Overfitting: Hyperparameter Adjustments

Applying dropout to our RNN is a way of removing hidden nodes (by zeroing output) at each layer at random, as a means to avoid overfitting and in turn promote specializing, or rather "unique" music composition.
*Note: We would expect temperature adjustments to behave in a similar way. If temperature was excessively low, the distribution would be less uniform, and the composition would be extremely repetitive. On the other hand, as the temperature grows increasingly high, the the distribution would be more uniform, leading to increasingly less-similar but less coherent (in terms of aurally compelling "reproductions" of a given dataset).

Project Report

From Google’s Magenta and Wavenet, to Stanford University's GRUV or Cambridge University’s BachBot, the plethora of deep learning software for music generation available today share one thing in common: neural networks. Music is the series of relationships constructed across the piece from note-to-note, harmony-to-harmony, or even larger macrostructural relationships. Understanding these relationships – and its resulting patterns and repetition – is essential to music composition. Since neural networks are the best ML technique for pattern recognition, they are the most natural choice for a learner. In conducting this project, we looked for a series of other techniques we could use to generate music. One of the leading alternatives we considered was using a decision tree: in segmenting a dataset of thousands of harmonic progressions sorted in a CSV, a probabilistic model would help generate a new series of harmonies – and upon generating the new harmonic progression, we would use Python’s MIDI package to generate these harmonies. While this would have been a successful project, the generated harmonies would have no way to morph into a more elaborate piece of music, because of their high level of determinism and lack of context awareness. Thus, we chose long short term memory neural networks as our backend architecture.

For each output generated (see below), each training set was composed a series of MIDI files in that style or genre. For example, in our Bach dataset, we had approximately 1.2MB of a variety of Bach selections, from fugues to chorales. The specific features of our data, outlined in our “How does the Biaxial Recurrent Neural Network Work?” section above, is composed of:

1. Pitch class:Current pitch of a particular note, ranging from the MIDI note values 0-127)
2. Previous vicinity: A set of 50 values, representing the three nearest octaves of pitches, where each value is given a 1 if that pitch was present in the last timestep and a zero otherwise
3. Previous Context: Similar to Previous Vicinity parameter, but condensed to 1 octave, number of times a certain pitch class was played across any octave is reported across the 12 values
4. Beat: Using a 4 row binary representation, one for each beat of the 4/4 time signature, each 1 represents the "note-on" MIDI property for each timestep.

The Datasets

We used four datasets for our project. Each set contained between 1 and 2 MB of data.

Bach 1.0: Our First Bach Dataset: In this dataset, we used approximately ~1.2MB of a variety of Bach selections, from fugues to chorales. Characteristics of this dataset include standard harmonic progressions of Western Tonal music (i.e., what is often referred to as “progressive harmony”), multiple independent voices, and modulations to closely related keys (e.g., C Major to G Major). The MIDI was represented in multi-track form, and the RNN tried to reconcile chord structures and melodies across tracks, a feat significantly more challenging than if all tracks (voices) were condensed into one. As a result, the output resulted in a fugue-esque composition, with two independent melodic lines at a particularly slow tempo in C Major. Especially after hearing the audio output, this led us to conclude that a relative weakness of Daniel Johnson’s algorithm was multitrack MIDI. In future iterations (i.e., Bach 2.0), we used single-track MIDI.

The perplexity and error of the final training epoch was (click here to see complete training perplexity/error): epoch 4900, error=1442.07702637

Daniel Johnson used 300 nodes in the first two (rhythm) layers and 100 and 50 in the note layers. This choice seemed peculiar to us, because rhythm is generally considered to be the simpler part of music. While we were more concentrated on establishing that the network could pick up on individual styles, we experimented with lowering the number of rhythm layers to see if that many were really needed by training a network with 256 nodes in the first and second layers, while maintaining the number of note layers. We found that the network worked just as well, if not better with this configuration (see graph below). While we were intrigued by this result, for the sake of time we did not investigate further. In the future, we would like to experiment much more with the dimensions of the network. For example, a master layer in front of the rhythm one with a rather small number of nodes that feeds forward on both the note and time axis could give the network a more comprehensive compositional “plan.” By examining the graphs, it seems apparent that the networks leveled off before the 6000th epoch. Boosting the number of layers and nodes in the note layers could allow the network to continue to learn more complex patterns in the later stages, instead of stagnating. In general, our intuition is that more layers could help the network compose more interesting music, and produce more reoccurring patterns (an incredibly important aspect of music). This could also eliminate what minor note placement errors the network still makes. Since we are using datasets that are fairly comprehensive of a composer’s career, it would be worth the time to optimize the hyperperimeters in order to produce an even more convincing result.

In our first run on May 18th, we had 256 nodes. But in our second run (May 19th) we had 300 nodes. The following graph below, illustrating error over time (epoch number) shows both training errors graphed over time:

Bach 2.0: More Comprehensive Bach Dataset: As expected, this dataset ran significantly better than our first Bach dataset. This was due to changing the MIDI Format (see MIDI Formatting section) from 1 to 0, as well as almost doubling the amount of training examples given. We ran 153 Bach compositions, with MIDI note counts ranging from 312 notes to 2,930 notes, averaging at around 1,044 notes per file. As a result, the composition features a clear melodic line with a supporting accompaniment figure in Eb Major. This significantly more harmonically nuanced composition supported our earlier hypothesis that the RNN architecture prefers single track MIDI.

The perplexity and error of the final training epoch was (click here to see complete training perplexity/error): epoch 9900, minute 24939121.4239, error=2092.07885742, perplexity=0.000772314505144

Code added: Daniel Johnson’s code proved to be fairly abstruse for us to manipulate. This is mainly because it was written in Theano. None of us are very familiar with this library and given the time constraints, we deemed it unwise to get to deep in the weeds. However, we did write some code to try to extract more data during the training process. Originally, only epoch and error were output every 100 epochs. We added more user feedback, including train time and perplexity. However, after graphing we decided that error was the best metric for visualization anyway. Krege wrote a parser that read the terminal outputs, collected the relevant data, and outputted it to a CSV(comma separated values) file so that it could be read by Microsoft Excel for visualization purposes.

The error of our output (For the Second Bach Dataset), graphed over time (epoch number) can be seen below:

Chopin Dataset: We ran 95 Chopin compositions, with MIDI note counts ranging from 675 notes to 4,827 notes, averaging at around 1,662 notes per file. Inspired by Polish folk tunes, characteristic of Chopin compositions was melodic ornamentation, increasingly complex modulations, and more liberal tempos. In the Chopin output, the output resembled many of his nocturnes, with a slightly syncopated rhythm in Eb Major. While the composition was by no means a “perfect replica” of a Chopin composition, many of Chopin’s idiosyncratic compositional elements (i.e., chromatic passing tones in melodic line, general tonal areas that were less likely to be in C Major) were certainly present.

The perplexity and error of the final training epoch was (click here to see complete training perplexity/error): epoch 9800, minute 24940234.9389, error=3977.01831055, perplexity=0.000845743685353

Beatles Dataset: The Beatles have a genre-defying style, and their musical energy range exceeds any other pop group. Their compositions include slow ballads, upbeat rock, lullabies, acoustic folk, and psychedelic drones. It is very difficult to pinpoint a specific mood, structure, or even melodic pattern since the quartet changed their sound so drastically from one album to the next. We grabbed every song from seven of their studio albums to analyze: Rubber Soul, Revolver, Sgt. Pepper’s, Magical Mystery Tour, The White Album, Abbey Road, and Let it Be, along with some single releases.In total, we ran 96 Beatles compositions, with MIDI note counts ranging from 1,762 notes to 10,988 notes, averaging at around 3,762 notes per file. The amount of notes played in Beatles music was much higher on average than that of classical compositions. Our guess is this has to do with the instrumentation of Beatles songs, having many more voices complementing each other at once. We believe this allowed the neural network to pick up on rhythmic patterns very well.

The software’s output really impressed us. Although we mentioned above that it is difficult to define the Beatles’ musical style, we could make out much of Paul McCartney and John Lennon’s songwriting in the neural net’s creation. From the beginning of the piece, I noticed the algorithm was able to recognize a rhythmic pattern used a lot in McCartney’s piano playing. It captures the chords repeating on the upbeat, while the bass plays on the downbeat. The alternating style, found in many Beatles’ tracks, captures the listener with differing frequency ranges responding to each other over time. This section’s melody also complements Lennon’s descending scale tones, used in many of his vocal lines (e.g. Blue Jay Way, Come Together). At around 1 minute in, we start to hear the use of an alternating root bass against elegant and complex major chords, with an uplifting melody, very reminiscent of some of McCartney’s sillier or lighter songs (e.g. Maxwell’s Silver Hammer, Hello Goodbye, Good Day Sunshine). The chords contain many dominant seventh and major seventh notes, a key component of multiple Beatles’ progressions. At around the 1:35 mark, we hear a beautifully crafted progression in A minor, which many people would recognize as a standard pop progression used by many top artists today. It amazes us that the neural network is able to implicitly learn, to some extent, what makes chords fit well together. The higher pitched melody gracefully dances above the chords and almost seamlessly blends back into the chords themselves. This is our favorite section of this track for it’s simple, yet almost human feel. It reminds us of McCartney’s solo piano songs (e.g. Hey Jude, Golden Slumbers), delicately fusing vocals and piano parts into very emotionally charged compositions. The two output recordings below (from left to right) feature a snippet of our output that showcases Beatles-sounding chordal progression, followed by the full output.

The perplexity and error of the final training epoch was (click here to see complete training perplexity/error): epoch 9800, minute 24941982.3987, error=2119.68139648, perplexity=0.000781532744972

Input Formatting: While gathering MIDI files for analysis, we found that most notation divides a piece into different tracks to represent separate voicings. This makes sense for playing back a composition with multiple instruments, but for our analysis, we wanted all parts of the piece (melody, chords, bass) on a single piano track, allowing the algorithm to focus on a single file per piece. While researching how MIDI files are written, we learned that there are multiple MIDI formats. Most of our examples are MIDI Format 1, which are files that contain multiple tracks for output on multiple instruments. We therefore used a simple conversion software called Sweet Midi Converter to turn these files into MIDI Format 0, which unifies all tracks into one.

Another problem we ran into was rhythm sections. Percussion in terms of MIDI relies on using hit samples placed into a sampler. A sampler still uses a note convention, assigning a sample to each key on a keyboard, but when read by a note analyzer, it throws off the rest of the song’s harmony components. In other words, if the song is in the scale of C Major, and the sampler has a kick sample on B2 and a snare sample on D#3, feeding this track to our algorithm would greatly alter the conclusion of harmony drawn. Our solution was to remove all percussion tracks from the tracks. Bach and Chopin pieces had very little change, since they are meant to be played as solo or duet pianos. Our Beatles dataset on the other hand had percussion on almost every file. Midi Format 1 worked in our favor here as we were able to delete the individual percussion tracks and then save the file without them. We then finally converted the now percussion-less files into Format 0 for a single MIDI track.

Output Modulation/Manipulation: In order to turn the neural net’s MIDI output into useful audio, we imported the files into Logic Pro X, processing and editing within the Digital Audio Workstation for a more pleasant listening experience. For comparison, we have included the original neural network’s output alongside our edited versions. In our opinion, the edited versions have more “feel” and the listener can relate more to this experience.

MIDI Effects: In terms of the MIDI notes, we wanted to give our machine’s performance a human feel. We achived this with several effects that manipulate the note’s timing, velocity, and relative pitch. The first addition was adding subtle swing to our pieces. Swing moves the notes very slightly off of the grid, giving the performance a less than perfect timing. Human error is part of many individual styles, and sounding mechanically perfect detracts from a piece’s movement. The Beatles output received the biggest timing change, while the Bach output received minimal swing, accounting for style each composer was intending to play. Next we modulated the note’s velocities. Again, having every note hit at the same force constantly detracts from a song’s dynamic energy. The velocity modulator allows for a random velocity (0-127 value) within a specified range and with a bias towards playing harder or softer more often. Lastly, we used a scale transposer to keep the song harmonious, avoiding dissonant notes. We set a scale, say C Major, and if Logic detects a note that is not within the scale it transposes it to the closest note that is in the scale. This is surely the most controversial edit we made to our neural nets output, but we see it as adding an extra constraint on the output. By filtering incorrect notes, the piece sounds more cohesive and allows a listener to better follow along. The classical compositions tended to lean more towards major scales, while The Beatles had simpler, although more minor tonalities.

Audio Output: As for the actual auditory output, we ran our MIDI notes through a sampler loaded with recordings of individual Boesendorfer Grand Piano keys, each with multiple velocities. This is as close as we can get to replicating an acoustic piano. Aside from the instrument, we used an equalizer to shape the tone and a compressor for dynamic stability. Lastly, we placed convolutional reverb on our audio chain, giving each output a unique space. The reverb works by grabbing frequency responses of specific rooms/halls. Each piece’s sound is slightly different than the others to account for the style of the composer. We believe this instrumentation assists in expressing what the network has learned and created.

Future Works: Very pleased with the work and outputs of the project, there are many areas for further exploration. Firstly, we could experiment with a variety of other different musical genres (i.e., pop, blues, jazz, etc.), creating datasets on which to run our RNN. Additionally, another significantly time-consuming project could be to modify the algorithm to support a wider range of rhythmic devices (e.g., 32nd notes), and harmonic devices (e.g., create different weightings of “vicinity” feature to support more interesting melodic lines for, say, 12-tone compositions).

Group Responsibilities

Ben: Backend infrastructure work with AWS
Victor: Dataset manipulation and FX synthesis
Robbie: Front-end webpage design, harmonic analyses of RNN outputs

Automated Music Generator using Recurrent Neural Network

Abstract

Project Design

Our Architecture: Daniel Johnson's Biaxial Long Short-Term Memory Recurrent Neural Network

Endless, Recyclable Music Composition

Our Data (MIDI Files)

Breadth of Project

How does the Biaxial Recurrent Neural Network Work?

Deep Dive into Our Biaxial Recurrent Neural Network

Model Software Implementation

Avoiding Overfitting: Hyperparameter Adjustments

Project Report

The Datasets

Group Responsibilities

Ben: Backend infrastructure work with AWS
Victor: Dataset manipulation and FX synthesis
Robbie: Front-end webpage design, harmonic analyses of RNN outputs

Bibliography

Contact Us

Abstract

Project Design

Our Architecture: Daniel Johnson's Biaxial Long Short-Term Memory Recurrent Neural Network

Endless, Recyclable Music Composition

Our Data (MIDI Files)

Breadth of Project

How does the Biaxial Recurrent Neural Network Work?

Deep Dive into Our Biaxial Recurrent Neural Network

Model Software Implementation

Avoiding Overfitting: Hyperparameter Adjustments

Project Report

The Datasets

Group Responsibilities

Ben: Backend infrastructure work with AWS Victor: Dataset manipulation and FX synthesis Robbie: Front-end webpage design, harmonic analyses of RNN outputs

Bibliography

Contact Us

Ben: Backend infrastructure work with AWS
Victor: Dataset manipulation and FX synthesis
Robbie: Front-end webpage design, harmonic analyses of RNN outputs