Metis Final Project: Music Composition with LSTMs

For my final project at Metis Data Science, I designed a recurrent neural network utilizing Long Short-Term Memory nodes (LSTMs) to learn patterns in the Six Cello Suites by J.S. Bach, and subsequently generate its own musical fragments. I learned a ton about deep learning and feature engineering, and was inspired to continue exploring the intersection between data and art - watch this space for further developments!

Bach and neural networks...why?

So why would you want to leverage machine learning for the purpose of generating bad music? I have to admit, this does sound like a ill-advised and totally impractical idea. After all, some of the talented folks in my cohort focused on seemingly more substantial topics, such as predicting seizures in hospital patients, or identifying patterns of insider trading.

The short answer is that I really wanted to make the capstone a personal project by making full use of my musical background and interest, having as much fun as possible and developing new questions about music and data in the process. Also, I can't resist injecting my personal projects with a bit of silliness, as seen in similar endeavors like TED-RNN.

I set out with modest goals:

  1. Create a model which generates "interesting" musical fragments derived from material in the Bach Suites (this is different from realistically imitating Bach)
  2. Develop a better understanding of the end-to-end process involved in using a deep learning model

In this context, building JohaNN was highly fruitful. Enough with the leadup, on with the implementation!

Sourcing the data and general model strategy

First, I had to find a good data representation of the Cello Suites preferably in a format parseable by the wonderful music21 library, since I wasn't about to go into transcribing from raw audio recordings, nor did I have time to do automated scanning of PDF scores.

Text-based notation formats like ABC have been used to great effect in projects like this stunning Irish folk tune bot, but data availability is patchy when it comes to classical works and they seem less well-suited for representing polyphonic structures. Also, the general idea of feeding music into a NN model as plaintext - with disregard for its inherent musical features - didn't sound as interesting to me, despite the apparent effectiveness of domain-agnostic char-RNN structures in a number of projects. Ultimately, I went with a solid MIDI rendition of the Suites found here.

Much has been written about the features of LSTMs, so I'll skip over the in-depth explanation here - in short, the robustness of the basic memory cell unit's combination of input/forget/output gates makes LSTMs rather suitable to learning long patterns in sequential data. For the purpose of this model, I chose a simple two-layer LSTM network, applying dropout after each layer.

Parsing the Bach corpus

The imported Bach MIDI files were treated as a stream of notes (and rests) using the rich set of feature extractors in music21. Roughly, this is how each note in the music is represented:

  1. Each note or rest is represented as a tuple: (midi_pitch_number, beat_strength, duration_in_quarters), where midi_pitch_number = 0 for rests, and the beat_strength is calculated based on metrical accent
  2. The corpus is represented as a list of these tuple-ized notes, and the set version (essentially a dictionary of the note tokens in the Bach suites) forms an n-dimensional space, where n is the number of distinct notes and rests found in the works

Some code to accompany this step:

from music21 import converter, clef, stream, pitch, note, meter, midi  
import numpy as np


def parse_notes(midi_stream):  
    melody_corpus = []

    last_pitch = 1
    chord_buffer = []
    prev_offset = 0.0
    for m in midi_stream.measures(1, None):
        time_sig = m.timeSignature
        for nr in m.flat.notesAndRests:
            offset_loc = nr.offset
            # pitch = nr.pitch.pitchClass + 1  if isinstance(nr, note.Note) else 0
            pitch = nr.pitch.midi  if isinstance(nr, note.Note) else 0
            beat_strength = round(nr.beatStrength * 4.0, 0)
            duration = float(nr.quarterLength)

            note_repr = (pitch, beat_strength, duration)
            # note_repr = (pitch, duration)
            # Handle chords
            if nr.offset == prev_offset:
                if note_repr[0] > 0:
                if chord_buffer: # Choose tone from chord buffer closest to current note
                    chord_melody_tone = sorted(chord_buffer, key=lambda x: abs(x[0] - pitch))[0]
                chord_buffer = []
            prev_offset = nr.offset

    return melody_corpus

def build_corpus(midi_files):  
    melody_corpus = []
    for file in midi_files:
        midi_stream = converter.parse(file)
        midi_stream = midi_stream[0]
        if '1008' in file or '1011' in file:
            midi_stream.keySignature = midi_stream.keySignature.relative
        key_sig = midi_stream.keySignature
        print('Input file: {} ({})'.format(file, str(key_sig)))
        midi_stream.transpose(KEY_SIG_OFFSET - key_sig.tonic.pitchClass, inPlace=True)
    # map indices for constructing matrix representations
    melody_set = set(melody_corpus)
    notes_indices = {note: i for i, note in enumerate(melody_set)}
    indices_notes = {i: note for i, note in enumerate(melody_set)}

    return melody_corpus, melody_set, notes_indices, indices_notes

Training the model

The model was built using the popular Keras framework on a Theano backend. I trained several incarnations of the model, using different sequence lengths, i.e. the number of notes in a given melodic fragment used to make note predictions. Amazon EC2 g2.2xlarge GPU-equipped instances came in handy here.

import numpy as np  
from keras.models import Sequential, load_model  
from keras.layers.core import Dense, Activation, Dropout  
from keras.layers.recurrent import LSTM  
from keras.callbacks import History, ModelCheckpoint  
from keras.optimizers import RMSprop

from corpus import build_corpus

def train_model(midi_files, save_path, model_path=None, step_size=3, phrase_len=20, layer_size=128, batch_size=128, nb_epoch=1):

    melody_corpus, melody_set, notes_indices, indices_notes = build_corpus(midi_files)

    corpus_size = len(melody_set)

    # cut the corpus into semi-redundant sequences of max_len values
    # step_size = 3
    # phrase_len = 20
    phrases = []
    next_notes = []
    for i in range(0, len(melody_corpus) - phrase_len, step_size):
        phrases.append(melody_corpus[i: i + phrase_len])
        next_notes.append(melody_corpus[i + phrase_len])
    print('nb sequences:', len(phrases))

    # transform data into binary matrices
    X = np.zeros((len(phrases), phrase_len, corpus_size), dtype=np.bool)
    y = np.zeros((len(phrases), corpus_size), dtype=np.bool)
    for i, phrase in enumerate(phrases):
        for j, note in enumerate(phrase):
            X[i, j, notes_indices[note]] = 1
        y[i, notes_indices[next_notes[i]]] = 1
    if model_path is None:
        model = Sequential()
        model.add(LSTM(layer_size, return_sequences=True, input_shape=(phrase_len, corpus_size)))
        model.add(LSTM(layer_size, return_sequences=False))

        model.compile(loss='categorical_crossentropy', optimizer=RMSprop())

        model = load_model(model_path)

    checkpoint = ModelCheckpoint(filepath=save_path,
        verbose=1, save_best_only=False)
    history = History(), y, batch_size=batch_size, nb_epoch=nb_epoch, callbacks=[checkpoint, history])

    return model, melody_corpus, melody_set, notes_indices, indices_notes

Generating fresh melodies

After training the models overnight, I hacked together a simple Flask app where you can generate new quasi-Baroque jingles in the browser!

See code for music generation below (borrows heavily from an LSTM text generation example shipped with Keras). The temperature parameter for sampling from the probability vector produced by the final softmax output is very important here - too low a value, and the predictions quickly converge to a single pitch repeated over and over again, whereas too high a value results in basically random outputs with no semblance of melodic contour or rhythmic cohesion. Generally, temperature values between 1.0 ~ 2.0 seemed to work best, which is to say that slightly smoothing out the respective class probabilities predicted by the model produced interesting, yet structured melodic sequences.

import numpy as np  
from music21 import midi, stream, pitch, note, clef, instrument

def __sample(preds, temperature=1.0):  
    # helper function to sample an index from a probability array
    preds = np.asarray(preds).astype('float64')
    preds = np.log(preds) / temperature
    exp_preds = np.exp(preds)
    preds = exp_preds / np.sum(exp_preds)
    probas = np.random.multinomial(1, preds, 1)
    return np.argmax(probas)

def __predict(model, x, indices_notes, temperature):  
    preds = model.predict(x, verbose=0)[0]
    next_index = __sample(preds, temperature)
    next_val = indices_notes[next_index]

    return next_val

def generate_sequence(model, seq_len, melody_corpus, melody_set, phrase_len, notes_indices, indices_notes, temperature):  
    gen_melody_indices = np.zeros((1, phrase_len, len(melody_set)))
    start_pos = np.random.randint(0, len(melody_corpus) - phrase_len)
    seed_phrase = melody_corpus[start_pos : start_pos + phrase_len]
    gen_melody = seed_phrase

    for _ in range(seq_len):
        seed_phrase = gen_melody[-phrase_len:]
        for i, note in enumerate(seed_phrase):
            gen_melody_indices[0, i, notes_indices[note]] = 1
        x = gen_melody_indices
        next_note = __predict(model, x, indices_notes, temperature)
        # seed_phrase.append(next_note)
        # seed_phrase = seed_phrase[1:]

#     gen_melody = [indices_notes[i] for i in gen_melody_indices]
    return gen_melody

def play_melody(gen_melody):  
    v = stream.Voice()
    last_note_duration = 0
    for n in gen_melody:
        if n[0] == 0:
            new_note = note.Rest()
            new_pitch = pitch.Pitch()
            # new_pitch.midi = 59.0 + n[0] - 24
            new_pitch.midi = n[0]
            new_note = note.Note(new_pitch)
        new_note.offset = v.highestOffset + last_note_duration
        new_note.duration.quarterLength = n[2]
        last_note_duration = new_note.duration.quarterLength
    s = stream.Stream()
    part = stream.Part()
    part.clef = clef.BassClef()

    return s

Lessons learned

As expected, this simple 2-layer LSTM model doesn't come close to approaching a real composition model. A more sophisticated network topology, trained on a larger corpus of compositions would likely perform better; for the purposes of this project, I imposed the restriction of supplying the model with only the six Cello Suites (36 movements in total). This limitation is compounded by the fact that the harmonic realizations in these works are largely implied rather than actually realized, meaning that most of the chord tones for each harmony are not played, and the listener's ear is left to fill in the gaps based on their experience with Baroque harmonic progressions.

So in a sense, this is a special sort of cold start problem, where not all of the required information (harmony) is available to the model to begin with! It would be interesting to redo this exercise on a model pretrained with Baroque harmonic progressions and counterpoint, to see if the output becomes more plausible. Others in the space have created novel network structures like "biaxial" networks or Clockwork RNNs, which present inspiration for future projects.

Naoya Kanai

Data scientist, cellist, ex-consultant, tech person.

San Francisco