How to Train Your ML Models: Music Generation (Try it out!)

15 min readMar 30, 2024

LSTM Generated Sample Track shown in MIDI

In the lifecycle of machine learning (ML), training stands as the crucial phase where models learn from data to discern patterns and make predictions or generate new outputs, akin to an apprentice mastering a craft through practice. Particularly in music generation, training enables models to absorb the essence of musicality from vast datasets, empowering them to compose new pieces that echo the complexity and emotion of human-created music.

Quick Read(<5min): 4 Steps of ML Model Training

Hooray! You’ve just built a machine learning (ML) model, like a budding musician. Now, you want this instrument to create beautiful music, but first, you must teach it how. Just as a musician needs to practice with various pieces, understand their mistakes, and gradually improve, our ML model follows a similar path to learn from data and create new music. This process, known as training, unfolds in a few critical steps, ensuring our digital composer not only understands the basics of music but also learns to innovate within it.

Step 1: Feeding the Music (Data)

Just as a musician learns by studying various pieces, our model learns by examining a collection of music. We need to feed it a diverse array of music data, from classical to jazz, helping it understand the wide spectrum of musical patterns and structures. This step usually includes cleaning the data, handling missing values, normalizing or standardizing data, and possibly feature engineering.

Step 2: Tuning for Improvement (Defining the Loss Function)

We need to give our musicians a way to recognize mistakes — this is where the “loss function” comes into play. Think of it as a music teacher who listens attentively, pointing out the difference between the musician’s performance and the original masterpiece. The goal is to minimize these differences, making our musician’s compositions as close to the original as possible. Depending on the specific task (e.g., regression, classification), the choice of loss function can significantly affect performance.

Step 3: Finding the Rhythm (Optimizers)

With feedback in hand, the model needs to fine-tune its internal mechanisms with an “optimizer” for our model, much like adjusting the strings of a guitar to get the perfect pitch. It takes the feedback from our musical teacher (the loss function) and advises on changing the model’s approach, ensuring better performance in the next practice session. Choices of optimizers (e.g., SGD, Adam) and their configurations (learning rate, momentum) can greatly impact the efficiency and effectiveness of training.

Step 4: Practice Makes Perfect (Epochs)

Finally, our musician practices repeatedly, going through the same pieces (data) multiple times. Each complete practice session is called an “epoch.” With each epoch, the model refines its ability to generate music, gradually improving its compositions to create new and beautiful pieces.

In summary, training an ML model for music generation involves a cyclical process of learning from data, receiving feedback through the loss function, making adjustments with the help of an optimizer, and iterating this process across many epochs. Through this meticulous training regimen, the model learns to understand and replicate the complex patterns of music, eventually gaining the ability to compose new melodies that resonate with the richness of human-created music.

Deeper Dive: a Closer Look at ML Training by 2 Examples.

Let’s unpack loss functions, optimizers, and the overall training cycle more to see how they contribute to teaching a model to create music. We’ll explore the ML training process by comparing and contrasting 3 models, break down the code and math behind them, and show you the result. Feel free to make a copy of these codebooks and run your own experiment side by side as you’re reading through.

1st Model: Training a Simple VAE

Simple VAE With Mel-Spectrogram

Make a copy of the codebook to run by yourself!

colab.research.google.com

Variational Autoencoders (VAEs) stand at the frontier of generating new content, including music. They are a type of generative model that not only learns to compress data into a compact representation (encoding) but also to reconstruct data from this representation (decoding), with the added twist of introducing variability. I encourage you to check out this resource to learn more about VAE’s structure, which will help you better understand our discussion below.

# Create the VAE
vae_input = Input(shape=(n_mels, time_frames))
vae_target = Input(shape=(n_mels, time_frames))
encoder_output = encoder(vae_input)
decoder_output = decoder(encoder_output[2])

The Loss Function

The VAE’s ability to generate music hinges on its loss function. Here, we use a composite of two components: the reconstruction loss and the Kullback-Leibler (KL) divergence loss.

# Define the VAE loss function
def vae_loss_function(args):
    y_true, y_pred, z_mean, z_log_var = args
    reconstruction_loss = K.mean(K.square(y_true - y_pred))
    kl_loss = -0.5 * K.mean(1 + z_log_var - K.square(z_mean) - K.exp(z_log_var))
    return reconstruction_loss + kl_loss

Reconstruction Loss

At its core, the Reconstruction Loss in a VAE measures how well the model is able to reproduce the input data after it has been encoded into a lower-dimensional latent space and then decoded back into its original space. This process is akin to translating a sentence into another language and then back again, assessing how much of the original meaning is preserved.

For our VAE, the input data is a mel spectrogram, a visual representation of the spectrum of frequencies of a sound signal as it varies with time. The Reconstruction Loss calculates the difference between the original mel spectrogram (y_true) and the mel spectrogram produced by the model (y_pred). This difference is quantified using the Mean Squared Error (MSE), which averages the squared differences between the actual and predicted values across all elements of the spectrogram. Mathematically, it's defined as:

where n is the total number of elements in the mel spectrogram. This loss function effectively captures the model’s accuracy in reproducing the original music piece, emphasizing the fidelity of reconstruction.

KL Divergence Loss

The KL Divergence Loss, on the other hand, is specific to the ‘variational’ aspect of VAEs. It quantifies how much the encoded representations (latent variables) of the input data diverge from a predefined distribution, typically a standard normal distribution. The purpose of this component is to regularize the latent space, ensuring that the encoded representations do not stray too far from a standard distribution. This regularization makes it easier to sample from the latent space when generating new data, as it ensures a degree of smoothness and continuity in the latent space.

The mathematical expression for the KL divergence loss, given the mean (μ) and variance (σ²) of the encoded representations, is:

The term log(σ²)ensures stability in the calculation by avoiding the log of zero.

This formula essentially measures the difference between the encoder’s distribution over latent variables, given the input data, and a target distribution (in this case, a standard normal distribution). By minimizing this divergence, the VAE encourages the encoded latent variables to adhere to a predictable structure, facilitating the generation of coherent and novel outputs.

Integrating Reconstruction and KL Divergence Losses

# Add the loss function to the model using a Lambda layer
loss = Lambda(vae_loss_function, output_shape=(1,), name='loss')([vae_target, decoder_output, encoder_output[0], encoder_output[1]])
vae = Model(inputs=[vae_input, vae_target], outputs=[decoder_output, loss])

In practice, training a VAE involves minimizing a weighted sum of the Reconstruction Loss and the KL Divergence Loss. Balancing these two losses is critical: emphasizing the Reconstruction Loss can lead to overfitting, where the model learns to replicate the training data too closely without capturing its underlying structure. Overemphasizing the KL Divergence Loss, conversely, may result in a model that generates new data points that are too generic, lacking the distinctive features of the input data.

Optimization with Adam

# Compile the VAE
vae.compile(optimizer=Adam(learning_rate=0.001), loss=['mse', None])

In machine learning, optimization refers to the process of adjusting a model’s parameters to minimize the error it makes, quantified by a loss function. For VAEs, this involves finding the optimal set of weights that reduce the composite loss, which includes both reconstruction loss and KL divergence loss.

The Adam optimizer is a sophisticated algorithm for first-order gradient-based optimization of stochastic objective functions. It’s an extension of stochastic gradient descent (SGD) that has several key advantages:

Adaptive Learning Rates: Adam automatically adjusts the learning rate for each parameter based on the estimates of the first (mean) and second (uncentered variance) moments of the gradients. This adaptability makes Adam particularly effective in situations with sparse gradients or noisy data, common in complex models like VAEs.
Bias Correction: Adam also implements bias corrections to the first and second moment estimates to account for their initialization at the origin. This helps stabilize the early stages of the optimization, leading to a more consistent and reliable convergence.

Mathematically, the Adam optimizer updates each parameter θ based on the gradients g with respect to the loss function L, using the following update rule:

where η is the learning rate, m_hat and v_hat are bias-corrected estimates of the first and second moments of the gradients, respectively, and ϵ is a small scalar added to improve numerical stability.

The Role of Epochs

# Train the VAE
vae.fit([audio_data, audio_data], [audio_data, np.zeros_like(audio_data)], epochs=epochs, batch_size=batch_size)

First 5 Epochs (output of the above cell)

An epoch in machine learning training is a single pass through the entire training dataset. It represents one iteration of the model’s learning process, where it has the opportunity to adjust its weights based on the gradient of the loss function computed from the entire dataset.

In the context of VAE training, epochs serve several crucial purposes:

Iterative Improvement: With each epoch, the VAE gets an opportunity to refine its weights, gradually improving its ability to reconstruct the input data while ensuring the latent space remains well-structured (as enforced by the KL divergence loss). This iterative process is key to the model’s learning and generalization capabilities.
Balance Between Reconstruction and Regularization: Across epochs, the VAE learns to find a delicate balance between accurately recreating the input music and ensuring the variability necessary for generating new, novel music pieces. This balance is crucial for the model’s creative capabilities.
Monitoring and Early Stopping: By evaluating the model’s performance at the end of each epoch (e.g., using a validation set), we can monitor its learning progress. If the model’s performance on the validation set starts to degrade, indicating overfitting, training can be stopped early to prevent this issue.

As you can see, after 100 Epochs, our loss decreased from 0.0059 to 0.0031.

Check our Output

After our model is trained, it can now generate new data (music). Below is a sample output from our simple VAE, after we applied the training steps above.

Visually, we can see that the original spectrogram displays a richer set of features with more pronounced and varied frequency activations, indicative of the complexity of the original music piece. The reconstructed spectrogram has less definition. The frequency activations are more diffused, and the sharpness of specific patterns in the original is somewhat lost.

Despite these differences, the VAE has done a commendable job in approximating the overall distribution of frequencies and time frames. The general pattern of high-energy areas (the yellow segments) correlates well with the original, suggesting that the model is learning the fundamental components of the music. However, the fine details that would contribute to the distinctiveness of the music might still be missing.

2nd Model: Training an LSTM

TensorFlow’s LSTM Extended

Make a copy and run by yourself!

colab.research.google.com

This LSTM (Long Short-Term Memory) model extends on the TensorFlow music generation tutorial. With a similar molel structure, we used the newest MAESTRO dataset(v3.0), to include a set of richer data features during training. Moreover, we added 3 features of the notes to train and predict (6 in total), which will then be used to generate notes with more complexity. There are also more visualizations in the codebook to help readers understand ML concepts.

Model Construction

Our LSTM-based model is designed to predict 6 features of a musical note from a given sequence.

num_features = 6  # The number of features per timestep
input_shape = (seq_length, num_features)

inputs = tf.keras.Input(input_shape)
x = tf.keras.layers.LSTM(128)(inputs)

outputs = {
  'pitch': tf.keras.layers.Dense(128, name='pitch')(x),
  'step': tf.keras.layers.Dense(1, name='step')(x),
  'duration': tf.keras.layers.Dense(1, name='duration')(x),
  'interval': tf.keras.layers.Dense(1, name='interval')(x),
  'velocity': tf.keras.layers.Dense(1, name='velocity')(x),
  'polyphony': tf.keras.layers.Dense(13, activation='softmax', name='polyphony')(x),
}

model = tf.keras.Model(inputs, outputs)

Here, input_shape determines the shape of the input sequences that our LSTM will process. The LSTM layer is configured with 128 units, a figure that defines the dimensionality of the output space and, consequently, the complexity of the model's learning capabilities (Check out this resource to learn more about LSTM’s structure). After processing the sequence through the LSTM layer, we have a series of Dense layers for each feature we're predicting.

The outputs dictionary contains a separate Dense layer for each of the note attributes we want to predict. Each Dense layer has its own activation function and the number of units tailored to the specific nature of the prediction it is making. Notably, the polyphony attribute uses a softmax activation function since we're treating it as a classification problem with 12 classes, corresponding to the number of most simultaneous notes that could be played.

The Loss Functions

Our model has 6 features to predict with each note: pitch, step, duration, interval, velocity, and polyphony. Therefore, you will see the application of more than one loss function at work. In compiling the model, we use a custom dictionary of loss functions tailored to each output:

loss = {
      'pitch': tf.keras.losses.SparseCategoricalCrossentropy(
          from_logits=True),
      'step': mse_with_positive_pressure,
      'duration': mse_with_positive_pressure,
      'interval': mse_with_positive_pressure,
      'velocity': mse_with_positive_pressure,
      'polyphony': tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
}

For pitch and polyphony, we're using sparse categorical cross entropy, a loss function used for classification tasks where the classes are mutually exclusive, meaning each data point belongs strictly to one class out of many possible classes.

For a given prediction, the Sparse Categorical Crossentropy computes the loss using the following equation:

This function calculates the negative log probability of the true class label given the prediction probability distribution generated by the model. The log function penalizes incorrect predictions with a higher score, and correct predictions closer to 1 with a lower score, hence the loss is lower when the predicted probability distribution is close to the true label.

For the other features, we're using our custom mse_with_positive_pressure to handle regression tasks.

def mse_with_positive_pressure(y_true: tf.Tensor, y_pred: tf.Tensor):
  mse = (y_true - y_pred) ** 2
  positive_pressure = 10 * tf.maximum(-y_pred, 0.0)
  return tf.reduce_mean(mse + positive_pressure)

This function computes the mean squared error (MSE), the same function we introduced in the above VAE model. However, it adds a twist: a “positive pressure” term, defined by:

This term applies a penalty when the predicted value is negative. The maximum function evaluates to (-y_pred,i) when the prediction is negative, and zero otherwise. This means no penalty is applied for positive predictions. The rationale behind this could be to encourage the model not to undervalue certain features, such as velocity or duration, which should not be negative in the context of music generation.

The final loss for each sample is the sum of the standard MSE and the positive pressure penalty. To obtain the overall loss for the dataset, we compute the mean of individual losses:

This custom loss function guides the model toward predictions that are both accurate (minimizing MSE) and physically plausible (avoiding negative values for certain features).

The Optimizer

learning_rate = 0.005
optimizer = tf.keras.optimizers.Adam(learning_rate=learning_rate)

Our optimizer is still Adam, introduced above. We choose a learning rate of 0.005 to try to balance between sufficiently fast convergence and the risk of overshooting the minimum loss. Generally, we start with a moderate learning rate and adjust based on the rate of improvement in the model’s performance during training.

Loss Weights

Since we have 6 features to predict and used loss functions on all of them, there will also be 6 losses for each feature.

As you can see, some features (duration, step) already have very small losses, but some features have huge losses (interval, velocity). To prevent these huge losses from hijacking the training, we need to assign weights to each feature’s loss to ensure that each feature contributes appropriately to the learning process.

model.compile(
    loss=loss,
    loss_weights={
        'pitch': 1.0,
        'step': 1.0,
        'duration':1.0,
        'interval': 0.5,  # Larger loss, but not as large as velocity, moderate weight
        'velocity': 0.1,  # Very large loss, significantly reduce weight
        'polyphony': 1.0,

    },
    optimizer=optimizer,
)

After we assigned the weights, the total loss went from 4537 to 555.

Callbacks

In training, the ModelCheckpoint and EarlyStopping callbacks are employed. They are safety nets to save the model at its best state and to prevent overfitting by stopping training if the loss doesn’t improve for a certain number of epochs.

callbacks = [
    tf.keras.callbacks.ModelCheckpoint(
        filepath='./training_checkpoints/ckpt_{epoch}',
        save_weights_only=True),
    tf.keras.callbacks.EarlyStopping(
        monitor='loss',
        patience=5,
        verbose=1,
        restore_best_weights=True),
]

The ModelCheckpoint callback systematically saves the model at specific intervals (or when it surpasses previous performance benchmarks), allowing us to recover and deploy the best-performing iteration without the risk of losing progress due to a potential interruption or overfitting as training progresses.

The EarlyStopping callback monitors a chosen performance metric (typically validation loss). If it sees no improvement for a predefined number of epochs ('patience' parameter), it will halt the training process.

Fit the Model, Finally

%%time
epochs = 50

history = model.fit(
    train_ds,
    epochs=epochs,
    callbacks=callbacks,
)

We used 50 epochs to train the model. You can see how each feature’s loss decrease through each epoch.

Generating Notes

The process of generating notes begins by normalizing the features of an initial seed sequence, drawing from a learned distribution of musical data. With each iteration, our model predicts the characteristics of the next note, adjusting the pitch based on the calculated interval from the previous note, and applying variations to create polyphonic textures, all while maintaining a coherent musical structure within the framework of a specified scale.

I once again encourage you to engage with the code and experiment with different parameters like polyphony_mode, and observe the creative output.

The graph above is the original data, and the graph below is a sample output of our LSTM model after training. Comparing the two, we can see that the generated track captures the general structure of music in terms of note distribution across time and pitch.

The density of the notes and their spread across the pitch range suggests that the model has learned to produce a polyphonic texture. However, the original piece shows a more diverse and structured use of polyphony, with a clear temporal distribution of notes that the generated sample doesn’t quite match. The LSTM’s output is more uniform in its polyphony and less nuanced in timing, indicating areas for further refinement.

In summary, the training process, which included the preparation of sequences, the choice of loss functions and optimizers, and the optimization strategy, resulted in a model capable of generating new music with a reasonable degree of complexity.

Next Steps: Evaluation and refinement

We’ve taken a journey through the intricate process of training machine learning models to compose music, starting with preparing the data, defining custom loss functions for nuanced learning, and progressing through training cycles with intelligent callbacks. Our LSTM and VAE models have learned from the rich patterns of music datasets, and now they’re ready to create and perform their own melodies. It’s a blend of art and science, where each iteration brings us closer to the symphony of data-driven composition.

The next critical steps involve evaluation and refinement. Evaluation entails testing the model’s performance using unseen data, and assessing its ability to generate music that is both novel and harmonically pleasing. This step often uncovers new insights into the model’s strengths and potential areas for improvement. Refinement might include tweaking model parameters, incorporating additional training data, or even redesigning the model’s architecture based on evaluation outcomes.

Ultimately, deploying the model into a real-world application or further research requires careful consideration of user feedback and continuous iterations to enhance the model’s creative capabilities. This cyclical process of training, evaluation, refinement, and deployment underscores the dynamic and iterative nature of machine learning projects, driving innovations and improvements in the field of music generation and beyond.

References:
TensorFlow. (n.d.). Generate music with an RNN. Retrieved from https://www.tensorflow.org/tutorials/audio/music_generation