Introduction to Variational Autoencoders (VAEs) in AI Music Generation
In AI music generation, Variational Autoencoders (VAEs) are revolutionary. VAEs are a class of generative models in machine learning that excel in creating new data similar to their training set. In the context of music, they analyze music patterns and generate new music that mirrors these styles with unique variations. In our first post, we’ll break down the VAE’s structure and how it enables music creation (with a codebook you can try out!), then take a look at one of the newest applications of VAE for AI remixing.
Quick Read (<5min): Understanding the Basics of VAEs
To understand variational autoencoders, let’s first look at their predecessors, autoencoders. An autoencoder is a type of artificial neural network used to learn efficient representation of data. It has two main components: an encoder that compresses the input into a lower-dimensional latent space, and a decoder that reconstructs the input from this compressed representation. The aim is to capture the most important features of the data in the latent space.
If you’re confounded by jargon like “efficient representation of data” or “latent space” here’s an analogy: Imagine an autoencoder as an artist who sketches a detailed scene (your data) by capturing only its essential elements. This sketch is an efficient representation of the data, and the canvas that holds this sketch would be the latent space. The decoder, like a second artist, then uses this sketch to recreate a version of the original scene, interpreting the outlines and shapes to reconstruct the data.
Variational Autoencoders (VAEs) build upon the foundation of standard autoencoders. VAEs also have encoders and decoders, but introduce a key difference: they encode data as probabilistic distributions in the latent space rather than as fixed points. The decoder then samples from these distributions to produce outputs that are similar to the input data, yet uniquely distinct, which pave the way for data generation.
In the artist analogy, if a standard autoencoder is an artist precisely sketching a scene, a VAE is like an artist who sketches multiple interpretations of the same scene. When generating new data, the decoder acts like a second artist who selects elements from these varied sketches to create an entirely new one.
Deeper Dive: A Breakdown of VAE’s structure through a basic piano music generator. (~10min)
In music VAE an encoder network uses several layers to reduce the data down to a lentent space. Convolutional Neural Networks (CNNs) layers are commonly used for spectrogram inputs, exploiting their ability to capture spatial hierarchies in data. For MIDI or piano roll data, Recurrent Neural Networks (RNNs) or Transformers layers might be used to capture the temporal dynamics of music. The decoder network then inversely mirrors this architecture, transforming the condensed latent representations back into coherent musical outputs.
To see VAEs in action, we’ll explore them through my piano music generator project that uses simple Dense layers. Here, I converted the music from the audio waveform to mel-spectrogram, an image representation of the audio signal, which the VAE receives as input and spits out as output.
The link below goes to the complete project workbook in case you want to learn in detail about other aspects(e.g., data processing, training, performance). Right now, we’ll focus on the structure of the VAE.
# Encoder
#1st layer: handle input
encoder_input = Input(shape=(n_mels, time_frames))
#2nd layer: Flatten, turns 2D data into 1D
encoder_flatten = Flatten()(encoder_input)
#3rd layer: Dense, calculate data's mean
z_mean = Dense(latent_dim)(encoder_flatten)
#4th layer: Dense, calculate data's variance
z_log_var = Dense(latent_dim)(encoder_flatten)
The above is our encoder! It only has four layers. Simple, isn’t it? It starts with the Input layer, which takes in our mel spectrogram data — the frequency breakdown over time of a piano piece. The Flatten layer then transforms the 2D mel spectrogram into a 1D tensor, preparing it for dense layer processing. Finally, two Dense layers come into play: one to determine the mean and another for the variance of our data’s features.
The Dense layers are crucial as they create the parameters for the latent space’s probability distribution, allowing our VAE to capture the essence of music. The intermediate layers between input and latent space parameter layers in the diagram above could represent Dense layers. Each neuron in these Dense layers is fully connected to all neurons in the previous and subsequent layers, allowing each neuron to potentially learn complex patterns and relationships from the input data. The input data is carried through the layers as each neuron’s output is a function of all its inputs, which are the outputs of all the neurons in the previous layer.
The latent space is where we’ve reduced our high-dimensional music data into a lower-dimensional, abstract representation. The above two graphs represent the same latent space, structured differently. The left graph shows representations that are distributed in relation to one another, while the right one shows the distribution around the origin (0,0), which shows the clustering of the data points as the VAE interprets various musical features.
Each point in this space represents the encoded mean and variance of an audio sample from our dataset. Put simply, the encoder turned each mel-spectrogram you saw above into a single point in 2D space. The points in this space are not randomly scattered; they are arranged such that similar data points (in terms of musical features like pitch or rhythm captured in the spectrograms) are located closer together, while dissimilar ones are further apart. This proximity allows the decoder to generate music that is coherent and maintains the stylistic elements of the input data.
You might wonder, where is the variance plotted? They are in fact plotted around each dot, but the variance is too small on that scale so you can’t see it. Below is a 2D Gaussian density plot, for the variance of the rightmost datapoint in the above graph when we zoom in.
Relate back to the artist analogy — the Gaussian distribution is like the various sketches that the artist creates for the same scene. This sampling process from the Gaussian distribution is pivotal to the generative ability of VAEs; since points are probabilistically sampled from the area defined by the encoder’s learned mean and variance, the model inherently incorporates variability, enabling the creation of new, diverse content from the latent space.
#Decoder
#1st layer: take latent space dimensions as input
decoder_input = Input(shape=(latent_dim,))
#2nd layer: Dense layer, turn the 1D variables back to 2D
decoder_dense = Dense(n_mels * time_frames, activation="sigmoid")(decoder_input)
#3rd layer: Reshape Dense layer's output to match mel-spectrogram's shape.
decoder_reshaped = Reshape((n_mels, time_frames))(decoder_dense)
The decoder begins with an input from the latent space (latent_dim). Then the Dense layer, activated by a sigmoid function, is responsible for mapping these latent representations back into the full feature space of the music (n_mels * time_frames). It ‘unflattens’ the data, reconstructing the detailed features that were compressed by the encoder. The last layer reshapes the Dense layer’s output into the original spectrogram dimensions (n_mels, time_frames), and the decoder essentially reverses the encoding process.
When the decoder reconstructs an output, it samples points from this latent space. Since the points are centered around encoded means of actual music data, the decoder’s outputs are new musical pieces that reflect the encoded characteristics. These outputs can then be visualized as new spectrograms and converted back into audio, yielding novel compositions that resonate with the original music’s properties.
Real-world Application: How VAE is applied in Music Remixing.
Earlier this year(2023), Han et al. proposed the music remixing framework InstructME. Using VAE and latent diffusion, the framework enables tasks such as specific instrument addition/removal, remixing music to different genres, and multi-round editing while preserving the harmony and original melody.
The VAE in the InstructME framework still has the same overall function and structure as the VAE in the piano music generator. The VAE still transforms the audio waveform of a music segment into a 2D latent embedding, with an encoder E and decoder D (labeled in yellow blocks above).
The difference lies in complexity. The VAE in InstructME has 4 layers in its encoder and decoder (number of Donw/Up Blocks). The below table shows what happens at each layer of the VAE.
As the audio passes through the encoder’s layers, its dimensionality (the length of the audio sample sequence) is reduced by the down sample rate. This compresses the audio into a more manageable 2D latent space. Simultaneously, the number of channels increases at each encoder layer. In the case of 1D data like audio, each channel represents a different feature detected across the time series data. Each neuron in a channel looks at a specific part of the audio signal and activates (or fires) when it detects its particular feature. The initial layers with fewer channels might capture basic audio features such as frequency bands or temporal patterns. In deeper layers, more channels might abstract these into more complex features that can represent aspects of the audio such as pitch, tone, or in the remixing case, a specific instrument like guitar or piano. The decoder reverses this process by upsampling and decreasing channels. We can observe that both the data dimension and the number of channels are consistent in the input and output layers.
I created a basic version of the paper’s VAE based on the available information. There, I included print statements that allow you to see the encoding and decoding process in real-time. I also included the simple piano music generator VAE there. You can make a copy of the codebook and play with the models by yourself!
Apart from the base VAE, InstructME introduces many enhancements to tackle the complexity of remixing tasks. These include the integration of a discriminator for improved audio quality via adversarial training and the usage of stacked convolutional blocks, which are not implemented in my base codebook. Moreover, the VAE interacts with additional components like text encoders and diffusion models, which consider both textual instructions and audio data to generate audio that aligns with the given editing commands. These augmentations ensure that InstructME can handle more nuanced and complex music editing and generation tasks, producing high-quality audio that follows user instructions. You can check out their impressive sample output here.
Conclusion
As we close this chapter of our exploration into Variational Autoencoders (VAEs) and their application in music, we’ve seen the simplicity of a basic VAE and the complexity of advanced models like InstructME. Both open up a world of possibilities in AI-powered music creation. Stay tuned for our next installment, where we’ll delve into the intricacies of data processing and model training, continuing our journey through the VAE landscape to unlock the full symphony of possibilities in AI-generated music.
References:
Riebesell, J. (n.d.) Variational Auto Encoder Architecture. Retrieved from https://tikz.net/vae/
Han, B., Dai, J., Song, X., Hao, W., He, X., Guo, D., Chen, J., Wang, Y., & Qian, Y. (2023). InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models. Retrieved from https://arxiv.org/abs/2308.14360