Deep Learning: Difference between revisions

From EMC23 - Satellite Of Love
Jump to navigation Jump to search
mNo edit summary
 
(51 intermediate revisions by the same user not shown)
Line 1: Line 1:
Deep Learning is a subset of [[Machine Learning]] (which would also include Reinforcement learning)


= Definitions =
===== Definition =====
Deep learning is a class of machine learning algorithms that[12](pp199–200) uses multiple layers to progressively extract higher-level features from the raw input.
Deep Learning is a subset of A.I
Although applicable accoss many domains and disciplines we will be concentrating solely on audio. Even here we will be ignoring such topics as audio classification and and concentrating on generation and composition, and to a lesser extend de-mixing and audio restoration.


Learning can be supervised, semi-supervised or unsupervised
Peter Kirn http://cdm.link/2019/04/now-ai-takes-on-writing-death-metal-country-music-hits-more/
 
In traditional problem-solving with software, a person analyzes a problem and engineers a solution in code to solve that problem.
In machine learning the problem solver abstracts away part of their solution as a flexible component called a model, and uses a special program called a model training algorithm to adjust that model to real-world data. The result is a trained model which can be used to predict outcomes that are not part of the data set used to train it.
 
===== Applications =====
* Speech Recognition
* Voice Based emotion classification
* Noise recognition
* Musical Genre Instrument Mood Classificatiob
* Music Tagging
* Music Generation
 
===== Learning types =====
There are three main learming process
 
====== Supervised Learning ======
In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.
 
====== Unsupervised Learning ======
In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.
 
Unsupervised learning involves using data that doesn't have a label. One common task is called '''clustering'''. Clustering helps to determine if there are any naturally occurring groupings in the data.
 
====== Reinforcement Learning ======
In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.
 
An agent is a piece of software you are training that makes decisions in an environment to reach a goal.
 
* An algorithm is a set of instructions that tells a computer what to do. ML is special because it enables computers to learn without being explicitly programmed to do so.
* The training algorithm defines your model’s learning objective, which is to maximize total cumulative reward. Different algorithms have different strategies for going about this.
# * A soft actor critic (SAC) embraces exploration and is data-efficient, but can lack stability.
# * A proximal policy optimization (PPO) is stable but data-hungry.
* An action space is the set of all valid actions, or choices, available to an agent as it interacts with an environment.
# * Discrete action space represents all of an agent's possible actions for each state in a finite set of steering angle and throttle value combinations.
#
# Continuous action space allows the agent to select an action from a range of values that you define for each state.
 
* Hyperparameters are variables that control the performance of your agent during training. There is a variety of different categories with which to experiment. Change the values to increase or decrease the influence of different parts of your model.
# * For example, the learning rate is a hyperparameter that controls how many new experiences are counted in learning at each step. A higher learning rate results in faster training but may reduce the model’s quality.
 
* The reward function's purpose is to encourage the agent to reach its goal. Figuring out how to reward which actions is one of your most important jobs.
 
The more an agent learns about its environment, the more confident it becomes about the actions it chooses.
 
If an agent doesn't explore enough, it often sticks to information its already learned even if this knowledge doesn't help the agent achieve its goal.
 
The agent can use information from previous experiences to help it make future decisions that enable it to reach its goal.
 
==== Machine Learning  steps for music generation ====
ML is a generic problem solver. A model can solve many problems inclusding ones not dicovered until the model is in action.
The Model is created from data, through an iterative process the model is 'fitted' to the edata. With the final model inferences are made.
 
* Define the Problem
* Build the Dataset
* Train the M odel
* Evaluate the Model
* Use the Model
 
A task is supervised if you are using '''labeled''' data. We use the term labeled to refer to data that already contains the solutions, called labels.
In supervised learning, there are two main identifiers you will see in machine learning:
 
    A categorical label has a discrete set of possible values. Furthermore, when you work with categorical labels, you often carry out classification tasks*, which are part of the supervised learning family.
 
    A continuous (regression) label does not have a discrete set of possible values, which often means you are working with numerical data.
 
# Data Collection
Does the data you've collected match the machine learning task and problem you have defined?
# Data Inspection
 
* Outliers
* Missing or incomplete values
* Data that needs to be transformed or preprocessed so it's in the correct format to be used by your model
 
# Summary Statistics
check that your data is in line with the underlying assumptions
With many statistical tools, you can calculate things like the mean, inner-quartile range (IQR), and standard deviation. These tools can give you insight into the scope, scale, and shape of the dataset.
# Data Visualisation
You can use data visualization to see outliers and trends in your dat
 
Splitting your dataset gives you two sets of data:
    Training dataset: The data on which the model will be trained. Most of your data will be here. Many developers estimate about 80%.
    Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.
 
The model training algorithm iteratively updates a model's parameters to minimize some loss function.
 
===== Model parameters =====
: Model parameters are settings or configurations the training algorithm can update to change how the model behaves, such as weights and biases. Weights, which are values that change as the model learns, are more specific to neural networks.
 
===== Loss function =====
A loss function is used to codify the model’s distance from this goal. 
 
===== Model Types =====
* Linear models
* Tree-based models
* Deep learning models
    * FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
    *    CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
    *    RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
    *    Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.
 
===== Model Evaluation =====
Log loss seeks to calculate how uncertain your model is about the predictions it is generating.
 
Model Accuracy is the fraction of predictions a model gets right.
 
1. R Square/Adjusted R Square
2. Mean Square Error(MSE)/Root Mean Square Error(RMSE)
3. Mean Absolute Error(MAE)
Things to think about
 
There are many different tools that can be used to evaluate a linear regression model. Here are a few examples:
 
    Mean absolute error (MAE): This is measured by taking the average of the absolute difference between the actual values and the predictions. Ideally, this difference is minimal.
 
    Root mean square error (RMSE): This is similar MAE, but takes a slightly modified approach so values with large error receive a higher penalty. RMSE takes the square root of the average squared difference between the prediction and the actual value.


'''FFNN''': The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.<br />
    Coefficient of determination or R-squared (R^2): This measures how well-observed outcomes are actually predicted by the model, based on the proportion of total variation of outcomes.
'''CNN''': Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.<br />
'''RNN/LSTM''': Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.<br />
'''Transformer''': A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.<br />


= Python =
[[Librosa]] is used to analyse and manipulate audio


[[Tensorflow]] is used to train models


Accuracy    False positive rate Precision
Accuracy    False positive rate Precision
Line 24: Line 138:
Negative predictive value Specificity
Negative predictive value Specificity


===== Model Inference =====
    * Generating predictions.
    * Finding patterns in your data.
    * Using a trained model.
    * Testing your model on data it has not seen before.
==== Generative AI Models Used for Music Composition ====
Generative adversarial networks (GANs), general autoregressive models, and transformer-based models.
===== Autoregressive models =====
Autoregressive convolutional neural networks (AR-CNNs) are used to study systems that evolve over time and assume that the likelihood of some data depends only on what has happened in the past. It’s a useful way of looking at many systems, from weather prediction to stock prediction.
When a note is either added or removed from your input track during inference, we call it an edit event.
To train the AR-CNN model to predict when notes need to be added or removed from your input track (edit event), the model iteratively updates the input track to sounds more like the training dataset. During training, the model is also challenged to detect differences between an original piano roll and a newly modified piano roll.
===== Generative adversarial networks (GANs) =====
Generative adversarial networks (GANs), are a machine learning model format that involves pitting two networks against each other to generate new content. The training algorithm swaps back and forth between training
* a generator network (responsible for producing new data) and
* a discriminator network (responsible for measuring how closely the generator network’s data represents the training dataset).
The generator and the discriminator are trained in alternating cycles. The generator learns to produce more and more realistic data while the discriminator iteratively gets better at learning to differentiate real data from the newly created data.
    Generator: A neural network that learns to create new data resembling the source data on which it was trained.
    Discriminator: A neural network trained to differentiate between real and synthetic data.
    Generator loss: Measures how far the output data deviates from the real data present in the training dataset.
    Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.
===== Transformer-based models =====
Transformer-based models are most often used to study data with some sequential structure (such as the sequence of words in a sentence). Transformer-based methods are now a common modern tool for modeling natural language.
===== Architectural Patterns =====
which could be applied to them)
• convolutional,
• conditioning,
• adversarial.


= Deep learning  architectures used for music generation =


From this basic building block, we will describe in the following sections the main types of deep learning architectures used for music generation (as well as for other purposes):
From this basic building block, we will describe in the following sections the main types of deep learning architectures used for music generation (as well as for other purposes):
Line 37: Line 189:
• recurrent (RNN).
• recurrent (RNN).


= architectural patterns =  
==== Tools ====
which could be applied to them)
There are several essential tools in the [[Python]] kit for Deep Learning
===== Librosa =====
Librosa  is used to analyse and manipulate audio
===== Tensorflow =====
[[Tensorflow]]is used to train models
===== Keras =====
Keras  High Level Library for Tensorflow
===== SkLearn =====
SkLearn statistical library with examples and tutorials
 
==== Generating Sound with Neural Networks ====
===== Defining the sound generation task =====
===== Classification of sound generation systems =====
====== WaveNet ======
====== Jukebox ======
===== Types of generated sounds =====
===== Sound representations =====
===== Generation from raw audio =====
===== Challenges of raw audio generation =====
 
long-range dependencies
* Pitch
* Melody
* Structure
* Timbre
* Rhythm
* Harmony
 
===== Generation from spectrograms =====
More compact than numeric representation of raw audio
 
Short terms Fourier transform -> Model model ->  spectrogram -> Inverse Short terms Fourier transform ->  audio
 
===== Advantages of generation from spectrograms =====
* Temporal axis of spectrogram is more compact than that of waveform
* Capture longer time dependencies
* Computationally lighter than raw audio
 
===== Challenges of generation from spectrograms =====
* Audio Fidelity
* Phase reconstruction
 
We cannot generate sound with Mel Frequency Cepstral Coefficients or MFCCs
 
===== Deep Learning architectures for sound generation =====
* Gan
* Autoencoder
Encoder compresses original data down to bottleneck or latent space, the decoder decompresses data back to original domain (dimensions) using back propagation, using minimise reconmstuction error
E(x,x1) . The difference between the original data and the reconstructed data and Root mean square error function to reconstruct but regularization to not overfit.
 
from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Conv2D, ReLU, BatchNormalization, \
    Flatten, Dense
from tensorflow.keras import backend as K
 
[[File:AE.png|thumb]]
 
* Variational Autoencoder (VAE)
* VQ-VAE
 
===== Inputs for generation =====
* Conditioning
Uses conditions to create output eg create vocal like x in a style of y
 
* Autonomous
Free wild generation
 
* Continuation
Start a sequence and have model continue


• convolutional,
==== Tutorials ====
===== The Sound of AI =====
* [[Deep Learning (for Audio) with Python]] - 19 videos


• conditioning,
* [[Audio Signal Processing for Machine Learning]] - 23 videos
* [[PyTorch for Audio + Music Processing]] - 10 videos
* [[Generating Sound with Neural Networks]] - 14 videos


• adversarial.
[https://www.youtube.com/c/ValerioVelardoTheSoundofAI Valerio Velardo - The Sound of AI]


= Links =
==== Links ====
Audio Handling Basics: Process Audio Files In Command-Line or Python
Audio Handling Basics: Process Audio Files In Command-Line or Python


https://benhayes.net/projects/nws/#audio-examples<br />
https://benhayes.net/projects/nws/#audio-examples<br />


Do Androids Dream of Electric Beats?<br />
Do Androids Dream of Electric Beats?<br />
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53<br />
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53<br />
Intro to Audio Analysis: Recognizing Sounds Using Machine Learning<br />
Intro to Audio Analysis: Recognizing Sounds Using Machine Learning<br />
Line 63: Line 287:
https://pythonrepo.com/repo/nerdyrodent-VQGAN-CLIP-python-deep-learning<br />
https://pythonrepo.com/repo/nerdyrodent-VQGAN-CLIP-python-deep-learning<br />


<evlplayer id="player1" w="480" h="360" service="youtube" defaultid="MwtVkPKx3RA" />


==== Terminology ====
'''Deep learning''' is a class of machine learning algorithms that[12](pp199–200) uses multiple layers to progressively extract higher-level features from the raw input.


'''Learning''' can be '''supervised''', '''semi-supervised''' or '''unsupervised'''


'''Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.
Outliers are data points that are significantly different from others in the same sample.


= Terminology =
'''FFNN''': The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.<br />
Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.
'''CNN''': Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.<br />
Outliers are data points that are significantly different from others in the same sample.
'''RNN/LSTM''': Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.<br />
'''Transformer''': A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.<br />
   
   


Glossary
'''Bag of words''': A technique used to extract features from the text. It counts how many times a word appears in a document (corpus), and then transforms that information into a dataset.


Bag of words: A technique used to extract features from the text. It counts how many times a word appears in a document (corpus), and then transforms that information into a dataset.
'''A categorical label''' has a discrete set of possible values, such as "is a cat" and "is not a cat."


A categorical label has a discrete set of possible values, such as "is a cat" and "is not a cat."
'''Clustering'''. Unsupervised learning task that helps to determine if there are any naturally occurring groupings in the data.


Clustering. Unsupervised learning task that helps to determine if there are any naturally occurring groupings in the data.
'''CNN''': Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
 
CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.


A continuous (regression) label does not have a discrete set of possible values, which means possibly an unlimited number of possibilities.
A continuous (regression) label does not have a discrete set of possible values, which means possibly an unlimited number of possibilities.


Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.
'''Data vectorization''': A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.


Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week).
'''Discrete''': A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week).


FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
'''FFNN''': The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.


Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.
'''Hyperparameters''' are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.


Log loss is used to calculate how uncertain your model is about the predictions it is generating.
'''Log loss''' is used to calculate how uncertain your model is about the predictions it is generating.


Hyperplane: A mathematical term for a surface that contains more than two planes.
'''Hyperplane''': A mathematical term for a surface that contains more than two planes.


Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.
'''Impute''' is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.


label refers to data that already contains the solution.
'''label''' refers to data that already contains the solution.


loss function is used to codify the model’s distance from this goal
'''loss function''' is used to codify the model’s distance from this goal


Machine learning, or ML, is a modern software development technique that enables computers to solve problems by using examples of real-world data.
'''Machine learning''', or ML, is a modern software development technique that enables computers to solve problems by using examples of real-world data.


Model accuracy is the fraction of predictions a model gets right. Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week). Continuous: Floating-point values with an infinite range of possible values. The opposite of categorical or discrete values, which take on a limited number of possible values.
'''Model''' accuracy is the fraction of predictions a model gets right. Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week). Continuous: Floating-point values with an infinite range of possible values. The opposite of categorical or discrete values, which take on a limited number of possible values.


Model inference is when the trained model is used to generate predictions.
'''Model inference''' is when the trained model is used to generate predictions.


model is an extremely generic program, made specific by the data used to train it.
model is an extremely generic program, made specific by the data used to train it.


Model parameters are settings or configurations the training algorithm can update to change how the model behaves.
'''Model parameters''' are settings or configurations the training algorithm can update to change how the model behaves.


Model training algorithms work through an interactive process where the current model iteration is analyzed to determine what changes can be made to get closer to the goal. Those changes are made and the iteration continues until the model is evaluated to meet the goals.
'''Model training algorithms''' work through an interactive process where the current model iteration is analyzed to determine what changes can be made to get closer to the goal. Those changes are made and the iteration continues until the model is evaluated to meet the goals.


Neural networks: a collection of very simple models connected together. These simple models are called neurons. The connections between these models are trainable model parameters called weights.
'''Neural networks''': a collection of very simple models connected together. These simple models are called neurons. The connections between these models are trainable model parameters called weights.


Outliers are data points that are significantly different from others in the same sample.
'''Outliers''' are data points that are significantly different from others in the same sample.


Plane: A mathematical term for a flat surface (like a piece of paper) on which two points can be joined by a straight line.
'''Plane''': A mathematical term for a flat surface (like a piece of paper) on which two points can be joined by a straight line.


Regression: A common task in supervised machine learning.
'''Regression''': A common task in supervised machine learning.


In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.
In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.


RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
'''RNN/LSTM''': Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
 
Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A
 
Stop words: A list of words removed by natural language processing tools when building your dataset. There is no single universal list of stop words used by all-natural language processing tools.


In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.
'''Silhouette coefficient''': A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A


Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.
'''Stop words''': A list of words removed by natural language processing tools when building your dataset. There is no single universal list of stop words used by all-natural language processing tools.


Training dataset: The data on which the model will be trained. Most of your data will be here.
In '''supervised learning''', every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.


Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.
'''Test dataset''': The data withheld from the model during training, which is used to test how well your model will generalize to new data.


In unlabeled data, you don't need to provide the model with any kind of label or solution while the model is being trained.
'''Training dataset''': The data on which the model will be trained. Most of your data will be here.


In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.
'''Transformer''': A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.
 
 
 
 
 
 
 
 
https://youtu.be/CNNmBtNcccE
 
 
 
Machine learning is synthesizing death metal. It might make your death metal radio DJ nervous – but it could also mean music software works with timbre and time in new ways. That news – plus some comical abuse of neural networks for writing genre-specific lyrics in genres like country – next.
Peter Kirn http://cdm.link/2019/04/now-ai-takes-on-writing-death-metal-country-music-hits-more/


In '''unlabeled data''', you don't need to provide the model with any kind of label or solution while the model is being trained.


In '''unsupervised learning''', there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.


<evlplayer id="player1" w="480" h="360" service="youtube" defaultid="CNNmBtNcccE" />
  Machine learning is synthesizing death metal. It might make your death metal radio DJ nervous – but it could also mean music software works with timbre and time in new ways. That news – plus some comical abuse of neural networks for writing genre-specific lyrics in genres like country – next.

Latest revision as of 22:17, 10 March 2022

Definition[edit]

Deep Learning is a subset of A.I Although applicable accoss many domains and disciplines we will be concentrating solely on audio. Even here we will be ignoring such topics as audio classification and and concentrating on generation and composition, and to a lesser extend de-mixing and audio restoration.

Peter Kirn http://cdm.link/2019/04/now-ai-takes-on-writing-death-metal-country-music-hits-more/

In traditional problem-solving with software, a person analyzes a problem and engineers a solution in code to solve that problem.
In machine learning the problem solver abstracts away part of their solution as a flexible component called a model, and uses a special program called a model training algorithm to adjust that model to real-world data. The result is a trained model which can be used to predict outcomes that are not part of the data set used to train it.
Applications[edit]
  • Speech Recognition
  • Voice Based emotion classification
  • Noise recognition
  • Musical Genre Instrument Mood Classificatiob
  • Music Tagging
  • Music Generation
Learning types[edit]

There are three main learming process

Supervised Learning[edit]

In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.

Unsupervised Learning[edit]

In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

Unsupervised learning involves using data that doesn't have a label. One common task is called clustering. Clustering helps to determine if there are any naturally occurring groupings in the data.

Reinforcement Learning[edit]

In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.

An agent is a piece of software you are training that makes decisions in an environment to reach a goal.

  • An algorithm is a set of instructions that tells a computer what to do. ML is special because it enables computers to learn without being explicitly programmed to do so.
  • The training algorithm defines your model’s learning objective, which is to maximize total cumulative reward. Different algorithms have different strategies for going about this.
  1. * A soft actor critic (SAC) embraces exploration and is data-efficient, but can lack stability.
  2. * A proximal policy optimization (PPO) is stable but data-hungry.
  • An action space is the set of all valid actions, or choices, available to an agent as it interacts with an environment.
  1. * Discrete action space represents all of an agent's possible actions for each state in a finite set of steering angle and throttle value combinations.
  2. Continuous action space allows the agent to select an action from a range of values that you define for each state.
  • Hyperparameters are variables that control the performance of your agent during training. There is a variety of different categories with which to experiment. Change the values to increase or decrease the influence of different parts of your model.
  1. * For example, the learning rate is a hyperparameter that controls how many new experiences are counted in learning at each step. A higher learning rate results in faster training but may reduce the model’s quality.
  • The reward function's purpose is to encourage the agent to reach its goal. Figuring out how to reward which actions is one of your most important jobs.

The more an agent learns about its environment, the more confident it becomes about the actions it chooses.

If an agent doesn't explore enough, it often sticks to information its already learned even if this knowledge doesn't help the agent achieve its goal.

The agent can use information from previous experiences to help it make future decisions that enable it to reach its goal.

Machine Learning steps for music generation[edit]

ML is a generic problem solver. A model can solve many problems inclusding ones not dicovered until the model is in action. The Model is created from data, through an iterative process the model is 'fitted' to the edata. With the final model inferences are made.

  • Define the Problem
  • Build the Dataset
  • Train the M odel
  • Evaluate the Model
  • Use the Model

A task is supervised if you are using labeled data. We use the term labeled to refer to data that already contains the solutions, called labels. In supervised learning, there are two main identifiers you will see in machine learning:

   A categorical label has a discrete set of possible values. Furthermore, when you work with categorical labels, you often carry out classification tasks*, which are part of the supervised learning family.
   A continuous (regression) label does not have a discrete set of possible values, which often means you are working with numerical data. 
  1. Data Collection

Does the data you've collected match the machine learning task and problem you have defined?

  1. Data Inspection
  • Outliers
  • Missing or incomplete values
  • Data that needs to be transformed or preprocessed so it's in the correct format to be used by your model
  1. Summary Statistics
check that your data is in line with the underlying assumptions

With many statistical tools, you can calculate things like the mean, inner-quartile range (IQR), and standard deviation. These tools can give you insight into the scope, scale, and shape of the dataset.

  1. Data Visualisation

You can use data visualization to see outliers and trends in your dat

Splitting your dataset gives you two sets of data:

   Training dataset: The data on which the model will be trained. Most of your data will be here. Many developers estimate about 80%.
   Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

The model training algorithm iteratively updates a model's parameters to minimize some loss function.

Model parameters[edit]
Model parameters are settings or configurations the training algorithm can update to change how the model behaves, such as weights and biases. Weights, which are values that change as the model learns, are more specific to neural networks.
Loss function[edit]

A loss function is used to codify the model’s distance from this goal.

Model Types[edit]
  • Linear models
  • Tree-based models
  • Deep learning models
   * FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
   *     CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
   *     RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
   *     Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.
Model Evaluation[edit]

Log loss seeks to calculate how uncertain your model is about the predictions it is generating.

Model Accuracy is the fraction of predictions a model gets right.

1. R Square/Adjusted R Square 2. Mean Square Error(MSE)/Root Mean Square Error(RMSE) 3. Mean Absolute Error(MAE) Things to think about

There are many different tools that can be used to evaluate a linear regression model. Here are a few examples:

   Mean absolute error (MAE): This is measured by taking the average of the absolute difference between the actual values and the predictions. Ideally, this difference is minimal.
   Root mean square error (RMSE): This is similar MAE, but takes a slightly modified approach so values with large error receive a higher penalty. RMSE takes the square root of the average squared difference between the prediction and the actual value.
   Coefficient of determination or R-squared (R^2): This measures how well-observed outcomes are actually predicted by the model, based on the proportion of total variation of outcomes.


Accuracy False positive rate Precision

Confusion matrix False negative rate Recall

F1 ScoreLog LossROC curve

Negative predictive value Specificity

Model Inference[edit]
   * Generating predictions.
   * Finding patterns in your data.
   * Using a trained model.
   * Testing your model on data it has not seen before.

Generative AI Models Used for Music Composition[edit]

Generative adversarial networks (GANs), general autoregressive models, and transformer-based models.

Autoregressive models[edit]

Autoregressive convolutional neural networks (AR-CNNs) are used to study systems that evolve over time and assume that the likelihood of some data depends only on what has happened in the past. It’s a useful way of looking at many systems, from weather prediction to stock prediction.

When a note is either added or removed from your input track during inference, we call it an edit event.

To train the AR-CNN model to predict when notes need to be added or removed from your input track (edit event), the model iteratively updates the input track to sounds more like the training dataset. During training, the model is also challenged to detect differences between an original piano roll and a newly modified piano roll.

Generative adversarial networks (GANs)[edit]

Generative adversarial networks (GANs), are a machine learning model format that involves pitting two networks against each other to generate new content. The training algorithm swaps back and forth between training

  • a generator network (responsible for producing new data) and
  • a discriminator network (responsible for measuring how closely the generator network’s data represents the training dataset).

The generator and the discriminator are trained in alternating cycles. The generator learns to produce more and more realistic data while the discriminator iteratively gets better at learning to differentiate real data from the newly created data.

   Generator: A neural network that learns to create new data resembling the source data on which it was trained.
   Discriminator: A neural network trained to differentiate between real and synthetic data.
   Generator loss: Measures how far the output data deviates from the real data present in the training dataset.
   Discriminator loss: Evaluates how well the discriminator differentiates between real and fake data.
Transformer-based models[edit]

Transformer-based models are most often used to study data with some sequential structure (such as the sequence of words in a sentence). Transformer-based methods are now a common modern tool for modeling natural language.

Architectural Patterns[edit]

which could be applied to them)

• convolutional,

• conditioning,

• adversarial.


From this basic building block, we will describe in the following sections the main types of deep learning architectures used for music generation (as well as for other purposes):

• feedforward,

• autoencoder,

• restricted Boltzmann machine (RBM),

• recurrent (RNN).

Tools[edit]

There are several essential tools in the Python kit for Deep Learning

Librosa[edit]

Librosa is used to analyse and manipulate audio

Tensorflow[edit]

Tensorflowis used to train models

Keras[edit]

Keras High Level Library for Tensorflow

SkLearn[edit]

SkLearn statistical library with examples and tutorials

Generating Sound with Neural Networks[edit]

Defining the sound generation task[edit]
Classification of sound generation systems[edit]
WaveNet[edit]
Jukebox[edit]
Types of generated sounds[edit]
Sound representations[edit]
Generation from raw audio[edit]
Challenges of raw audio generation[edit]

long-range dependencies

  • Pitch
  • Melody
  • Structure
  • Timbre
  • Rhythm
  • Harmony
Generation from spectrograms[edit]

More compact than numeric representation of raw audio

Short terms Fourier transform -> Model model ->  spectrogram -> Inverse Short terms Fourier transform ->  audio
Advantages of generation from spectrograms[edit]
  • Temporal axis of spectrogram is more compact than that of waveform
  • Capture longer time dependencies
  • Computationally lighter than raw audio
Challenges of generation from spectrograms[edit]
  • Audio Fidelity
  • Phase reconstruction

We cannot generate sound with Mel Frequency Cepstral Coefficients or MFCCs

Deep Learning architectures for sound generation[edit]
  • Gan
  • Autoencoder

Encoder compresses original data down to bottleneck or latent space, the decoder decompresses data back to original domain (dimensions) using back propagation, using minimise reconmstuction error E(x,x1) . The difference between the original data and the reconstructed data and Root mean square error function to reconstruct but regularization to not overfit.

from tensorflow.keras import Model
from tensorflow.keras.layers import Input, Conv2D, ReLU, BatchNormalization, \
   Flatten, Dense
from tensorflow.keras import backend as K
AE.png
  • Variational Autoencoder (VAE)
  • VQ-VAE
Inputs for generation[edit]
  • Conditioning

Uses conditions to create output eg create vocal like x in a style of y

  • Autonomous

Free wild generation

  • Continuation

Start a sequence and have model continue

Tutorials[edit]

The Sound of AI[edit]

Valerio Velardo - The Sound of AI

Links[edit]

Audio Handling Basics: Process Audio Files In Command-Line or Python

https://benhayes.net/projects/nws/#audio-examples

Do Androids Dream of Electric Beats?
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53
Intro to Audio Analysis: Recognizing Sounds Using Machine Learning
https://magenta.tensorflow.org/music-vae
https://musicalmetacreation.org/mume2018/proceedings/Sturm.pdf
https://ccrma.stanford.edu/~blackrse/algorithm.html
https://magenta.tensorflow.org/music-vae
https://mirg.city.ac.uk/codeapps/the-magnatagatune-dataset
https://docs.microsoft.com/en-us/cognitive-toolkit/
https://scikit-learn.org/stable/
https://pythonrepo.com/repo/nerdyrodent-VQGAN-CLIP-python-deep-learning

Terminology[edit]

Deep learning is a class of machine learning algorithms that[12](pp199–200) uses multiple layers to progressively extract higher-level features from the raw input.

Learning can be supervised, semi-supervised or unsupervised

Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset. Outliers are data points that are significantly different from others in the same sample.

FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.
CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.
RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.
Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.


Bag of words: A technique used to extract features from the text. It counts how many times a word appears in a document (corpus), and then transforms that information into a dataset.

A categorical label has a discrete set of possible values, such as "is a cat" and "is not a cat."

Clustering. Unsupervised learning task that helps to determine if there are any naturally occurring groupings in the data.

CNN: Convolutional Neural Networks (CNN) represent nested filters over grid-organized data. They are by far the most commonly used type of model when processing images.

A continuous (regression) label does not have a discrete set of possible values, which means possibly an unlimited number of possibilities.

Data vectorization: A process that converts non-numeric data into a numerical format so that it can be used by a machine learning model.

Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week).

FFNN: The most straightforward way of structuring a neural network, the Feed Forward Neural Network (FFNN) structures neurons in a series of layers, with each neuron in a layer containing weights to all neurons in the previous layer.

Hyperparameters are settings on the model which are not changed during training but can affect how quickly or how reliably the model trains, such as the number of clusters the model should identify.

Log loss is used to calculate how uncertain your model is about the predictions it is generating.

Hyperplane: A mathematical term for a surface that contains more than two planes.

Impute is a common term referring to different statistical tools which can be used to calculate missing values from your dataset.

label refers to data that already contains the solution.

loss function is used to codify the model’s distance from this goal

Machine learning, or ML, is a modern software development technique that enables computers to solve problems by using examples of real-world data.

Model accuracy is the fraction of predictions a model gets right. Discrete: A term taken from statistics referring to an outcome taking on only a finite number of values (such as days of the week). Continuous: Floating-point values with an infinite range of possible values. The opposite of categorical or discrete values, which take on a limited number of possible values.

Model inference is when the trained model is used to generate predictions.

model is an extremely generic program, made specific by the data used to train it.

Model parameters are settings or configurations the training algorithm can update to change how the model behaves.

Model training algorithms work through an interactive process where the current model iteration is analyzed to determine what changes can be made to get closer to the goal. Those changes are made and the iteration continues until the model is evaluated to meet the goals.

Neural networks: a collection of very simple models connected together. These simple models are called neurons. The connections between these models are trainable model parameters called weights.

Outliers are data points that are significantly different from others in the same sample.

Plane: A mathematical term for a flat surface (like a piece of paper) on which two points can be joined by a straight line.

Regression: A common task in supervised machine learning.

In reinforcement learning, the algorithm figures out which actions to take in a situation to maximize a reward (in the form of a number) on the way to reaching a specific goal.

RNN/LSTM: Recurrent Neural Networks (RNN) and the related Long Short-Term Memory (LSTM) model types are structured to effectively represent for loops in traditional computing, collecting state while iterating over some object. They can be used for processing sequences of data.

Silhouette coefficient: A score from -1 to 1 describing the clusters found during modeling. A score near zero indicates overlapping clusters, and scores less than zero indicate data points assigned to incorrect clusters. A

Stop words: A list of words removed by natural language processing tools when building your dataset. There is no single universal list of stop words used by all-natural language processing tools.

In supervised learning, every training sample from the dataset has a corresponding label or output value associated with it. As a result, the algorithm learns to predict labels or output values.

Test dataset: The data withheld from the model during training, which is used to test how well your model will generalize to new data.

Training dataset: The data on which the model will be trained. Most of your data will be here.

Transformer: A more modern replacement for RNN/LSTMs, the transformer architecture enables training over larger datasets involving sequences of data.

In unlabeled data, you don't need to provide the model with any kind of label or solution while the model is being trained.

In unsupervised learning, there are no labels for the training data. A machine learning algorithm tries to learn the underlying patterns or distributions that govern the data.

 Machine learning is synthesizing death metal. It might make your death metal radio DJ nervous – but it could also mean music software works with timbre and time in new ways. That news – plus some comical abuse of neural networks for writing genre-specific lyrics in genres like country – next.