The following navigation utilizes arrow, enter, escape, and space bar key commands. Left and right arrows move through main tier links and expand / close menus in sub tiers. Up and Down arrows will open main tier menus and toggle through sub tier links. Enter and space open menus and escape closes them as well. Tab will move on to the next part of the site rather than go through menu items.
Victoria Ebert and Dr. Patrick Donnelly, Department of Electrical Engineering and Computer Science, Oregon State University, Corvallis, OR 97331
Audio transcription is a music information retrieval task of transcribing an audio file of a musical performance into symbolic notation. In this project, two separate autoencoders are built to investigate the use of deep music representation. Autoencoders produce outputs similar to their inputs and can help inform the model structures that best represent the problem. The first autoencoder learns waveforms from audio files and the second autoencoder learns musical scores represented by symbolic MIDI files. The audio and MIDI sequences are used to train the two respective separate autoencoders, built using the Tensorflow Keras library. The autoencoders are trained on the MAESTRO dataset, a dataset that aligns audio and MIDI files within 10 milliseconds of accuracy, that was collected from solo piano performances on a Yamaha Disklavier piano at the International Piano E-Competition. Prior to training the neural networks, the dataset was preprocessed, downsampling the audio to 16000 samples per second and converting the MIDI into a time-series representation with a step size of 64 milliseconds, or 1024 samples. This project is part of a larger research effort to explore music transcription using deep learning. These autoencoders will be combined using transfer learning, merging trained layers of the encoder of the audio model and trained layers of the decoder of the score model. Starting with this combination of trained autoencoders, a neural network will be trained that attempts to learn MIDI from an audio file. By combining halves of trained autoencoders together, we expect faster convergence when training deep neural networks to convert audio signals to symbolic notation.
Presenter: Victoria Ebert
Institution: Oregon State University
Type: Oral
Subject: Computer Science
Status: Approved