Machine Learning and Forced Alignment Applied to Audiobook Editing

Micah Tietz, Rebecca Bates, Computer Science, Minnesota State University, Mankato, 218 Wissink Hall, Mankato, MN 56001

Machine Learning (ML) techniques can contribute to saving time and effort in many fields. A promising application is the use of ML techniques to develop powerful, versatile tools to support human creativity. One application is detection of errors, disfluencies, and noise in voice performances, such as audiobooks. Creating an audiobook is a lengthy process, involving not only the reading of the text, but careful editing and review to correct any errors that might have been made. These revisions often take between three and six hours per hour of finished audio, depending on the number of mistakes, the complexity of the production, and the skill of the editor. Many automated techniques already exist that can help this process, such as audio-signal-based removal and normalization of mouth clicks and electronic noise, but repairing speech errors like disfluencies and mispronunciations are more difficult to detect and repair because they are related to the intended words rather than what is produced. ML has the potential to fill this gap and speed up the editing process. Several techniques are used to accomplish this. The first is forced alignment, which is when speech recognition is used in conjunction with a text transcript to segment and label the audio stream with individual words. Once forced alignment has been performed, pre- and post-edited versions of a spoken audio recording are used as training data for a neural-network-based system to detect where changes have been made, and what types of changes those were. The agent is then used to predict likely points for editing. Future work involves automatic correction of the audio file at the identified points. The tools used for this project are the Kaldi open-source speech recognition library, TensorFlow, and Brainome.

Additional Abstract Information

Presenter: Micah Tietz

Institution: Minnesota State University, Mankato

Type: Oral

Subject: Computer Science

Status: Approved

Time and Location

Session: Oral 7
Date/Time: Tue 3:30pm-4:30pm
Session Number: 714
List other presenters in this same room and session