Automatic Music Transcription in The Wild using Unaligned Supervision

We have developed a novel system that enables automatic transcription of real musical performances. The system detects in unmatched high accuracy which notes were played, at what time, and by which instrument. In other words, the system can parse the musical content from a recording with high precision in timing and instrumentation. The prediction of the system on an audio recording is in the form of a corresponding MIDI performance, including instrument information.

The systems concepts are described in Figure 1 and Figure 2 below:

Figure 1: Technical System concept and flows. We take an EM approach: We train an initial system on synthetic data. Then, we further train our system on real, unannotated data by aligning predictions of the current system on real performances, with MIDI performances of the same pieces (MIDI performances are from other, unrelated performers, and are unaligned with the real performances). We train using the latter generated labels, with pitch-shift augmentation/consistency. Once the network has improved, we repeat the labelling process and retrain.

Figure 2: System concept from musicians’ eyes. We leverage existing recordings together with existing MIDI performances (MIDI performances are from other, unrelated performers, and are unaligned with the real performances), to train a transcriber that can transcribe new, unseen music.

The system can generalize well to real-world scenarios due to easy, on-demand, data collection. Hence, high performance is not limited to a specific dataset (see Table 1 below). As we show in our paper, the simplicity of data gathering allows the system to compete and even surpass supervised methods (see Table 1) and enables better variety in instruments and recording environments.

Table 1. Comparison of note-level accuracy to existing methods on various benchmarks. As can be seen, we outperform all weakly- or self-supervised methods by large margins. We compete with supervised methods, and even surpass them on the MAPS dataset. Our advantage over the supervised methods is in cross-dataset evaluation – Gardner et al. reach 96.0 and 90.0 note-level F1 on MAESTRO and GuitarSet, but only when including them in the train set. When excluding them (Zero-shot, ZS), accuracy drops significantly to 28.0 and 32.0. We do not include MAESTRO or GuitarSet in our training set and therefore outperform them by large margins on the cross-dataset task. Using our approach, we automatically generate new labels to the MusicNet dataset, which we show to be incomparably more accurate than the original labels – see MusicNet (train with the original labels, row 9)  vs. MusicNetEM (train with our labels, row 10). We also demonstrate simplicity of data collection by training on self-collected data, achieving similar accuracy (row 11).

• Musicology and musical education – the system can assist in learning to play an existing musical piece, or learning musical theory, by providing the notes and chord progressions. It can also provide feedback and correction for a student that is playing.
• Transcribing improvisations or musical ideas, for professional or semi-professional musicians. A musician can have an improvisation session, and the system will transcribe the content. The musician can later use the transcription himself or share with others.
• Synthesizing hyper-realistic musical performances & automatic composition: The transcription system can be used to generate massive amounts of data to train generative models, For example: music synthesis, or automatic composition.

Ready for Commercialization

US Provisional patent application

Ben Maman and Amit Bermano, “Unaligned Supervision for Automatic Music Transcription in The Wild”.



Sign up for
our events

    Life Science