Moonbeam: A MIDI Foundation Model Using both Absolute and Relative Music Attributes

Authors: Zixun(Nicolas) Guo and Simon Dixon


Moonbeam is a MIDI foundation model pretrained on 85k hours of symbolic MIDI data. It has proven to be effective for various downstream tasks, including: music representation learning, conditional music generation and music infilling.

PaperCode ReleaseModel Release
Correspondence to: zixun.guo@qmul.ac.uk
Moonbeam Architecture Diagram
Figure: Moonbeam Architecture. It can autoregressively predict the next MIDI event.

Pretrained Model Summary

Moonbeam is a transformer-based foundation model for symbolic music, pretrained on a large and diverse collection of MIDI data totaling 81.6K and 18 billion tokens. Unlike most existing transformer-based models for symbolic music, Moonbeam incorporates music-domain inductive biases by capturing both absolute and relative musical attributes through the introduction of a novel domain-knowledge-inspired tokenization method and a novel attention mechanism.

Downstream Task: Unconditional Prompt Continuation

Leveraging Moonbeam's generative capability, we finetuned Moonbeam using the following datasets: ATEPP (Classical Piano), GAPS (Classical Guitar) and Pijama (Jazz Piano). The resulting model is able to generate MIDI unconditionally, following an input prompt. The following MIDI samples are cherry-picked and synthesized using heavyocity.

Fine-tuning Dataset Prompt (Audio synthesized from MIDI) Generation (Audio synthesized from MIDI)
ATEPP
ATEPP
ATEPP
GAPS
GAPS
GAPS
Pijama

Downstream Task: Conditional Music Generation and Music Infilling

Conditional music generation is a task in which a model learns to compose music based on specified input conditions. One sub-task within this broader definition is music infilling, where the model learns to generate the missing parts or "fills", given an incomplete musical composition.

Moonbeam can be finetuned as a conditional music generator and has music infilling capabilities. We finetuned our model on the CoMMU dataset using the finetuning architecture proposed in the paper. The finetuned model could generate music based on chord and metadata controls.

Below we show 50 samples generated from our model and a transformer baseline with REMI-like tokenizer proposed in the CoMMU paper. For reference, we also provide the music composed by professional composers using the same condition. The samples below are randomly sampled without cherry-picking.

Downstream Task: Symbolic Music Understanding

Moonbeam can also be finetuned for music representation learning and we show this by finetuning Moonbeam on three music understanding tasks, including player, composer and emotion classification, using four datasets. In most cases, our model outperforms other large-scale pretrained music models in terms of accuracy and F1 score.

Dataset Pijama30 Pianist8 Emopia GPM30
Acc F1 Macro Acc F1 Macro Acc F1 Macro Acc F1 Macro
Clamp 2 0.44 0.219 0.892 0.891 0.659 0.652 0.644 0.549
M3 0.262 0.077 0.784 0.786 0.715 0.688 0.385 0.16
MusicBert 0.55 0.452 0.811 0.806 0.682 0.674 0.63 0.575
Moonbeam (S) 0.649 0.596 0.811 0.804 0.636 0.623 0.541 0.470
Moonbeam (M) 0.679 0.638 0.946 0.947 0.693 0.682 0.648 0.635
Downstream music classification results.

Releasing Moonbeam

The paper, code, and pretrained weights are released under the Apache License, Version 2.0.

Fun Fact: Where Does the Name "Moonbeam" Come From?

The first author was traveling in Chiang Mai, Thailand, while searching for a name for his model. He was listening to "Polka Dots and Moonbeams", a jazz standard performed by Chet Atkins and Lenny Breau. Inspired, he then decided to name the model "Moonbeam".

Bibtex

@article{guo2025moonbeam,
  title={Moonbeam: A MIDI Foundation Model Using both
    Absolute and Relative Music Attributes},
  author={Guo, Zixun and Dixon, Simon},
  journal={arXiv preprint arXiv:2505.08620},
  year={2025}
}
Website design modified from the Anticipation Music Transformer blog post.