
Stable Audio 3.0 is a model family trained on fully licensed data, designed to be the foundation for what the audio community builds next.
Context
-
tokens
Input
-
per 1M tokens
Output
-
per 1M tokens
Downloads
-
Stable Audio 3.0 is a text-to-audio diffusion model from Stability AI. You describe what you want — a genre, mood, instrumentation, tempo, or sound effect — and the model generates audio that matches your prompt.
Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing.
Our models can create several minutes of audio. Variable-length generations help us avoid the cost of full-length sounds for short clips. We also support inpainting, which allows for targeted audio edits and the continuation of short recordings.
Our latent diffusion models use a new semantic-acoustic autoencoder. This projects audio into a compact latent space, allowing efficient diffusion-based generation while maintaining audio quality and encouraging semantic structure.
We perform adversarial post-training to speed up inference and enhance generation quality. This reduces the number of inference steps while improving fidelity and prompt adherence.
Stable Audio 3 models are trained on licensed and Creative Commons data. They generate music and sounds in under 2 seconds on an H200 GPU and in just a few seconds on a MacBook Pro M4. We release the weights for small and medium models, which work on consumer-grade hardware, along with their training and inference pipeline.
Long enough for a complete song structure with intro, verse, chorus, and outro.
Not just music; it handles ambient audio, UI sounds, and cinematic effects.
Researchers, developers, and builders can download and run the model locally or integrate it into their own tools.