Stable Audio 3.0 AI Model

Stable Audio 3.0 is a text-to-audio diffusion model from Stability AI. You describe what you want — a genre, mood, instrumentation, tempo, or sound effect — and the model generates audio that matches your prompt.

Stable Audio 3 Overview

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable length audio generation and editing.

Our models can create several minutes of audio. Variable-length generations help us avoid the cost of full-length sounds for short clips. We also support inpainting, which allows for targeted audio edits and the continuation of short recordings.

Our latent diffusion models use a new semantic-acoustic autoencoder. This projects audio into a compact latent space, allowing efficient diffusion-based generation while maintaining audio quality and encouraging semantic structure.

We perform adversarial post-training to speed up inference and enhance generation quality. This reduces the number of inference steps while improving fidelity and prompt adherence.

Stable Audio 3 models are trained on licensed and Creative Commons data. They generate music and sounds in under 2 seconds on an H200 GPU and in just a few seconds on a MacBook Pro M4. We release the weights for small and medium models, which work on consumer-grade hardware, along with their training and inference pipeline.