
Scenema Audio is an audio diffusion model extracted from LTX 2.3. It generates speech with emotional acting, pacing, breath control, and sound effects from a text prompt. It is not speech synthesis. It is performative audio generation.
Downloads
99
Scenema Audio AI Model is every existing text-to-speech system converts words into sound, but none of them perform. Speech that merely pronounces words correctly is functionally useless for filmmaking, audiobooks, or any context where the emotional delivery carries as much meaning as the words themselves. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.
The pipeline has four stages: text encoding (Gemma 3 12B, bf16), audio diffusion (8-step denoising), post-processing (vocal isolation, validation, silence trimming), and optional voice identity transfer via SeedVC.
A 20-second clip takes about 5-8 seconds end-to-end on a RTX 4090. Minimum hardware is 16GB VRAM with CPU streaming for the text encoder. Standard all-on-GPU configuration requires 24GB.
Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)
INT8 audio transformer + NF4 Gemma quantization. Models are automatically offloaded between GPU and CPU RAM between pipeline stages (encode, diffuse, decode, voice convert). Requires 32 GB system RAM. Default configuration via docker compose up.
Recommended: 24 GB VRAM (RTX 4090, RTX A5000)
Same INT8 + NF4 config with all models resident on GPU simultaneously. No offloading overhead, fastest generation.
Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)
bf16 audio transformer + bf16 Gemma, all models resident on GPU. Best quality. Set environment variables