Scenema Audio AI Model

Scenema Audio Overview

Scenema Audio AI Model is every existing text-to-speech system converts words into sound, but none of them perform. Speech that merely pronounces words correctly is functionally useless for filmmaking, audiobooks, or any context where the emotional delivery carries as much meaning as the words themselves. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

Scenema Audio Features

Generate: Build prompts from individual fields (voice description, speech text, scene, action tags) with preset examples
Voice Design: Quick 15-second voice previews for iterating on voice descriptions
Voice Cloning: Upload reference audio and generate with voice identity transfer
Advanced: Write raw XML directly for full control

How it Works

The pipeline has four stages: text encoding (Gemma 3 12B, bf16), audio diffusion (8-step denoising), post-processing (vocal isolation, validation, silence trimming), and optional voice identity transfer via SeedVC.

A 20-second clip takes about 5-8 seconds end-to-end on a RTX 4090. Minimum hardware is 16GB VRAM with CPU streaming for the text encoder. Standard all-on-GPU configuration requires 24GB.

Hardware Requirements

Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)
INT8 audio transformer + NF4 Gemma quantization. Models are automatically offloaded between GPU and CPU RAM between pipeline stages (encode, diffuse, decode, voice convert). Requires 32 GB system RAM. Default configuration via docker compose up.

Recommended: 24 GB VRAM (RTX 4090, RTX A5000)
Same INT8 + NF4 config with all models resident on GPU simultaneously. No offloading overhead, fastest generation.

Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)
bf16 audio transformer + bf16 Gemma, all models resident on GPU. Best quality. Set environment variables

HuggingFace: https://huggingface.co/ScenemaAI/scenema-audio
Github: https://github.com/ScenemaAI/scenema-audio
Website: https://scenema.ai/audio