• VKMO AI
    VKMO AI
  • Search
  • Explore
  • AI Promos Codes
  • Prompt Library
  • AI Models
  • Submit AI Tool
Categories
  • AI Data
  • AI Writer
  • AI Image Generator
  • AI Video Generator
  • AI Logo Generator
  • AI Ecommerce
  • AI Study
  • AI Chat
  • AI Voice Generator
  • AI Anime Generator
  • AI Agent
  • AI Coding Tools
  • AI Games
SearchExploreAI Promos CodesPrompt LibraryAI ModelsSubmit AI Tool

VKMO AI is a premium AI tools directory that helps users discover the best AI products worldwide.

Categories
AI DataAI WriterAI Image Generator
Resources
Submit ToolAI NewsBlog
Hot Models
GPT-5.5
© 2024 VKMO AI, All rights reserved
Privacy PolicyTerms of Service
  1. Home
  2. AI Models
  3. Scenema Audio
Scenema Audio

Scenema Audio

Released May 15, 2026Text to Speech

Scenema Audio is an audio diffusion model extracted from LTX 2.3. It generates speech with emotional acting, pacing, breath control, and sound effects from a text prompt. It is not speech synthesis. It is performative audio generation.

Downloads

99

Analysis Summary

Scenema Audio Overview

Scenema Audio AI Model is every existing text-to-speech system converts words into sound, but none of them perform. Speech that merely pronounces words correctly is functionally useless for filmmaking, audiobooks, or any context where the emotional delivery carries as much meaning as the words themselves. Scenema Audio generates speech with intention, pacing, breath control, and emotional arcs that shift within a single generation, all from a text prompt that describes not just what to say but how to say it.

Scenema Audio Features

  • Generate: Build prompts from individual fields (voice description, speech text, scene, action tags) with preset examples
  • Voice Design: Quick 15-second voice previews for iterating on voice descriptions
  • Voice Cloning: Upload reference audio and generate with voice identity transfer
  • Advanced: Write raw XML directly for full control

How it Works

The pipeline has four stages: text encoding (Gemma 3 12B, bf16), audio diffusion (8-step denoising), post-processing (vocal isolation, validation, silence trimming), and optional voice identity transfer via SeedVC.

A 20-second clip takes about 5-8 seconds end-to-end on a RTX 4090. Minimum hardware is 16GB VRAM with CPU streaming for the text encoder. Standard all-on-GPU configuration requires 24GB.

Hardware Requirements

Minimum: 16 GB VRAM (RTX 4060 Ti 16GB, RTX A4000)
INT8 audio transformer + NF4 Gemma quantization. Models are automatically offloaded between GPU and CPU RAM between pipeline stages (encode, diffuse, decode, voice convert). Requires 32 GB system RAM. Default configuration via docker compose up.

Recommended: 24 GB VRAM (RTX 4090, RTX A5000)
Same INT8 + NF4 config with all models resident on GPU simultaneously. No offloading overhead, fastest generation.

Full Precision: 48 GB VRAM (A6000 Ada, A40, L40S)
bf16 audio transformer + bf16 Gemma, all models resident on GPU. Best quality. Set environment variables

Related materials

  • HuggingFace: https://huggingface.co/ScenemaAI/scenema-audio
  • Github: https://github.com/ScenemaAI/scenema-audio
  • Website: https://scenema.ai/audio

Related Models

Specifications

Downloads99
ReleasedMay 15, 2026
Category
Text to Speech
Qwen3.7 Max

Qwen3.7 Max

Lance

Lance

Hy3 Preview

Hy3 Preview

Sulphur 2 Base GGUF

Sulphur 2 Base GGUF

GLM-5.1

GLM-5.1

S

Sulphur 2 base

View all models →