Cosmos 3 AI Model

Cosmos3 Overview

Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.

Cosmos3 Model Versions

Cosmos3-Nano:

Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.

Cosmos3-Super:

Cosmos3-Nano-Policy-DROID:

Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.
Cosmos3-Super-Image2Video:

Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.

Cosmos3-Super-Text2Image:

Given text input, generate high-fidelity images that are consistent with the provided description.

How to run Cosmos3

NVIDIA recommends vLLM-Omni for an OpenAI-compatible API endpoint.

vllm serve nvidia/Cosmos3-Nano \
  --omni \
  --host 0.0.0.0 \
  --port 8000

Simpler deployment method

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained(
    "nvidia/Cosmos3-Nano",
    dtype=torch.bfloat16,
    device_map="cuda",
)

image = pipe("A warehouse robot inspecting stacked crates").images[0]

Cosmos 3

Analysis Summary