
Cosmos 3 is NVIDIA’s next-generation family of open omnimodal world foundation models for Physical AI.
Context
-
tokens
Input
-
per 1M tokens
Output
-
per 1M tokens
Downloads
2,830
Cosmos3 is a collection of Omnimodal world models capable of generating dynamic, high-quality video, image, audio, and action commands from combinations of text, image, video, and action trajectory inputs. It serves as a foundational building block for a broad range of Physical AI applications and research spanning world understanding, world generation, simulation, and embodied policy learning.
Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
Given multimodal inputs including text, images, video, audio, and action trajectories, generate coherent text, images, video, audio, and action outputs for multimodal understanding, world simulation, future prediction, action reasoning, and Physical AI applications.
Given language instructions and visual observations from the DROID robot platform, generate robot action trajectories for manipulation and control tasks.
Cosmos3-Super-Image2Video:
Given one input image and text instructions, generate temporally coherent video sequences that are consistent with the provided visual content.
Given text input, generate high-fidelity images that are consistent with the provided description.
How to run Cosmos3
vllm serve nvidia/Cosmos3-Nano \
--omni \
--host 0.0.0.0 \
--port 8000
import torch
from diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"nvidia/Cosmos3-Nano",
dtype=torch.bfloat16,
device_map="cuda",
)
image = pipe("A warehouse robot inspecting stacked crates").images[0]