
Lance is a lightweight native unified multimodal model that supports image and video understanding, generation, and editing within a single framework.
Downloads
438
Lance is a 3B native unified multimodal model for image and video understanding, generation, and editing, trained from scratch within a training budget of no more than 128 GPUs using a staged multi-task recipe.
Text-to-Video
Nine text-conditioned cases focused on character motion, fantasy animals, two-person interaction, and cinematic dreamlike scenes.
Video Editing
Nine prompt-driven single-step and compositional editing cases spanning background transformation, object addition and removal, subject replacement, appearance restyling, stylization, and action edits.
Multi-turn Consistency Editing
Source video followed by four linked edits on the same subject: replacement, accessory addition, background rewrite, and motion update.
Video Understanding
Selected video question answering and captioning cases that evaluate temporal reasoning, motion recognition, and concise-to-detailed description.
Text-to-Image
Representative text-to-image outputs spanning photorealistic, stylized, compositional, and typography-heavy prompts.
Image Editing
Instruction-guided image editing cases showing local replacement, style transfer, object-aware modifications, and layout-preserving transformations.
Image Understanding
Six selected visual question answering cases spanning charts, trade data, OCR, documents, landmarks, and natural phenomena.