Bernini-R is the open-source renderer release for Bernini, a video generation and editing framework from ByteDance.
This article explains what Bernini-R is, what it can do, how it is released, and who may want to try it.
What Is Bernini-R?
Bernini-R is the renderer component of Bernini.
Bernini is a unified framework for video generation and video editing. It uses an MLLM-based semantic planner and a DiT-based renderer.
In simple terms, the planner handles high-level meaning. The renderer turns that plan into visual output.
Bernini-R is not a simple chatbot or a general image tool. It is a model release for developers and researchers who want to run video and image generation or editing workflows locally.
Overview
| Item | Details |
|---|---|
| Model name | Bernini-R |
| Developer | Bernini Team, ByteDance |
| Main category | Image-text-to-video / video generation and editing |
| Release type | Open-source inference code and model weights |
| License | Apache-2.0 |
| Model files | Safetensors |
| Paper | Bernini: Latent Semantic Planning for Video Diffusion |
| Hosted API | Not deployed by a Hugging Face Inference Provider at the time of writing |
| Recommended setup | CUDA GPU, with Hopper GPUs recommended for FlashAttention-3 |
| Diffusers version | Bernini-R-Diffusers is available as the recommended packaged format |
Features
Unified Video Generation and Editing
Bernini is designed for both generation and editing tasks.
This matters because users do not need to treat text-to-video, video editing, and reference-guided editing as fully separate workflows.
MLLM-Based Semantic Planning
The framework uses a multimodal large language model as a semantic planner.
This planner reasons over text, images, videos, and target placeholders before the renderer creates the final output.
DiT-Based Rendering
Bernini-R is the rendering side of the system.
It uses a DiT-based renderer to synthesize pixels from semantic guidance and visual features.
Support for Multiple Task Types
The official examples include text-to-image, image editing, text-to-video, video-to-video editing, reference-guided video editing, and reference-to-video generation.
This makes Bernini-R useful for testing different media workflows in one codebase.
Reference-Guided Editing
Bernini can use reference images to guide edits.
For example, a user can provide a reference object, garment, material, weather condition, or visual style.
Content Insertion
The project page shows content insertion as one supported direction.
This can be useful when a creator wants to insert image or video content into an existing video scene.
Gradio Demo Support
Bernini includes a Gradio demo script.
This gives developers a simple interface for testing the pipeline without building a custom app first.
Use Cases
AI Video Editing for Creative Teams
Video editors can use Bernini-R to test prompt-driven video edits.
A practical use case is changing an object, adding a scene element, or adjusting the subject while preserving the source video structure.
Reference-Based Product or Fashion Edits
Design teams can use reference-guided editing to test how a garment, object, or material might look in a video.
This is useful for early visual prototyping, not final production without review.
Research on Video Diffusion Models
AI researchers can study Bernini’s split between semantic planning and pixel rendering.
The paper frames this as a way to combine MLLM reasoning with diffusion-based visual synthesis.
Local Experimentation With Video Generation
Developers with suitable GPUs can run the inference code locally.
This is useful for teams that want more control than a hosted demo, but it also requires strong hardware.
Building Internal Creative Tools
Engineering teams can use the Gradio demo or inference scripts as a starting point.
A possible internal tool could let artists test text-to-video, video editing, and reference-to-video tasks from one interface.