---
language:
- en
pipeline_tag: image-to-image
tags:
- image-editing
- text-guided-editing
- diffusion
- sana
- qwen-vl
- multimodal
base_model:
- Efficient-Large-Model/SANA1.5_1.6B_1024px
- Qwen/Qwen3-VL-2B-Instruct
library_name: diffusers
---
# VIBE: Visual Instruction Based Editor
🌐 Project Page |
📜 Paper on arXiv |
Github |
🤗 Space |
🤗 VIBE-Image-Edit-DistilledCFG |
**VIBE** is a powerful open-source framework for text-guided image editing. It leverages the efficiency of the [Sana1.5-1.6B](https://github.com/NVlabs/Sana) diffusion model and the visual understanding capabilities of [Qwen3-VL-2B-Instruct](https://github.com/QwenLM/Qwen3-VL) to provide **exceptionally fast** and high-quality, instruction-based image manipulation.
We also provide a faster, **CFG-distilled** version of this model available at [VIBE-Image-Edit-DistilledCFG](https://huggingface.co/iitolstykh/VIBE-Image-Edit-DistilledCFG).
## Model Details
- **Name:** VIBE
- **Task:** Text-Guided Image Editing
- **Architecture:**
- **Diffusion Backbone:** Sana1.5 (1.6B parameters) with Linear Attention.
- **Condition Encoder:** Qwen3-VL (2B parameters) for multimodal understanding.
- **Framework:** Built on `diffusers` and `transformers`.
- **Model precision**: torch.bfloat16 (BF16)
- **Model resolution**: This model is developed to edit up to 2048px images with multi-scale heigh and width.
## Features
- **Text-Guided Editing:** Edit images using natural language instructions (e.g., "Add a cat on the sofa").
- **Compact & Efficient:** Combines a 1.6B parameter diffusion model with a 2B parameter encoder for a lightweight footprint.
- **High-Speed Inference:** Utilizes Sana1.5's linear attention mechanism for rapid generation.
- **Multimodal Understanding:** Qwen3-VL ensures strong alignment between visual content and text instructions.
- **Text-to-Image** support.
# Inference Requirements
- `vibe` library
```bash
pip install git+https://github.com/ai-forever/VIBE
```
- requirements for `vibe` library:
```bash
pip install transformers==4.57.1 torchvision==0.21.0 torch==2.6.0 diffusers==0.33.1 loguru==0.7.3
```
# Quick start
```python
from PIL import Image
import requests
from io import BytesIO
from huggingface_hub import snapshot_download
from vibe.editor import ImageEditor
# Download model
model_path = snapshot_download(
repo_id="iitolstykh/VIBE-Image-Edit",
repo_type="model",
)
# Load model
editor = ImageEditor(
checkpoint_path=model_path,
image_guidance_scale=1.2,
guidance_scale=4.5,
num_inference_steps=20,
device="cuda:0",
)
# Download test image
resp = requests.get('https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/3f58a82a-b4b4-40c3-a318-43f9350fcd02/original=true,quality=90/115610275.jpeg')
image = Image.open(BytesIO(resp.content))
# Generate edited image
edited_image = editor.generate_edited_image(
instruction="let this case swim in the river",
conditioning_image=image,
num_images_per_prompt=1,
)[0]
edited_image.save(f"edited_image.jpg", quality=100)
```
## T2I Examples
(Seed: 234) Prompt: View through the clouds at Earth from a plane

(Seed: 2) Prompt: Medieval castle at sunset surrounded by dense forest and mist

(Seed: 666) Prompt: Portrait of an old wise man with a long white beard surrounded by books and candles

(Seed: 9513) Prompt: Night urban street with wet asphalt reflections and neon signs

(Seed: 142) Prompt: Futuristic sports car racing in the desert

(Seed: 1325) Prompt: Pirate boat in ocean

(Seed: 4241) Prompt: Davy Jones portrait

(Seed: 142) Prompt: Epic cosmic scene with a huge space station and distant stars

(Seed: 42) Prompt: Cherry blossom park in spring with petals falling to the ground

## Comparison with SANA1.5_1.6B_1024px
**Prompt:** Generate an interior of a rustic cabin workshop during winter evening. The viewpoint is from the doorway, showing a workbench with tools, wood shavings on the floor, and a cast-iron stove glowing softly. Place shelves with jars of nails, coils of rope, and folded blankets. Through a small window, show snow falling and pine trees in the twilight. Add warm lamplight creating soft gradients and a gentle vignette. Include a person in a thick sweater sanding a wooden object at the bench, but keep the person small in frame
VIBE (Seed: 4411)
SANA1.5_1.6B_1024px (Seed: 1521)
---
**Prompt:** Generate an ancient jungle temple ruin partially covered in moss and vines, with a waterfall cascading nearby into a shallow pool. Show broken stone steps, carved patterns that are abstract, and damp surfaces with realistic moss detail. Add mist, shafts of sunlight through leaves, and small floating insects. Include a human explorer in the mid-ground, small in frame, wearing a backpack. Lush, cinematic realism.
VIBE (Seed: 1995)
SANA1.5_1.6B_1024px (Seed: 9842)
---
**Prompt:** Create a science-fiction interior of a space greenhouse module with hydroponic racks, glowing grow lights, and condensation on transparent walls. Plants include leafy greens and flowering specimens. Tools and tablets have UI elements. Add soft floating dust or microgravity droplets. Clean, detailed, plausible sci-fi aesthetic.
VIBE (Seed: 2203)
SANA1.5_1.6B_1024px (Seed: 143)
---
**Prompt:** Beautiful tropical beach with guinea pig swimming in the water and human drinking wine
VIBE (Seed: 132142)
SANA1.5_1.6B_1024px (Seed: 132142)
---
**Prompt:** Create a cinematic, rainy night scene in a narrow backstreet of an old downtown area. The camera is at street level, slightly tilted upward, emphasizing wet cobblestones reflecting neon-like colored lights without readable text. Show a small ramen stall with steam rising from pots, hanging paper lanterns that are blank or patterned (no letters), and acouple of stools under a simple awning. Add puddles, scattered trash like crumpled paper, and subtle mist. Include a passerby in the mid-ground seen from behind wearing a hooded jacket and carrying an umbrella, face not visible. Use a moody color palette of deep blues and warm oranges, with soft bokeh highlights and realistic rain streaks
VIBE (Seed: 1003)
SANA1.5_1.6B_1024px (Seed: 3114)
---
**Prompt:** Depict a volcanic lava field at twilight with cooled black rock, glowing cracks of magma in the distance, and heat shimmer. The sky is darkening with faint stars emerging. Add thin smoke plumes and red-orange reflections on nearby rocks. Cinematic realism, dramatic contrast
VIBE (Seed: 1520)
SANA1.5_1.6B_1024px (Seed: 1267)
---
**Prompt:** Portrait from back of a young woman dressed in Victorian attire standing in an ancient library filled with mirrors and stained glass windows, softly illuminated by sunlight streaming through
VIBE (Seed: 4152)
SANA1.5_1.6B_1024px (Seed: 6742)
## License
This project is built upon the SANA. Please refer to the original SANA license for usage terms:
[SANA License](https://huggingface.co/Efficient-Large-Model/SANA1.5_4.8B_1024px_diffusers/blob/main/LICENSE.txt)
## Citation
If you use this model in your research or applications, please acknowledge the original projects:
- [SANA 1.5: Efficient Scaling of Training-Time and Inference-Time Compute in Linear Diffusion Transformer](https://github.com/NVlabs/Sana)
- [Qwen3-VL](https://github.com/QwenLM/Qwen3-VL)
```bibtex
@misc{vibe2026,
Author = {Grigorii Alekseenko and Aleksandr Gordeev and Irina Tolstykh and Bulat Suleimanov and Vladimir Dokholyan and Georgii Fedorov and Sergey Yakubson and Aleksandra Tsybina and Mikhail Chernyshov and Maksim Kuprashevich},
Title = {VIBE: Visual Instruction Based Editor},
Year = {2026},
Eprint = {arXiv:2601.02242},
}
```