Qwen3-VL-2B-MegaStyle
Full supervised fine-tuning of Qwen3-VL-2B-Instruct on the MegaStyle dataset — a large-scale vision-language style description dataset containing ~1.36M image-style-description triples across 1,000 distinct artistic styles.
This model specializes in generating concise, style-aware image descriptions conditioned on a target artistic style and color palette.
Model Description
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Training data: MegaStyle (1,361,636 train / 27,788 val)
- Training method: Full SFT (LLM layers only; ViT and aligner frozen)
- Training framework: ms-swift
- Training hardware: 1 node × 8× GPU
Training Data: MegaStyle
MegaStyle is a curated vision-language dataset for style-conditioned image description. Each sample consists of:
- Image: A source image
- Style prompt: A textual specification combining an art style, color palette, and composition directive (e.g., "In the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration")
- Description: A concise description of the image content
Key Statistics
| Statistic | Value |
|---|---|
| Training samples | 1,361,636 |
| Validation samples | 27,788 |
| Unique artistic styles | ~1,000 |
| Unique source images | ~1,389,424 |
Style Diversity
The dataset covers a broad spectrum of artistic styles including but not limited to:
- Fine art movements: Impressionism, Art Nouveau, Art Deco, Baroque, Ukiyo-e, Surrealism, Cubism, etc.
- Modern illustration: Vintage travel poster, 3D animation, pixel art, low-poly, flat design, etc.
- Photographic styles: 19th-century photorealism, polaroid, cinematic, fashion photography, etc.
- Cultural styles: Afrofuturist, Chinese ink wash, Japanese woodblock, Persian miniature, etc.
Each style is paired with specific color palette directives and lighting descriptions, enabling fine-grained style-aware generation.
Data Format
{
"messages": [
{"role": "user", "content": "<image>\nDescribe this image in the following style:\nIn the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration"},
{"role": "assistant", "content": "Skyscraper with geometric golden facade against night sky"}
],
"images": ["/path/to/image.jpg"]
}
Usage
Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
model_name = "Kassadin88/Qwen3-VL-2B-MegaStyle"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)
messages = [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image in the following style:\nIn the style of Impressionism, soft pastel colors with dappled light, oil painting"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization=False)
print(output_text[0])
vLLM
vllm serve Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tensor-parallel-size 1 --max-model-len 8192 --limit-mm-per-prompt image=1
SGLang
python -m sglang.launch_server --model-path Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tp-size 1 --mem-fraction-static 0.8 --context-length 8192
Training Details
| Item | Value |
|---|---|
| Training epochs | 1 |
| Global steps | 10,638 |
| Final train loss | 0.824 |
| Final eval loss | 0.817 |
| DeepSpeed | ZeRO-2 |
| Precision | BF16 |
| Learning rate | 1e-5 |
| Batch size | 2 per device × 8 GPUs × 8 accum = 128 effective |
| Max sequence length | 4,096 |
| Frozen modules | ViT, aligner |
Intended Use
This model is designed for:
- Style-conditioned image description: Generate descriptions that match a specified artistic style and color palette
- Image-to-text pipelines: Use as a style-aware captioning component
- Creative applications: Art style analysis, style transfer guidance, design mood boards
Limitations
- Descriptions are concise (typically 1-2 sentences); not suitable for long-form art analysis
- Style vocabulary is limited to the ~1,000 styles present in the training data
- The model was trained with frozen ViT; visual grounding may not improve over the base model
- Performance on out-of-distribution styles (not in training set) is not guaranteed
Citation
If you find our work helpful, feel free to give us a cite.
@misc{qwen3-vl,
title = {{Qwen3-VL}},
author = {{Qwen Team}},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3-vl}
}
@misc{qwen3-vl-2b-megastyle,
title={Qwen3-VL-2B-MegaStyle: Style-Conditioned Image Description via Full SFT},
author={Kassadin88},
year={2026},
url={https://huggingface.co/Kassadin88/Qwen3-VL-2B-MegaStyle}
}
- Downloads last month
- 15
Model tree for Kassadin88/Qwen3-VL-2B-MegaStyle
Base model
Qwen/Qwen3-VL-2B-Instruct