Qwen3-VL-2B-MegaStyle

Full supervised fine-tuning of Qwen3-VL-2B-Instruct on the MegaStyle dataset — a large-scale vision-language style description dataset containing ~1.36M image-style-description triples across 1,000 distinct artistic styles.

This model specializes in generating concise, style-aware image descriptions conditioned on a target artistic style and color palette.

Model Description

Base model: Qwen/Qwen3-VL-2B-Instruct
Training data: MegaStyle (1,361,636 train / 27,788 val)
Training method: Full SFT (LLM layers only; ViT and aligner frozen)
Training framework: ms-swift
Training hardware: 1 node × 8× GPU

Training Data: MegaStyle

MegaStyle is a curated vision-language dataset for style-conditioned image description. Each sample consists of:

Image: A source image
Style prompt: A textual specification combining an art style, color palette, and composition directive (e.g., "In the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration")
Description: A concise description of the image content

Key Statistics

Statistic	Value
Training samples	1,361,636
Validation samples	27,788
Unique artistic styles	~1,000
Unique source images	~1,389,424

Style Diversity

The dataset covers a broad spectrum of artistic styles including but not limited to:

Fine art movements: Impressionism, Art Nouveau, Art Deco, Baroque, Ukiyo-e, Surrealism, Cubism, etc.
Modern illustration: Vintage travel poster, 3D animation, pixel art, low-poly, flat design, etc.
Photographic styles: 19th-century photorealism, polaroid, cinematic, fashion photography, etc.
Cultural styles: Afrofuturist, Chinese ink wash, Japanese woodblock, Persian miniature, etc.

Each style is paired with specific color palette directives and lighting descriptions, enabling fine-grained style-aware generation.

Data Format

{
  "messages": [
    {"role": "user", "content": "<image>\nDescribe this image in the following style:\nIn the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration"},
    {"role": "assistant", "content": "Skyscraper with geometric golden facade against night sky"}
  ],
  "images": ["/path/to/image.jpg"]
}

Usage

Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Kassadin88/Qwen3-VL-2B-MegaStyle"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
            {"type": "text", "text": "Describe this image in the following style:\nIn the style of Impressionism, soft pastel colors with dappled light, oil painting"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization=False)
print(output_text[0])

vLLM

vllm serve Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tensor-parallel-size 1 --max-model-len 8192 --limit-mm-per-prompt image=1

SGLang

python -m sglang.launch_server --model-path Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tp-size 1 --mem-fraction-static 0.8 --context-length 8192

Training Details

Item	Value
Training epochs	1
Global steps	10,638
Final train loss	0.824
Final eval loss	0.817
DeepSpeed	ZeRO-2
Precision	BF16
Learning rate	1e-5
Batch size	2 per device × 8 GPUs × 8 accum = 128 effective
Max sequence length	4,096
Frozen modules	ViT, aligner

Intended Use

This model is designed for:

Style-conditioned image description: Generate descriptions that match a specified artistic style and color palette
Image-to-text pipelines: Use as a style-aware captioning component
Creative applications: Art style analysis, style transfer guidance, design mood boards

Limitations

Descriptions are concise (typically 1-2 sentences); not suitable for long-form art analysis
Style vocabulary is limited to the ~1,000 styles present in the training data
The model was trained with frozen ViT; visual grounding may not improve over the base model
Performance on out-of-distribution styles (not in training set) is not guaranteed

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3-vl,
    title  = {{Qwen3-VL}},
    author = {{Qwen Team}},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3-vl}
}

@misc{qwen3-vl-2b-megastyle,
    title={Qwen3-VL-2B-MegaStyle: Style-Conditioned Image Description via Full SFT},
    author={Kassadin88},
    year={2026},
    url={https://huggingface.co/Kassadin88/Qwen3-VL-2B-MegaStyle}
}

Downloads last month: 15

Safetensors

Model size

2B params

Tensor type

BF16

Model tree for Kassadin88/Qwen3-VL-2B-MegaStyle

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(187)

this model