Qwen3-VL-2B-MegaStyle

Full supervised fine-tuning of Qwen3-VL-2B-Instruct on the MegaStyle dataset — a large-scale vision-language style description dataset containing ~1.36M image-style-description triples across 1,000 distinct artistic styles.

This model specializes in generating concise, style-aware image descriptions conditioned on a target artistic style and color palette.

Model Description

  • Base model: Qwen/Qwen3-VL-2B-Instruct
  • Training data: MegaStyle (1,361,636 train / 27,788 val)
  • Training method: Full SFT (LLM layers only; ViT and aligner frozen)
  • Training framework: ms-swift
  • Training hardware: 1 node × 8× GPU

Training Data: MegaStyle

MegaStyle is a curated vision-language dataset for style-conditioned image description. Each sample consists of:

  • Image: A source image
  • Style prompt: A textual specification combining an art style, color palette, and composition directive (e.g., "In the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration")
  • Description: A concise description of the image content

Key Statistics

Statistic Value
Training samples 1,361,636
Validation samples 27,788
Unique artistic styles ~1,000
Unique source images ~1,389,424

Style Diversity

The dataset covers a broad spectrum of artistic styles including but not limited to:

  • Fine art movements: Impressionism, Art Nouveau, Art Deco, Baroque, Ukiyo-e, Surrealism, Cubism, etc.
  • Modern illustration: Vintage travel poster, 3D animation, pixel art, low-poly, flat design, etc.
  • Photographic styles: 19th-century photorealism, polaroid, cinematic, fashion photography, etc.
  • Cultural styles: Afrofuturist, Chinese ink wash, Japanese woodblock, Persian miniature, etc.

Each style is paired with specific color palette directives and lighting descriptions, enabling fine-grained style-aware generation.

Data Format

{
  "messages": [
    {"role": "user", "content": "<image>\nDescribe this image in the following style:\nIn the style of Art Deco, golden hues with deep blues in high-contrast distribution, dramatic lighting, digital illustration"},
    {"role": "assistant", "content": "Skyscraper with geometric golden facade against night sky"}
  ],
  "images": ["/path/to/image.jpg"]
}

Usage

Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model_name = "Kassadin88/Qwen3-VL-2B-MegaStyle"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_name)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
            {"type": "text", "text": "Describe this image in the following style:\nIn the style of Impressionism, soft pastel colors with dappled light, oil painting"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt").to(model.device)

output_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization=False)
print(output_text[0])

vLLM

vllm serve Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tensor-parallel-size 1 --max-model-len 8192 --limit-mm-per-prompt image=1

SGLang

python -m sglang.launch_server --model-path Kassadin88/Qwen3-VL-2B-MegaStyle --port 8000 --tp-size 1 --mem-fraction-static 0.8 --context-length 8192

Training Details

Item Value
Training epochs 1
Global steps 10,638
Final train loss 0.824
Final eval loss 0.817
DeepSpeed ZeRO-2
Precision BF16
Learning rate 1e-5
Batch size 2 per device × 8 GPUs × 8 accum = 128 effective
Max sequence length 4,096
Frozen modules ViT, aligner

Intended Use

This model is designed for:

  • Style-conditioned image description: Generate descriptions that match a specified artistic style and color palette
  • Image-to-text pipelines: Use as a style-aware captioning component
  • Creative applications: Art style analysis, style transfer guidance, design mood boards

Limitations

  • Descriptions are concise (typically 1-2 sentences); not suitable for long-form art analysis
  • Style vocabulary is limited to the ~1,000 styles present in the training data
  • The model was trained with frozen ViT; visual grounding may not improve over the base model
  • Performance on out-of-distribution styles (not in training set) is not guaranteed

Citation

If you find our work helpful, feel free to give us a cite.

@misc{qwen3-vl,
    title  = {{Qwen3-VL}},
    author = {{Qwen Team}},
    year   = {2026},
    url    = {https://qwen.ai/blog?id=qwen3-vl}
}
@misc{qwen3-vl-2b-megastyle,
    title={Qwen3-VL-2B-MegaStyle: Style-Conditioned Image Description via Full SFT},
    author={Kassadin88},
    year={2026},
    url={https://huggingface.co/Kassadin88/Qwen3-VL-2B-MegaStyle}
}
Downloads last month
15
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Kassadin88/Qwen3-VL-2B-MegaStyle

Finetuned
(187)
this model