A3-preview
Project Artemis — Stage-1 alignment proof-of-concept. This is a preview, not a production VLM. It demonstrates that the Schneewolf Labs A-series text decoder can be successfully extended to vision-language with a small learned projector. A3-preview is the training milestone between A2 (text-only flagship) and A3 (the real multimodal release).
What this is
A LLaVA-style graft assembling three pieces:
| Component | Source | Role |
|---|---|---|
| Vision tower | Qwen/Qwen3-VL-2B-Instruct (ViT, ~600M params) |
Image → visual feature tokens |
| Projector | Fresh 2-layer MLP, ~45M params | Visual hidden → text hidden bridge |
| Language model | schneewolflabs/A2 (~12B params) |
Unchanged decoder |
Only the projector was trained. The vision tower and decoder are frozen exactly as published, so A2's text capabilities (reasoning, tool calls, identity, Qwen3 chat template support) are preserved by construction.
Training details
| Setting | Value |
|---|---|
| Corpus | BLIP3o/BLIP3o-Pretrain-Long-Caption (25,000 streamed samples) |
| Optimizer | AdamW (fp32 moments), lr 1e-3 cosine to 0 |
| Effective batch | 8 (bs=2 × grad_accum=4) |
| Steps | 3,094 (1 epoch) |
| Precision | bfloat16 |
| Wall clock | ~3.4 hours on a single NVIDIA GB10 (DGX Spark) |
| Train loss | 5.44 → 0.88 |
| Eval loss | 0.77 on held-out BLIP3o (better than train — not memorizing) |
What works
Tested on a small held-out battery (BLIP3o + entirely out-of-distribution Japanese photos). The projector is image-grounded — captions describe what's actually in each image, including specific named objects on OOD inputs (brand text on bottles, identification of a "Gundam statue" at a specific "Lalaport" mall, etc.). This is what we hoped for from Stage-1 alignment and it sets up a real Stage-1 run.
What this is not
- Not a production VLM. 25k samples is a fraction of what serious projector alignment needs (LLaVA-1.5 used 558k; LLaVA-NeXT used 1.3M).
- Captions stay close to "describe the image" patterns. Visual reasoning, OCR, VQA, multi-image, and detailed counting were not trained for and won't work reliably.
- No instruction tuning on multimodal data yet. That's Stage-2.
- No safety / refusal tuning on visual inputs.
What's next
- A3 — full Stage-1 (~1M samples on BLIP3o-Long-Caption) currently training on a single NVIDIA GB10. A3 is the projector-aligned successor to A3-preview.
- Artemis — Stage-2 (multimodal instruction FFT with text rehearsal so A2's reasoning / tool calling / identity survive). The named flagship multimodal release after A3.
Install
pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'
The artemis-vlm package contains
the model definition, processor, and data collator. On import, it registers
artemis_vlm with HuggingFace AutoConfig and AutoModelForCausalLM so
from_pretrained() resolves without trust_remote_code.
Usage
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm # registers ArtemisVLM with AutoConfig / AutoModel
model = AutoModelForCausalLM.from_pretrained(
"schneewolflabs/A3-preview", dtype=torch.bfloat16,
).to("cuda").eval()
tok = AutoTokenizer.from_pretrained("schneewolflabs/A3-preview")
processor = artemis_vlm.ArtemisVLMProcessor(
tokenizer=tok, vision_config=model.visual.config,
min_pixels=32 * 32, max_pixels=512 * 512,
)
# Qwen3 chat-template style multimodal message
messages = [{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))
Architecture notes
A3-preview uses the Path B (composition, not modification) approach to extending a text LLM into a VLM: the decoder is untouched, the vision encoder is taken intact from a pretrained VLM, and only the projector between them is new. This keeps the underlying text model's reasoning, tool, and identity capabilities exactly as in A2 — the multimodal addition cannot regress text capability because the text computation path is byte-identical.
Image tokens are inserted using A2's repurposed reserved-token layout
(<|image_pad|> is token id 22 — see the A1 release notes for the
full token-id allocation across <think>, <tool_call>, vision, etc.).
License
Apache 2.0. Same as A1, A2, and the underlying Qwen3-VL vision tower.
Acknowledgements
- BLIP3o team for the Long-Caption pretraining corpus
- Qwen team for the Qwen3-VL vision encoder
- LLaVA project for the architectural template
— Schneewolf Labs · Project Artemis
- Downloads last month
- 24
Model tree for schneewolflabs/A3-preview
Base model
Qwen/Qwen3-VL-2B-Instruct