A3-preview

Project Artemis — Stage-1 alignment proof-of-concept. This is a preview, not a production VLM. It demonstrates that the Schneewolf Labs A-series text decoder can be successfully extended to vision-language with a small learned projector. A3-preview is the training milestone between A2 (text-only flagship) and A3 (the real multimodal release).

What this is

A LLaVA-style graft assembling three pieces:

Component	Source	Role
Vision tower	`Qwen/Qwen3-VL-2B-Instruct` (ViT, ~600M params)	Image → visual feature tokens
Projector	Fresh 2-layer MLP, ~45M params	Visual hidden → text hidden bridge
Language model	`schneewolflabs/A2` (~12B params)	Unchanged decoder

Only the projector was trained. The vision tower and decoder are frozen exactly as published, so A2's text capabilities (reasoning, tool calls, identity, Qwen3 chat template support) are preserved by construction.

Training details

Setting	Value
Corpus	`BLIP3o/BLIP3o-Pretrain-Long-Caption` (25,000 streamed samples)
Optimizer	AdamW (fp32 moments), lr 1e-3 cosine to 0
Effective batch	8 (bs=2 × grad_accum=4)
Steps	3,094 (1 epoch)
Precision	bfloat16
Wall clock	~3.4 hours on a single NVIDIA GB10 (DGX Spark)
Train loss	5.44 → 0.88
Eval loss	0.77 on held-out BLIP3o (better than train — not memorizing)

What works

Tested on a small held-out battery (BLIP3o + entirely out-of-distribution Japanese photos). The projector is image-grounded — captions describe what's actually in each image, including specific named objects on OOD inputs (brand text on bottles, identification of a "Gundam statue" at a specific "Lalaport" mall, etc.). This is what we hoped for from Stage-1 alignment and it sets up a real Stage-1 run.

What this is not

Not a production VLM. 25k samples is a fraction of what serious projector alignment needs (LLaVA-1.5 used 558k; LLaVA-NeXT used 1.3M).
Captions stay close to "describe the image" patterns. Visual reasoning, OCR, VQA, multi-image, and detailed counting were not trained for and won't work reliably.
No instruction tuning on multimodal data yet. That's Stage-2.
No safety / refusal tuning on visual inputs.

What's next

A3 — full Stage-1 (~1M samples on BLIP3o-Long-Caption) currently training on a single NVIDIA GB10. A3 is the projector-aligned successor to A3-preview.
Artemis — Stage-2 (multimodal instruction FFT with text rehearsal so A2's reasoning / tool calling / identity survive). The named flagship multimodal release after A3.

Install

pip install 'artemis-vlm @ git+https://github.com/Schneewolf-Labs/Artemis.git@v0.1.0'

The artemis-vlm package contains the model definition, processor, and data collator. On import, it registers artemis_vlm with HuggingFace AutoConfig and AutoModelForCausalLM so from_pretrained() resolves without trust_remote_code.

Usage

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
import artemis_vlm  # registers ArtemisVLM with AutoConfig / AutoModel

model = AutoModelForCausalLM.from_pretrained(
    "schneewolflabs/A3-preview", dtype=torch.bfloat16,
).to("cuda").eval()

tok = AutoTokenizer.from_pretrained("schneewolflabs/A3-preview")
processor = artemis_vlm.ArtemisVLMProcessor(
    tokenizer=tok, vision_config=model.visual.config,
    min_pixels=32 * 32, max_pixels=512 * 512,
)

# Qwen3 chat-template style multimodal message
messages = [{"role": "user", "content": [
    {"type": "image"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
text = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
from PIL import Image
image = Image.open("your_image.jpg")
batch = processor(text=text, images=[image], return_tensors="pt").to("cuda")
with torch.no_grad():
    out = model.generate(**batch, max_new_tokens=200, do_sample=False)
print(tok.decode(out[0][batch["input_ids"].shape[1]:], skip_special_tokens=True))

Architecture notes

A3-preview uses the Path B (composition, not modification) approach to extending a text LLM into a VLM: the decoder is untouched, the vision encoder is taken intact from a pretrained VLM, and only the projector between them is new. This keeps the underlying text model's reasoning, tool, and identity capabilities exactly as in A2 — the multimodal addition cannot regress text capability because the text computation path is byte-identical.

Image tokens are inserted using A2's repurposed reserved-token layout (<|image_pad|> is token id 22 — see the A1 release notes for the full token-id allocation across <think>, <tool_call>, vision, etc.).

License

Apache 2.0. Same as A1, A2, and the underlying Qwen3-VL vision tower.

Acknowledgements

BLIP3o team for the Long-Caption pretraining corpus
Qwen team for the Qwen3-VL vision encoder
LLaVA project for the architectural template

— Schneewolf Labs · Project Artemis

Downloads last month: 24

Safetensors

Model size

13B params

Tensor type

BF16

Model tree for schneewolflabs/A3-preview

Base model

Qwen/Qwen3-VL-2B-Instruct

Finetuned

(213)

this model

schneewolflabs
/

A3-preview