This is a converted and quantized version to mxfp4 of Joycaption Beta One, please refer to the model card for more details.

It is intended for use with this fork of mlx-vlm.

Since this particular brand of Llava is making use of Siglip2 and Llama3 (and the implementation of RoPE in mlx_vlm is not compatible, plus Siglip2 is not implemented for Llava), i had to create a custom type of model and change the config.json for detection (from llava to llava_joycaption).

I have also altered the chat_template.json to fit the parser used by mlx_vlm.

Please note that installing torchvision with mlx_vlm will result in an error when interpolation is executed (with a mention to lanczos not being implemented in torch.nn.functional.interpolate).

A sample code for testing is available here.

import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config

# Load the model
model_path = "n-Arno/joycaption-mlx-mxfp4"
model, processor = load(model_path)
config = load_config(model_path)

# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Write a long descriptive caption for this image in a formal tone."

# Apply chat template
formatted_prompt = apply_chat_template(
    processor, config, prompt, num_images=len(image)
)

# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)

The specific fork needed can be installed using this line in requirements.txt

mlx-vlm @ git+https://github.com/nArn0/mlx-vlm@main

(I'll try proposing a PR, but since it is a bit roughly coded, i don't know if it will be accepted)

Downloads last month
5
Safetensors
Model size
2B params
Tensor type
U8
U32
BF16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for n-Arno/joycaption-mlx-mxfp4