This is a converted and quantized version to mxfp4 of Joycaption Beta One, please refer to the model card for more details.
It is intended for use with this fork of mlx-vlm.
Since this particular brand of Llava is making use of Siglip2 and Llama3 (and the implementation of RoPE in mlx_vlm is not compatible, plus Siglip2 is not implemented for Llava), i had to create a custom type of model and change the config.json for detection (from llava to llava_joycaption).
I have also altered the chat_template.json to fit the parser used by mlx_vlm.
Please note that installing torchvision with mlx_vlm will result in an error when interpolation is executed (with a mention to lanczos not being implemented in torch.nn.functional.interpolate).
A sample code for testing is available here.
import mlx.core as mx
from mlx_vlm import load, generate
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import load_config
# Load the model
model_path = "n-Arno/joycaption-mlx-mxfp4"
model, processor = load(model_path)
config = load_config(model_path)
# Prepare input
image = ["http://images.cocodataset.org/val2017/000000039769.jpg"]
# image = [Image.open("...")] can also be used with PIL.Image.Image objects
prompt = "Write a long descriptive caption for this image in a formal tone."
# Apply chat template
formatted_prompt = apply_chat_template(
processor, config, prompt, num_images=len(image)
)
# Generate output
output = generate(model, processor, formatted_prompt, image, verbose=False)
print(output)
The specific fork needed can be installed using this line in requirements.txt
mlx-vlm @ git+https://github.com/nArn0/mlx-vlm@main
(I'll try proposing a PR, but since it is a bit roughly coded, i don't know if it will be accepted)
- Downloads last month
- 5
Model tree for n-Arno/joycaption-mlx-mxfp4
Base model
google/siglip2-so400m-patch14-384