Model Card for CowCorpus/llama3-llava-next-8b-cowcorpus

This model is a fine-tuned version of llava-hf/llama3-llava-next-8b-hf trained on the CowCorpus dataset.

This model is designed for the task of Human Intervention Prediction in collaborative web navigation. Unlike standard autonomous agents, this model predicts when a human user needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history) to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance.

It serves as a strong open-weight baseline for intervention modeling, achieving a Perfect Timing Score (PTS) of 0.204, demonstrating a significant improvement over the base LLaVA model which fails to predict interventions (PTS 0.017).

Model Details

Model Description

Input Data

The model is trained on a rich, multimodal state representation:

  1. Visual Screenshot: The pixel-level view of the current webpage.
  2. UI Structure (AX Tree): The accessibility tree (textual representation of DOM).
  3. Past Trajectory: The history of actions taken by the agent/human so far.
  4. Proposed Next Action: The action that the autonomous agent intends to take. The model evaluates if this intent is erroneous.

How to Get Started

For inference code, prompt templates, and setup instructions, please refer to our GitHub Repository.

Training Details

Training Data

The model was trained on CowCorpus, containing 400 collaborative trajectories across:

  • Dataset Size: ~4,200 total steps (2,748 Agent steps, 1,476 Human steps).
  • Task Diversity: 200 Standardized Tasks (Mind2Web) and 200 Free-form User Tasks.
  • Annotations: Steps are labeled with specific intervention reasons: Error Correction, Preference Alignment, or Assistive Action.

Training Configuration

  • Hyperparameters:
    • Learning Rate: Linear decay from 1e-5 to ~2e-9
    • Epochs: 6
    • Global Steps: 120
    • Batch Size: 1
    • Precision: bfloat16

Evaluation

The model was evaluated on the CowCorpus test set. We report Step Accuracy, Intervention metrics (Precision, Recall, F1), and the Perfect Timing Score (PTS), which measures the temporal accuracy of intervention predictions.

Model Step Accuracy Precision (Intervention Steps) Recall (Intervention Steps) F1 (Intervention Steps) PTS (Timing Score)
LLaVA 8B (CowCorpus) 0.861 0.500 0.200 0.286 0.204
Claude 4 Sonnet 0.705 0.190 0.343 0.245 0.302
Gemini 2.5 Pro 0.697 0.211 0.429 0.283 0.253
GPT-4o 0.753 0.186 0.229 0.205 0.147
LLaVA 8B (Base) 0.857 0.000 0.000 0.000 0.017

Note: All models are evaluated in a zero-shot setting without reasoning.

Citation [optional]

If you use this model or dataset, please cite our work: Paper incoming

Downloads last month
5
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CowCorpus/llama3-llava-next-8b-cowcorpus

Finetuned
(4)
this model