Model Card for CowCorpus/llama3-llava-next-8b-cowcorpus
This model is a fine-tuned version of llava-hf/llama3-llava-next-8b-hf trained on the CowCorpus dataset.
This model is designed for the task of Human Intervention Prediction in collaborative web navigation. Unlike standard autonomous agents, this model predicts when a human user needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history) to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance.
It serves as a strong open-weight baseline for intervention modeling, achieving a Perfect Timing Score (PTS) of 0.204, demonstrating a significant improvement over the base LLaVA model which fails to predict interventions (PTS 0.017).
Model Details
Model Description
- Developed by: CowCorpus Team (Huq et al.)
- Model type: LLaVA-NeXT 8B (LLaMA-3 based)
- Base model: llava-hf/llama3-llava-next-8b-hf
- Language: English
- License: Llama 3 Community License
- Paper: Modeling Distinct Human Interventions in Web Navigation
- Repository: GitHub: oaishi/CowCorpus
Input Data
The model is trained on a rich, multimodal state representation:
- Visual Screenshot: The pixel-level view of the current webpage.
- UI Structure (AX Tree): The accessibility tree (textual representation of DOM).
- Past Trajectory: The history of actions taken by the agent/human so far.
- Proposed Next Action: The action that the autonomous agent intends to take. The model evaluates if this intent is erroneous.
How to Get Started
For inference code, prompt templates, and setup instructions, please refer to our GitHub Repository.
Training Details
Training Data
The model was trained on CowCorpus, containing 400 collaborative trajectories across:
- Dataset Size: ~4,200 total steps (2,748 Agent steps, 1,476 Human steps).
- Task Diversity: 200 Standardized Tasks (Mind2Web) and 200 Free-form User Tasks.
- Annotations: Steps are labeled with specific intervention reasons: Error Correction, Preference Alignment, or Assistive Action.
Training Configuration
- Hyperparameters:
- Learning Rate: Linear decay from 1e-5 to ~2e-9
- Epochs: 6
- Global Steps: 120
- Batch Size: 1
- Precision: bfloat16
Evaluation
The model was evaluated on the CowCorpus test set. We report Step Accuracy, Intervention metrics (Precision, Recall, F1), and the Perfect Timing Score (PTS), which measures the temporal accuracy of intervention predictions.
| Model | Step Accuracy | Precision (Intervention Steps) | Recall (Intervention Steps) | F1 (Intervention Steps) | PTS (Timing Score) |
|---|---|---|---|---|---|
| LLaVA 8B (CowCorpus) | 0.861 | 0.500 | 0.200 | 0.286 | 0.204 |
| Claude 4 Sonnet | 0.705 | 0.190 | 0.343 | 0.245 | 0.302 |
| Gemini 2.5 Pro | 0.697 | 0.211 | 0.429 | 0.283 | 0.253 |
| GPT-4o | 0.753 | 0.186 | 0.229 | 0.205 | 0.147 |
| LLaVA 8B (Base) | 0.857 | 0.000 | 0.000 | 0.000 | 0.017 |
Note: All models are evaluated in a zero-shot setting without reasoning.
Citation [optional]
If you use this model or dataset, please cite our work: Paper incoming
- Downloads last month
- 5
Model tree for CowCorpus/llama3-llava-next-8b-cowcorpus
Base model
llava-hf/llama3-llava-next-8b-hf