Spaces:
Sleeping
Sleeping
Commit Β·
579652a
1
Parent(s): 8b10144
update on README.md
Browse files
README.md
CHANGED
|
@@ -11,284 +11,577 @@ tags:
|
|
| 11 |
|
| 12 |
# π§ API Integration Debugging Environment
|
| 13 |
|
| 14 |
-
>
|
| 15 |
|
| 16 |
-
[](https://python.org)
|
| 18 |
-
[![
|
|
|
|
| 19 |
|
| 20 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 21 |
|
| 22 |
-
|
| 23 |
|
| 24 |
-
|
| 25 |
-
- **Dependency awareness**: understanding which service failures affect which downstream services
|
| 26 |
-
- **Strategic reasoning**: fixing upstream issues first to unmask downstream problems
|
| 27 |
|
| 28 |
-
|
| 29 |
|
| 30 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 31 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 32 |
```
|
| 33 |
-
|
| 34 |
-
|
| 35 |
-
|
| 36 |
-
|
| 37 |
-
|
| 38 |
-
|
| 39 |
-
|
| 40 |
-
|
| 41 |
-
|
| 42 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 43 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 44 |
|
| 45 |
-
|
| 46 |
-
|
| 47 |
-
Each task models a real multi-service system with dependency chains:
|
| 48 |
-
|
| 49 |
-
```mermaid
|
| 50 |
-
graph LR
|
| 51 |
-
A[order_service] --> B[inventory_service]
|
| 52 |
-
B --> C[shipping_service]
|
| 53 |
-
A --> D[api_gateway]
|
| 54 |
-
B --> E[auth_service]
|
| 55 |
-
style A fill:#ff6b6b
|
| 56 |
-
style B fill:#ffd93d
|
| 57 |
-
style C fill:#6bcb77
|
| 58 |
-
style D fill:#6bcb77
|
| 59 |
-
style E fill:#6bcb77
|
| 60 |
```
|
| 61 |
|
| 62 |
-
|
|
|
|
|
|
|
| 63 |
|
| 64 |
-
##
|
| 65 |
|
| 66 |
-
|
| 67 |
|
| 68 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 69 |
|
| 70 |
-
|
| 71 |
-
2. **Dynamic logs**: After fixing an issue, re-inspecting logs shows *new entries* reflecting the fix
|
| 72 |
-
3. **Cascading effects**: Fixing an upstream issue can change downstream service behavior
|
| 73 |
-
4. **Error trace**: Shows the full error propagation chain, shrinking as issues are fixed
|
| 74 |
|
| 75 |
-
|
| 76 |
|
| 77 |
-
|
|
| 78 |
-
|---
|
| 79 |
-
|
|
| 80 |
-
|
|
| 81 |
-
|
|
| 82 |
-
|
| 83 |
-
|
| 84 |
-
|
| 85 |
-
|
| 86 |
-
|
| 87 |
-
|
| 88 |
-
|
| 89 |
-
|
| 90 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 91 |
|
| 92 |
## Tasks
|
| 93 |
|
| 94 |
-
### Easy: Payment API Integration (2 issues, 15 steps)
|
| 95 |
|
| 96 |
-
|
| 97 |
|
| 98 |
-
- **Issue pool**: 4 possible issues, 2 selected per episode
|
| 99 |
- **Services**: `payment_client`, `payment_gateway`
|
| 100 |
-
- **Issue
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 101 |
|
| 102 |
-
### Medium: Webhook Event Chain (3 issues, 25 steps)
|
| 103 |
|
| 104 |
-
|
| 105 |
|
| 106 |
-
- **Issue pool**: 5 possible issues, 3 selected per episode
|
| 107 |
- **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
|
| 108 |
-
- **Issue
|
| 109 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 110 |
|
| 111 |
-
### Hard: E-Commerce Order Pipeline (5 issues, 40 steps)
|
| 112 |
|
| 113 |
-
|
| 114 |
|
| 115 |
-
- **Issue pool**: 7 possible issues, 5 selected per episode
|
| 116 |
- **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
|
| 117 |
-
- **Issue
|
| 118 |
-
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 119 |
|
| 120 |
-
## Grading Rubric
|
| 121 |
|
| 122 |
-
The grader uses a **
|
| 123 |
|
| 124 |
-
| Dimension | Weight |
|
| 125 |
-
|---
|
| 126 |
-
| **Fix Score** | 40% | `issues_fixed / total_issues` |
|
| 127 |
-
| **
|
| 128 |
-
| **
|
| 129 |
-
| **
|
| 130 |
|
| 131 |
```
|
| 132 |
-
Final Score = fix Γ 0.40 +
|
| 133 |
-
Clamped to (0.001, 0.999)
|
| 134 |
```
|
| 135 |
|
| 136 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 137 |
|
| 138 |
-
|
| 139 |
-
|
|
|
|
|
|
|
| 140 |
| Easy | ~0.75 | 7 | 2/2 |
|
| 141 |
| Medium | ~0.55 | 10 | 3/3 |
|
| 142 |
| Hard | ~0.45 | 15 | 5/5 |
|
| 143 |
|
| 144 |
-
*
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 145 |
|
| 146 |
## Action & Observation Spaces
|
| 147 |
|
| 148 |
-
### Action
|
| 149 |
|
| 150 |
```json
|
| 151 |
{
|
| 152 |
"action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
|
| 153 |
-
"target": "service_name",
|
| 154 |
"fix_payload": {
|
| 155 |
"config_key": "corrected_value"
|
| 156 |
}
|
| 157 |
}
|
| 158 |
```
|
| 159 |
|
| 160 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 161 |
|
| 162 |
```json
|
| 163 |
{
|
| 164 |
"task_id": "easy",
|
| 165 |
-
"task_description": "...",
|
| 166 |
-
"logs": ["[ERROR] ..."],
|
| 167 |
-
"config_snapshot": {"headers": {"Content-Type": "text/plain"}},
|
| 168 |
-
"api_response": {"status": "error", "status_code": 401},
|
| 169 |
"service_status": {"payment_client": "error", "payment_gateway": "healthy"},
|
| 170 |
-
"dependency_graph": {"payment_client": ["payment_gateway"]},
|
| 171 |
-
"error_trace": [
|
|
|
|
|
|
|
|
|
|
|
|
|
| 172 |
"remaining_steps": 14,
|
| 173 |
"issues_found": 1,
|
| 174 |
"issues_fixed": 0,
|
| 175 |
"issues_total": 2,
|
| 176 |
-
"
|
| 177 |
-
"available_targets": ["payment_client", "payment_gateway"]
|
|
|
|
|
|
|
| 178 |
}
|
| 179 |
```
|
| 180 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 181 |
## Example Transcript
|
| 182 |
|
| 183 |
```
|
| 184 |
>>> reset(task_id="easy")
|
| 185 |
-
task_description: "
|
| 186 |
service_status: {payment_client: "error", payment_gateway: "healthy"}
|
| 187 |
-
error_trace:
|
| 188 |
-
|
| 189 |
-
|
| 190 |
-
|
| 191 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 192 |
]
|
| 193 |
-
|
| 194 |
-
>>> step(inspect_logs, target=payment_client)
|
| 195 |
-
logs: ["[ERROR] POST /process -> 401 Unauthorized", ...]
|
| 196 |
issues_found: 2, reward: +0.15
|
| 197 |
|
| 198 |
-
>>> step(inspect_config, target=payment_client)
|
| 199 |
-
|
| 200 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 201 |
|
| 202 |
-
>>> step(submit_fix, target=payment_client
|
| 203 |
-
|
| 204 |
-
|
| 205 |
-
|
|
|
|
| 206 |
|
| 207 |
-
>>> step(inspect_logs, target=payment_client)
|
| 208 |
-
logs: [...original
|
| 209 |
-
|
|
|
|
|
|
|
| 210 |
|
| 211 |
-
>>> step(submit_fix, target=payment_client
|
| 212 |
-
|
|
|
|
| 213 |
service_status: {payment_client: "healthy", payment_gateway: "healthy"}
|
| 214 |
error_trace: ["All issues resolved. No error cascades active."]
|
| 215 |
-
reward: +0.50 (
|
|
|
|
| 216 |
|
| 217 |
>>> grade()
|
| 218 |
-
score: 0.82
|
|
|
|
|
|
|
|
|
|
|
|
|
| 219 |
```
|
| 220 |
|
|
|
|
|
|
|
| 221 |
## Setup & Usage
|
| 222 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 223 |
### Install Dependencies
|
| 224 |
|
| 225 |
```bash
|
| 226 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 227 |
uv sync
|
|
|
|
|
|
|
|
|
|
| 228 |
```
|
| 229 |
|
| 230 |
-
### Run Locally
|
| 231 |
|
| 232 |
```bash
|
| 233 |
-
|
|
|
|
| 234 |
```
|
| 235 |
|
| 236 |
-
|
|
|
|
|
|
|
| 237 |
|
| 238 |
```bash
|
| 239 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 240 |
```
|
| 241 |
|
| 242 |
-
### Docker
|
| 243 |
|
| 244 |
```bash
|
| 245 |
-
|
|
|
|
| 246 |
docker run -p 8000:8000 api_debug_env
|
| 247 |
```
|
| 248 |
|
| 249 |
-
|
|
|
|
|
|
|
| 250 |
|
| 251 |
| Endpoint | Method | Description |
|
| 252 |
-
|---
|
| 253 |
-
| `/` | GET | Environment info
|
| 254 |
-
| `/reset` | POST | Reset environment |
|
| 255 |
-
| `/step` | POST | Execute
|
| 256 |
-
| `/state` | GET | Get current state |
|
| 257 |
-
| `/
|
| 258 |
-
| `/
|
| 259 |
-
| `/
|
| 260 |
-
| `/
|
| 261 |
-
|
| 262 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 263 |
|
| 264 |
```bash
|
| 265 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 266 |
python inference.py
|
| 267 |
```
|
| 268 |
|
| 269 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 270 |
|
| 271 |
-
|
| 272 |
|
| 273 |
-
|
| 274 |
-
|
| 275 |
-
|
| 276 |
-
|
| 277 |
-
|
| 278 |
-
6. **Cascading Dynamics**: Upstream fixes change downstream state, requiring multi-step reasoning
|
| 279 |
|
| 280 |
## Project Structure
|
| 281 |
|
| 282 |
```
|
| 283 |
-
|
| 284 |
-
βββ
|
| 285 |
-
βββ
|
| 286 |
-
βββ
|
| 287 |
-
βββ
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 288 |
βββ server/
|
| 289 |
-
β βββ
|
| 290 |
-
β βββ
|
| 291 |
-
β
|
|
|
|
|
|
|
|
|
|
|
|
|
| 292 |
βββ tests/
|
| 293 |
-
βββ test_environment.py
|
| 294 |
```
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 11 |
|
| 12 |
# π§ API Integration Debugging Environment
|
| 13 |
|
| 14 |
+
> A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**.
|
| 15 |
|
| 16 |
+
[](https://github.com/meta-pytorch/OpenEnv)
|
| 17 |
[](https://python.org)
|
| 18 |
+
[]()
|
| 19 |
+
[](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
|
| 20 |
|
| 21 |
+
---
|
| 22 |
+
|
| 23 |
+
## Table of Contents
|
| 24 |
+
|
| 25 |
+
- [Motivation β Why API Debugging?](#motivation--why-api-debugging)
|
| 26 |
+
- [Environment Overview](#environment-overview)
|
| 27 |
+
- [Key Design Features](#key-design-features)
|
| 28 |
+
- [Tasks (Easy / Medium / Hard)](#tasks)
|
| 29 |
+
- [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric)
|
| 30 |
+
- [Reward Shaping](#reward-shaping)
|
| 31 |
+
- [Action & Observation Spaces](#action--observation-spaces)
|
| 32 |
+
- [Example Transcript](#example-transcript)
|
| 33 |
+
- [Setup & Usage](#setup--usage)
|
| 34 |
+
- [API Endpoints](#api-endpoints)
|
| 35 |
+
- [Running Inference](#running-inference)
|
| 36 |
+
- [Running Tests](#running-tests)
|
| 37 |
+
- [Project Structure](#project-structure)
|
| 38 |
+
- [Design Philosophy](#design-philosophy)
|
| 39 |
+
|
| 40 |
+
---
|
| 41 |
|
| 42 |
+
## Motivation β Why API Debugging?
|
| 43 |
|
| 44 |
+
API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate β Service A calls Service B which calls Service C β a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose.
|
|
|
|
|
|
|
| 45 |
|
| 46 |
+
Real-world API debugging requires:
|
| 47 |
|
| 48 |
+
- **Structured diagnosis** β reading error logs and configs across multiple services
|
| 49 |
+
- **Dependency awareness** β understanding which upstream failure is causing downstream errors
|
| 50 |
+
- **Strategic reasoning** β fixing root causes first to unmask hidden downstream bugs
|
| 51 |
+
- **Precision** β submitting exact configuration corrections, not approximate guesses
|
| 52 |
+
|
| 53 |
+
This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts β not a static lookup puzzle.
|
| 54 |
+
|
| 55 |
+
---
|
| 56 |
|
| 57 |
+
## Environment Overview
|
| 58 |
+
|
| 59 |
+
```
|
| 60 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 61 |
+
β Agent Debugging Loop β
|
| 62 |
+
β β
|
| 63 |
+
β 1. reset(task_id) β Initial observation with broken state β
|
| 64 |
+
β 2. step(inspect_logs) β Error logs with diagnostic clues β
|
| 65 |
+
β 3. step(inspect_config)β Current (broken) service configuration β
|
| 66 |
+
β 4. step(inspect_endpoint) β Simulated API response (401, 504..) β
|
| 67 |
+
β 5. step(submit_fix) β Strict fix validation + cascade update β
|
| 68 |
+
β 6. grade() β Multi-dimensional rubric score [0,1] β
|
| 69 |
+
β β
|
| 70 |
+
β State updates dynamically: service health changes, new logs β
|
| 71 |
+
β appear, error cascades resolve as the agent fixes issues. β
|
| 72 |
+
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
|
| 73 |
```
|
| 74 |
+
|
| 75 |
+
The agent interacts through the standard OpenEnv API:
|
| 76 |
+
- **`reset()`** β returns initial observation with broken service state
|
| 77 |
+
- **`step(action)`** β executes one debugging action, returns observation + reward
|
| 78 |
+
- **`state()`** β returns current environment state (episode_id, step_count)
|
| 79 |
+
- **`grade()`** β returns final score using multi-dimensional rubric
|
| 80 |
+
|
| 81 |
+
---
|
| 82 |
+
|
| 83 |
+
## Key Design Features
|
| 84 |
+
|
| 85 |
+
### 1. Cascading Failures with Service Dependency Graphs
|
| 86 |
+
|
| 87 |
+
Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services:
|
| 88 |
+
|
| 89 |
```
|
| 90 |
+
Hard Task Dependency Graph:
|
| 91 |
+
|
| 92 |
+
order_service βββ¬βββ inventory_service βββ¬βββ shipping_service
|
| 93 |
+
β ββββ auth_service
|
| 94 |
+
ββββ api_gateway
|
| 95 |
|
| 96 |
+
[ERROR] [DEGRADED] [HEALTHY]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 97 |
```
|
| 98 |
|
| 99 |
+
- Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue
|
| 100 |
+
- Fixing `inventory_service`'s expired token allows `shipping_service` to respond
|
| 101 |
+
- **Some issues are intentionally masked by upstream failures** β the agent must fix in the right order
|
| 102 |
|
| 103 |
+
### 2. Dynamic State
|
| 104 |
|
| 105 |
+
Unlike static environments, the state **changes as the agent acts**:
|
| 106 |
|
| 107 |
+
| What changes | How |
|
| 108 |
+
|---|---|
|
| 109 |
+
| **Service health** | Fixing issues updates service status: `error` β `degraded` β `healthy` |
|
| 110 |
+
| **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") |
|
| 111 |
+
| **Error traces** | The cascade chain shrinks as upstream issues are resolved |
|
| 112 |
+
| **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state |
|
| 113 |
|
| 114 |
+
### 3. Seed-Based Scenario Randomization
|
|
|
|
|
|
|
|
|
|
| 115 |
|
| 116 |
+
Each difficulty level has an **expanded issue pool** (more issues than are selected per episode):
|
| 117 |
|
| 118 |
+
| Difficulty | Pool Size | Selected Per Episode |
|
| 119 |
+
|---|---|---|
|
| 120 |
+
| Easy | 4 issues | 2 |
|
| 121 |
+
| Medium | 5 issues | 3 |
|
| 122 |
+
| Hard | 7 issues | 5 |
|
| 123 |
+
|
| 124 |
+
Passing a `seed` to `reset()` produces a **deterministic but varied** scenario β different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns.
|
| 125 |
+
|
| 126 |
+
### 4. Strict Fix Validation with Partial Credit
|
| 127 |
+
|
| 128 |
+
The grader validates both **keys and values** of submitted fixes:
|
| 129 |
+
|
| 130 |
+
- **Exact match** β Full credit (+0.25 reward)
|
| 131 |
+
- **Right key, close value** (e.g., timeout=7 when expected=10) β Partial credit (+0.03)
|
| 132 |
+
- **Right key, wrong value** (e.g., timeout=100 when expected=10) β Rejected
|
| 133 |
+
- **Wrong key entirely** β Penalized (-0.1)
|
| 134 |
+
- **Bearer token pattern matching** β `Bearer <any_valid_token>` is accepted
|
| 135 |
+
- **Numeric tolerance** β strict 10% tolerance
|
| 136 |
+
- **Boolean coercion** β `"true"`, `"1"`, `"yes"` all match `True`
|
| 137 |
+
|
| 138 |
+
---
|
| 139 |
|
| 140 |
## Tasks
|
| 141 |
|
| 142 |
+
### Easy: Payment API Integration (2 issues, 15 max steps)
|
| 143 |
|
| 144 |
+
**Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors.
|
| 145 |
|
|
|
|
| 146 |
- **Services**: `payment_client`, `payment_gateway`
|
| 147 |
+
- **Issue pool** (4 possible, 2 selected):
|
| 148 |
+
- Missing `Authorization` header (HTTP 401)
|
| 149 |
+
- Wrong `Content-Type` header β `text/plain` instead of `application/json` (HTTP 415)
|
| 150 |
+
- Timeout too low for payment processing (HTTP 504)
|
| 151 |
+
- Base URL pointing to deprecated v1 endpoint (HTTP 301)
|
| 152 |
+
- **Dependencies**: None β straightforward diagnosis
|
| 153 |
|
| 154 |
+
### Medium: Webhook Event Chain (3 issues, 25 max steps)
|
| 155 |
|
| 156 |
+
**Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender β receiver β notification service, but multiple configuration issues are causing failures.
|
| 157 |
|
|
|
|
| 158 |
- **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
|
| 159 |
+
- **Issue pool** (5 possible, 3 selected):
|
| 160 |
+
- Rate limit mismatch (sender at 100/s, receiver accepts 10/s) β 429 errors
|
| 161 |
+
- Insufficient retry config (only 1 retry, no backoff, 429 not in retry list)
|
| 162 |
+
- Empty webhook signature header β receiver drops all events as unsigned
|
| 163 |
+
- Wrong target URL (`/webhook` vs `/hooks/incoming`) β 404 errors
|
| 164 |
+
- Payload compression enabled but receiver doesn't support gzip β 415 errors
|
| 165 |
+
- **Dependencies**: Retry issue is **masked** by rate limit β must fix rate limit first to see the retry problem
|
| 166 |
|
| 167 |
+
### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps)
|
| 168 |
|
| 169 |
+
**Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models.
|
| 170 |
|
|
|
|
| 171 |
- **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
|
| 172 |
+
- **Issue pool** (7 possible, 5 selected):
|
| 173 |
+
- Deprecated URL (`/v1/check` β should be `/v2/reserve`) β 301 redirect
|
| 174 |
+
- Timeout too short (2s vs 4s processing time) β masked by wrong URL
|
| 175 |
+
- Synchronous mode causing race conditions between concurrent orders
|
| 176 |
+
- Expired auth token on inventoryβshipping calls β 401
|
| 177 |
+
- No auto token refresh configured β masked by expired token
|
| 178 |
+
- No circuit breaker β failed requests hammer inventory service
|
| 179 |
+
- Missing idempotency key β retries create duplicate orders
|
| 180 |
+
- **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix
|
| 181 |
+
|
| 182 |
+
---
|
| 183 |
|
| 184 |
+
## Multi-Dimensional Grading Rubric
|
| 185 |
|
| 186 |
+
The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio:
|
| 187 |
|
| 188 |
+
| Dimension | Weight | What It Measures |
|
| 189 |
+
|---|---|---|
|
| 190 |
+
| **Fix Score** | 40% | `issues_fixed / total_issues` β how many bugs were actually resolved |
|
| 191 |
+
| **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types |
|
| 192 |
+
| **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? |
|
| 193 |
+
| **Efficiency Score** | 15% | `remaining_steps / max_steps` β faster solutions score higher |
|
| 194 |
|
| 195 |
```
|
| 196 |
+
Final Score = fix Γ 0.40 + strategy Γ 0.25 + diagnosis Γ 0.20 + efficiency Γ 0.15
|
| 197 |
+
Clamped to (0.001, 0.999) β never exactly 0.0 or 1.0
|
| 198 |
```
|
| 199 |
|
| 200 |
+
**Strategy scoring details:**
|
| 201 |
+
- Did the agent inspect logs/config before submitting any fix? (+1)
|
| 202 |
+
- Ratio of unique inspections to total inspections (no wasteful repeats) (+1)
|
| 203 |
+
- Did fixes follow the optimal dependency order? (+1)
|
| 204 |
+
- Did the agent use a variety of action types? (+1)
|
| 205 |
|
| 206 |
+
### Baseline Scores (Rule-Based Heuristic Agent)
|
| 207 |
+
|
| 208 |
+
| Task | Score | Steps Used | Issues Fixed |
|
| 209 |
+
|---|---|---|---|
|
| 210 |
| Easy | ~0.75 | 7 | 2/2 |
|
| 211 |
| Medium | ~0.55 | 10 | 3/3 |
|
| 212 |
| Hard | ~0.45 | 15 | 5/5 |
|
| 213 |
|
| 214 |
+
*The baseline uses a deterministic heuristic (inspect all logs β inspect all configs β submit known fixes). An LLM-based agent following good debugging strategy can score higher.*
|
| 215 |
+
|
| 216 |
+
---
|
| 217 |
+
|
| 218 |
+
## Reward Shaping
|
| 219 |
+
|
| 220 |
+
Every action produces a meaningful reward signal β not just sparse end-of-episode feedback:
|
| 221 |
+
|
| 222 |
+
| Action | Reward | Condition |
|
| 223 |
+
|---|---|---|
|
| 224 |
+
| `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found |
|
| 225 |
+
| `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service |
|
| 226 |
+
| `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed |
|
| 227 |
+
| `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix |
|
| 228 |
+
| `inspect_config` (service has issues) | +0.05 | Relevant config retrieved |
|
| 229 |
+
| `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here |
|
| 230 |
+
| `inspect_config` (repeat) | 0.00 | Already inspected |
|
| 231 |
+
| `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test |
|
| 232 |
+
| `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated |
|
| 233 |
+
| `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis |
|
| 234 |
+
| `submit_fix` (partial β close but not exact) | +0.03 | Right key, approximately right value |
|
| 235 |
+
| `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload |
|
| 236 |
+
| `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted |
|
| 237 |
+
| All issues fixed | **+0.20** | Episode completion bonus |
|
| 238 |
+
| Invalid target / invalid action | -0.05 | Bad input |
|
| 239 |
+
| Every step | **-0.01** | Step cost β encourages efficiency |
|
| 240 |
+
|
| 241 |
+
---
|
| 242 |
|
| 243 |
## Action & Observation Spaces
|
| 244 |
|
| 245 |
+
### Action Schema (Pydantic model: `ApiDebugAction`)
|
| 246 |
|
| 247 |
```json
|
| 248 |
{
|
| 249 |
"action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
|
| 250 |
+
"target": "<service_name>",
|
| 251 |
"fix_payload": {
|
| 252 |
"config_key": "corrected_value"
|
| 253 |
}
|
| 254 |
}
|
| 255 |
```
|
| 256 |
|
| 257 |
+
- `action_type` (required): One of the 4 debugging actions
|
| 258 |
+
- `target` (required): The service to act on (from `available_targets` in the observation)
|
| 259 |
+
- `fix_payload` (optional): Required only for `submit_fix` β the configuration correction
|
| 260 |
+
|
| 261 |
+
**Fix payload formats:**
|
| 262 |
+
```json
|
| 263 |
+
// Simple key-value fix
|
| 264 |
+
{"timeout": 10}
|
| 265 |
+
|
| 266 |
+
// Nested key fix (dot notation)
|
| 267 |
+
{"headers.Authorization": "Bearer my_api_key"}
|
| 268 |
+
|
| 269 |
+
// Complex nested object fix
|
| 270 |
+
{"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}}
|
| 271 |
+
```
|
| 272 |
+
|
| 273 |
+
### Observation Schema (Pydantic model: `ApiDebugObservation`)
|
| 274 |
|
| 275 |
```json
|
| 276 |
{
|
| 277 |
"task_id": "easy",
|
| 278 |
+
"task_description": "A payment processing API integration is failing...",
|
| 279 |
+
"logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."],
|
| 280 |
+
"config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30},
|
| 281 |
+
"api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"},
|
| 282 |
"service_status": {"payment_client": "error", "payment_gateway": "healthy"},
|
| 283 |
+
"dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []},
|
| 284 |
+
"error_trace": [
|
| 285 |
+
"[CRITICAL] payment_client: Missing Authorization header",
|
| 286 |
+
" ββ> payment_gateway: All requests rejected with 401"
|
| 287 |
+
],
|
| 288 |
+
"hints": ["Check headers.Authorization"],
|
| 289 |
"remaining_steps": 14,
|
| 290 |
"issues_found": 1,
|
| 291 |
"issues_fixed": 0,
|
| 292 |
"issues_total": 2,
|
| 293 |
+
"action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!",
|
| 294 |
+
"available_targets": ["payment_client", "payment_gateway"],
|
| 295 |
+
"done": false,
|
| 296 |
+
"reward": 0.15
|
| 297 |
}
|
| 298 |
```
|
| 299 |
|
| 300 |
+
**Key observation fields for agent reasoning:**
|
| 301 |
+
- `service_status` β shows which services are healthy/degraded/error (updates dynamically)
|
| 302 |
+
- `dependency_graph` β shows service relationships (agent should fix upstream first)
|
| 303 |
+
- `error_trace` β shows active error cascades (shrinks as issues are fixed)
|
| 304 |
+
- `hints` β progressive hints that get more specific as steps are used
|
| 305 |
+
|
| 306 |
+
---
|
| 307 |
+
|
| 308 |
## Example Transcript
|
| 309 |
|
| 310 |
```
|
| 311 |
>>> reset(task_id="easy")
|
| 312 |
+
task_description: "A payment processing API integration is failing..."
|
| 313 |
service_status: {payment_client: "error", payment_gateway: "healthy"}
|
| 314 |
+
error_trace:
|
| 315 |
+
[CRITICAL] payment_client: Missing Authorization header
|
| 316 |
+
ββ> payment_gateway: All requests rejected with 401
|
| 317 |
+
[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)
|
| 318 |
+
ββ> payment_gateway: Request body parsing fails
|
| 319 |
+
issues_total: 2, remaining_steps: 15
|
| 320 |
+
|
| 321 |
+
>>> step(action_type="inspect_logs", target="payment_client")
|
| 322 |
+
logs: [
|
| 323 |
+
"[INFO] Payment client initialized...",
|
| 324 |
+
"[ERROR] POST /process -> 401 Unauthorized",
|
| 325 |
+
"[ERROR] Response: {'error': 'Missing or invalid Authorization header'}",
|
| 326 |
+
"[WARN] Request headers: Content-Type=text/plain",
|
| 327 |
+
"[ERROR] POST /process -> 415 Unsupported Media Type",
|
| 328 |
]
|
|
|
|
|
|
|
|
|
|
| 329 |
issues_found: 2, reward: +0.15
|
| 330 |
|
| 331 |
+
>>> step(action_type="inspect_config", target="payment_client")
|
| 332 |
+
config_snapshot: {
|
| 333 |
+
"base_url": "https://api.paymentgateway.com/v2",
|
| 334 |
+
"headers": {"Content-Type": "text/plain", "Accept": "application/json"},
|
| 335 |
+
"timeout": 30
|
| 336 |
+
}
|
| 337 |
+
reward: +0.05 // Service has issues, first inspection
|
| 338 |
|
| 339 |
+
>>> step(action_type="submit_fix", target="payment_client",
|
| 340 |
+
fix_payload={"headers.Authorization": "Bearer sk_live_my_key"})
|
| 341 |
+
action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2"
|
| 342 |
+
service_status: {payment_client: "degraded", payment_gateway: "healthy"}
|
| 343 |
+
reward: +0.30 // Fix (+0.25) + strategy bonus (+0.05) for inspecting first
|
| 344 |
|
| 345 |
+
>>> step(action_type="inspect_logs", target="payment_client")
|
| 346 |
+
logs: [...original logs...,
|
| 347 |
+
"[INFO] Authorization header set. Retrying request..." // NEW dynamic log!
|
| 348 |
+
]
|
| 349 |
+
reward: +0.05 // Re-inspection has new dynamic logs
|
| 350 |
|
| 351 |
+
>>> step(action_type="submit_fix", target="payment_client",
|
| 352 |
+
fix_payload={"headers.Content-Type": "application/json"})
|
| 353 |
+
action_result: "Fix accepted! All issues fixed! Episode complete."
|
| 354 |
service_status: {payment_client: "healthy", payment_gateway: "healthy"}
|
| 355 |
error_trace: ["All issues resolved. No error cascades active."]
|
| 356 |
+
reward: +0.50 // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20)
|
| 357 |
+
done: true
|
| 358 |
|
| 359 |
>>> grade()
|
| 360 |
+
score: 0.82
|
| 361 |
+
fix_score: 1.00 (2/2 fixed)
|
| 362 |
+
diagnosis_score: 1.00 (inspected before every fix)
|
| 363 |
+
efficiency_score: 0.67 (5/15 steps used)
|
| 364 |
+
strategy_score: 0.80 (inspected first, used multiple action types)
|
| 365 |
```
|
| 366 |
|
| 367 |
+
---
|
| 368 |
+
|
| 369 |
## Setup & Usage
|
| 370 |
|
| 371 |
+
### Prerequisites
|
| 372 |
+
|
| 373 |
+
- Python 3.10+
|
| 374 |
+
- [uv](https://docs.astral.sh/uv/) (recommended) or pip
|
| 375 |
+
- Docker (for containerized deployment)
|
| 376 |
+
|
| 377 |
### Install Dependencies
|
| 378 |
|
| 379 |
```bash
|
| 380 |
+
# Clone the repository
|
| 381 |
+
git clone https://github.com/yadnyeshkolte/openenv-task.git
|
| 382 |
+
cd openenv-task
|
| 383 |
+
|
| 384 |
+
# Install dependencies with uv
|
| 385 |
uv sync
|
| 386 |
+
|
| 387 |
+
# Or with pip
|
| 388 |
+
pip install -e .
|
| 389 |
```
|
| 390 |
|
| 391 |
+
### Run the Server Locally
|
| 392 |
|
| 393 |
```bash
|
| 394 |
+
# From the project root (openenv-task/)
|
| 395 |
+
uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
|
| 396 |
```
|
| 397 |
|
| 398 |
+
The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation.
|
| 399 |
+
|
| 400 |
+
### Quick Test
|
| 401 |
|
| 402 |
```bash
|
| 403 |
+
# Reset environment
|
| 404 |
+
curl -X POST http://localhost:8000/reset
|
| 405 |
+
|
| 406 |
+
# Inspect logs
|
| 407 |
+
curl -X POST http://localhost:8000/step \
|
| 408 |
+
-H "Content-Type: application/json" \
|
| 409 |
+
-d '{"action_type": "inspect_logs", "target": "payment_client"}'
|
| 410 |
+
|
| 411 |
+
# Submit a fix
|
| 412 |
+
curl -X POST http://localhost:8000/step \
|
| 413 |
+
-H "Content-Type: application/json" \
|
| 414 |
+
-d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}'
|
| 415 |
```
|
| 416 |
|
| 417 |
+
### Docker Build & Run
|
| 418 |
|
| 419 |
```bash
|
| 420 |
+
# From the project root (openenv-task/)
|
| 421 |
+
docker build -t api_debug_env -f Dockerfile .
|
| 422 |
docker run -p 8000:8000 api_debug_env
|
| 423 |
```
|
| 424 |
|
| 425 |
+
---
|
| 426 |
+
|
| 427 |
+
## API Endpoints
|
| 428 |
|
| 429 |
| Endpoint | Method | Description |
|
| 430 |
+
|---|---|---|
|
| 431 |
+
| `/` | GET | Environment info, version, and feature list |
|
| 432 |
+
| `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) |
|
| 433 |
+
| `/step` | POST | Execute a debugging action |
|
| 434 |
+
| `/state` | GET | Get current state (episode_id, step_count) |
|
| 435 |
+
| `/schema` | GET | Get action/observation Pydantic schemas |
|
| 436 |
+
| `/tasks` | GET | List all 3 tasks with action schema and service dependencies |
|
| 437 |
+
| `/grader` | POST | Get multi-dimensional grader score for current episode |
|
| 438 |
+
| `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks |
|
| 439 |
+
| `/health` | GET | Health check endpoint |
|
| 440 |
+
| `/docs` | GET | Interactive Swagger UI documentation |
|
| 441 |
+
|
| 442 |
+
---
|
| 443 |
+
|
| 444 |
+
## Running Inference
|
| 445 |
+
|
| 446 |
+
The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks:
|
| 447 |
|
| 448 |
```bash
|
| 449 |
+
# Set your API credentials
|
| 450 |
+
export HF_TOKEN=your_huggingface_token
|
| 451 |
+
# Optional: override model and API base
|
| 452 |
+
export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
|
| 453 |
+
export API_BASE_URL=https://router.huggingface.co/v1
|
| 454 |
+
|
| 455 |
+
# Run inference from the project root
|
| 456 |
python inference.py
|
| 457 |
```
|
| 458 |
|
| 459 |
+
**Output format** (stdout):
|
| 460 |
+
```
|
| 461 |
+
[START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
|
| 462 |
+
[STEP] step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null
|
| 463 |
+
[STEP] step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null
|
| 464 |
+
...
|
| 465 |
+
[END] success=true steps=5 score=0.820 rewards=0.15,0.30,...
|
| 466 |
+
```
|
| 467 |
+
|
| 468 |
+
The inference script:
|
| 469 |
+
- Uses `openai.OpenAI` client for all LLM calls
|
| 470 |
+
- Reads `HF_TOKEN` (or `API_KEY`) from environment variables
|
| 471 |
+
- Includes retry logic with exponential backoff
|
| 472 |
+
- Emits `[START]`, `[STEP]`, `[END]` lines to stdout
|
| 473 |
+
|
| 474 |
+
---
|
| 475 |
+
|
| 476 |
+
## Running Tests
|
| 477 |
+
|
| 478 |
+
```bash
|
| 479 |
+
# From the project root (openenv-task/)
|
| 480 |
+
python -m pytest tests/ -v --tb=short
|
| 481 |
+
```
|
| 482 |
+
|
| 483 |
+
**70 tests** across 12 test classes covering:
|
| 484 |
+
- Scenario loading, seed randomization, and issue pool selection
|
| 485 |
+
- Environment reset and initialization
|
| 486 |
+
- All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix`
|
| 487 |
+
- Dynamic state: service health updates, dynamic log injection, error trace changes
|
| 488 |
+
- Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy)
|
| 489 |
+
- Strict fix validation with partial credit
|
| 490 |
+
- Value matching (strings, numbers, booleans, lists, Bearer tokens)
|
| 491 |
+
- Full episode integration tests (easy, medium, hard)
|
| 492 |
+
- Cascading failure mechanics and dependency chains
|
| 493 |
+
- Episode termination conditions
|
| 494 |
|
| 495 |
+
### Validate OpenEnv Compliance
|
| 496 |
|
| 497 |
+
```bash
|
| 498 |
+
openenv validate
|
| 499 |
+
```
|
| 500 |
+
|
| 501 |
+
---
|
|
|
|
| 502 |
|
| 503 |
## Project Structure
|
| 504 |
|
| 505 |
```
|
| 506 |
+
openenv-task/ # Project root
|
| 507 |
+
βββ __init__.py # Package init (exports ApiDebugEnv, Action, Observation)
|
| 508 |
+
βββ client.py # OpenEnv client (WebSocket connection to server)
|
| 509 |
+
βββ models.py # Pydantic Action & Observation type definitions
|
| 510 |
+
βββ scenarios.py # Task scenarios with dependency graphs & issue pools
|
| 511 |
+
βββ inference.py # MANDATORY inference script (LLM agent, OpenAI client)
|
| 512 |
+
βββ openenv.yaml # OpenEnv metadata (spec v1)
|
| 513 |
+
βββ pyproject.toml # Python project config & dependencies
|
| 514 |
+
βββ Dockerfile # Docker build for HF Spaces deployment
|
| 515 |
+
βββ LICENSE # BSD license
|
| 516 |
+
βββ README.md # This file
|
| 517 |
+
βββ PROGRESS.md # Development session log
|
| 518 |
+
βββ AGENTS.md # Instructions for AI coding agents
|
| 519 |
βββ server/
|
| 520 |
+
β βββ __init__.py # Server package init
|
| 521 |
+
β βββ api_debug_env_environment.py # Core environment (reset/step/grade logic)
|
| 522 |
+
β βββ app.py # FastAPI endpoints (/reset, /step, /tasks, etc.)
|
| 523 |
+
β βββ Dockerfile # Alternate Dockerfile (same as root)
|
| 524 |
+
β βββ requirements.txt # Server-specific requirements
|
| 525 |
+
βββ scripts/
|
| 526 |
+
β βββ baseline_inference.py # Alternate baseline script
|
| 527 |
βββ tests/
|
| 528 |
+
βββ test_environment.py # 70 unit & integration tests
|
| 529 |
```
|
| 530 |
+
|
| 531 |
+
### Key Files
|
| 532 |
+
|
| 533 |
+
| File | Purpose |
|
| 534 |
+
|---|---|
|
| 535 |
+
| `server/api_debug_env_environment.py` | **Core logic** β `reset()`, `step()`, `grade()`, dynamic state, cascading failures |
|
| 536 |
+
| `scenarios.py` | **Task definitions** β issue pools, dependency graphs, dynamic logs, service configs |
|
| 537 |
+
| `models.py` | **Type definitions** β `ApiDebugAction` and `ApiDebugObservation` Pydantic models |
|
| 538 |
+
| `inference.py` | **Mandatory** β LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output |
|
| 539 |
+
| `openenv.yaml` | **Mandatory** β OpenEnv spec v1 metadata with task definitions |
|
| 540 |
+
| `server/app.py` | **FastAPI server** β all HTTP endpoints including `/baseline` and `/grader` |
|
| 541 |
+
|
| 542 |
+
---
|
| 543 |
+
|
| 544 |
+
## Design Philosophy
|
| 545 |
+
|
| 546 |
+
This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark:
|
| 547 |
+
|
| 548 |
+
1. **Dense Reward Signal** β every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end.
|
| 549 |
+
|
| 550 |
+
2. **Progressive Difficulty** β Easy (2 services, 2 issues) β Medium (3 services, 3 issues with 1 dependency) β Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity.
|
| 551 |
+
|
| 552 |
+
3. **Partial Credit** β close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track.
|
| 553 |
+
|
| 554 |
+
4. **Strategy Incentives** β the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies.
|
| 555 |
+
|
| 556 |
+
5. **Stochastic Scenarios** β seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility.
|
| 557 |
+
|
| 558 |
+
6. **Cascading Dynamics** β upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently β it must understand the system architecture.
|
| 559 |
+
|
| 560 |
+
7. **Real-World Relevance** β API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions).
|
| 561 |
+
|
| 562 |
+
---
|
| 563 |
+
|
| 564 |
+
## OpenEnv Spec Compliance
|
| 565 |
+
|
| 566 |
+
| Requirement | Status |
|
| 567 |
+
|---|---|
|
| 568 |
+
| OpenEnv spec v1 (`openenv.yaml`) | β
|
|
| 569 |
+
| Typed Pydantic models (Action, Observation) | β
|
|
| 570 |
+
| `reset()` / `step()` / `state()` API | β
|
|
| 571 |
+
| 3+ tasks with difficulty range | β
(easy, medium, hard) |
|
| 572 |
+
| Programmatic graders (0.0β1.0) | β
(multi-dimensional rubric) |
|
| 573 |
+
| Meaningful reward function | β
(dense, not sparse) |
|
| 574 |
+
| Baseline inference script | β
(`inference.py` at root) |
|
| 575 |
+
| OpenAI client for LLM calls | β
|
|
| 576 |
+
| `[START]/[STEP]/[END]` stdout format | β
|
|
| 577 |
+
| Dockerfile builds and runs | β
|
|
| 578 |
+
| HF Space deploys and responds | β
|
|
| 579 |
+
| `openenv validate` passes | β
|
|
| 580 |
+
|
| 581 |
+
---
|
| 582 |
+
|
| 583 |
+
## Hackathon Submission
|
| 584 |
+
|
| 585 |
+
- **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
|
| 586 |
+
- **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task)
|
| 587 |
+
- **Hackathon**: Meta PyTorch OpenEnv Hackathon Γ Scaler School of Technology
|