yadnyeshkolte commited on
Commit
579652a
Β·
1 Parent(s): 8b10144

update on README.md

Browse files
Files changed (1) hide show
  1. README.md +450 -157
README.md CHANGED
@@ -11,284 +11,577 @@ tags:
11
 
12
  # πŸ”§ API Integration Debugging Environment
13
 
14
- > **A real-world environment for training and evaluating AI agents on multi-service API debugging with cascading failures, dynamic state, and multi-dimensional grading.**
15
 
16
- [![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
17
  [![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
18
- [![License](https://img.shields.io/badge/License-BSD-yellow)](LICENSE)
 
19
 
20
- ## Why API Debugging?
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
21
 
22
- API integration failures are one of the most common and time-consuming issues in production software. When Service A calls Service B which calls Service C, a single misconfiguration can cascade through the entire system. Debugging requires:
23
 
24
- - **Structured diagnosis**: inspecting logs, configs, and endpoints across services
25
- - **Dependency awareness**: understanding which service failures affect which downstream services
26
- - **Strategic reasoning**: fixing upstream issues first to unmask downstream problems
27
 
28
- This environment simulates *real-world cascading API failures* β€” not toy string-matching puzzles.
29
 
30
- ## How It Works
 
 
 
 
 
 
 
31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
32
  ```
33
- β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
34
- β”‚ Agent Debugging Loop β”‚
35
- β”‚ β”‚
36
- β”‚ 1. reset() β†’ Initial observation with broken service state β”‚
37
- β”‚ 2. step(inspect_logs) β†’ Error logs from target service β”‚
38
- β”‚ 3. step(inspect_config) β†’ Current (broken) configuration β”‚
39
- β”‚ 4. step(inspect_endpoint) β†’ Live error response simulation β”‚
40
- β”‚ 5. step(submit_fix) β†’ Fix validation + cascade resolution β”‚
41
- β”‚ 6. grade() β†’ Multi-dimensional rubric score β”‚
42
- β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
 
 
 
 
 
43
  ```
 
 
 
 
 
44
 
45
- ### Service Dependency Graphs
46
-
47
- Each task models a real multi-service system with dependency chains:
48
-
49
- ```mermaid
50
- graph LR
51
- A[order_service] --> B[inventory_service]
52
- B --> C[shipping_service]
53
- A --> D[api_gateway]
54
- B --> E[auth_service]
55
- style A fill:#ff6b6b
56
- style B fill:#ffd93d
57
- style C fill:#6bcb77
58
- style D fill:#6bcb77
59
- style E fill:#6bcb77
60
  ```
61
 
62
- **Red** = error, **Yellow** = degraded, **Green** = healthy. Fixing upstream issues changes downstream health.
 
 
63
 
64
- ## Environment Design
65
 
66
- ### Dynamic State
67
 
68
- Unlike static environments, our state changes as the agent acts:
 
 
 
 
 
69
 
70
- 1. **Service health tracking**: Each service has a status (`healthy`, `degraded`, `error`) that updates when issues are fixed
71
- 2. **Dynamic logs**: After fixing an issue, re-inspecting logs shows *new entries* reflecting the fix
72
- 3. **Cascading effects**: Fixing an upstream issue can change downstream service behavior
73
- 4. **Error trace**: Shows the full error propagation chain, shrinking as issues are fixed
74
 
75
- ### Reward Shaping
76
 
77
- | Action | Reward | Condition |
78
- |--------|--------|-----------|
79
- | `inspect_logs` (new service, finds issues) | +0.15 | New relevant error patterns found |
80
- | `inspect_logs` (new service, no issues) | +0.05 | Valid inspection but no issues here |
81
- | `inspect_logs` (repeat, unchanged) | 0.00 | No new information |
82
- | `inspect_logs` (repeat, dynamic logs) | +0.05 | New logs appeared after a fix |
83
- | `inspect_config` (service has issues) | +0.05 | Relevant configuration retrieved |
84
- | `inspect_endpoint` | +0.02 to +0.05 | Endpoint testing |
85
- | `submit_fix` (correct) | +0.25 | Issue resolved |
86
- | `submit_fix` (correct + inspected first) | +0.30 | Diagnosis + fix strategy bonus |
87
- | `submit_fix` (partial β€” close value) | +0.03 | Right key, close but not exact value |
88
- | `submit_fix` (wrong) | -0.10 | Incorrect fix |
89
- | All actions complete | +0.20 | Completion bonus |
90
- | Every step | -0.01 | Step cost (encourages efficiency) |
 
 
 
 
 
 
 
91
 
92
  ## Tasks
93
 
94
- ### Easy: Payment API Integration (2 issues, 15 steps)
95
 
96
- Payment client failing to connect to payment gateway. Issues involve authentication and protocol errors.
97
 
98
- - **Issue pool**: 4 possible issues, 2 selected per episode
99
  - **Services**: `payment_client`, `payment_gateway`
100
- - **Issue types**: Auth header missing, wrong Content-Type, timeout, deprecated endpoint
 
 
 
 
 
101
 
102
- ### Medium: Webhook Event Chain (3 issues, 25 steps)
103
 
104
- Webhook notification system dropping events across a 3-service chain.
105
 
106
- - **Issue pool**: 5 possible issues, 3 selected per episode
107
  - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
108
- - **Issue types**: Rate limiting, retry misconfiguration, webhook signature, endpoint URL, compression
109
- - **Dependencies**: Retry issue is masked by rate limit β€” must fix rate limit first
 
 
 
 
 
110
 
111
- ### Hard: E-Commerce Order Pipeline (5 issues, 40 steps)
112
 
113
- Complex order processing pipeline with cascading failures across 5 services.
114
 
115
- - **Issue pool**: 7 possible issues, 5 selected per episode
116
  - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
117
- - **Issue types**: Deprecated URLs, timeouts, race conditions, expired tokens, missing token refresh, circuit breakers, idempotency
118
- - **Dependencies**: Timeout masked by wrong URL; token refresh masked by expired token
 
 
 
 
 
 
 
 
 
119
 
120
- ## Grading Rubric
121
 
122
- The grader uses a **multi-dimensional rubric**, not a simple fix ratio:
123
 
124
- | Dimension | Weight | Description |
125
- |-----------|--------|-------------|
126
- | **Fix Score** | 40% | `issues_fixed / total_issues` |
127
- | **Diagnosis Score** | 20% | Did the agent inspect the service before fixing it? |
128
- | **Efficiency Score** | 15% | `remaining_steps / max_steps` β€” faster is better |
129
- | **Strategy Score** | 25% | Logical debugging approach: inspect before fix, avoid repeats, follow dependency order, use all action types |
130
 
131
  ```
132
- Final Score = fix Γ— 0.40 + diagnosis Γ— 0.20 + efficiency Γ— 0.15 + strategy Γ— 0.25
133
- Clamped to (0.001, 0.999)
134
  ```
135
 
136
- ### Baseline Scores
 
 
 
 
137
 
138
- | Task | Score | Steps | Issues Fixed |
139
- |------|-------|-------|--------------|
 
 
140
  | Easy | ~0.75 | 7 | 2/2 |
141
  | Medium | ~0.55 | 10 | 3/3 |
142
  | Hard | ~0.45 | 15 | 5/5 |
143
 
144
- *Baseline uses a rule-based heuristic agent (inspect all β†’ fix all).*
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
145
 
146
  ## Action & Observation Spaces
147
 
148
- ### Action Space
149
 
150
  ```json
151
  {
152
  "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
153
- "target": "service_name",
154
  "fix_payload": {
155
  "config_key": "corrected_value"
156
  }
157
  }
158
  ```
159
 
160
- ### Observation Space
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
161
 
162
  ```json
163
  {
164
  "task_id": "easy",
165
- "task_description": "...",
166
- "logs": ["[ERROR] ..."],
167
- "config_snapshot": {"headers": {"Content-Type": "text/plain"}},
168
- "api_response": {"status": "error", "status_code": 401},
169
  "service_status": {"payment_client": "error", "payment_gateway": "healthy"},
170
- "dependency_graph": {"payment_client": ["payment_gateway"]},
171
- "error_trace": ["[CRITICAL] payment_client: Missing Authorization header"],
 
 
 
 
172
  "remaining_steps": 14,
173
  "issues_found": 1,
174
  "issues_fixed": 0,
175
  "issues_total": 2,
176
- "hints": ["Check headers.Authorization"],
177
- "available_targets": ["payment_client", "payment_gateway"]
 
 
178
  }
179
  ```
180
 
 
 
 
 
 
 
 
 
181
  ## Example Transcript
182
 
183
  ```
184
  >>> reset(task_id="easy")
185
- task_description: "Payment processing API integration is failing..."
186
  service_status: {payment_client: "error", payment_gateway: "healthy"}
187
- error_trace: [
188
- "[CRITICAL] payment_client: Missing Authorization header",
189
- " └─> payment_gateway: All requests rejected with 401",
190
- "[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)",
191
- " └─> payment_gateway: Request body parsing fails"
 
 
 
 
 
 
 
 
 
192
  ]
193
-
194
- >>> step(inspect_logs, target=payment_client)
195
- logs: ["[ERROR] POST /process -> 401 Unauthorized", ...]
196
  issues_found: 2, reward: +0.15
197
 
198
- >>> step(inspect_config, target=payment_client)
199
- config: {headers: {Content-Type: "text/plain", Accept: "..."}, ...}
200
- reward: +0.05
 
 
 
 
201
 
202
- >>> step(submit_fix, target=payment_client, fix_payload={headers.Authorization: "Bearer sk_key"})
203
- action_result: "Fix accepted! Fixed 1 issue(s)."
204
- service_status: {payment_client: "degraded"} # still has content-type issue
205
- reward: +0.30
 
206
 
207
- >>> step(inspect_logs, target=payment_client) # re-inspect shows new logs!
208
- logs: [...original..., "[INFO] Authorization header set. Retrying request..."]
209
- reward: +0.05 # reward for checking updated state
 
 
210
 
211
- >>> step(submit_fix, target=payment_client, fix_payload={headers.Content-Type: "application/json"})
212
- action_result: "Fix accepted! All issues fixed! Episode complete. πŸŽ‰"
 
213
  service_status: {payment_client: "healthy", payment_gateway: "healthy"}
214
  error_trace: ["All issues resolved. No error cascades active."]
215
- reward: +0.50 (fix + completion bonus)
 
216
 
217
  >>> grade()
218
- score: 0.82 (fix=1.0, diagnosis=1.0, efficiency=0.67, strategy=0.8)
 
 
 
 
219
  ```
220
 
 
 
221
  ## Setup & Usage
222
 
 
 
 
 
 
 
223
  ### Install Dependencies
224
 
225
  ```bash
226
- cd api_debug_env # or project root
 
 
 
 
227
  uv sync
 
 
 
228
  ```
229
 
230
- ### Run Locally
231
 
232
  ```bash
233
- uvicorn server.app:app --reload --port 8000
 
234
  ```
235
 
236
- ### Run Tests
 
 
237
 
238
  ```bash
239
- python -m pytest tests/ -v --tb=short
 
 
 
 
 
 
 
 
 
 
 
240
  ```
241
 
242
- ### Docker
243
 
244
  ```bash
245
- docker build -t api_debug_env -f server/Dockerfile .
 
246
  docker run -p 8000:8000 api_debug_env
247
  ```
248
 
249
- ### API Endpoints
 
 
250
 
251
  | Endpoint | Method | Description |
252
- |----------|--------|-------------|
253
- | `/` | GET | Environment info + status |
254
- | `/reset` | POST | Reset environment |
255
- | `/step` | POST | Execute an action |
256
- | `/state` | GET | Get current state |
257
- | `/tasks` | GET | List all tasks with schemas |
258
- | `/grader` | POST | Get grading score |
259
- | `/baseline` | POST | Run baseline agent |
260
- | `/health` | GET | Health check |
261
-
262
- ### Run Inference
 
 
 
 
 
 
263
 
264
  ```bash
265
- export HF_TOKEN=your_token_here
 
 
 
 
 
 
266
  python inference.py
267
  ```
268
 
269
- ## Design Philosophy
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
270
 
271
- This environment is designed to be useful for **RL/agent training**, not just evaluation:
272
 
273
- 1. **Dense Rewards**: Every action type can yield positive or negative reward, enabling gradient-based training
274
- 2. **Progressive Difficulty**: Easy→Medium→Hard with increasing service count and dependency complexity
275
- 3. **Partial Credit**: Close-but-wrong fixes get feedback instead of binary rejection
276
- 4. **Strategy Incentives**: The multi-dimensional rubric rewards *how* the agent solves, not just *what* it solves
277
- 5. **Stochastic**: Seed-based randomization prevents policy overfitting to memorized scenarios
278
- 6. **Cascading Dynamics**: Upstream fixes change downstream state, requiring multi-step reasoning
279
 
280
  ## Project Structure
281
 
282
  ```
283
- β”œβ”€β”€ models.py # Pydantic Action & Observation definitions
284
- β”œβ”€β”€ scenarios.py # Task scenarios with dependency graphs
285
- β”œβ”€β”€ inference.py # MANDATORY baseline inference script
286
- β”œβ”€β”€ openenv.yaml # OpenEnv metadata
287
- β”œβ”€β”€ pyproject.toml # Dependencies
 
 
 
 
 
 
 
 
288
  β”œβ”€β”€ server/
289
- β”‚ β”œβ”€β”€ api_debug_env_environment.py # Core environment logic
290
- β”‚ β”œβ”€β”€ app.py # FastAPI endpoints
291
- β”‚ └── Dockerfile # HF Spaces deployment
 
 
 
 
292
  └── tests/
293
- └── test_environment.py # 48+ unit & integration tests
294
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
 
12
  # πŸ”§ API Integration Debugging Environment
13
 
14
+ > A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**.
15
 
16
+ [![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2.2-blue)](https://github.com/meta-pytorch/OpenEnv)
17
  [![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
18
+ [![Tests](https://img.shields.io/badge/Tests-70%20passed-brightgreen)]()
19
+ [![HF Space](https://img.shields.io/badge/HF%20Space-Live-orange)](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
20
 
21
+ ---
22
+
23
+ ## Table of Contents
24
+
25
+ - [Motivation β€” Why API Debugging?](#motivation--why-api-debugging)
26
+ - [Environment Overview](#environment-overview)
27
+ - [Key Design Features](#key-design-features)
28
+ - [Tasks (Easy / Medium / Hard)](#tasks)
29
+ - [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric)
30
+ - [Reward Shaping](#reward-shaping)
31
+ - [Action & Observation Spaces](#action--observation-spaces)
32
+ - [Example Transcript](#example-transcript)
33
+ - [Setup & Usage](#setup--usage)
34
+ - [API Endpoints](#api-endpoints)
35
+ - [Running Inference](#running-inference)
36
+ - [Running Tests](#running-tests)
37
+ - [Project Structure](#project-structure)
38
+ - [Design Philosophy](#design-philosophy)
39
+
40
+ ---
41
 
42
+ ## Motivation β€” Why API Debugging?
43
 
44
+ API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate β€” Service A calls Service B which calls Service C β€” a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose.
 
 
45
 
46
+ Real-world API debugging requires:
47
 
48
+ - **Structured diagnosis** β€” reading error logs and configs across multiple services
49
+ - **Dependency awareness** β€” understanding which upstream failure is causing downstream errors
50
+ - **Strategic reasoning** β€” fixing root causes first to unmask hidden downstream bugs
51
+ - **Precision** β€” submitting exact configuration corrections, not approximate guesses
52
+
53
+ This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts β€” not a static lookup puzzle.
54
+
55
+ ---
56
 
57
+ ## Environment Overview
58
+
59
+ ```
60
+ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
61
+ β”‚ Agent Debugging Loop β”‚
62
+ β”‚ β”‚
63
+ β”‚ 1. reset(task_id) β†’ Initial observation with broken state β”‚
64
+ β”‚ 2. step(inspect_logs) β†’ Error logs with diagnostic clues β”‚
65
+ β”‚ 3. step(inspect_config)β†’ Current (broken) service configuration β”‚
66
+ β”‚ 4. step(inspect_endpoint) β†’ Simulated API response (401, 504..) β”‚
67
+ β”‚ 5. step(submit_fix) β†’ Strict fix validation + cascade update β”‚
68
+ β”‚ 6. grade() β†’ Multi-dimensional rubric score [0,1] β”‚
69
+ β”‚ β”‚
70
+ β”‚ State updates dynamically: service health changes, new logs β”‚
71
+ β”‚ appear, error cascades resolve as the agent fixes issues. β”‚
72
+ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
73
  ```
74
+
75
+ The agent interacts through the standard OpenEnv API:
76
+ - **`reset()`** β†’ returns initial observation with broken service state
77
+ - **`step(action)`** β†’ executes one debugging action, returns observation + reward
78
+ - **`state()`** β†’ returns current environment state (episode_id, step_count)
79
+ - **`grade()`** β†’ returns final score using multi-dimensional rubric
80
+
81
+ ---
82
+
83
+ ## Key Design Features
84
+
85
+ ### 1. Cascading Failures with Service Dependency Graphs
86
+
87
+ Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services:
88
+
89
  ```
90
+ Hard Task Dependency Graph:
91
+
92
+ order_service ──┬──→ inventory_service ──┬──→ shipping_service
93
+ β”‚ └──→ auth_service
94
+ └──→ api_gateway
95
 
96
+ [ERROR] [DEGRADED] [HEALTHY]
 
 
 
 
 
 
 
 
 
 
 
 
 
 
97
  ```
98
 
99
+ - Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue
100
+ - Fixing `inventory_service`'s expired token allows `shipping_service` to respond
101
+ - **Some issues are intentionally masked by upstream failures** β€” the agent must fix in the right order
102
 
103
+ ### 2. Dynamic State
104
 
105
+ Unlike static environments, the state **changes as the agent acts**:
106
 
107
+ | What changes | How |
108
+ |---|---|
109
+ | **Service health** | Fixing issues updates service status: `error` β†’ `degraded` β†’ `healthy` |
110
+ | **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") |
111
+ | **Error traces** | The cascade chain shrinks as upstream issues are resolved |
112
+ | **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state |
113
 
114
+ ### 3. Seed-Based Scenario Randomization
 
 
 
115
 
116
+ Each difficulty level has an **expanded issue pool** (more issues than are selected per episode):
117
 
118
+ | Difficulty | Pool Size | Selected Per Episode |
119
+ |---|---|---|
120
+ | Easy | 4 issues | 2 |
121
+ | Medium | 5 issues | 3 |
122
+ | Hard | 7 issues | 5 |
123
+
124
+ Passing a `seed` to `reset()` produces a **deterministic but varied** scenario β€” different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns.
125
+
126
+ ### 4. Strict Fix Validation with Partial Credit
127
+
128
+ The grader validates both **keys and values** of submitted fixes:
129
+
130
+ - **Exact match** β†’ Full credit (+0.25 reward)
131
+ - **Right key, close value** (e.g., timeout=7 when expected=10) β†’ Partial credit (+0.03)
132
+ - **Right key, wrong value** (e.g., timeout=100 when expected=10) β†’ Rejected
133
+ - **Wrong key entirely** β†’ Penalized (-0.1)
134
+ - **Bearer token pattern matching** β€” `Bearer <any_valid_token>` is accepted
135
+ - **Numeric tolerance** β€” strict 10% tolerance
136
+ - **Boolean coercion** β€” `"true"`, `"1"`, `"yes"` all match `True`
137
+
138
+ ---
139
 
140
  ## Tasks
141
 
142
+ ### Easy: Payment API Integration (2 issues, 15 max steps)
143
 
144
+ **Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors.
145
 
 
146
  - **Services**: `payment_client`, `payment_gateway`
147
+ - **Issue pool** (4 possible, 2 selected):
148
+ - Missing `Authorization` header (HTTP 401)
149
+ - Wrong `Content-Type` header β€” `text/plain` instead of `application/json` (HTTP 415)
150
+ - Timeout too low for payment processing (HTTP 504)
151
+ - Base URL pointing to deprecated v1 endpoint (HTTP 301)
152
+ - **Dependencies**: None β€” straightforward diagnosis
153
 
154
+ ### Medium: Webhook Event Chain (3 issues, 25 max steps)
155
 
156
+ **Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender β†’ receiver β†’ notification service, but multiple configuration issues are causing failures.
157
 
 
158
  - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
159
+ - **Issue pool** (5 possible, 3 selected):
160
+ - Rate limit mismatch (sender at 100/s, receiver accepts 10/s) β†’ 429 errors
161
+ - Insufficient retry config (only 1 retry, no backoff, 429 not in retry list)
162
+ - Empty webhook signature header β†’ receiver drops all events as unsigned
163
+ - Wrong target URL (`/webhook` vs `/hooks/incoming`) β†’ 404 errors
164
+ - Payload compression enabled but receiver doesn't support gzip β†’ 415 errors
165
+ - **Dependencies**: Retry issue is **masked** by rate limit β€” must fix rate limit first to see the retry problem
166
 
167
+ ### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps)
168
 
169
+ **Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models.
170
 
 
171
  - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
172
+ - **Issue pool** (7 possible, 5 selected):
173
+ - Deprecated URL (`/v1/check` β†’ should be `/v2/reserve`) β†’ 301 redirect
174
+ - Timeout too short (2s vs 4s processing time) β€” masked by wrong URL
175
+ - Synchronous mode causing race conditions between concurrent orders
176
+ - Expired auth token on inventory→shipping calls → 401
177
+ - No auto token refresh configured β€” masked by expired token
178
+ - No circuit breaker β†’ failed requests hammer inventory service
179
+ - Missing idempotency key β†’ retries create duplicate orders
180
+ - **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix
181
+
182
+ ---
183
 
184
+ ## Multi-Dimensional Grading Rubric
185
 
186
+ The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio:
187
 
188
+ | Dimension | Weight | What It Measures |
189
+ |---|---|---|
190
+ | **Fix Score** | 40% | `issues_fixed / total_issues` β€” how many bugs were actually resolved |
191
+ | **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types |
192
+ | **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? |
193
+ | **Efficiency Score** | 15% | `remaining_steps / max_steps` β€” faster solutions score higher |
194
 
195
  ```
196
+ Final Score = fix Γ— 0.40 + strategy Γ— 0.25 + diagnosis Γ— 0.20 + efficiency Γ— 0.15
197
+ Clamped to (0.001, 0.999) β€” never exactly 0.0 or 1.0
198
  ```
199
 
200
+ **Strategy scoring details:**
201
+ - Did the agent inspect logs/config before submitting any fix? (+1)
202
+ - Ratio of unique inspections to total inspections (no wasteful repeats) (+1)
203
+ - Did fixes follow the optimal dependency order? (+1)
204
+ - Did the agent use a variety of action types? (+1)
205
 
206
+ ### Baseline Scores (Rule-Based Heuristic Agent)
207
+
208
+ | Task | Score | Steps Used | Issues Fixed |
209
+ |---|---|---|---|
210
  | Easy | ~0.75 | 7 | 2/2 |
211
  | Medium | ~0.55 | 10 | 3/3 |
212
  | Hard | ~0.45 | 15 | 5/5 |
213
 
214
+ *The baseline uses a deterministic heuristic (inspect all logs β†’ inspect all configs β†’ submit known fixes). An LLM-based agent following good debugging strategy can score higher.*
215
+
216
+ ---
217
+
218
+ ## Reward Shaping
219
+
220
+ Every action produces a meaningful reward signal β€” not just sparse end-of-episode feedback:
221
+
222
+ | Action | Reward | Condition |
223
+ |---|---|---|
224
+ | `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found |
225
+ | `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service |
226
+ | `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed |
227
+ | `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix |
228
+ | `inspect_config` (service has issues) | +0.05 | Relevant config retrieved |
229
+ | `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here |
230
+ | `inspect_config` (repeat) | 0.00 | Already inspected |
231
+ | `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test |
232
+ | `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated |
233
+ | `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis |
234
+ | `submit_fix` (partial β€” close but not exact) | +0.03 | Right key, approximately right value |
235
+ | `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload |
236
+ | `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted |
237
+ | All issues fixed | **+0.20** | Episode completion bonus |
238
+ | Invalid target / invalid action | -0.05 | Bad input |
239
+ | Every step | **-0.01** | Step cost β€” encourages efficiency |
240
+
241
+ ---
242
 
243
  ## Action & Observation Spaces
244
 
245
+ ### Action Schema (Pydantic model: `ApiDebugAction`)
246
 
247
  ```json
248
  {
249
  "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
250
+ "target": "<service_name>",
251
  "fix_payload": {
252
  "config_key": "corrected_value"
253
  }
254
  }
255
  ```
256
 
257
+ - `action_type` (required): One of the 4 debugging actions
258
+ - `target` (required): The service to act on (from `available_targets` in the observation)
259
+ - `fix_payload` (optional): Required only for `submit_fix` β€” the configuration correction
260
+
261
+ **Fix payload formats:**
262
+ ```json
263
+ // Simple key-value fix
264
+ {"timeout": 10}
265
+
266
+ // Nested key fix (dot notation)
267
+ {"headers.Authorization": "Bearer my_api_key"}
268
+
269
+ // Complex nested object fix
270
+ {"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}}
271
+ ```
272
+
273
+ ### Observation Schema (Pydantic model: `ApiDebugObservation`)
274
 
275
  ```json
276
  {
277
  "task_id": "easy",
278
+ "task_description": "A payment processing API integration is failing...",
279
+ "logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."],
280
+ "config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30},
281
+ "api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"},
282
  "service_status": {"payment_client": "error", "payment_gateway": "healthy"},
283
+ "dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []},
284
+ "error_trace": [
285
+ "[CRITICAL] payment_client: Missing Authorization header",
286
+ " └─> payment_gateway: All requests rejected with 401"
287
+ ],
288
+ "hints": ["Check headers.Authorization"],
289
  "remaining_steps": 14,
290
  "issues_found": 1,
291
  "issues_fixed": 0,
292
  "issues_total": 2,
293
+ "action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!",
294
+ "available_targets": ["payment_client", "payment_gateway"],
295
+ "done": false,
296
+ "reward": 0.15
297
  }
298
  ```
299
 
300
+ **Key observation fields for agent reasoning:**
301
+ - `service_status` β€” shows which services are healthy/degraded/error (updates dynamically)
302
+ - `dependency_graph` β€” shows service relationships (agent should fix upstream first)
303
+ - `error_trace` β€” shows active error cascades (shrinks as issues are fixed)
304
+ - `hints` β€” progressive hints that get more specific as steps are used
305
+
306
+ ---
307
+
308
  ## Example Transcript
309
 
310
  ```
311
  >>> reset(task_id="easy")
312
+ task_description: "A payment processing API integration is failing..."
313
  service_status: {payment_client: "error", payment_gateway: "healthy"}
314
+ error_trace:
315
+ [CRITICAL] payment_client: Missing Authorization header
316
+ └─> payment_gateway: All requests rejected with 401
317
+ [ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)
318
+ └─> payment_gateway: Request body parsing fails
319
+ issues_total: 2, remaining_steps: 15
320
+
321
+ >>> step(action_type="inspect_logs", target="payment_client")
322
+ logs: [
323
+ "[INFO] Payment client initialized...",
324
+ "[ERROR] POST /process -> 401 Unauthorized",
325
+ "[ERROR] Response: {'error': 'Missing or invalid Authorization header'}",
326
+ "[WARN] Request headers: Content-Type=text/plain",
327
+ "[ERROR] POST /process -> 415 Unsupported Media Type",
328
  ]
 
 
 
329
  issues_found: 2, reward: +0.15
330
 
331
+ >>> step(action_type="inspect_config", target="payment_client")
332
+ config_snapshot: {
333
+ "base_url": "https://api.paymentgateway.com/v2",
334
+ "headers": {"Content-Type": "text/plain", "Accept": "application/json"},
335
+ "timeout": 30
336
+ }
337
+ reward: +0.05 // Service has issues, first inspection
338
 
339
+ >>> step(action_type="submit_fix", target="payment_client",
340
+ fix_payload={"headers.Authorization": "Bearer sk_live_my_key"})
341
+ action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2"
342
+ service_status: {payment_client: "degraded", payment_gateway: "healthy"}
343
+ reward: +0.30 // Fix (+0.25) + strategy bonus (+0.05) for inspecting first
344
 
345
+ >>> step(action_type="inspect_logs", target="payment_client")
346
+ logs: [...original logs...,
347
+ "[INFO] Authorization header set. Retrying request..." // NEW dynamic log!
348
+ ]
349
+ reward: +0.05 // Re-inspection has new dynamic logs
350
 
351
+ >>> step(action_type="submit_fix", target="payment_client",
352
+ fix_payload={"headers.Content-Type": "application/json"})
353
+ action_result: "Fix accepted! All issues fixed! Episode complete."
354
  service_status: {payment_client: "healthy", payment_gateway: "healthy"}
355
  error_trace: ["All issues resolved. No error cascades active."]
356
+ reward: +0.50 // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20)
357
+ done: true
358
 
359
  >>> grade()
360
+ score: 0.82
361
+ fix_score: 1.00 (2/2 fixed)
362
+ diagnosis_score: 1.00 (inspected before every fix)
363
+ efficiency_score: 0.67 (5/15 steps used)
364
+ strategy_score: 0.80 (inspected first, used multiple action types)
365
  ```
366
 
367
+ ---
368
+
369
  ## Setup & Usage
370
 
371
+ ### Prerequisites
372
+
373
+ - Python 3.10+
374
+ - [uv](https://docs.astral.sh/uv/) (recommended) or pip
375
+ - Docker (for containerized deployment)
376
+
377
  ### Install Dependencies
378
 
379
  ```bash
380
+ # Clone the repository
381
+ git clone https://github.com/yadnyeshkolte/openenv-task.git
382
+ cd openenv-task
383
+
384
+ # Install dependencies with uv
385
  uv sync
386
+
387
+ # Or with pip
388
+ pip install -e .
389
  ```
390
 
391
+ ### Run the Server Locally
392
 
393
  ```bash
394
+ # From the project root (openenv-task/)
395
+ uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
396
  ```
397
 
398
+ The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation.
399
+
400
+ ### Quick Test
401
 
402
  ```bash
403
+ # Reset environment
404
+ curl -X POST http://localhost:8000/reset
405
+
406
+ # Inspect logs
407
+ curl -X POST http://localhost:8000/step \
408
+ -H "Content-Type: application/json" \
409
+ -d '{"action_type": "inspect_logs", "target": "payment_client"}'
410
+
411
+ # Submit a fix
412
+ curl -X POST http://localhost:8000/step \
413
+ -H "Content-Type: application/json" \
414
+ -d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}'
415
  ```
416
 
417
+ ### Docker Build & Run
418
 
419
  ```bash
420
+ # From the project root (openenv-task/)
421
+ docker build -t api_debug_env -f Dockerfile .
422
  docker run -p 8000:8000 api_debug_env
423
  ```
424
 
425
+ ---
426
+
427
+ ## API Endpoints
428
 
429
  | Endpoint | Method | Description |
430
+ |---|---|---|
431
+ | `/` | GET | Environment info, version, and feature list |
432
+ | `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) |
433
+ | `/step` | POST | Execute a debugging action |
434
+ | `/state` | GET | Get current state (episode_id, step_count) |
435
+ | `/schema` | GET | Get action/observation Pydantic schemas |
436
+ | `/tasks` | GET | List all 3 tasks with action schema and service dependencies |
437
+ | `/grader` | POST | Get multi-dimensional grader score for current episode |
438
+ | `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks |
439
+ | `/health` | GET | Health check endpoint |
440
+ | `/docs` | GET | Interactive Swagger UI documentation |
441
+
442
+ ---
443
+
444
+ ## Running Inference
445
+
446
+ The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks:
447
 
448
  ```bash
449
+ # Set your API credentials
450
+ export HF_TOKEN=your_huggingface_token
451
+ # Optional: override model and API base
452
+ export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
453
+ export API_BASE_URL=https://router.huggingface.co/v1
454
+
455
+ # Run inference from the project root
456
  python inference.py
457
  ```
458
 
459
+ **Output format** (stdout):
460
+ ```
461
+ [START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
462
+ [STEP] step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null
463
+ [STEP] step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null
464
+ ...
465
+ [END] success=true steps=5 score=0.820 rewards=0.15,0.30,...
466
+ ```
467
+
468
+ The inference script:
469
+ - Uses `openai.OpenAI` client for all LLM calls
470
+ - Reads `HF_TOKEN` (or `API_KEY`) from environment variables
471
+ - Includes retry logic with exponential backoff
472
+ - Emits `[START]`, `[STEP]`, `[END]` lines to stdout
473
+
474
+ ---
475
+
476
+ ## Running Tests
477
+
478
+ ```bash
479
+ # From the project root (openenv-task/)
480
+ python -m pytest tests/ -v --tb=short
481
+ ```
482
+
483
+ **70 tests** across 12 test classes covering:
484
+ - Scenario loading, seed randomization, and issue pool selection
485
+ - Environment reset and initialization
486
+ - All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix`
487
+ - Dynamic state: service health updates, dynamic log injection, error trace changes
488
+ - Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy)
489
+ - Strict fix validation with partial credit
490
+ - Value matching (strings, numbers, booleans, lists, Bearer tokens)
491
+ - Full episode integration tests (easy, medium, hard)
492
+ - Cascading failure mechanics and dependency chains
493
+ - Episode termination conditions
494
 
495
+ ### Validate OpenEnv Compliance
496
 
497
+ ```bash
498
+ openenv validate
499
+ ```
500
+
501
+ ---
 
502
 
503
  ## Project Structure
504
 
505
  ```
506
+ openenv-task/ # Project root
507
+ β”œβ”€β”€ __init__.py # Package init (exports ApiDebugEnv, Action, Observation)
508
+ β”œβ”€β”€ client.py # OpenEnv client (WebSocket connection to server)
509
+ β”œβ”€β”€ models.py # Pydantic Action & Observation type definitions
510
+ β”œβ”€β”€ scenarios.py # Task scenarios with dependency graphs & issue pools
511
+ β”œβ”€β”€ inference.py # MANDATORY inference script (LLM agent, OpenAI client)
512
+ β”œβ”€β”€ openenv.yaml # OpenEnv metadata (spec v1)
513
+ β”œβ”€β”€ pyproject.toml # Python project config & dependencies
514
+ β”œβ”€β”€ Dockerfile # Docker build for HF Spaces deployment
515
+ β”œβ”€β”€ LICENSE # BSD license
516
+ β”œβ”€β”€ README.md # This file
517
+ β”œβ”€β”€ PROGRESS.md # Development session log
518
+ β”œβ”€β”€ AGENTS.md # Instructions for AI coding agents
519
  β”œβ”€β”€ server/
520
+ β”‚ β”œβ”€β”€ __init__.py # Server package init
521
+ β”‚ β”œβ”€β”€ api_debug_env_environment.py # Core environment (reset/step/grade logic)
522
+ β”‚ β”œβ”€β”€ app.py # FastAPI endpoints (/reset, /step, /tasks, etc.)
523
+ β”‚ β”œβ”€β”€ Dockerfile # Alternate Dockerfile (same as root)
524
+ β”‚ └── requirements.txt # Server-specific requirements
525
+ β”œβ”€β”€ scripts/
526
+ β”‚ └── baseline_inference.py # Alternate baseline script
527
  └── tests/
528
+ └── test_environment.py # 70 unit & integration tests
529
  ```
530
+
531
+ ### Key Files
532
+
533
+ | File | Purpose |
534
+ |---|---|
535
+ | `server/api_debug_env_environment.py` | **Core logic** β€” `reset()`, `step()`, `grade()`, dynamic state, cascading failures |
536
+ | `scenarios.py` | **Task definitions** β€” issue pools, dependency graphs, dynamic logs, service configs |
537
+ | `models.py` | **Type definitions** β€” `ApiDebugAction` and `ApiDebugObservation` Pydantic models |
538
+ | `inference.py` | **Mandatory** β€” LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output |
539
+ | `openenv.yaml` | **Mandatory** β€” OpenEnv spec v1 metadata with task definitions |
540
+ | `server/app.py` | **FastAPI server** β€” all HTTP endpoints including `/baseline` and `/grader` |
541
+
542
+ ---
543
+
544
+ ## Design Philosophy
545
+
546
+ This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark:
547
+
548
+ 1. **Dense Reward Signal** β€” every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end.
549
+
550
+ 2. **Progressive Difficulty** β€” Easy (2 services, 2 issues) β†’ Medium (3 services, 3 issues with 1 dependency) β†’ Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity.
551
+
552
+ 3. **Partial Credit** β€” close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track.
553
+
554
+ 4. **Strategy Incentives** β€” the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies.
555
+
556
+ 5. **Stochastic Scenarios** β€” seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility.
557
+
558
+ 6. **Cascading Dynamics** β€” upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently β€” it must understand the system architecture.
559
+
560
+ 7. **Real-World Relevance** β€” API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions).
561
+
562
+ ---
563
+
564
+ ## OpenEnv Spec Compliance
565
+
566
+ | Requirement | Status |
567
+ |---|---|
568
+ | OpenEnv spec v1 (`openenv.yaml`) | βœ… |
569
+ | Typed Pydantic models (Action, Observation) | βœ… |
570
+ | `reset()` / `step()` / `state()` API | βœ… |
571
+ | 3+ tasks with difficulty range | βœ… (easy, medium, hard) |
572
+ | Programmatic graders (0.0–1.0) | βœ… (multi-dimensional rubric) |
573
+ | Meaningful reward function | βœ… (dense, not sparse) |
574
+ | Baseline inference script | βœ… (`inference.py` at root) |
575
+ | OpenAI client for LLM calls | βœ… |
576
+ | `[START]/[STEP]/[END]` stdout format | βœ… |
577
+ | Dockerfile builds and runs | βœ… |
578
+ | HF Space deploys and responds | βœ… |
579
+ | `openenv validate` passes | βœ… |
580
+
581
+ ---
582
+
583
+ ## Hackathon Submission
584
+
585
+ - **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
586
+ - **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task)
587
+ - **Hackathon**: Meta PyTorch OpenEnv Hackathon Γ— Scaler School of Technology