Spaces:

yadnyeshkolte
/

api-debug-env

Sleeping

App Files Files Community

yadnyeshkolte commited on 15 days ago

Commit

579652a

1 Parent(s): 8b10144

update on README.md

Browse files

Files changed (1) hide show

README.md +450 -157

README.md CHANGED Viewed

@@ -11,284 +11,577 @@ tags:
 # 🔧 API Integration Debugging Environment
-> **A real-world environment for training and evaluating AI agents on multi-service API debugging with cascading failures, dynamic state, and multi-dimensional grading.**
-[![OpenEnv](https://img.shields.io/badge/OpenEnv-compatible-blue)](https://github.com/meta-pytorch/OpenEnv)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
-[![License](https://img.shields.io/badge/License-BSD-yellow)](LICENSE)
-## Why API Debugging?
-API integration failures are one of the most common and time-consuming issues in production software. When Service A calls Service B which calls Service C, a single misconfiguration can cascade through the entire system. Debugging requires:
-- **Structured diagnosis**: inspecting logs, configs, and endpoints across services
-- **Dependency awareness**: understanding which service failures affect which downstream services
-- **Strategic reasoning**: fixing upstream issues first to unmask downstream problems
-This environment simulates *real-world cascading API failures* — not toy string-matching puzzles.
-## How It Works
 ```
-┌─────────────────────────────────────────────────────────────┐
-│                   Agent Debugging Loop                       │
-│                                                              │
-│  1. reset() → Initial observation with broken service state  │
-│  2. step(inspect_logs) → Error logs from target service      │
-│  3. step(inspect_config) → Current (broken) configuration    │
-│  4. step(inspect_endpoint) → Live error response simulation  │
-│  5. step(submit_fix) → Fix validation + cascade resolution   │
-│  6. grade() → Multi-dimensional rubric score                 │
-└─────────────────────────────────────────────────────────────┘
 ```
-### Service Dependency Graphs
-Each task models a real multi-service system with dependency chains:
-```mermaid
-graph LR
-    A[order_service] --> B[inventory_service]
-    B --> C[shipping_service]
-    A --> D[api_gateway]
-    B --> E[auth_service]
-    style A fill:#ff6b6b
-    style B fill:#ffd93d
-    style C fill:#6bcb77
-    style D fill:#6bcb77
-    style E fill:#6bcb77
 ```
-**Red** = error, **Yellow** = degraded, **Green** = healthy. Fixing upstream issues changes downstream health.
-## Environment Design
-### Dynamic State
-Unlike static environments, our state changes as the agent acts:
-1. **Service health tracking**: Each service has a status (`healthy`, `degraded`, `error`) that updates when issues are fixed
-2. **Dynamic logs**: After fixing an issue, re-inspecting logs shows *new entries* reflecting the fix
-3. **Cascading effects**: Fixing an upstream issue can change downstream service behavior
-4. **Error trace**: Shows the full error propagation chain, shrinking as issues are fixed
-### Reward Shaping
-| Action | Reward | Condition |
-|--------|--------|-----------|
-| `inspect_logs` (new service, finds issues) | +0.15 | New relevant error patterns found |
-| `inspect_logs` (new service, no issues) | +0.05 | Valid inspection but no issues here |
-| `inspect_logs` (repeat, unchanged) | 0.00 | No new information |
-| `inspect_logs` (repeat, dynamic logs) | +0.05 | New logs appeared after a fix |
-| `inspect_config` (service has issues) | +0.05 | Relevant configuration retrieved |
-| `inspect_endpoint` | +0.02 to +0.05 | Endpoint testing |
-| `submit_fix` (correct) | +0.25 | Issue resolved |
-| `submit_fix` (correct + inspected first) | +0.30 | Diagnosis + fix strategy bonus |
-| `submit_fix` (partial — close value) | +0.03 | Right key, close but not exact value |
-| `submit_fix` (wrong) | -0.10 | Incorrect fix |
-| All actions complete | +0.20 | Completion bonus |
-| Every step | -0.01 | Step cost (encourages efficiency) |
 ## Tasks
-### Easy: Payment API Integration (2 issues, 15 steps)
-Payment client failing to connect to payment gateway. Issues involve authentication and protocol errors.
-- **Issue pool**: 4 possible issues, 2 selected per episode
 - **Services**: `payment_client`, `payment_gateway`
-- **Issue types**: Auth header missing, wrong Content-Type, timeout, deprecated endpoint
-### Medium: Webhook Event Chain (3 issues, 25 steps)
-Webhook notification system dropping events across a 3-service chain.
-- **Issue pool**: 5 possible issues, 3 selected per episode
 - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
-- **Issue types**: Rate limiting, retry misconfiguration, webhook signature, endpoint URL, compression
-- **Dependencies**: Retry issue is masked by rate limit — must fix rate limit first
-### Hard: E-Commerce Order Pipeline (5 issues, 40 steps)
-Complex order processing pipeline with cascading failures across 5 services.
-- **Issue pool**: 7 possible issues, 5 selected per episode
 - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
-- **Issue types**: Deprecated URLs, timeouts, race conditions, expired tokens, missing token refresh, circuit breakers, idempotency
-- **Dependencies**: Timeout masked by wrong URL; token refresh masked by expired token
-## Grading Rubric
-The grader uses a **multi-dimensional rubric**, not a simple fix ratio:
-| Dimension | Weight | Description |
-|-----------|--------|-------------|
-| **Fix Score** | 40% | `issues_fixed / total_issues` |
-| **Diagnosis Score** | 20% | Did the agent inspect the service before fixing it? |
-| **Efficiency Score** | 15% | `remaining_steps / max_steps` — faster is better |
-| **Strategy Score** | 25% | Logical debugging approach: inspect before fix, avoid repeats, follow dependency order, use all action types |
 ```
-Final Score = fix × 0.40 + diagnosis × 0.20 + efficiency × 0.15 + strategy × 0.25
-Clamped to (0.001, 0.999)
 ```
-### Baseline Scores
-| Task | Score | Steps | Issues Fixed |
-|------|-------|-------|--------------|
 | Easy | ~0.75 | 7 | 2/2 |
 | Medium | ~0.55 | 10 | 3/3 |
 | Hard | ~0.45 | 15 | 5/5 |
-*Baseline uses a rule-based heuristic agent (inspect all → fix all).*
 ## Action & Observation Spaces
-### Action Space
 ```json
 {
   "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
-  "target": "service_name",
   "fix_payload": {
     "config_key": "corrected_value"
   }
 }
 ```
-### Observation Space
 ```json
 {
   "task_id": "easy",
-  "task_description": "...",
-  "logs": ["[ERROR] ..."],
-  "config_snapshot": {"headers": {"Content-Type": "text/plain"}},
-  "api_response": {"status": "error", "status_code": 401},
   "service_status": {"payment_client": "error", "payment_gateway": "healthy"},
-  "dependency_graph": {"payment_client": ["payment_gateway"]},
-  "error_trace": ["[CRITICAL] payment_client: Missing Authorization header"],
   "remaining_steps": 14,
   "issues_found": 1,
   "issues_fixed": 0,
   "issues_total": 2,
-  "hints": ["Check headers.Authorization"],
-  "available_targets": ["payment_client", "payment_gateway"]
 }
 ```
 ## Example Transcript
 ```
 >>> reset(task_id="easy")
-task_description: "Payment processing API integration is failing..."
 service_status: {payment_client: "error", payment_gateway: "healthy"}
-error_trace: [
-  "[CRITICAL] payment_client: Missing Authorization header",
-  "  └─> payment_gateway: All requests rejected with 401",
-  "[ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)",
-  "  └─> payment_gateway: Request body parsing fails"
 ]
->>> step(inspect_logs, target=payment_client)
-logs: ["[ERROR] POST /process -> 401 Unauthorized", ...]
 issues_found: 2, reward: +0.15
->>> step(inspect_config, target=payment_client)
-config: {headers: {Content-Type: "text/plain", Accept: "..."}, ...}
-reward: +0.05
->>> step(submit_fix, target=payment_client, fix_payload={headers.Authorization: "Bearer sk_key"})
-action_result: "Fix accepted! Fixed 1 issue(s)."
-service_status: {payment_client: "degraded"}  # still has content-type issue
-reward: +0.30
->>> step(inspect_logs, target=payment_client)  # re-inspect shows new logs!
-logs: [...original..., "[INFO] Authorization header set. Retrying request..."]
-reward: +0.05  # reward for checking updated state
->>> step(submit_fix, target=payment_client, fix_payload={headers.Content-Type: "application/json"})
-action_result: "Fix accepted! All issues fixed! Episode complete. 🎉"
 service_status: {payment_client: "healthy", payment_gateway: "healthy"}
 error_trace: ["All issues resolved. No error cascades active."]
-reward: +0.50 (fix + completion bonus)
 >>> grade()
-score: 0.82 (fix=1.0, diagnosis=1.0, efficiency=0.67, strategy=0.8)
 ```
 ## Setup & Usage
 ### Install Dependencies
 ```bash
-cd api_debug_env  # or project root
 uv sync
 ```
-### Run Locally
 ```bash
-uvicorn server.app:app --reload --port 8000
 ```
-### Run Tests
 ```bash
-python -m pytest tests/ -v --tb=short
 ```
-### Docker
 ```bash
-docker build -t api_debug_env -f server/Dockerfile .
 docker run -p 8000:8000 api_debug_env
 ```
-### API Endpoints
 | Endpoint | Method | Description |
-|----------|--------|-------------|
-| `/` | GET | Environment info + status |
-| `/reset` | POST | Reset environment |
-| `/step` | POST | Execute an action |
-| `/state` | GET | Get current state |
-| `/tasks` | GET | List all tasks with schemas |
-| `/grader` | POST | Get grading score |
-| `/baseline` | POST | Run baseline agent |
-| `/health` | GET | Health check |
-### Run Inference
 ```bash
-export HF_TOKEN=your_token_here
 python inference.py
 ```
-## Design Philosophy
-This environment is designed to be useful for **RL/agent training**, not just evaluation:
-1. **Dense Rewards**: Every action type can yield positive or negative reward, enabling gradient-based training
-2. **Progressive Difficulty**: Easy→Medium→Hard with increasing service count and dependency complexity
-3. **Partial Credit**: Close-but-wrong fixes get feedback instead of binary rejection
-4. **Strategy Incentives**: The multi-dimensional rubric rewards *how* the agent solves, not just *what* it solves
-5. **Stochastic**: Seed-based randomization prevents policy overfitting to memorized scenarios
-6. **Cascading Dynamics**: Upstream fixes change downstream state, requiring multi-step reasoning
 ## Project Structure
 ```
-├── models.py                       # Pydantic Action & Observation definitions
-├── scenarios.py                    # Task scenarios with dependency graphs
-├── inference.py                    # MANDATORY baseline inference script
-├── openenv.yaml                    # OpenEnv metadata
-├── pyproject.toml                  # Dependencies
 ├── server/
-│   ├── api_debug_env_environment.py  # Core environment logic
-│   ├── app.py                      # FastAPI endpoints
-│   └── Dockerfile                  # HF Spaces deployment
 └── tests/
-    └── test_environment.py         # 48+ unit & integration tests
 ```

 # 🔧 API Integration Debugging Environment
+> A real-world OpenEnv environment where an AI agent diagnoses and fixes broken API integrations across multi-service systems with **cascading failures**, **dynamic state**, and **multi-dimensional rubric grading**.
+[![OpenEnv](https://img.shields.io/badge/OpenEnv-v0.2.2-blue)](https://github.com/meta-pytorch/OpenEnv)
 [![Python](https://img.shields.io/badge/Python-3.10%2B-green)](https://python.org)
+[![Tests](https://img.shields.io/badge/Tests-70%20passed-brightgreen)]()
+[![HF Space](https://img.shields.io/badge/HF%20Space-Live-orange)](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
+---
+## Table of Contents
+- [Motivation — Why API Debugging?](#motivation--why-api-debugging)
+- [Environment Overview](#environment-overview)
+- [Key Design Features](#key-design-features)
+- [Tasks (Easy / Medium / Hard)](#tasks)
+- [Multi-Dimensional Grading Rubric](#multi-dimensional-grading-rubric)
+- [Reward Shaping](#reward-shaping)
+- [Action & Observation Spaces](#action--observation-spaces)
+- [Example Transcript](#example-transcript)
+- [Setup & Usage](#setup--usage)
+- [API Endpoints](#api-endpoints)
+- [Running Inference](#running-inference)
+- [Running Tests](#running-tests)
+- [Project Structure](#project-structure)
+- [Design Philosophy](#design-philosophy)
+---
+## Motivation — Why API Debugging?
+API integration failures are one of the **most common and expensive issues** in production software engineering. When microservices communicate — Service A calls Service B which calls Service C — a single misconfiguration can cascade through the entire system, producing confusing error chains that take hours to diagnose.
+Real-world API debugging requires:
+- **Structured diagnosis** — reading error logs and configs across multiple services
+- **Dependency awareness** — understanding which upstream failure is causing downstream errors
+- **Strategic reasoning** — fixing root causes first to unmask hidden downstream bugs
+- **Precision** — submitting exact configuration corrections, not approximate guesses
+This environment simulates **real-world cascading API failures** with dynamic state that changes as the agent acts — not a static lookup puzzle.
+---
+## Environment Overview
+```
+┌───────────────────────────────────────────────────────────────────┐
+│                     Agent Debugging Loop                          │
+│                                                                   │
+│  1. reset(task_id)      → Initial observation with broken state   │
+│  2. step(inspect_logs)  → Error logs with diagnostic clues        │
+│  3. step(inspect_config)→ Current (broken) service configuration  │
+│  4. step(inspect_endpoint) → Simulated API response (401, 504..)  │
+│  5. step(submit_fix)    → Strict fix validation + cascade update  │
+│  6. grade()             → Multi-dimensional rubric score [0,1]    │
+│                                                                   │
+│  State updates dynamically: service health changes, new logs      │
+│  appear, error cascades resolve as the agent fixes issues.        │
+└───────────────────────────────────────────────────────────────────┘
 ```
+The agent interacts through the standard OpenEnv API:
+- **`reset()`** → returns initial observation with broken service state
+- **`step(action)`** → executes one debugging action, returns observation + reward
+- **`state()`** → returns current environment state (episode_id, step_count)
+- **`grade()`** → returns final score using multi-dimensional rubric
+---
+## Key Design Features
+### 1. Cascading Failures with Service Dependency Graphs
+Each task models a real multi-service ecosystem. Services depend on each other, and a bug in an upstream service **cascades** to all downstream services:
 ```
+Hard Task Dependency Graph:
+  order_service ──┬──→ inventory_service ──┬──→ shipping_service
+                  │                        └──→ auth_service
+                  └──→ api_gateway
+  [ERROR]              [DEGRADED]               [HEALTHY]
 ```
+- Fixing `order_service`'s wrong URL unmasks `inventory_service`'s timeout issue
+- Fixing `inventory_service`'s expired token allows `shipping_service` to respond
+- **Some issues are intentionally masked by upstream failures** — the agent must fix in the right order
+### 2. Dynamic State
+Unlike static environments, the state **changes as the agent acts**:
+| What changes | How |
+|---|---|
+| **Service health** | Fixing issues updates service status: `error` → `degraded` → `healthy` |
+| **Logs** | After a fix, re-inspecting logs shows **new entries** (e.g., "Authorization header set. Retrying...") |
+| **Error traces** | The cascade chain shrinks as upstream issues are resolved |
+| **Endpoint responses** | `inspect_endpoint` returns different HTTP errors based on current fix state |
+### 3. Seed-Based Scenario Randomization
+Each difficulty level has an **expanded issue pool** (more issues than are selected per episode):
+| Difficulty | Pool Size | Selected Per Episode |
+|---|---|---|
+| Easy | 4 issues | 2 |
+| Medium | 5 issues | 3 |
+| Hard | 7 issues | 5 |
+Passing a `seed` to `reset()` produces a **deterministic but varied** scenario — different seeds select different subsets from the pool and randomize log order. This prevents agents from memorizing fixed patterns.
+### 4. Strict Fix Validation with Partial Credit
+The grader validates both **keys and values** of submitted fixes:
+- **Exact match** → Full credit (+0.25 reward)
+- **Right key, close value** (e.g., timeout=7 when expected=10) → Partial credit (+0.03)
+- **Right key, wrong value** (e.g., timeout=100 when expected=10) → Rejected
+- **Wrong key entirely** → Penalized (-0.1)
+- **Bearer token pattern matching** — `Bearer <any_valid_token>` is accepted
+- **Numeric tolerance** — strict 10% tolerance
+- **Boolean coercion** — `"true"`, `"1"`, `"yes"` all match `True`
+---
 ## Tasks
+### Easy: Payment API Integration (2 issues, 15 max steps)
+**Scenario**: A payment processing client is failing to connect to the payment gateway. The agent must diagnose authentication and protocol errors.
 - **Services**: `payment_client`, `payment_gateway`
+- **Issue pool** (4 possible, 2 selected):
+  - Missing `Authorization` header (HTTP 401)
+  - Wrong `Content-Type` header — `text/plain` instead of `application/json` (HTTP 415)
+  - Timeout too low for payment processing (HTTP 504)
+  - Base URL pointing to deprecated v1 endpoint (HTTP 301)
+- **Dependencies**: None — straightforward diagnosis
+### Medium: Webhook Event Chain (3 issues, 25 max steps)
+**Scenario**: A webhook notification system is dropping events across a 3-service chain. Events flow from sender → receiver → notification service, but multiple configuration issues are causing failures.
 - **Services**: `webhook_sender`, `webhook_receiver`, `notification_service`
+- **Issue pool** (5 possible, 3 selected):
+  - Rate limit mismatch (sender at 100/s, receiver accepts 10/s) → 429 errors
+  - Insufficient retry config (only 1 retry, no backoff, 429 not in retry list)
+  - Empty webhook signature header → receiver drops all events as unsigned
+  - Wrong target URL (`/webhook` vs `/hooks/incoming`) → 404 errors
+  - Payload compression enabled but receiver doesn't support gzip → 415 errors
+- **Dependencies**: Retry issue is **masked** by rate limit — must fix rate limit first to see the retry problem
+### Hard: E-Commerce Order Pipeline (5 issues, 40 max steps)
+**Scenario**: A complex e-commerce order processing pipeline is failing with cascading errors across 5 services. Multiple dependency chains make this genuinely challenging for frontier models.
 - **Services**: `order_service`, `inventory_service`, `shipping_service`, `api_gateway`, `auth_service`
+- **Issue pool** (7 possible, 5 selected):
+  - Deprecated URL (`/v1/check` → should be `/v2/reserve`) → 301 redirect
+  - Timeout too short (2s vs 4s processing time) — masked by wrong URL
+  - Synchronous mode causing race conditions between concurrent orders
+  - Expired auth token on inventory→shipping calls → 401
+  - No auto token refresh configured — masked by expired token
+  - No circuit breaker → failed requests hammer inventory service
+  - Missing idempotency key → retries create duplicate orders
+- **Dependencies**: `timeout` depends on `wrong_url` fix; `token_refresh` depends on `expired_token` fix; `idempotency` depends on `async` fix
+---
+## Multi-Dimensional Grading Rubric
+The grader uses a **4-dimension weighted rubric**, not a simple `issues_fixed / total` ratio:
+| Dimension | Weight | What It Measures |
+|---|---|---|
+| **Fix Score** | 40% | `issues_fixed / total_issues` — how many bugs were actually resolved |
+| **Strategy Score** | 25% | Did the agent follow a logical approach? Inspect before fix, avoid repeats, follow dependency order, use all action types |
+| **Diagnosis Score** | 20% | Did the agent inspect the service (logs/config) **before** submitting a fix for it? |
+| **Efficiency Score** | 15% | `remaining_steps / max_steps` — faster solutions score higher |
 ```
+Final Score = fix × 0.40 + strategy × 0.25 + diagnosis × 0.20 + efficiency × 0.15
+Clamped to (0.001, 0.999) — never exactly 0.0 or 1.0
 ```
+**Strategy scoring details:**
+- Did the agent inspect logs/config before submitting any fix? (+1)
+- Ratio of unique inspections to total inspections (no wasteful repeats) (+1)
+- Did fixes follow the optimal dependency order? (+1)
+- Did the agent use a variety of action types? (+1)
+### Baseline Scores (Rule-Based Heuristic Agent)
+| Task | Score | Steps Used | Issues Fixed |
+|---|---|---|---|
 | Easy | ~0.75 | 7 | 2/2 |
 | Medium | ~0.55 | 10 | 3/3 |
 | Hard | ~0.45 | 15 | 5/5 |
+*The baseline uses a deterministic heuristic (inspect all logs → inspect all configs → submit known fixes). An LLM-based agent following good debugging strategy can score higher.*
+---
+## Reward Shaping
+Every action produces a meaningful reward signal — not just sparse end-of-episode feedback:
+| Action | Reward | Condition |
+|---|---|---|
+| `inspect_logs` (first time, finds error patterns) | **+0.15** | New issue-related log patterns found |
+| `inspect_logs` (first time, no issues here) | +0.05 | Valid inspection, no errors in this service |
+| `inspect_logs` (repeat, no new info) | 0.00 | Already inspected, nothing changed |
+| `inspect_logs` (repeat, after a fix) | +0.05 | Dynamic logs appeared after a recent fix |
+| `inspect_config` (service has issues) | +0.05 | Relevant config retrieved |
+| `inspect_config` (service is clean) | +0.01 | Config retrieved but no issues here |
+| `inspect_config` (repeat) | 0.00 | Already inspected |
+| `inspect_endpoint` | +0.02 to +0.05 | Simulated endpoint test |
+| `submit_fix` (correct fix) | **+0.25** | Issue resolved, service health updated |
+| `submit_fix` (correct + inspected first) | **+0.30** | Fix + strategy bonus for diagnosis |
+| `submit_fix` (partial — close but not exact) | +0.03 | Right key, approximately right value |
+| `submit_fix` (wrong fix) | **-0.10** | Incorrect fix payload |
+| `submit_fix` (empty payload) | -0.10 | Empty fix_payload submitted |
+| All issues fixed | **+0.20** | Episode completion bonus |
+| Invalid target / invalid action | -0.05 | Bad input |
+| Every step | **-0.01** | Step cost — encourages efficiency |
+---
 ## Action & Observation Spaces
+### Action Schema (Pydantic model: `ApiDebugAction`)
 ```json
 {
   "action_type": "inspect_logs | inspect_config | inspect_endpoint | submit_fix",
+  "target": "<service_name>",
   "fix_payload": {
     "config_key": "corrected_value"
   }
 }
 ```
+- `action_type` (required): One of the 4 debugging actions
+- `target` (required): The service to act on (from `available_targets` in the observation)
+- `fix_payload` (optional): Required only for `submit_fix` — the configuration correction
+**Fix payload formats:**
+```json
+// Simple key-value fix
+{"timeout": 10}
+// Nested key fix (dot notation)
+{"headers.Authorization": "Bearer my_api_key"}
+// Complex nested object fix
+{"retry": {"max_retries": 3, "backoff_factor": 2, "retry_on_status": [429, 500]}}
+```
+### Observation Schema (Pydantic model: `ApiDebugObservation`)
 ```json
 {
   "task_id": "easy",
+  "task_description": "A payment processing API integration is failing...",
+  "logs": ["[ERROR] 2026-03-25T10:15:23Z POST /process -> 401 Unauthorized", "..."],
+  "config_snapshot": {"headers": {"Content-Type": "text/plain"}, "timeout": 30},
+  "api_response": {"status": "error", "status_code": 401, "error": "Missing Authorization"},
   "service_status": {"payment_client": "error", "payment_gateway": "healthy"},
+  "dependency_graph": {"payment_client": ["payment_gateway"], "payment_gateway": []},
+  "error_trace": [
+    "[CRITICAL] payment_client: Missing Authorization header",
+    "  └─> payment_gateway: All requests rejected with 401"
+  ],
+  "hints": ["Check headers.Authorization"],
   "remaining_steps": 14,
   "issues_found": 1,
   "issues_fixed": 0,
   "issues_total": 2,
+  "action_result": "Inspected logs for 'payment_client'. Found relevant error patterns!",
+  "available_targets": ["payment_client", "payment_gateway"],
+  "done": false,
+  "reward": 0.15
 }
 ```
+**Key observation fields for agent reasoning:**
+- `service_status` — shows which services are healthy/degraded/error (updates dynamically)
+- `dependency_graph` — shows service relationships (agent should fix upstream first)
+- `error_trace` — shows active error cascades (shrinks as issues are fixed)
+- `hints` — progressive hints that get more specific as steps are used
+---
 ## Example Transcript
 ```
 >>> reset(task_id="easy")
+task_description: "A payment processing API integration is failing..."
 service_status: {payment_client: "error", payment_gateway: "healthy"}
+error_trace:
+  [CRITICAL] payment_client: Missing Authorization header
+    └─> payment_gateway: All requests rejected with 401
+  [ERROR] payment_client: Wrong Content-Type (text/plain instead of application/json)
+    └─> payment_gateway: Request body parsing fails
+issues_total: 2, remaining_steps: 15
+>>> step(action_type="inspect_logs", target="payment_client")
+logs: [
+  "[INFO]  Payment client initialized...",
+  "[ERROR] POST /process -> 401 Unauthorized",
+  "[ERROR] Response: {'error': 'Missing or invalid Authorization header'}",
+  "[WARN]  Request headers: Content-Type=text/plain",
+  "[ERROR] POST /process -> 415 Unsupported Media Type",
 ]
 issues_found: 2, reward: +0.15
+>>> step(action_type="inspect_config", target="payment_client")
+config_snapshot: {
+  "base_url": "https://api.paymentgateway.com/v2",
+  "headers": {"Content-Type": "text/plain", "Accept": "application/json"},
+  "timeout": 30
+}
+reward: +0.05   // Service has issues, first inspection
+>>> step(action_type="submit_fix", target="payment_client",
+         fix_payload={"headers.Authorization": "Bearer sk_live_my_key"})
+action_result: "Fix accepted! Fixed 1 issue(s). Total: 1/2"
+service_status: {payment_client: "degraded", payment_gateway: "healthy"}
+reward: +0.30   // Fix (+0.25) + strategy bonus (+0.05) for inspecting first
+>>> step(action_type="inspect_logs", target="payment_client")
+logs: [...original logs...,
+  "[INFO]  Authorization header set. Retrying request..."   // NEW dynamic log!
+]
+reward: +0.05   // Re-inspection has new dynamic logs
+>>> step(action_type="submit_fix", target="payment_client",
+         fix_payload={"headers.Content-Type": "application/json"})
+action_result: "Fix accepted! All issues fixed! Episode complete."
 service_status: {payment_client: "healthy", payment_gateway: "healthy"}
 error_trace: ["All issues resolved. No error cascades active."]
+reward: +0.50   // Fix (+0.25) + strategy (+0.05) + completion bonus (+0.20)
+done: true
 >>> grade()
+score: 0.82
+  fix_score: 1.00 (2/2 fixed)
+  diagnosis_score: 1.00 (inspected before every fix)
+  efficiency_score: 0.67 (5/15 steps used)
+  strategy_score: 0.80 (inspected first, used multiple action types)
 ```
+---
 ## Setup & Usage
+### Prerequisites
+- Python 3.10+
+- [uv](https://docs.astral.sh/uv/) (recommended) or pip
+- Docker (for containerized deployment)
 ### Install Dependencies
 ```bash
+# Clone the repository
+git clone https://github.com/yadnyeshkolte/openenv-task.git
+cd openenv-task
+# Install dependencies with uv
 uv sync
+# Or with pip
+pip install -e .
 ```
+### Run the Server Locally
 ```bash
+# From the project root (openenv-task/)
+uvicorn server.app:app --reload --host 0.0.0.0 --port 8000
 ```
+The server will be available at `http://localhost:8000`. Visit `http://localhost:8000/docs` for interactive API documentation.
+### Quick Test
 ```bash
+# Reset environment
+curl -X POST http://localhost:8000/reset
+# Inspect logs
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "inspect_logs", "target": "payment_client"}'
+# Submit a fix
+curl -X POST http://localhost:8000/step \
+  -H "Content-Type: application/json" \
+  -d '{"action_type": "submit_fix", "target": "payment_client", "fix_payload": {"headers.Authorization": "Bearer my_key"}}'
 ```
+### Docker Build & Run
 ```bash
+# From the project root (openenv-task/)
+docker build -t api_debug_env -f Dockerfile .
 docker run -p 8000:8000 api_debug_env
 ```
+---
+## API Endpoints
 | Endpoint | Method | Description |
+|---|---|---|
+| `/` | GET | Environment info, version, and feature list |
+| `/reset` | POST | Reset environment (accepts `task_id` and `seed` params) |
+| `/step` | POST | Execute a debugging action |
+| `/state` | GET | Get current state (episode_id, step_count) |
+| `/schema` | GET | Get action/observation Pydantic schemas |
+| `/tasks` | GET | List all 3 tasks with action schema and service dependencies |
+| `/grader` | POST | Get multi-dimensional grader score for current episode |
+| `/baseline` | POST | Run the rule-based baseline agent on all 3 tasks |
+| `/health` | GET | Health check endpoint |
+| `/docs` | GET | Interactive Swagger UI documentation |
+---
+## Running Inference
+The `inference.py` script at the project root uses the OpenAI API client to run an LLM agent against all 3 tasks:
 ```bash
+# Set your API credentials
+export HF_TOKEN=your_huggingface_token
+# Optional: override model and API base
+export MODEL_NAME=Qwen/Qwen2.5-72B-Instruct
+export API_BASE_URL=https://router.huggingface.co/v1
+# Run inference from the project root
 python inference.py
 ```
+**Output format** (stdout):
+```
+[START] task=easy env=api_debug_env model=Qwen/Qwen2.5-72B-Instruct
+[STEP]  step=1 action=inspect_logs(target=payment_client) reward=0.15 done=false error=null
+[STEP]  step=2 action=submit_fix(target=payment_client, fix={...}) reward=0.30 done=false error=null
+...
+[END]   success=true steps=5 score=0.820 rewards=0.15,0.30,...
+```
+The inference script:
+- Uses `openai.OpenAI` client for all LLM calls
+- Reads `HF_TOKEN` (or `API_KEY`) from environment variables
+- Includes retry logic with exponential backoff
+- Emits `[START]`, `[STEP]`, `[END]` lines to stdout
+---
+## Running Tests
+```bash
+# From the project root (openenv-task/)
+python -m pytest tests/ -v --tb=short
+```
+**70 tests** across 12 test classes covering:
+- Scenario loading, seed randomization, and issue pool selection
+- Environment reset and initialization
+- All 4 action types: `inspect_logs`, `inspect_config`, `inspect_endpoint`, `submit_fix`
+- Dynamic state: service health updates, dynamic log injection, error trace changes
+- Multi-dimensional grading rubric (fix, diagnosis, efficiency, strategy)
+- Strict fix validation with partial credit
+- Value matching (strings, numbers, booleans, lists, Bearer tokens)
+- Full episode integration tests (easy, medium, hard)
+- Cascading failure mechanics and dependency chains
+- Episode termination conditions
+### Validate OpenEnv Compliance
+```bash
+openenv validate
+```
+---
 ## Project Structure
 ```
+openenv-task/                         # Project root
+├── __init__.py                       # Package init (exports ApiDebugEnv, Action, Observation)
+├── client.py                         # OpenEnv client (WebSocket connection to server)
+├── models.py                         # Pydantic Action & Observation type definitions
+├── scenarios.py                      # Task scenarios with dependency graphs & issue pools
+├── inference.py                      # MANDATORY inference script (LLM agent, OpenAI client)
+├── openenv.yaml                      # OpenEnv metadata (spec v1)
+├── pyproject.toml                    # Python project config & dependencies
+├── Dockerfile                        # Docker build for HF Spaces deployment
+├── LICENSE                           # BSD license
+├── README.md                         # This file
+├── PROGRESS.md                       # Development session log
+├── AGENTS.md                         # Instructions for AI coding agents
 ├── server/
+│   ├── __init__.py                   # Server package init
+│   ├── api_debug_env_environment.py  # Core environment (reset/step/grade logic)
+│   ├── app.py                        # FastAPI endpoints (/reset, /step, /tasks, etc.)
+│   ├── Dockerfile                    # Alternate Dockerfile (same as root)
+│   └── requirements.txt             # Server-specific requirements
+├── scripts/
+│   └── baseline_inference.py         # Alternate baseline script
 └── tests/
+    └── test_environment.py           # 70 unit & integration tests
 ```
+### Key Files
+| File | Purpose |
+|---|---|
+| `server/api_debug_env_environment.py` | **Core logic** — `reset()`, `step()`, `grade()`, dynamic state, cascading failures |
+| `scenarios.py` | **Task definitions** — issue pools, dependency graphs, dynamic logs, service configs |
+| `models.py` | **Type definitions** — `ApiDebugAction` and `ApiDebugObservation` Pydantic models |
+| `inference.py` | **Mandatory** — LLM-based agent using OpenAI client with `[START]/[STEP]/[END]` output |
+| `openenv.yaml` | **Mandatory** — OpenEnv spec v1 metadata with task definitions |
+| `server/app.py` | **FastAPI server** — all HTTP endpoints including `/baseline` and `/grader` |
+---
+## Design Philosophy
+This environment is designed to be useful for **RL/agent training and evaluation**, not just a one-off benchmark:
+1. **Dense Reward Signal** — every action type produces positive or negative reward, enabling gradient-based training (GRPO, DPO, PPO). Not just a sparse binary score at the end.
+2. **Progressive Difficulty** — Easy (2 services, 2 issues) → Medium (3 services, 3 issues with 1 dependency) → Hard (5 services, 5 issues with multiple dependency chains). Difficulty comes from complexity, not ambiguity.
+3. **Partial Credit** — close-but-wrong fixes get constructive feedback instead of just rejection. This provides learning signal for agents that are on the right track.
+4. **Strategy Incentives** — the multi-dimensional rubric rewards **how** the agent solves (inspect before fix, follow dependencies, avoid waste), not just **what** it solves. This encourages emergent debugging strategies.
+5. **Stochastic Scenarios** — seed-based randomization from expanded issue pools prevents policy overfitting to memorized scenarios while maintaining reproducibility.
+6. **Cascading Dynamics** — upstream fixes change downstream state, requiring **multi-step causal reasoning**. The agent can't just pattern-match each issue independently — it must understand the system architecture.
+7. **Real-World Relevance** — API integration debugging is a genuine, high-value task that software engineers spend significant time on. The scenarios model actual failure patterns (expired tokens, rate limiting, missing headers, deprecated endpoints, race conditions).
+---
+## OpenEnv Spec Compliance
+| Requirement | Status |
+|---|---|
+| OpenEnv spec v1 (`openenv.yaml`) | ✅ |
+| Typed Pydantic models (Action, Observation) | ✅ |
+| `reset()` / `step()` / `state()` API | ✅ |
+| 3+ tasks with difficulty range | ✅ (easy, medium, hard) |
+| Programmatic graders (0.0–1.0) | ✅ (multi-dimensional rubric) |
+| Meaningful reward function | ✅ (dense, not sparse) |
+| Baseline inference script | ✅ (`inference.py` at root) |
+| OpenAI client for LLM calls | ✅ |
+| `[START]/[STEP]/[END]` stdout format | ✅ |
+| Dockerfile builds and runs | ✅ |
+| HF Space deploys and responds | ✅ |
+| `openenv validate` passes | ✅ |
+---
+## Hackathon Submission
+- **HF Space**: [yadnyeshkolte/api-debug-env](https://huggingface.co/spaces/yadnyeshkolte/api-debug-env)
+- **GitHub**: [yadnyeshkolte/openenv-task](https://github.com/yadnyeshkolte/openenv-task)
+- **Hackathon**: Meta PyTorch OpenEnv Hackathon × Scaler School of Technology