Ishangtxl's picture
Sync from GitHub 6cadce4
c4cb463 verified
metadata
title: Legacy COBOL Migration Workbench
emoji: 🧾
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
  - openenv
  - cobol
  - reinforcement-learning

Legacy COBOL Migration Workbench

Legacy COBOL Migration Workbench is an OpenEnv environment for training an LLM to act like a legacy modernization engineer. The agent receives a migration ticket, inspects COBOL programs and copybooks through tools, writes Python, runs visible tests, studies structured diffs, repairs its draft, and submits code scored on hidden and fresh tests.

The target capability is not one-shot translation. Real modernization depends on fixed-width records, copybooks, implied decimals, OCCURS tables, level-88 condition names, branch precedence, and exact output formatting. This environment turns that workflow into a partially observable, tool-mediated training loop.

Submission Materials

Environment

Each episode starts with a partial ticket. The agent can discover details through MCP tools:

  • read_cobol_file
  • read_copybook
  • parse_copybook_layout
  • inspect_business_rules
  • write_python_solution
  • run_visible_tests
  • inspect_diff
  • submit_final

The submitted Python code must define:

def migrate(input_record: str) -> str:
    ...

Episodes are capped at 24 tool steps. Visible tests are available for debugging, but hidden and fresh cases are withheld until final scoring.

Reward

final_reward =
  0.55 * hidden_correctness
+ 0.15 * fresh_correctness
+ 0.10 * interface_contract
+ 0.08 * type_and_layout_fidelity
+ 0.07 * anti_hardcoding
+ 0.05 * safety

The reward is correctness-heavy, but it also penalizes unsafe code, broken interfaces, layout drift, and visible-case hardcoding.

Task Families

Task Difficulty COBOL concepts Main failure modes
fixed_width_customer easy PIC X, padding/truncation, status mapping trimmed spaces, lost ZIP leading zeros, bad output width
decimal_copybook_payroll medium copybook layout, implied decimals, level-88 bonus flag float drift, wrong rounding, wrong fixed-width net pay
claims_eligibility_branching medium EVALUATE TRUE, branch precedence wrong first-match branch, boundary mistakes
account_status_level88 medium level-88 status conditions, signed amount treating condition names as variables, wrong precedence
date_normalization medium legacy YYMMDD windowing, validation wrong century window, over-rejecting legacy dates
invoice_occurs_totals hard multi-file INVTOTAL.cbl/TAXRATE.cbl, OCCURS, copybook tax-code metadata wrong stride, ignoring tax-code lookup, overfitting visible invoice IDs

inspect_business_rules exposes agent-facing hints only. Exact reference behavior stays internal to hidden and fresh tests.

Results

The current training run uses Hugging Face TRL LoRA SFT on Qwen/Qwen3-14B with 15 oracle and repair examples. Evaluation uses the same OpenEnv rollout harness before and after training.

Policy Role Mean public score Accepted tasks
deterministic identity deterministic baseline 0.1500 0 / 6
deterministic blank width deterministic baseline 0.1767 0 / 6
base Qwen/Qwen3-14B model before SFT 0.5320 2 / 6
trained Qwen/Qwen3-14B LoRA SFT model after SFT 0.7971 4 / 6
oracle-model sanity check with reference solutions 1.0000 6 / 6

Qwen3-14B training reward evidence dashboard

This dashboard summarizes the before/after reward evidence for the Qwen3-14B LoRA SFT run.

Qwen3-14B LoRA SFT loss

The SFT run reduced loss from 1.1350 to 0.1924 and improved mean token accuracy from 0.7938 to 0.9483.

Qwen3-14B reward before and after SFT

The trained checkpoint improved mean public reward from 0.5320 to 0.7971.

Qwen3-14B accepted tasks before and after SFT

Accepted tasks improved from 2 / 6 to 4 / 6.

Training Artifacts

Kept in the submission root:

  • outputs/training/sft_run_metadata.json: completed-run metadata for the Qwen3-14B LoRA SFT run
  • outputs/training/sft_loss.csv: real training loss and token-accuracy history
  • outputs/training/qwen3_14b_training_evidence_summary.json: compact extracted training/eval summary
  • outputs/training/hf_job_qwen3_14b_logs.txt: raw HF Jobs log for provenance
  • outputs/evals/base_qwen3_14b_all_tasks.json: before-training rollout
  • outputs/evals/score_summary.json: judge-facing summary with deterministic, base-model, trained-model, and oracle rows
  • plots/qwen3_14b_training_reward_evidence_dashboard.png
  • plots/qwen3_14b_loss_curve.png
  • plots/qwen3_14b_reward_before_after.png
  • plots/qwen3_14b_accepted_before_after.png

The old dry-run sft_loss.svg scaffold is intentionally not included as judge-facing training evidence.

Historical Artifacts

Older Azure gpt-5.4-mini rollouts are archived under outputs/evals/historical/. They were produced before the invoice task was hardened into a multi-file tax-code task, so they are preserved for context but excluded from the current score summary.

Run Locally

python -m venv .venv
.venv/bin/pip install -e ".[dev]"
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. .venv/bin/pytest tests -q -p no:cacheprovider

Run the server:

PYTHONPATH=. .venv/bin/python -m uvicorn legacy_cobol_env.server.app:app --host 127.0.0.1 --port 8000
curl -sS http://127.0.0.1:8000/health
curl -sS http://127.0.0.1:8000/schema

Validate the OpenEnv package:

.venv/bin/openenv validate --verbose

Regenerate the score summary:

.venv/bin/python -m legacy_cobol_env.eval.run_evidence_report

Run the root inference script:

API_BASE_URL="https://..." MODEL_NAME="..." HF_TOKEN="..." .venv/bin/python inference.py --max-repairs 1

The inference script emits [START], one [STEP] per task, and [END].

Safety Note

Candidate code is checked for common unsafe imports and builtins, executed in a temporary subprocess with cleared environment variables, and bounded by a timeout. This is layered mitigation for a hackathon environment, not a complete secure sandbox.