title: Legacy COBOL Migration Workbench
emoji: 🧾
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
app_port: 8000
base_path: /web
tags:
- openenv
- cobol
- reinforcement-learning
Legacy COBOL Migration Workbench
Legacy COBOL Migration Workbench is an OpenEnv environment for training an LLM to act like a legacy modernization engineer. The agent receives a migration ticket, inspects COBOL programs and copybooks through tools, writes Python, runs visible tests, studies structured diffs, repairs its draft, and submits code scored on hidden and fresh tests.
The target capability is not one-shot translation. Real modernization depends on fixed-width records, copybooks, implied decimals, OCCURS tables, level-88 condition names, branch precedence, and exact output formatting. This environment turns that workflow into a partially observable, tool-mediated training loop.
Submission Materials
- Mini-blog:
blog.md - Hugging Face Space: https://huggingface.co/spaces/Ishangtxl/mainframe-modernization-openenv
- Training run notebook: https://colab.research.google.com/drive/1XGcw8Xkcyx1byYqSu5jpF9UWPIQWWBCi?usp=sharing
- Training evidence:
outputs/training/sft_run_metadata.json - Training log:
outputs/training/hf_job_qwen3_14b_logs.txt - Score summary:
outputs/evals/score_summary.json
Environment
Each episode starts with a partial ticket. The agent can discover details through MCP tools:
read_cobol_fileread_copybookparse_copybook_layoutinspect_business_ruleswrite_python_solutionrun_visible_testsinspect_diffsubmit_final
The submitted Python code must define:
def migrate(input_record: str) -> str:
...
Episodes are capped at 24 tool steps. Visible tests are available for debugging, but hidden and fresh cases are withheld until final scoring.
Reward
final_reward =
0.55 * hidden_correctness
+ 0.15 * fresh_correctness
+ 0.10 * interface_contract
+ 0.08 * type_and_layout_fidelity
+ 0.07 * anti_hardcoding
+ 0.05 * safety
The reward is correctness-heavy, but it also penalizes unsafe code, broken interfaces, layout drift, and visible-case hardcoding.
Task Families
| Task | Difficulty | COBOL concepts | Main failure modes |
|---|---|---|---|
fixed_width_customer |
easy | PIC X, padding/truncation, status mapping |
trimmed spaces, lost ZIP leading zeros, bad output width |
decimal_copybook_payroll |
medium | copybook layout, implied decimals, level-88 bonus flag | float drift, wrong rounding, wrong fixed-width net pay |
claims_eligibility_branching |
medium | EVALUATE TRUE, branch precedence |
wrong first-match branch, boundary mistakes |
account_status_level88 |
medium | level-88 status conditions, signed amount | treating condition names as variables, wrong precedence |
date_normalization |
medium | legacy YYMMDD windowing, validation | wrong century window, over-rejecting legacy dates |
invoice_occurs_totals |
hard | multi-file INVTOTAL.cbl/TAXRATE.cbl, OCCURS, copybook tax-code metadata |
wrong stride, ignoring tax-code lookup, overfitting visible invoice IDs |
inspect_business_rules exposes agent-facing hints only. Exact reference behavior stays internal to hidden and fresh tests.
Results
The current training run uses Hugging Face TRL LoRA SFT on Qwen/Qwen3-14B with 15 oracle and repair examples. Evaluation uses the same OpenEnv rollout harness before and after training.
| Policy | Role | Mean public score | Accepted tasks |
|---|---|---|---|
| deterministic identity | deterministic baseline | 0.1500 | 0 / 6 |
| deterministic blank width | deterministic baseline | 0.1767 | 0 / 6 |
base Qwen/Qwen3-14B |
model before SFT | 0.5320 | 2 / 6 |
trained Qwen/Qwen3-14B LoRA SFT |
model after SFT | 0.7971 | 4 / 6 |
oracle-model |
sanity check with reference solutions | 1.0000 | 6 / 6 |
This dashboard summarizes the before/after reward evidence for the Qwen3-14B LoRA SFT run.
The SFT run reduced loss from 1.1350 to 0.1924 and improved mean token accuracy from 0.7938 to 0.9483.
The trained checkpoint improved mean public reward from 0.5320 to 0.7971.
Accepted tasks improved from 2 / 6 to 4 / 6.
Training Artifacts
Kept in the submission root:
outputs/training/sft_run_metadata.json: completed-run metadata for the Qwen3-14B LoRA SFT runoutputs/training/sft_loss.csv: real training loss and token-accuracy historyoutputs/training/qwen3_14b_training_evidence_summary.json: compact extracted training/eval summaryoutputs/training/hf_job_qwen3_14b_logs.txt: raw HF Jobs log for provenanceoutputs/evals/base_qwen3_14b_all_tasks.json: before-training rolloutoutputs/evals/score_summary.json: judge-facing summary with deterministic, base-model, trained-model, and oracle rowsplots/qwen3_14b_training_reward_evidence_dashboard.pngplots/qwen3_14b_loss_curve.pngplots/qwen3_14b_reward_before_after.pngplots/qwen3_14b_accepted_before_after.png
The old dry-run sft_loss.svg scaffold is intentionally not included as judge-facing training evidence.
Historical Artifacts
Older Azure gpt-5.4-mini rollouts are archived under outputs/evals/historical/. They were produced before the invoice task was hardened into a multi-file tax-code task, so they are preserved for context but excluded from the current score summary.
Run Locally
python -m venv .venv
.venv/bin/pip install -e ".[dev]"
PYTHONDONTWRITEBYTECODE=1 PYTHONPATH=. .venv/bin/pytest tests -q -p no:cacheprovider
Run the server:
PYTHONPATH=. .venv/bin/python -m uvicorn legacy_cobol_env.server.app:app --host 127.0.0.1 --port 8000
curl -sS http://127.0.0.1:8000/health
curl -sS http://127.0.0.1:8000/schema
Validate the OpenEnv package:
.venv/bin/openenv validate --verbose
Regenerate the score summary:
.venv/bin/python -m legacy_cobol_env.eval.run_evidence_report
Run the root inference script:
API_BASE_URL="https://..." MODEL_NAME="..." HF_TOKEN="..." .venv/bin/python inference.py --max-repairs 1
The inference script emits [START], one [STEP] per task, and [END].
Safety Note
Candidate code is checked for common unsafe imports and builtins, executed in a temporary subprocess with cleared environment variables, and bounded by a timeout. This is layered mitigation for a hackathon environment, not a complete secure sandbox.



