Incident Command Center

OpenEnv · Multi-Agent · Long-Horizon · Professional-Task Simulation

🚨 The story in 2 minutes

When a real tech company has an outage, three people's phones buzz at once — a Triage engineer, an Investigator, and an Ops Manager. They have to cooperate under a ticking SLA clock, every action costs budget, and every wrong call costs real money (enterprise outages hurt ~3× more than free-tier).

We built a simulator of that war room — and we fine-tuned an LLM to run it as well as the human expert.

What is the environment?

Three specialist agents with different permissions resolve a live queue of 13 realistic tech incidents across 3 difficulty tiers.

RoleCan doCannot do
🔍 Triage Pull logs · check metrics · consult KB Close a ticket
🧪 Investigator Apply a fix · roll back a deploy Escalate or file a post-mortem
👷 Ops Manager Escalate · file post-mortem · close the ticket Apply a code fix

What did the agent learn?

Not "pick the right label." It learned a whole workflow — dig up clues, hand off to the right specialist, apply the correct fix, respect the SLA, file the post-mortem, close the ticket. The rubric makes every piece of that workflow visible as a named reward component, so you can see why the agent earned (or lost) points at every step.

Why it matters for the 3 hackathon themes

🤝 Theme #1 — Multi-Agent

Three distinct roles with non-overlapping permissions. Wrong-actor calls → -0.08. Correct handoff → +0.15. Cooperation is trained, not hard-coded.

⏱️ Theme #2 — Long-Horizon

Each episode runs 3–5 sequential incidents over 20–60 steps with a single ticking SLA clock. Big rewards (+0.80 × tier) only fire after clues → fix → post-mortem. Sparse and delayed by design.

🏢 Theme #3 — Professional World-Model

Real logs, metrics, KB articles, red-herring signals, customer tiers, SLA timers, revenue impact. Close an enterprise ticket wrong and it hurts ~3× what a free-tier one does.

↓ Keep scrolling for the headline numbers, training plots, ablation, and the full rubric. Or jump straight to the README or the blog post.

Resources & documentation

💻
GitHub repository
Full source, tests, Dockerfile, CI-ready
🤗
Hugging Face Space page
Repo view, build logs, discussions
🟢
Live environment
You are here — OpenEnv endpoints live
🎓
Reproduce training (Colab T4)
One-click notebook, ~1 h wall clock
📖
README (Part 1 + Part 2)
Story overview + full technical deep-dive
📝
Mini blog post
The short writeup — MD file on the HF Space + GitHub
Submission checklist
Every judging rule → where to find the evidence

Headline results

SFT reward lift on hard tasks +10.17 vs Qwen2.5-1.5B-Instruct base
Heuristic-policy match Exact SFT clones the demonstrator across every tier
Scale ablation (hard Δ) 0.5B → 1.5B +0.00 → +10.17: capacity matters
Training data 680 rows 24 heuristic rollouts · 3 epochs

Environment at a glance

Incidents in library
Specialist roles 3 triage · investigator · ops manager
Reward components 14+ rubric-based, transparent
Seeded reproducibility Yes default seed 20260425

1.5B SFT vs base (reference run)

Task tierBase rewardSFT rewardΔ
easy-2.92-4.72-1.80
medium-4.00-0.87+3.13
hard-4.28+5.89+10.17

Numbers loaded live from summary_metrics.json committed alongside this Space.

Training evidence

Committed artifacts from the reference training run (Qwen2.5-1.5B-Instruct, 8 episodes/task, 3 epochs) plus the Qwen2.5-0.5B-Instruct ablation. Click any plot to open it full-size.

Reward curve by policy (1.5B)
1.5B reward curve. Mean episodic reward per task tier across Random / Heuristic / Base-LLM / SFT-LLM. SFT matches the heuristic demonstrator across every tier and outperforms the untuned base by ++10.17 on hard incidents.
SFT training loss and token accuracy (1.5B)
1.5B training curve. Supervised loss collapses from ~2.84 → ~0.02 and next-token accuracy climbs from ~0.49 → ~0.99 over three epochs on 680 rollout tokens.
Reward component decomposition (1.5B)
1.5B reward-component breakdown. SFT reproduces the heuristic's positive components (clue_bonus, mitigation_correct, closure_correct, speed_bonus) while the base model stalls on step_cost and SLA penalties.
Reward curve by policy (0.5B ablation)
0.5B ablation reward curve. Same pipeline, smaller backbone. SFT improves by only +0.43 / +0.14 / +0.00 over base — the 0.5B model is too small to absorb the multi-step, role-gated policy. Scale is the story.

Raw files: summary_metrics.json · training_log.json · summary_metrics_qwen0p5b.json

Ablation: model scale matters for imitation learning

Same pipeline, same data schema — only the base-model size differs. The 0.5B model cannot absorb the expert policy; 1.5B matches it exactly.

ModelEasy ΔMedium ΔHard Δ Heuristic match?
Qwen2.5-0.5B-Instruct +0.43+0.14+0.00 No (stuck on step-cost)
Qwen2.5-1.5B-Instruct -1.80+3.13+10.17 Yes (exact match)

Endpoints

Standard OpenEnv contract plus operational endpoints.

Action space

inspect_logsinspect_metricsconsult_kbnegotiate_handoffapply_fixclose_incidentescalaterollbacksubmit_postmortem

Each action is gated by the acting role; wrong-actor calls are penalised.

Reward model

Composable rubric with anti-gaming safeguards. Every step returns a reward_components dictionary so training curves are interpretable. Closure rewards and SLA penalties are scaled by customer-tier multipliers:

free: x0.6standard: x1.0premium: x1.4enterprise: x1.8

ComponentSignal
step_cost-0.04 per investigation step
clue_reward+0.12 per new fact
handoff_correct+0.15
mitigation_correct+0.35
closure_correct_base+0.8 × tier multiplier
closure_wrong-1.1 × tier multiplier

Full rubric (invalid-action, repeated-lookup, rollback-effective, post-mortem-logged, etc.) is documented in the README.

Metadata

{
  "name": "incident_command_center_env",
  "version": "3.0.0",
  "tasks": [
    "easy",
    "medium",
    "hard"
  ],
  "incidents_per_task": {
    "easy": 3,
    "medium": 5,
    "hard": 5
  },
  "actions": [
    "inspect_logs",
    "inspect_metrics",
    "consult_kb",
    "negotiate_handoff",
    "apply_fix",
    "close_incident",
    "escalate",
    "rollback",
    "submit_postmortem"
  ],
  "roles": [
    "triage_agent",
    "investigator_agent",
    "ops_manager_agent"
  ],
  "reward_model": {
    "step_cost_investigation": -0.04,
    "clue_reward": 0.12,
    "handoff_correct": 0.15,
    "mitigation_correct": 0.35,
    "closure_correct_base": 0.8,
    "closure_wrong": -1.1,
    "tier_multiplier": {
      "free": 0.6,
      "standard": 1.0,
      "premium": 1.4,
      "enterprise": 1.8
    }
  },
  "budgets": {
    "easy": 28,
    "medium": 54,
    "hard": 84
  },
  "sla_minutes": {
    "easy": 120,
    "medium": 210,
    "hard": 330
  }
}