Aria Evals Dashboard

Job	Schedule	Description	Status
shadow-review	`30 10 * * *`	Daily shadow review of agent outputs	Active
dashboard-export	`0 12 * * *`	Export data + deploy dashboard to Cloudflare Pages	Active
correlation-daily	`0 11 * * 1-5`	Weekday correlation engine run	Active
correlation-weekly	`30 11 * * 6`	Full weekly correlation analysis	Active
self-improvement	`0 11 * * *`	Auto-create issues from eval failures	Active

Self-Learning Pipeline — Technical Architecture

This system implements a closed-loop self-improvement cycle for autonomous AI agent operations. Every agent output is captured, evaluated against human-calibrated rubrics by a cross-model LLM judge, and failures are automatically triaged into behavioral fix classes. Fixes are implemented in agent configuration, and the next evaluation cycle verifies the improvement. The loop runs daily without human intervention — humans set the judgment criteria, the system enforces and improves against them autonomously.

Complete Pipeline Flow

📡

Phase 1

Output Collection

Every agent message across all Slack channels is captured with full context, metadata, and trigger information

→

⚖️

Phase 2

Shadow Review

Cross-model LLM judge (GLM 5.1) evaluates each output against domain-specific Tony-calibrated rubrics

→

🔍

Phase 3

Failure Triage

Failures grouped by root cause class, prioritized by instance count, tracked in self-improvement loop file

→

🔧

Phase 4

Auto-Fix

Agent reads failure classes at session startup and implements behavioral fixes (AGENTS.md rules, SOUL.md guardrails, code changes)

→

✅

Phase 5

Verification

Next shadow review cycle verifies fixes resolved the failure class — verified fixes move to "Resolved", new failures enter triage

📊 MLflow — Observability Backbone

Every judge call, correlation analysis, and run metric is traced in MLflow. Dashboard data is exported from MLflow traces via search_traces() API. MLflow provides the historical memory that makes cycle-over-cycle comparison possible.

🔄 The cycle runs autonomously every 24 hours. Humans define judgment criteria through rubrics — the system enforces and improves against them.

Phase 1 — Output Collection

What It Does

Every message Warren posts to Slack — across all channels (client channels, #sales, #agent-warren, DMs) — is intercepted by an OpenClaw output-collector hook. The hook captures:

Full message textComplete, untruncated

Channel contextChannel name, ID, type

Trigger messageWhat prompted the output

Timestamp + metadataThread, files, reactions

Technical Implementation

Collection scriptshadow-collect.py

Output formatJSONL (one entry/line)

Queue fileshadow-review-queue.jsonl

Noise filteringSub-agent chatter, NO_REPLY, stubs

ClassificationContent-first (keyword → channel fallback)

Noise filter: Messages under 50 chars, sub-agent coordination, NO_REPLY responses, and system stubs are excluded automatically. Only substantive agent outputs enter the evaluation queue.

Phase 2 — Shadow Review (LLM Judge)

Cross-Model Evaluation

Each collected output is evaluated by GLM 5.1 (open-weights, Together AI) — a different model from the agent's own LLM. This cross-model approach prevents the echo-chamber effect where an AI would always approve its own outputs.

Why GLM 5.1?

Open weights — full transparency, no black box
Strong reasoning without self-bias toward Claude/GPT outputs
Cost-effective at scale (106+ evaluations daily)
Consistent judgment characteristics across runs

Tony-Calibrated Rubrics

Rubrics aren't generic quality checks — they encode specific human judgment patterns extracted from Tony Lambatos's actual production corrections. Each rubric references real verdict IDs.

behavioralPermission loops, self-talk, unused resources

processGate timing, coverage, displacement activity

sales-bdAttribution, data accuracy, action items

effort-valueEffort proportionality, scope management

product-scopeRequirements fidelity, capability coverage

discoveryCross-validation triangle, measurable outcomes

86-entry calibration corpus from Tony's real verdicts ensures rubric alignment with human judgment.

Phase 3 — Self-Improvement Loop

1. Failure Classification

After each shadow review, failures are automatically grouped by root cause class — not individual instances. Example: 13 separate "jumped to execution without planning" failures → 1 failure class "Missing Product Judgment Gate." This prevents whack-a-mole and ensures fixes address patterns, not symptoms.

2. Prioritized Fix Queue

Failure classes are prioritized by instance count — the class with the most failures gets fixed first. This maximizes pass rate improvement per fix. The queue lives in memory/self-improvement-loop.md — a structured file the agent reads at every session startup.

3. Behavioral Mutation

Fixes are implemented as standing rules in the agent's configuration files (AGENTS.md, SOUL.md, BOOTSTRAP.md). These aren't code changes — they're behavioral directives that modify the agent's decision-making. Example: "Pre-Execution Gate" rule that blocks execution until decomposition is complete.

Key insight: The agent modifies its own behavioral configuration based on measured failures. This is not prompt engineering by humans — it's an autonomous feedback loop where the system identifies its own failure patterns and writes rules to prevent recurrence. Humans define what good looks like (rubrics); the system figures out how to get there.

Phase 4 — Correlation Engine

Cross-Source Intelligence

The correlation engine connects output quality with upstream intelligence quality. It cross-correlates:

Google Drive intake accuracy vs. output accuracy — did bad intake data produce bad outputs?
Stakeholder mention patterns — who is being discussed, how often, in what context?
Meeting-to-output chains — tracing a single decision from meeting transcript → memory → output
Topic drift detection — identifying when agent focus drifts from assigned priorities

Scheduling

Daily (Mon–Fri)0 11 * * 1-5

Weekly deep (Sat)30 11 * * 6

Lookback7 days (daily) / 30 days (weekly)

OutputsCorrelations, alerts, insights

Lightweight daily runs catch immediate correlations. Full weekly analysis on Saturday processes the complete week's data with multi-hop chain tracing.

Phase 5 — Quality Gate (Pre-Delivery)

Real-Time Evaluation

The quality gate runs before delivery for high-stakes outputs (BD dailys, weekly recaps, client deliverables). It uses the same rubrics as the shadow review but operates synchronously — the output is evaluated before it reaches the channel.

Wired SOPs (5): BD daily, BD weekly recap, BD daily alert, product judgment, sprint kickoff. If the quality gate fails, the output is rejected and the agent must revise before sending.

Exit Codes

EXIT 0PASS — deliver

EXIT 1FAIL — revise

EXIT 2ERROR — skip gate

MLflow — Observability & Experiment Tracking Backbone

MLflow is the central nervous system that connects every phase of the pipeline. It doesn't just log results — it provides the structured trace data that enables the dashboard, the self-improvement loop, and cross-cycle comparison. Without MLflow, each shadow review would be an isolated run with no memory of previous cycles.

Where MLflow Sits in Each Phase

⚖️ Shadow Review Every judge call = @mlflow.trace span. Full run wrapped in parent span with pass/fail/domain metadata

🔗 Correlation Engine LLM analysis calls traced. Run metadata: intake files, stakeholders, correlations found

📊 Dashboard Export export-dashboard-data.py queries MLflow API → dashboard-data.json → Cloudflare Pages

📈 Trend Analysis Daily pass rates, domain breakdown, failure class counts — all derived from MLflow trace history

🧪 Calibration Agreement measurement between judge + Tony corpus logged as experiment comparisons

MLflow Trace Structure

shadow-review-run          ← parent span
├─ entries: 106
├─ pass_rate: 72%
├─ domain_breakdown: {process: ...}
│
├── shadow-review-judge    ← per-entry span
│   ├─ domain: "behavioral"
│   ├─ verdict: "PASS"
│   ├─ reasoning: "..."
│   └─ latency_ms: 3200
│
├── shadow-review-judge    ← next entry
│   ├─ domain: "process"
│   ├─ verdict: "FAIL"
│   └─ recommendation: "..."
│
└── ... (106 spans per run)

Experiment

warren-evals

Experiment ID: 1 • All eval traces in one experiment

Tracking Server

localhost:5000

SQLite backend • MLflow 3.12 • systemd managed

Data Flow

Traces → JSON → Dashboard

search_traces() → dashboard-data.json → CF Pages

Technology Stack

Component	Technology	Purpose	Location
Agent Runtime	OpenClaw + Claude	Agent execution, memory, tool use	DGX Spark (local)
Judge Model	GLM 5.1 (Together AI)	Cross-model rubric evaluation	API (together.xyz)
Observability	MLflow 3.12	Trace logging per judge call, run-level metrics, experiment history, dashboard data source	localhost:5000 (SQLite)
Rubrics	YAML	Human judgment patterns → machine-evaluable criteria	`ops/evals/rubrics/`
Calibration Corpus	JSONL (86 entries)	Tony's real verdicts for rubric alignment	`ops/evals/datasets/`
Dashboard	Cloudflare Pages	Static HTML, auto-deployed from MLflow data export	vtkl-dashboard.pages.dev
Scheduling	Linux cron	Daily/weekly automation — no external dependencies	DGX Spark crontab
Intelligence Intake	Google Drive API + GWS SA	Meeting transcripts, voice notes, stakeholder docs	Service account: warren@vtkl.ai

Design Principles

🎯

Human Judgment, Machine Enforcement

Humans define quality through rubrics calibrated from real corrections. The system enforces those standards at scale, 24/7. Humans never need to review individual outputs — they set the bar, the machine holds it.

🔄

Pattern Fix, Not Instance Fix

Individual failures are never patched. The system groups failures by root cause, then implements a single behavioral rule that prevents the entire class. This means every fix improves multiple future outputs simultaneously.

🔍

Cross-Model Honesty

The agent (Claude) and the judge (GLM 5.1) are different models from different providers. This prevents self-approval bias and ensures evaluations reflect actual quality, not model-specific patterns.

Adversarial Testing & Calibration

Adversarial Test Set

A set of deliberately crafted failure cases that should always be caught — permission loops disguised as thoroughness, self-talk disguised as updates, cherry-picked reviews that look complete. The adversarial set validates that rubric changes don't introduce blind spots.

Detection rate100%

EntriesCalibrated against Tony verdicts

Agreement Measurement

measure-agreement.py compares LLM judge verdicts against Tony's actual verdicts on the calibration corpus. This measures rubric fidelity — how well the machine-readable rubric captures what Tony would actually say.

Corpus size86 Tony verdicts

PurposeRubric drift detection

The Target State: Autonomous Quality Convergence

The system is designed to converge on monotonically improving quality without human intervention in the loop. Each cycle: measure → identify pattern → fix pattern → verify fix → next cycle. As failure classes are resolved, the pass rate climbs. New failure classes that emerge from expanded coverage or stricter rubrics are triaged and fixed by the same loop.

The human role shifts from "reviewing agent outputs" to "defining what excellent looks like" — rubric design, calibration corpus curation, and adversarial test creation. The machine handles the enforcement, measurement, and behavioral adaptation at scale.

VTKL Quality Dashboard

🔁 Closed-Loop Intelligence

🧠 Memory Lifecycle

📊 Evals Explainer

🧠 The Brain Company