Self-Learning Pipeline β Technical Architecture
This system implements a closed-loop self-improvement cycle for autonomous AI agent operations. Every agent output is captured, evaluated against human-calibrated rubrics by a cross-model LLM judge, and failures are automatically triaged into behavioral fix classes. Fixes are implemented in agent configuration, and the next evaluation cycle verifies the improvement. The loop runs daily without human intervention β humans set the judgment criteria, the system enforces and improves against them autonomously.
Complete Pipeline Flow
π‘
Phase 1
Output Collection
Every agent message across all Slack channels is captured with full context, metadata, and trigger information
β
βοΈ
Phase 2
Shadow Review
Cross-model LLM judge (GLM 5.1) evaluates each output against domain-specific Tony-calibrated rubrics
β
π
Phase 3
Failure Triage
Failures grouped by root cause class, prioritized by instance count, tracked in self-improvement loop file
β
π§
Phase 4
Auto-Fix
Agent reads failure classes at session startup and implements behavioral fixes (AGENTS.md rules, SOUL.md guardrails, code changes)
β
β
Phase 5
Verification
Next shadow review cycle verifies fixes resolved the failure class β verified fixes move to "Resolved", new failures enter triage
π MLflow β Observability Backbone
Every judge call, correlation analysis, and run metric is traced in MLflow. Dashboard data is exported from MLflow traces via search_traces() API. MLflow provides the historical memory that makes cycle-over-cycle comparison possible.
π The cycle runs autonomously every 24 hours. Humans define judgment criteria through rubrics β the system enforces and improves against them.
Phase 1 β Output Collection
What It Does
Every message Warren posts to Slack β across
all channels (client channels, #sales, #agent-warren, DMs) β is intercepted by an OpenClaw output-collector hook. The hook captures:
Full message textComplete, untruncated
Channel contextChannel name, ID, type
Trigger messageWhat prompted the output
Timestamp + metadataThread, files, reactions
Technical Implementation
Collection scriptshadow-collect.py
Output formatJSONL (one entry/line)
Queue fileshadow-review-queue.jsonl
Noise filteringSub-agent chatter, NO_REPLY, stubs
ClassificationContent-first (keyword β channel fallback)
Noise filter: Messages under 50 chars, sub-agent coordination, NO_REPLY responses, and system stubs are excluded automatically. Only substantive agent outputs enter the evaluation queue.
Phase 2 β Shadow Review (LLM Judge)
Cross-Model Evaluation
Each collected output is evaluated by
GLM 5.1 (open-weights, Together AI) β a
different model from the agent's own LLM. This cross-model approach prevents the echo-chamber effect where an AI would always approve its own outputs.
Why GLM 5.1?
- Open weights β full transparency, no black box
- Strong reasoning without self-bias toward Claude/GPT outputs
- Cost-effective at scale (106+ evaluations daily)
- Consistent judgment characteristics across runs
Tony-Calibrated Rubrics
Rubrics aren't generic quality checks β they encode
specific human judgment patterns extracted from Tony Lambatos's actual production corrections. Each rubric references real verdict IDs.
behavioralPermission loops, self-talk, unused resources
processGate timing, coverage, displacement activity
sales-bdAttribution, data accuracy, action items
effort-valueEffort proportionality, scope management
product-scopeRequirements fidelity, capability coverage
discoveryCross-validation triangle, measurable outcomes
86-entry calibration corpus from Tony's real verdicts ensures rubric alignment with human judgment.
Phase 3 β Self-Improvement Loop
1. Failure Classification
After each shadow review, failures are automatically grouped by root cause class β not individual instances. Example: 13 separate "jumped to execution without planning" failures β 1 failure class "Missing Product Judgment Gate." This prevents whack-a-mole and ensures fixes address patterns, not symptoms.
2. Prioritized Fix Queue
Failure classes are prioritized by instance count β the class with the most failures gets fixed first. This maximizes pass rate improvement per fix. The queue lives in memory/self-improvement-loop.md β a structured file the agent reads at every session startup.
3. Behavioral Mutation
Fixes are implemented as standing rules in the agent's configuration files (AGENTS.md, SOUL.md, BOOTSTRAP.md). These aren't code changes β they're behavioral directives that modify the agent's decision-making. Example: "Pre-Execution Gate" rule that blocks execution until decomposition is complete.
Key insight: The agent modifies its own behavioral configuration based on measured failures. This is not prompt engineering by humans β it's an autonomous feedback loop where the system identifies its own failure patterns and writes rules to prevent recurrence. Humans define what good looks like (rubrics); the system figures out how to get there.
Phase 4 β Correlation Engine
Cross-Source Intelligence
The correlation engine connects output quality with
upstream intelligence quality. It cross-correlates:
- Google Drive intake accuracy vs. output accuracy β did bad intake data produce bad outputs?
- Stakeholder mention patterns β who is being discussed, how often, in what context?
- Meeting-to-output chains β tracing a single decision from meeting transcript β memory β output
- Topic drift detection β identifying when agent focus drifts from assigned priorities
Scheduling
Daily (MonβFri)0 11 * * 1-5
Weekly deep (Sat)30 11 * * 6
Lookback7 days (daily) / 30 days (weekly)
OutputsCorrelations, alerts, insights
Lightweight daily runs catch immediate correlations. Full weekly analysis on Saturday processes the complete week's data with multi-hop chain tracing.
Phase 5 β Quality Gate (Pre-Delivery)
Real-Time Evaluation
The quality gate runs
before delivery for high-stakes outputs (BD dailys, weekly recaps, client deliverables). It uses the same rubrics as the shadow review but operates synchronously β the output is evaluated before it reaches the channel.
Wired SOPs (5): BD daily, BD weekly recap, BD daily alert, product judgment, sprint kickoff. If the quality gate fails, the output is rejected and the agent must revise before sending.
Exit Codes
EXIT 0PASS β deliver
EXIT 1FAIL β revise
EXIT 2ERROR β skip gate
MLflow β Observability & Experiment Tracking Backbone
MLflow is the central nervous system that connects every phase of the pipeline. It doesn't just log results β it provides the structured trace data that enables the dashboard, the self-improvement loop, and cross-cycle comparison. Without MLflow, each shadow review would be an isolated run with no memory of previous cycles.
Where MLflow Sits in Each Phase
βοΈ Shadow Review
Every judge call = @mlflow.trace span. Full run wrapped in parent span with pass/fail/domain metadata
π Correlation Engine
LLM analysis calls traced. Run metadata: intake files, stakeholders, correlations found
π Dashboard Export
export-dashboard-data.py queries MLflow API β dashboard-data.json β Cloudflare Pages
π Trend Analysis
Daily pass rates, domain breakdown, failure class counts β all derived from MLflow trace history
π§ͺ Calibration
Agreement measurement between judge + Tony corpus logged as experiment comparisons
MLflow Trace Structure
shadow-review-run β parent span
ββ entries: 106
ββ pass_rate: 72%
ββ domain_breakdown: {process: ...}
β
βββ shadow-review-judge β per-entry span
β ββ domain: "behavioral"
β ββ verdict: "PASS"
β ββ reasoning: "..."
β ββ latency_ms: 3200
β
βββ shadow-review-judge β next entry
β ββ domain: "process"
β ββ verdict: "FAIL"
β ββ recommendation: "..."
β
βββ ... (106 spans per run)
Experiment
warren-evals
Experiment ID: 1 β’ All eval traces in one experiment
Tracking Server
localhost:5000
SQLite backend β’ MLflow 3.12 β’ systemd managed
Data Flow
Traces β JSON β Dashboard
search_traces() β dashboard-data.json β CF Pages
Technology Stack
| Component | Technology | Purpose | Location |
| Agent Runtime | OpenClaw + Claude | Agent execution, memory, tool use | DGX Spark (local) |
| Judge Model | GLM 5.1 (Together AI) | Cross-model rubric evaluation | API (together.xyz) |
| Observability | MLflow 3.12 | Trace logging per judge call, run-level metrics, experiment history, dashboard data source | localhost:5000 (SQLite) |
| Rubrics | YAML | Human judgment patterns β machine-evaluable criteria | ops/evals/rubrics/ |
| Calibration Corpus | JSONL (86 entries) | Tony's real verdicts for rubric alignment | ops/evals/datasets/ |
| Dashboard | Cloudflare Pages | Static HTML, auto-deployed from MLflow data export | vtkl-dashboard.pages.dev |
| Scheduling | Linux cron | Daily/weekly automation β no external dependencies | DGX Spark crontab |
| Intelligence Intake | Google Drive API + GWS SA | Meeting transcripts, voice notes, stakeholder docs | Service account: warren@vtkl.ai |
Design Principles
π―
Human Judgment, Machine Enforcement
Humans define quality through rubrics calibrated from real corrections. The system enforces those standards at scale, 24/7. Humans never need to review individual outputs β they set the bar, the machine holds it.
π
Pattern Fix, Not Instance Fix
Individual failures are never patched. The system groups failures by root cause, then implements a single behavioral rule that prevents the entire class. This means every fix improves multiple future outputs simultaneously.
π
Cross-Model Honesty
The agent (Claude) and the judge (GLM 5.1) are different models from different providers. This prevents self-approval bias and ensures evaluations reflect actual quality, not model-specific patterns.
Adversarial Testing & Calibration
Adversarial Test Set
A set of deliberately
crafted failure cases that should always be caught β permission loops disguised as thoroughness, self-talk disguised as updates, cherry-picked reviews that look complete. The adversarial set validates that rubric changes don't introduce blind spots.
Detection rate100%
EntriesCalibrated against Tony verdicts
Agreement Measurement
measure-agreement.py compares LLM judge verdicts against Tony's actual verdicts on the calibration corpus. This measures
rubric fidelity β how well the machine-readable rubric captures what Tony would actually say.
Corpus size86 Tony verdicts
PurposeRubric drift detection
The Target State: Autonomous Quality Convergence
The system is designed to converge on monotonically improving quality without human intervention in the loop. Each cycle: measure β identify pattern β fix pattern β verify fix β next cycle. As failure classes are resolved, the pass rate climbs. New failure classes that emerge from expanded coverage or stricter rubrics are triaged and fixed by the same loop.
The human role shifts from "reviewing agent outputs" to "defining what excellent looks like" β rubric design, calibration corpus curation, and adversarial test creation. The machine handles the enforcement, measurement, and behavioral adaptation at scale.