← Back to Journal
AI Engineering

We Train Our AI Skills Like Neural Networks. Here's How.

Daniela S. · 9 min read · June 2026

The Problem with Hand-Crafted Skills

Brain Kit ships 63 skills. Each one is a markdown file that teaches Claude Code how to do something specific — write copy, debug code, audit security, plan a launch.

Until last week, every skill was hand-crafted. We'd write the instructions, test them against real tasks, notice failures, edit the skill, test again. It worked. But it had three problems:

1. It doesn't scale. 63 skills × manual iteration = permanent bottleneck.

2. It's not reproducible. "I tweaked the copywriting skill and it felt better" isn't engineering. It's vibes.

3. It plateaus. After 3-4 rounds of manual editing, you stop finding improvements — not because the skill is optimal, but because you've exhausted your intuition.

We needed a systematic way to improve skills. Not guessing. Not A/B testing copy. Something closer to how you train a model.

Microsoft Published the Answer

In June 2026, Microsoft Research released SkillOpt — a framework that treats skill documents as trainable artifacts. The core insight is deceptively simple:

A skill document is to an LLM what weights are to a neural network.

You don't change the model. You change the instructions. And you can optimize those instructions using the same principles that make deep learning work — epochs, learning rates, gradient descent, validation gates.

The results across their benchmarks: +23.5 accuracy points on GPT-5.5. Best or tied-best on all 52 evaluated configurations. And the optimized skills transfer across models without retraining.

That last part is what caught our attention. We run a tri-brain architecture — Claude architects, Codex executes, Gemini validates. If optimized skills transfer across models, we only need to train once.

How SkillOpt Works (The 30-Second Version)

The training loop has six stages that repeat across steps and epochs:

1. Rollout — The target LLM runs a batch of tasks using the current skill.

2. Reflect — An optimizer LLM analyzes the failures. Not individually — in minibatches of 8, looking for systemic patterns.

3. Aggregate — Multiple reflection patches get merged via hierarchical LLM calls. 16 patches → 4 → 1.

4. Select — Edits get ranked by importance. Only the top N survive (N = learning rate, typically 4).

5. Update — The surviving edits get applied to the skill document. Four operations: append, insert_after, replace, delete.

6. Gate — Run validation tasks with the new skill. If scores improve, accept. If not, reject and keep the old version.

At epoch boundaries, two extra mechanisms fire: a slow update that compares epoch-start performance vs epoch-end (preventing forgetting), and a meta skill that distills "what optimization strategies worked this epoch" as memory for the next one.

The output? A compact markdown file — 300 to 2,000 tokens — that runs against unchanged models with zero inference overhead.

The Deep Learning Analogy

This isn't metaphor. It's a direct structural mapping:

| Deep Learning | SkillOpt |
|---|---|
| Model weights | Skill document (markdown) |
| Forward pass | Rollout (run tasks with skill) |
| Loss function | Task evaluator (score outputs) |
| Backpropagation | Reflect (analyze failures) |
| Gradients | Edit patches |
| Gradient clipping | Select (cap edits to budget) |
| Learning rate | Max edits per step |
| SGD step | Apply patches to skill |
| Validation set | Gate (accept only if score improves) |
| Momentum | Slow update (epoch boundary) |
| Meta-learning | Meta skill (cross-epoch memory) |

The learning rate even has schedulers — cosine decay, linear decay, constant, or autonomous (let the optimizer decide). The same shapes that stabilize neural network training stabilize text optimization.

What We're Building

We're adapting SkillOpt for Brain Kit. The pieces map cleanly:

Our 63 SKILL.md files are the weights. Each one already has eval tasks (rubrics with hard pass/fail criteria and soft partial-credit scoring).

Our tri-brain architecture maps to SkillOpt's roles: Claude as the optimizer (reflects on failures, proposes edits), Codex as the target (runs tasks with the skill), Gemini as the scorer (evaluates output against rubric).

Our n8n workflows handle orchestration — triggering rollouts, collecting scores, managing checkpoints.

The MVP is simple: pick one skill, run it against 10 eval tasks, score the outputs, reflect on failures, propose ≤4 edits, validate, accept or reject. One loop takes about 30 minutes. We started with the copywriting skill.

The full pipeline will run all 63 skills through 3 epochs of optimization, with cosine learning rate decay, slow updates, and meta skill memory. Version-controlled checkpoints at every step. A leaderboard tracking skill scores over time.

Why This Matters

Every Brain Kit user gets the same 63 skills. If we can systematically improve each one by even 10-15%, that compounds across every task, every user, every day.

But the real advantage is defensibility. Hand-crafted skills can be copied. Training infrastructure can't. Once we have the optimization pipeline running, every iteration widens the gap between our skills and anyone who tries to replicate them by hand.

Microsoft proved that optimized skills transfer across models. That means the skills we train on Claude also work better on Codex and Gemini — without retraining. One investment, three models improved.

And because the output is just markdown — no model fine-tuning, no GPU costs, no inference overhead — the marginal cost of optimization is near zero. We're training text, not parameters.

The era of hand-crafted prompts is ending. The same way hand-tuned features lost to learned representations in computer vision, hand-written skills will lose to optimized ones. We'd rather be on the right side of that curve.

Brain Kit already gives Claude Code a second brain. Now that brain is learning to think better on its own.

Install Brain Kit