Building a Skills System for AI Agents

The first version of my AI assistant had a system prompt. Every time I wanted it to do something new — format a commit message a certain way, run a specific test before pushing, draft an article in a particular voice — I added instructions to the prompt.

By version three, the prompt was 4,000 words. By version five, it was 8,000. The model was spending a significant chunk of its context window reading instructions it would not need for 90% of tasks. And when it did need a specific procedure, that procedure was buried in a wall of text alongside dozens of unrelated instructions.

System prompts do not scale. They grow monotonically. Every new capability adds weight. Eventually the prompt is so large that the model's actual reasoning degrades — it cannot find the relevant instruction in time, it conflates similar procedures, and it starts ignoring anything more than a few paragraphs back.

I started working on this problem before Claude published their SKILL.md standard. My first attempt was YAML files that acted as expertise profiles — each one was a complete persona rewrite for a specific task type. A "code reviewer" profile didn't just add review instructions to the existing prompt. It replaced the system prompt entirely, putting the model into the right mindset with the right toolset, vocabulary, and decision framework for that domain. A "documentarian" profile did the same thing, but with a completely different orientation.

The insight was that different tasks require fundamentally different modes of reasoning, not just additional instructions layered on top of a generic prompt. A code reviewer and a technical writer are not the same agent with different instructions — they are different experts. The expertise profiles captured that distinction.

You would select one profile as your primary agent, and that agent could then invoke other profiles as sub-agents for specific subtasks. This delegation pattern had a secondary benefit that turned out to be critical: it kept context clean. The primary agent held the high-level objective. When it needed domain-specific work, it spawned a sub-agent with a focused expertise profile — that sub-agent only knew about its task, only loaded the relevant tools, and returned a clean result. No cross-contamination of context between the strategic layer and the execution layer.

The solution was not a shorter prompt. It was a different architecture entirely.

Skills as Files, Not Prompt Text

Those early YAML expertise profiles evolved into what I now call skills. The idea is straightforward: take every procedure, every workflow, every reusable instruction set out of the system prompt and put it in its own file.

A skill is a markdown document. It describes a procedure — how to do one thing, step by step, in enough detail that an agent can follow it without additional guidance. Each skill file has YAML frontmatter that gives the agent the metadata it needs to decide when to use it.

---
name: commit-message-convention
description: "Write conventional commits with scope and breaking-change footer"
tags: ["git", "commits"]
trigger: "before git commit"
---

# Commit Message Convention

1. Read the staged diff to understand what changed.
2. Determine the type: feat, fix, docs, refactor, test, chore.
3. Identify the scope from the affected module name.
4. Write the summary line: type(scope): description (max 72 chars).
5. If the change breaks compatibility, add a BREAKING CHANGE footer.
...

When the agent needs to write a commit message, it does not search its system prompt for the relevant section. It loads the skill file. The procedure is fresh, isolated, and the only thing in context at that moment.

The system prompt stays small. The knowledge grows.

How Discovery Works

An agent cannot use a skill it does not know about. Discovery — how the agent finds skills — is the core architectural decision.

I use filesystem scanning. Skills live in a known directory. At startup, the agent reads the YAML frontmatter of every skill file — just the metadata, not the full content — and builds a registry. The registry is a lightweight index: skill name, description, tags, trigger conditions.

When a task comes in, the agent checks the registry. Does any skill match the current context? If the task is "commit these changes," the registry matches on the trigger: "before git commit" field. The agent loads the full skill file and follows it.

This is important: the full skill content is only loaded when the skill is actually needed. A registry of 50 skills might cost 2,000 tokens of context. Loading one skill to execute it costs another 500-1,000 tokens. Compare that to keeping 50 procedures in the system prompt permanently — 25,000 tokens of context consumed whether you need them or not.

The Lifecycle: Proposals, Not Free-For-Alls

One of the early mistakes was letting skills multiply without oversight. Every interesting procedure became a skill file. Within a month there were 40 skills, half of them overlapping, several contradicting each other. The agent would find two matching skills for a task and have no way to decide between them.

The fix was a proposal lifecycle. New skills do not go straight into the registry. They go through a workshop:

Propose. Anyone — the agent itself, a user, an automated suggestion — can create a skill proposal. The proposal is a markdown file with the skill content and metadata.

Review. Proposals sit in a pending state. A human reviews them. Does this skill duplicate an existing one? Is the procedure correct? Is it safe to give the agent this capability?

Apply or reject. Approved proposals become live skills — discoverable by the agent at runtime. Rejected proposals are archived with a reason.

There is also a quarantine state for skills that turned out to be dangerous or wrong after deployment. A quarantined skill is removed from the registry but not deleted — the history is preserved so the same mistake is not made twice.

Skills are capabilities. You would not hand someone a new tool without checking whether it works. The proposal lifecycle is quality control for agent abilities.

Composability: Skills That Build on Each Other

The real power emerges when skills compose. A code review skill can reference a testing skill. An article drafting skill can reference a voice and style skill. Each skill is self-contained, but the agent can chain them.

This works because skills are just markdown. One skill can mention another by name. When the agent encounters a reference to a skill it has not loaded, it checks the registry. If the skill exists, it loads it. If not, it proceeds without — the skill is a enhancement, not a dependency.

The result is a capability graph. Each node is a skill. Edges are references between skills. The agent walks the graph as needed, loading only the nodes relevant to the current task.

This is what composability means for AI agents. Not one giant prompt that tries to cover everything. A library of focused, reviewable, individually testable procedures that the agent assembles at runtime.

Why an Open Standard Matters

The skill format — YAML frontmatter plus markdown body — is deliberately simple. No proprietary schema. No platform lock-in. Any agent that can read files and parse YAML can use it.

This matters because the ecosystem is fragmented. Every AI agent framework has its own way of defining capabilities. OpenAI has custom GPTs. Anthropic has tool definitions. LangChain has chains and agents. None of them talk to each other.

A skill written as a markdown file with standard frontmatter is portable. The same commit-message skill works in a coding agent, a CI pipeline, or a chat bot — anything that can read a file and follow instructions. Write the procedure once. Use it everywhere.

The value of a standard is not in its complexity. It is in its adoption. A simple format that any agent can read will outcompete a sophisticated format that only works in one framework.

What I Would Do Differently

The current system works, but the evolution was messy. If I were starting over:

Version the format. The YAML frontmatter schema changed three times before it stabilized. A version field (schema_version: 1) from day one would have saved migration headaches.

Tag ruthlessly. Tags are how the registry matches skills to tasks. Early on I was inconsistent — some skills had ["git"], others had ["version-control"]. A controlled vocabulary from the start makes the registry far more effective.

Write the description for the agent, not the human. The description field in the frontmatter is what the agent reads to decide if a skill is relevant. "Commit Message Convention" is a good title for a human. "Write conventional commits with scope and breaking-change footer" is a good description for an agent. The agent needs to know what the skill does, not what it is called.

The Principle

Capabilities should be modular, discoverable, and reviewable. Modular because a single skill should fit in context alongside the task. Discoverable because an agent cannot use what it cannot find. Reviewable because every capability is also a risk — the agent gains the ability to do something, and that ability needs oversight.

Skills are not prompts. They are programs — written in natural language, executed by a language model, and governed like any other code that runs in production. Treat them that way, and your agent grows without rotting.