People ask us all the time: what is Scrubby actually doing in there? It’s a fair question. “Codebase intelligence” sounds abstract, and the value is real but largely invisible. Your AI agent stops generating code that violates conventions, and your token bills drop. But what’s the path from “I just installed this thing” to those outcomes?

This post walks through it, from a user’s perspective, in the order it actually happens.

Step 1: Connect a Repository

The first thing you do is point Scrubby at a repository. There are two ways to do this, and they’re not mutually exclusive.

The GitHub App. You install Scrubby’s GitHub App on your organization, pick the repos you want covered, and that’s it. From that moment forward, every pull request gets reviewed against your codebase’s actual patterns, and Scrubby’s understanding of your code stays current as commits land.

The MCP server. You configure your AI editor (such as Claude Code or anything else that speaks Model Context Protocol) to connect to Scrubby. Now your AI agent can query Scrubby directly, in real time, while you work.

Most teams use both. The GitHub App reviews PRs at merge time, and the MCP server prevents the bad PRs from ever being written in the first place.

Step 2: The First Index

When Scrubby first sees your repository, it does something that no linter or static analysis tool does. It tries to understand your codebase as a structure of meaning rather than just a tree of files.

Here’s what happens:

  1. The repo gets scanned. Scrubby walks your file tree, hashes every file, parses imports and exports, and builds a graph of how files reference each other.
  2. Domains get discovered. Scrubby’s DomainClassifier sends your directory structure, file metadata, and a sample of file contents to Claude. The model identifies the architectural domains in your codebase (for example, “Authentication” or “Background Jobs”) and assigns each file to one. This is actual semantic classification of what each part of your codebase does, rather than pattern matching on directory names.
  3. Connections get built. Scrubby’s ConnectionBuilder analyzes cross-domain imports and produces a weighted graph of how domains relate to one another. If files in billing import from user-management 23 times across the codebase, those two domains get a strong connection.
  4. Global domains get activated. Scrubby maintains a set of global domains, which are shared knowledge bundles for things like React, Rails, Security, and Testing. Based on your file extensions and dependency files, the relevant ones get activated for your repo.
  5. Git history gets ingested. The last 100 commits get pulled in, along with which files they touched. No author data and no PII gets stored, only what changed and when.
  6. Domain activity gets tracked. Scrubby computes per-domain metrics like change velocity and which domains tend to be modified in the same commit.
  7. A snapshot gets saved. An IndexSnapshot records the state of the index, the head SHA, how many files were processed, and whether reclassification was triggered.

This first index typically takes a few minutes for a repo of moderate size. After that, it’s incremental, so Scrubby only re-processes what changed.

Step 3: Conventions Get Extracted

Once domains exist, Scrubby’s ConventionExtractor looks at the actual code in each one and identifies the patterns your team uses. The most important pattern Scrubby captures is the API facade your team uses to keep one domain from reaching directly into another.

These aren’t rules someone wrote down. They’re patterns Scrubby observed across the actual code your team has been writing for months or years. They get stored as DomainPattern records, with exemplar files attached, so when an AI agent asks “how does this team structure a new service?”, Scrubby has a real answer with real examples.

Step 4: An AI Agent Asks a Question

Now your AI editor is configured with the Scrubby MCP server. You open a file and ask Claude Code to add a new feature. What happens?

Before generating code, the agent issues a tool call. It might be scrubby_review on the file you’re editing, or scrubby_get_network to check the blast radius of the change you’re about to make.

Scrubby returns structured context about what the file does, what domain it belongs to, what conventions apply, and what other files historically change with it. The agent reads this and incorporates it into the code it generates. The result is code that fits, because the agent now has the context it was previously missing.

Step 5: A Pull Request Opens

You push your branch and open a PR. GitHub fires a pull_request webhook. Scrubby’s WebhooksController verifies the HMAC signature and enqueues an AnalyzePrJob.

The job runs in the background:

  1. It creates a GitHub Check Run in the “in progress” state, so you can see something’s happening.
  2. It fetches the changed files via the GitHub API and filters out anything binary or oversized.
  3. It loads your repository’s index, including the domains, connections, conventions, and history.
  4. It runs Neural SME analysis on every changed file in parallel threads. Each file gets analyzed by its primary domain expert, plus every connected domain whose weight is above the activation threshold.
  5. The findings get aggregated, deduplicated, and ranked by severity.
  6. The Check Run gets completed with annotations, and a single PR comment is posted (or updated, if Scrubby has already commented on this PR).

The comment doesn’t read like a generic AI reviewer. It reads like a senior engineer on your team, because it’s grounded in your team’s actual patterns: “This new endpoint is missing the corresponding spec file. Endpoints in this domain consistently get tests at spec/requests/<endpoint>_spec.rb based on the last 47 commits.”

Step 6: The Network Learns

This is the part that’s easy to miss. Every analysis Scrubby does is a learning event.

When a connected domain produces useful findings on a file, the connection between that domain and the file’s primary domain gets reinforced, with the weight going up by 0.05. When a connected domain runs and finds nothing, the connection gets weakened by 0.02. This is Hebbian learning, the principle that neurons which fire together wire together. Over time, the network converges on the connections that actually matter for your codebase, and irrelevant cross-domain noise fades.

Global domains learn the same way, but at a different scale. When the React global domain produces useful findings across many repos, its connections strengthen for everyone. When a pattern stops being relevant, temporal decay (0.995 per activation) lets the network adapt. This is what we mean when we say Scrubby gets smarter over time. The phrase describes the literal behavior of the connection graph.

Step 7: Subsequent Indexes Are Cheap

When new commits land, Scrubby doesn’t re-index from scratch. The IncrementalScanner computes a changeset against the last indexed SHA and only processes what changed. Files whose content hash is unchanged get skipped entirely. Domain reclassification only re-runs if your codebase has shifted significantly, such as when twenty or more new files are added or when dependency files like package.json change.

This is what makes Scrubby practical to run on real codebases at real velocity. You don’t pay a re-indexing tax every time someone merges a PR.

What Comes Out the Other Side

After a few days of normal development, your team has a Scrubby that understands your repository roughly the way a senior engineer does. It knows the domains, the connections between them, the conventions in each, and the patterns of how your code actually changes. That knowledge is queryable through the MCP server, applied automatically on every PR, and keeps refining itself as your code evolves.

This is the “codebase intelligence” we keep talking about. It is a living, learning representation of your codebase that the rest of your AI tooling can finally use to do its job well.

A Closing Thought

The reason we built Scrubby this way (with domains, connections, weights, history, and conventions all working together) is that we believe the next decade of software engineering depends on AI agents being grounded in the codebases they work on. Generic intelligence is no longer enough. The agents that produce real value are the ones that are demonstrably aware of the codebase they’re touching.

If you’ve been wondering what’s actually inside Scrubby, that’s it. Connect a repo, let it index, and watch your AI agents get noticeably better at the work they do for you.

Sources: