Scrubbit in Plain English

These are the nine concepts you'll run into as you use Scrubbit. It's a quick read, and there's a quiz at the end to help you drill them into your memory.

1. Entity

An entity is any piece of sensitive information Scrubbit finds in your document — a person's name, an email, a dollar amount, a date. Scrubbit color-codes each entity by category so you can see at a glance what it found.

Example: "Dr. Jane Smith met with Meridian Group on March 15, 2024 to finalize the $42,500 retainer. She can be reached at jane@meridian.com or her office at 12 Harbor Rd, Boston, MA."

2. Token

When Scrubbit anonymizes your document, it replaces each entity with a token — a placeholder like PERSON_1 or ORG_2. Tokens are safe to send to any AI. The original values never leave your device.

Original: "Dr. Jane Smith met with Meridian Group on March 15, 2024 to finalize the $42,500 retainer."

Anonymized: "PERSON_1 met with ORG_1 on DATE_1 to finalize the AMOUNT_1 retainer."

3. Canonical Value

Behind every token is a canonical value — the single authoritative form of the real entity. It's the reference point Scrubbit uses to tie all the different surface forms of a name back to one identity. Canonical values live only on your device and are never sent anywhere.

4. Variant

The same person might appear as "Jane", "Dr. Smith", and "J. Smith" in a single document. These are variants — different surface forms of the same entity. Scrubbit maps them all to one token, so an AI knows they're the same person, not three different people.

5. Anonymization

Anonymization is the full process of replacing every sensitive entity in a document with tokens. Unlike redaction — which blacks out content and strips all meaning — Scrubbit's tokens preserve context. A token like PERSON_physician_1 still tells an AI this is a medical professional, so it can reason usefully about the document without ever seeing the actual name.

6. Extraction

Extraction is how Scrubbit finds entities in the first place. It runs three passes: regex patterns for structured data (emails, phone numbers, IDs), a lookup against your glossary, and a local AI model for anything context-dependent. The result is a complete list of candidates for you to review.

7. Provenance

Every entity comes with a provenance — a record of why it was flagged. This tells you how much to trust each detection. There are three types: Exact (you told Scrubbit to find it), Glossary (matched a known entity), and Discovered (found automatically by the algorithm).

8. Glossary

Every entity you confirm gets saved to your project's glossary — a growing library of known sensitive terms. Future documents are automatically checked against it, so Scrubbit gets faster and more accurate the more you use it.

9. Project

A project is the top-level container in Scrubbit. Each project has its own glossary, its own files, and its own settings — keeping different cases, clients, or topics completely separate. Think of it as a dedicated workspace for everything related to one matter.