How we measure detection accuracy

Name: Faultlines
Author: Faultlines

Faultlines emits two layers per repo — L1 engineering features (code-grounded, deterministic) and L2 product features (customer-facing, Sonnet-synthesised) — plus user flows attached to both. Every number on the landing comes from a fresh cold-scan compared against a hand-curated golden corpus.

Scan scope today

We publish full scans on 25 production OSS repositories in the landing carousel, plus dozens more private and dogfood scans not surfaced here. Of those 25:

16 trained corpus repos — hand-curated L1 + L2 + flow truth files, used to tune every Stage / extractor in the pipeline. Numbers below are computed against these truths.
7 unseen generalization repos — truth curated AFTER the engine was frozen, sourced from each product's public marketing surface. No engine change targeted these repos. Used to confirm we learned the problem class (route extractors, MVC, schema, workspace boundaries) rather than memorised the corpus.
2 internal dogfood scans — our own faultlines-app + a private soc0 repo we use as a stack-detection regression target. Not included in the precision/recall averages.
Plus the deterministic pipeline runs on every repo. Stage 0 → Stage 6 (intake, audit, extractors, reconciliation, flow detection, metrics) needs zero training data — only the Sonnet L2 rollup is tuned against the trained corpus. So coverage, health, hotspots, impact, and behavioral coverage work identically on a repo we've never seen before.

What the numbers mean

L2 — Product feature

A customer-facing capability: 'Multi-factor authentication', 'Stripe Checkout', 'AI Email Assistant'. These are what an end-user or PM would name from the product's marketing site + docs. Synthesised by a Sonnet analyst pass over the deterministic Layer 1 output.

L1 — Engineering feature

A code-grounded module / sub-system / package boundary: 'auth-middleware', 'webhook-processor', 'qr-code-generator'. These are what an engineer would name from the repo's structure. Emitted deterministically by Stage 1 extractors (route, MVC, schema, package, FastAPI, etc.) + a small LLM fallback for residual files.

Flow

A complete user-facing action threaded through code: 'Create Project', 'Reset Password', 'Cancel Subscription'. Each flow has an entry-point file + line, contributing files, and is attached to one or more product features via the Stage 8 shape-aware rollup.

Precision (P)

Of what the engine emitted, what fraction matches a real truth entry? High P means the output isn't cluttered with phantom features.

Recall (R)

Of the real truth entries, what fraction did the engine actually find? High R means the engine isn't missing things.

Shape

Stage 0.6 classifies every repo into one of 10 shapes (turborepo-monorepo, single-saas-routed, oss-library, backend-monolith, go-server, go-library, rust-workspace, framework-repo, cli-tool, universal-residual). Stage 8 picks a flow-rollup strategy from the registry by shape — Turborepo gets workspace-prefix matching, OSS-library gets Sonnet semantic attribution, etc. One algorithm doesn't fit all shapes.

Hotspot

A feature whose bug-fix ratio exceeds 30% with at least 3 commits — code that absorbs disproportionate bug-fix activity and warrants engineering attention. Shown as the per-repo Top Hotspots strip on the landing.

Generalization test

Cold-scan + score on repos the engine was never tuned against. Truth files curated from the product's public marketing surface only. If the unseen average sits within ±3pp of the trained corpus average, the engine generalizes — not overfit.

Trained corpus — 16 OSS SaaS / library repos

The corpus the engine was tuned against. Every Stage / extractor / strategy was validated here before shipping.

L2 product · Precision

90.6%

L2 product · Recall

87.6%

L1 dev · Recall

92.5%

L1 dev · Precision

33.0%

Independent generalization — 7 unseen repos

Repos the engine was never tuned against. Truth files curated from each product's public marketing surface after the engine was frozen. The gap to corpus tells you if the engine learned theproblem, not thecorpus.

L2 product · Precision (unseen)

93.0%

Δ vs corpus: +2.4pp

L2 product · Recall (unseen)

88.6%

Δ vs corpus: +0.9pp

Per-repo · trained corpus

Repo	Shape (Stage 0.6)	L2 P	L2 R	L1 P	L1 R	L2 D / T
ollama	`go-server`	97.1%	80.0%	13.8%	82.0%	33 / 55
supabase	`turborepo-monorepo`	78.4%	72.5%	11.7%	100.0%	33 / 40
maybe	`backend-monolith`	82.4%	93.8%	37.1%	58.8%	30 / 32
better-auth	`turborepo-monorepo`	96.7%	77.4%	27.8%	100.0%	27 / 53
axios	`oss-library`	96.0%	92.0%	42.3%	56.1%	30 / 25
fastapi	`oss-library`	90.3%	100.0%	26.3%	100.0%	28 / 15
dub	`turborepo-monorepo`	80.0%	100.0%	13.4%	100.0%	40 / 15
meilisearch	`rust-workspace`	93.8%	95.2%	68.6%	100.0%	31 / 21
papermark	`single-saas-routed`	100.0%	91.2%	26.0%	100.0%	31 / 34
documenso	`turborepo-monorepo`	97.3%	83.6%	52.0%	100.0%	37 / 55
inbox-zero	`turborepo-monorepo`	87.5%	95.7%	33.5%	98.6%	29 / 23
infisical	`turborepo-monorepo`	100.0%	88.2%	9.2%	85.7%	30 / 93
formbricks	`turborepo-monorepo`	94.1%	73.3%	59.5%	99.3%	35 / 60
plane	`turborepo-monorepo`	86.8%	83.9%	8.8%	100.0%	44 / 56
trigger.dev	`turborepo-monorepo`	87.5%	89.1%	63.5%	99.4%	33 / 46
cal-com	`turborepo-monorepo`	81.8%	86.1%	35.0%	100.0%	31 / 36

Per-repo · unseen (generalization eval)

Repo	Shape (Stage 0.6)	L2 P	L2 R	L2 D / T
commerce	`single-saas-routed`	91.7%	96.3%	24 / 27
caddy	`go-server`	100.0%	81.2%	25 / 48
strapi	`turborepo-monorepo`	92.5%	95.8%	40 / 48
twenty	`turborepo-monorepo`	92.1%	91.3%	38 / 46
chatwoot	`backend-monolith`	94.9%	94.6%	39 / 56
lobe-chat	`turborepo-monorepo`	87.8%	81.6%	41 / 38
directus	`turborepo-monorepo`	91.7%	79.1%	36 / 43

How the golden corpus is built

External triangulation. Marketing site, public docs sidebar, pricing page, llms.txt — what the maintainer tells paying customers the product does.
Code triangulation. File-system routing (Next.js App Router, FastAPI routers, Rails routes), schema files (Prisma / Drizzle / Django models), package manifests, MVC controllers, workspace member layouts.
Engineering-grain expansion. When the codebase ships per-subsystem folders (one per auth method, one per PAM resource, one per crypto operation), each becomes its own truth entry. We grade the engine for finding the granular reality of the code, not a marketing abstraction. This is a hard rule — we never retro-fit truth to match engine output (that would be cheating; the measurement becomes self-referential).
Aliases. For every truth entry we record the engine's plausible naming variants (Title Case, kebab-case, abbreviations). Matching is alias-aware and stemming-aware.
No README parsing. Hard rule per CLAUDE.md. README is the maintainer's pitch — too biased, too aspirational, too dependent on author's writing skill. External web surfaces (the live product's marketing site as customers see it) and code structure are the only inputs.

How each scan is run

Cold-scan only. The per-repo assignments cache is scrubbed before every measurement run, so a prior result can't bias the next one. If you skip the scrub, evaluation becomes a re-measurement of the last scan's bias — useless.
9-stage deterministic-first pipeline (pipeline_v2). Stage 0 git + filesystem intake → Stage 0.5 LLM stack auditor → Stage 0.6 deterministic shape classifier → Stage 1 parallel extractors (route, MVC, schema, package, FastAPI, Rails-suite, Go-router, Rust-workspace, JS-library, Python-library, …) → Stage 2 reconcile → Stage 3 Haiku flow detection (with dynamic wall-time budget, scales with feature count) → Stage 4 residual LLM clusterer → Stage 5 post-process → Stage 6 metrics + behavioral coverage → Stage 6.5 deterministic Layer 2 → Stage 7 metrics → Stage 8 Sonnet analyst + shape-aware flow rollup.
Shape-aware Stage 8. Each repo is classified into one of 10 shapes; a strategy registry picks the right flow-attribution algorithm. Turborepo monorepos use workspace-prefix matching; OSS libraries use Sonnet'smember_flows semantic attribution; backend monoliths use MVC controller grouping; the universal-residual fallback uses entry-point-in-paths + 50%-overlap as a safety net.
LLM stack. Anthropic Claude — Haiku 4.5 for high-throughput stages (flows, residual), Sonnet 4.6 for the Layer 2 analyst. Auto- streaming on everymessages.create so long-output stages never silently fail.
Deterministic scoring. Output JSON saved to ~/.faultline/feature-map-<slug>-<run-id>.json and scored by eval/detection_eval.py. Stemming + alias matching; consolidated matches (e.g.transactions ↔transaction) count toward recall but not as phantoms toward precision.

Behavioral test coverage (no test runs needed)

Most OSS repos don't ship lcov reports, so we built a deterministic 7-signal coverage estimator that works on any clone with zero setup.

7-signal composite (per-feature, per-flow). co_change (0.35) + bug_fix_test (0.25) + freshness (0.15) + density (0.10) + co_author (0.10) + ci_workflow (0.05) + commit_msg (0.05) — weights sum to 1, each signal clamped to [0, 1], composite clamped to [0, 1]. Per[behavioral-coverage-analyzer] skill in the repo.
Confidence band. high (5+ signals active > 0.1), medium (3-4 active), low. A high-confidence 50% coverage means the same thing algorithmically as a high-confidence 90% — different repos, different real coverage, both measured the same way.
Test-type classification. Every test path classified asunit/integration/e2e by regex on the file path (cypress/, playwright/, e2e/, integration-tests/, *.cy.*, *.e2e.*). CI workflow files scanned for known runners (jest, vitest, pytest, go test, cargo test, rspec, etc.) — theci_workflow signal fires if any runner is wired.
Stage 6 wires it in. Stage 6 metrics tries to import the privatefaultlines-test-coverage provider; falls back to the OSS density-only path when absent. Output:coverage_pct per feature + per flow on every scan.

Every metric we emit — sell-yourself catalog

One scan produces all of these. Each card answers four questions: what it measures, what gradations mean and why, how to raise it, and what you actually do with the number.

health_score

0-100 · per feature & flow

Composite of bug-fix ratio, churn, and recency-weighted regression density. Recent bug fixes count 2× because last-quarter pain predicts next-quarter pain.

Gradations

≥70 healthy

quietly works — deprioritize

50-70 caution

watch list, schedule cleanup next sprint

<50 firefighting

actively bleeding velocity

Why those thresholds:Empirically calibrated against 17 trained repos, ~2500 dev features. Below 50 sits in the top decile of bug-fix density across every feature we've ever observed.

How to raise

Add a regression test for the last bug-fix commit. Split files >500 LOC touching this feature. Freeze new work for one sprint and pay down open fixes.

Use case

EM sprint planning. Sort features by health ascending → top 5 become next quarter's refactor budget. Defensible to the CTO with numbers, not vibes.

bug_fix_ratio

0-1 · rolling 100 commits

Fraction of recent commits on this feature classified as fixes via commit-message regex (fix:, bug, hotfix, regression). Pure git, no LLM.

Gradations

<0.15 stable

normal maintenance noise

0.15-0.30 elevated

something is structurally off

>0.30 broken

most commits here are firefighting

Why those thresholds:Top 10% of features across the corpus exceed 0.30. Industry-typical SaaS codebases run 8-15% — anything sustained above that is structural, not transient.

How to raise

Root-cause the last 3 fix commits — usually one shared dependency. Add an integration test at the boundary. Stop accepting fixes without a paired test.

Use case

Hiring case. "Auth has 38% bug-fix ratio for 6 months" is a defensible argument for a senior hire on the platform team.

churn

commits / 100 · per feature

Commits per 100 commits of total repo history, scoped to this feature. Measures how much engineering attention this feature absorbs.

Gradations

<2 cold

stable or dead code

2-8 normal

actively maintained

>8 hotspot

disproportionate cost center

Why those thresholds:Scale-invariant percentile cut across every corpus repo. Top 10% of features absorb 40%+ of commits — the classic Pareto hotspot.

How to raise

High churn + low health = split. High churn + high health = freeze the API and protect with contract tests. Don't let new work pile on the same files.

Use case

Refactor budget defense. Top-5 churn features × engineer hourly rate = real dollar cost of debt. Show that number to finance.

impact_score

0-100 · blast radius

How many other features depend on this one via imports, shared files, and co-change. Touching high-impact code ripples downstream.

Gradations

≥70 critical

change here ripples across the product

40-70 connective

touches multiple flows

<40 leaf

isolated, safe to refactor

Why those thresholds:Graph-centrality percentiles across the feature dependency graph. Top quartile = "Stripe webhook handler" shape — touched by everything.

How to raise

Extract shared logic to a versioned internal package. Replace cross-feature imports with events. Add contract tests at the boundary.

Use case

PR review prioritization. Impact ≥70 + low coverage = mandatory senior review. Wire it to CODEOWNERS auto-assignment.

coverage_pct

0-100 · behavioral + lcov merge

Per-feature/flow test reach from 7 git-native behavioral signals (no test runs needed), merged with lcov line coverage if you upload it. Answers "is this feature tested?" — not "did some line execute?"

Gradations

≥70 covered

regressions surface in CI

40-70 partial

happy path only — edge cases ship to prod

<40 uncovered

change here is a coin flip

Why those thresholds:Calibrated against lcov ground truth on repos where both signals exist. Features below 40 correlate with 3× higher post-deploy incident rate.

How to raise

Add a test file co-located with the feature's primary path. Cover the last bug-fix commit. Pair every new endpoint with an integration test in the same PR.

Use case

Refactor gating. Don't refactor features with coverage <40 — write tests first. Turn the number into a merge-block rule per tier.

Under the hood:7 signals combined: co_change (0.35) + bug_fix_test (0.25) + freshness (0.15) + density (0.10) + co_author (0.10) + ci_workflow (0.05) + commit_msg (0.05). Zero LLM calls. Lcov merges on top when provided.

coverage_confidence

low / medium / high

Reliability label on every coverage number, based on how many of the 7 behavioral signals fired plus whether lcov was provided. Prevents over-trusting noisy reads.

Gradations

high

5+ signals fired or lcov merged — treat as ground truth

medium

3-4 signals — direction reliable, exact % approximate

low

≤2 signals — use as hint only

Why those thresholds:Avoids the trap of confidently reporting "45% coverage" on a repo with shallow test history. Same algorithm, but the score earns its precision.

How to raise

Upload lcov.info from CI. Adopt conventional commits (test:, fix:). Co-locate tests next to code.

Use case

Defensible reporting up the chain. CTO asks "how do you know?" — answer is "5 of 7 independent signals agree", not "trust me".

test_classification

unit / integration / e2e

Every test file auto-classified by path conventions per stack (Jest, Vitest, Playwright, pytest, Go, RSpec, etc.). Surfaces whether you have a real pyramid or an inverted one.

Gradations

healthy pyramid

≈ 70% unit / 20% integration / 10% e2e

top-heavy

slow CI, flaky merges, expensive maintenance

no integration tier

blind spot between unit and e2e

Why those thresholds:Classic Mike Cohn pyramid, validated against CI duration on our corpus. Inverted pyramids cost 2-4× more CI time and ship more regressions.

How to raise

Push assertions down — convert e2e checks to integration where possible. Add unit coverage on pure functions before adding e2e flows.

Use case

CI cost reduction. Invert the pyramid → 40-60% CI time cut, faster merges, fewer flaky reverts.

ownership · bus_factor

distinct authors · 90d

Per-feature count of distinct authors in the last 90 days. Bus factor 1 = one person knows this code. They leave, you bleed.

Gradations

≥3 resilient

safe to lose any single author

2 fragile

pair-program the next change

1 critical

single point of failure

Why those thresholds:Industry norm; sub-3 ownership correlates with 2-4 week onboarding delays per incident on the affected feature.

How to raise

Assign bus-factor-1 features as secondary ownership in the next sprint. Mandatory PR review from a second author for 30 days.

Use case

Hiring + retention case. "8 critical features owned by one person" is the conversation that gets headcount approved.

feature_uuid · flow_uuid

stable lineage IDs

Stable IDs that survive file renames, moves, and refactors. Track the same feature across 6 months of commits even when paths change.

Gradations

enabler, not a score

every other metric trends over time without resetting on every refactor

Why those thresholds:Path-based identity breaks on every reorg. UUID-based identity stays anchored — so quarterly health deltas survive folder shuffling.

How to raise

Nothing to raise — the ID is generated automatically from path signatures + entry-point hashes.

Use case

Trend reporting. Quarterly health-score deltas per feature, immune to refactor churn. The chart your CTO wants on the wall.

LOC · files_count · primary_path

size & canonical path

Raw size, file count, and the canonical path that "owns" this feature (the one with the most weighted attribution).

Gradations

sanity dimension

scales every other metric — a 12k-LOC "auth" is not the same problem as a 200-LOC one

Why those thresholds:Without size context, a 90% coverage on 100 LOC reads the same as 90% on 12,000. It isn't.

How to raise

Sort features by LOC descending — top 5 are split candidates. Cross-reference with churn and impact.

Use case

Decomposition planning. Defensible answer to "what should we split first?"

Runtime integrations — Sentry + PostHog

Two optional integrations enrich every feature with how it actually behaves in production. Zero SDK changes. Read-only API tokens. We match events to features through the same path_index the engine already computes.

Sentry

errors_24h · errors_14d · regression flag

Per-feature aggregate of Sentry errors over 24h and 14d, plus a regression flag when 7d rate ≥ 2× the prior 7d. We see counts and event IDs only — never bodies, never PII.

Gradations

0 errors

no regression — green

errors present, no regression

known issue, monitor

regression flag ON

fire alarm — ship a fix this sprint

Why those thresholds:2× is the standard anomaly threshold in SRE; below 2× is normal weekly noise, above is a real shift.

How to wire it

Wire regression flag to PR comment + Slack digest. Auto-tag the on-call when a regression fires.

Use case

On-call reduction. Cross-reference regression flag with the PR that landed in the prior 7 days → root cause in minutes, not hours.

PostHog

traffic_pct · events_14d

Per-feature traffic share and 14-day event count. URL→feature matching via the same route extractors that build the feature map. Aggregate only — no per-user data, no PII.

Gradations

≥10% traffic

customer-facing critical path

1-10%

meaningful but not load-bearing

<1%

long tail — candidate for deprecation

Why those thresholds:Top decile by traffic typically carries 70%+ of revenue-affecting flows. Concentration follows the same Pareto curve as churn and bug-fix ratio.

How to wire it

Cross-reference traffic with coverage and health. High traffic + low coverage = test debt with a revenue tag attached.

Use case

Deprecation case. <1% traffic + churn >8 = dead weight. Cut it, recover the maintenance cost.

The killer view

Traffic × Errors × Coverage

Every feature plotted on traffic (PostHog) × errors (Sentry) × coverage (engine). High traffic + high errors + low coverage = priority #1 refactor.

Why it's the killer view: no other tool joins these three signals at the feature level. CodeScene has history. Sentry has errors. PostHog has traffic. Faultlines is the only place they meet on the same feature_uuid — and survive code moves.

Use case: quarterly refactor planning. Open the quadrant, point at the top-right red cluster, that's your quarter. Defensible to the board with three independent signals agreeing.

Hotspots

Hotspot = feature whosebug_fix_ratio exceeds 30% with at least 3 total commits (the floor filters out newly-added features with only 1-2 commits). The landing surfaces the top 5 per repo, sorted by ratio descending. Health score (0-100, composite of churn + bug ratio) is shown alongside.

Honest caveats

L1 precision on some repos sits at 10-30%. This is not an engine bug — it's theengineering-grain rule in action. The engine emits real per-subsystem features (e.g.scim,audit,secret-rotation for infisical); the truth file enumerates only the maintainer's top-level pitch. We publish the gap honestly rather than retro-fit the truth to match (that would be cheating).
Flow precision varies widely. Backend-heavy repos can over-emit flows from the same entry-point file; the Stage 3 dedup pass (Sprint S7-B) catches most but not all. Sprint S11 added a dynamic wall-time budget so big repos (chatwoot 330 features, directus 242) don't time out anymore — chatwoot went from 0 / 330 features with flows to 132 / 330 after the fix.
Sonnet outputs are not strictly deterministic even at temperature defaults. A single re-run can shift L2 P/R by ±5-15pp. Trend over many scans is what matters.
Truth lists are versioned in our public repo. When we expand a list to reflect engineering-grain reality, the change and the rationale are commit-history visible.
Independent generalization eval (7 unseen repos) sits within±3pp of the trained corpus average on both L2 P and R — the engine learned the problem class (route extractors, MVC, schema, workspace boundaries) rather than memorising the corpus. We re-run this whenever we ship a Stage that could overfit.