NeverRanked · Measurement methodology

How we measure what AI engines cite.

The seven AI tools, the question-set discipline, the source classifier, the pattern-readiness rule, and a working example from data we actually captured.

What this page is, in plain English

This is the “show your work” page. Everything below explains exactly how we run the measurement, where the numbers come from, and what they mean. It is written for the person on your team (or your compliance reviewer, or your agency’s data lead) who needs to verify what we’re doing before signing on.

The plain-English version, in five sentences:

If that is enough for you, you can stop reading here and go back to the homepage. Everything below is the proof.

The two-layer measurement model

AI answer engines fall into two structurally different groups, and both groups can fail a brand independently. We measure both, every day, for every category we run.

Layer 1: Citation-grade engines (5)

Engines that search the live web at query time and surface the sources they cite. Five of them dominate the surface area today:

For each query, on each engine, the API returns the AI-generated answer and the source URLs cited in it. We capture all three: the answer text, the cited URLs, and the timestamp of the capture. A brand is "cited" on a query if its domain appears in the URL list for that query's answer.

Layer 2: Model-knowledge engines (2)

Engines that answer from training data without searching the live web. These represent the baseline of what AI says about a brand when it cannot look anything up, which is what users of Claude.ai or any Claude-powered support tool see by default.

For each query, on each engine, the response is captured and scanned for mentions of the brand name. A brand is "mentioned in model knowledge" on a query if its name appears in the response.

Why both layers matter. A brand invisible in citation is invisible when AI fact-checks itself before answering. A brand invisible in model knowledge is invisible at the baseline, before any search happens. These are different failure modes. Measuring only one half of the picture misses what the other half could tell you.

Query-set discipline

Each engagement opens with a query-set design conversation. The goal: a frozen list of 15-30 queries that represent how the customer's buyers actually search in their category. Three rules govern the list:

Every measured category gets its own runner with its own 18-question set, locked by hash. The runners are in the public repo. We have published runners and pattern-ready measurements for Hawaii consumer banking, Hawaii wealth management, Honolulu dental, Hawaii law firms, Hawaii CPAs, and Austin TX CPAs (our first cross-geo measurement). Each runner’s hash is printed at the start of every run; any change to the question set produces a new hash and the runs are no longer comparable across the change.

An example 18-question set

The actual locked set we used for Hawaii wealth management (one of the six published measurements). Eight head queries plus ten long-tail queries. This is what a scoping call produces: a list this specific, locked by hash, never edited after the first measurement run.

Head queries (8) · broad buyer intent

  1. best wealth manager in Hawaii
  2. best financial advisor in Honolulu
  3. top financial advisor Hawaii high net worth
  4. fee-only financial advisor Honolulu
  5. most trusted financial advisor in Hawaii
  6. best Hawaii financial advisor for retirement planning
  7. Hawaii wealth manager for business owner
  8. fiduciary financial advisor Honolulu

Long-tail queries (10) · specific buyer scenarios + comparison

  1. Hawaii financial advisor for inherited wealth
  2. Hawaii financial advisor that handles Roth conversions
  3. best Hawaii financial advisor for tax planning
  4. Hawaii wealth manager for real estate investor
  5. Hawaii financial advisor for early retirement
  6. Hawaii financial advisor first time investing
  7. Hawaii financial advisor that handles business sale proceeds
  8. Hawaii wealth manager for physician or attorney
  9. Hamada Financial vs Cadinha vs CKW
  10. Honolulu financial advisor Bishop Street

The shape repeats across categories: 8 head queries capturing the "name a few" demand surface plus 10 long-tail queries capturing specific buyer scenarios (and one named-comparison curiosity query that captures what AI says when someone is already evaluating known firms head-to-head). Per-category wording is tailored at the scoping call. Each set is locked by hash and never edited.

Noise control

AI engines produce different answers to the same query on different runs. Sometimes meaningfully different, sometimes not. To separate signal from noise, every query runs three times per engine per day. Across a three-week kickoff, that produces roughly 18 × 3 × 5 × 21 = 5,670 citation captures per engagement on the citation-grade engines alone, plus the model-knowledge captures.

A citation that appears in only one of three reps is a single-shot. A citation that appears in two of three reps is a moderate signal. A citation that appears in all three reps is the strongest single-day signal we report. Aggregate frequencies over the three-week kickoff are what the punch list is built against.

Source-type classifier

For each cited URL captured, the source is classified into one of nine buckets. The classifier source code is public at github.com/LanceRoylo/neverranked-outreach. The buckets:

BucketWhat's in it
youtubeyoutube.com, youtu.be
redditreddit.com (any subreddit)
wikipediawikipedia.org, wikimedia.org, wikidata.org
forumStack Exchange family, Quora, Hacker News, Discourse-hosted forums
socialLinkedIn, X, Facebook, Instagram, TikTok, Threads, Medium, Substack
review_directoryYelp, G2, Capterra, Trustpilot, Clutch, Healthgrades, RealSelf, TripAdvisor, BBB, Gartner, Glassdoor, ProductHunt, Birdeye, plus category-specific directories
ownedThe customer's own domain (passed in at engagement start)
competitorNamed competitors (passed in at engagement start)
independent_webEverything else: publications, blogs, vendor pages. Honestly lumped because hostnames alone cannot reliably distinguish "major publication" from "random blog" without false confidence.

Why independent_web is honestly lumped. The temptation in this category is to over-claim, to say "the AI cited eight major publications" when really it cited eight independent web pages of unknown editorial weight. Hostname-only classification cannot tell you which of those eight is a real publication and which is a content marketing site. So we put them all in one bucket and label it honestly. Anyone who wants to do the slow, expensive editorial-weight work on a per-host basis is welcome to do it on their own. The raw URLs are in the customer's data store.

Cohort and competitive analysis

Each engagement names a competitive cohort at scoping (typically 3-7 competitors). For every query on every engine, we record whether each cohort member was cited, and how frequently across the three-week window. That produces the competitive gap table at the heart of every research memo (see the example engagement page for what this looks like in practice).

Two structural observations are worth naming up front:

Within-citation depth (added 2026-05-21)

Beyond "your brand was cited on this query," we now capture two finer-grained signals for every recurring brand in the cohort:

Tool: dryrun/forensic/within-citation.mjs. Public source code, runnable on any cohort that has raw measurement data.

Drift detection (added 2026-05-21)

Day-over-day citation share movement, flagged when a host moves more than a configurable threshold (default 5 percentage points) between measurement windows. Same query-set hash required in both windows; the tool refuses to compare across different query sets because the universe of possible citations is different.

The current monthly delta memo surfaces drift findings as observation. Two drift tools ship in the public repo: drift.mjs for per-category point-in-time host-level deltas, and drift-summary.mjs for a one-line-per-category scan across all measured categories that the monthly memo workflow uses as a first-pass pre-write. Automated daily alerting (an email to the customer when a host gains or loses more than a threshold in a single day) is the obvious next layer and ships with the dashboard build.

Tool: dryrun/forensic/drift.mjs. Hash-locked, engine-restrictable, JSON-output for piping into downstream alerting.

The pattern-readiness rule

We do not claim a pattern in a category from a single run. The rule is in MOAT.md rule 5:

Every new category needs at least 3 USABLE runs before we claim a pattern. A usable run is one that produced at least one ok row. A run that returned all ok:false is a failed attempt, not a measurement, and does not count toward the pattern-readiness bar.

The catalog tool (dryrun/forensic/catalog.mjs) enforces this in the read-out, flagging failed runs and surfacing per-category usable-run counts. A "ready" label appears only when a category has three or more runs with at least one successful API capture each. Anything below that is reported as "data point, not pattern" with the warning surfaced inside the deliverable itself.

The aggregate layer

Beyond the per-customer measurement, every engagement's raw data feeds a cross-category aggregate. The aggregate answers questions no single engagement can answer alone:

The fourth question is what we call the cross-geo discipline. The Austin TX CPA measurement (our first non-Hawaii category) tested whether the Hawaii CPA training-data engine findings generalize. The result split the original pattern: one training-data engine’s collapse generalized cross-geo, the other did not. A vendor measuring one geography per category would have published the Hawaii result and called it the category-level pattern. The cross-geo measurement was the only way to discover that the framing needed to be more precise. See the cross-category teardown and pattern 01 in the patterns library for the full detail.

Privacy and handling stance for the aggregate: aggregate-level patterns observed across the dataset stay with us, never tied to a named customer, never reverse-engineerable to a specific engagement. The aggregator (dryrun/forensic/aggregate.mjs) has a host-surfacing gate set to min-runs >= 2: a host appears in aggregate output only when it appears in two or more distinct runs. Privacy is enforced in the code, not in the policy.

A working example from real data

The first pattern-ready category in the aggregate was Honolulu med spas, measured in May 2026 against a 2-engine seed (OpenAI + Perplexity). The 7-engine methodology shipped shortly after; six full measurements have published since (5 Hawaii categories plus the first cross-geo, Austin TX CPAs). The med-spa example below is preserved because it documents the source-type classifier’s post-bug-fix accuracy verification (see 2026-05-23 note further down). For the current state of measured categories with 7 engines, see the published teardowns and the cross-category teardown. The methodology demonstrated here is the same one that produced those teardowns; the data shape repeats with more engines and tighter cohort coverage.

The historical med-spa findings:

Source type% of citations
Competitor (cohort med-spa businesses themselves)57%
Independent web41%
Review directory2%
YouTube<1%
Reddit, Wikipedia, forum, social0%

The non-obvious read. For this category AI cites the businesses' own websites and third-party content at roughly 1:1 (57% competitor / 41% independent web). That is not the result a buyer expects. Conventional SEO instinct says "AI cites third-party content about you, so optimize the content not the site." For Honolulu med spas, the data says the opposite is half-true: own-site optimization matters about as much as off-site presence, because AI is reading both at roughly equal rates. The actionable shape of the list of fixes changes accordingly: both surfaces matter, neither dominates.

Per-engine behavior within the same category:

EngineCompetitorIndependent webReview dirYouTube
OpenAI62%37%1%0%
Perplexity51%46%3%1%

OpenAI weights own-site citations more heavily than Perplexity (62% vs 51%). That difference is exactly the kind of per-engine asymmetry a single-engine measurement misses.

Recurring hosts (cited across two or more measurement runs). The businesses are anonymized here because they are not NeverRanked customers and did not consent to appear in our public materials. A paying customer's deliverable names every host in their cohort in full, because the named competitive map is the product. On a public page, the pattern is what matters, not the names.

HostCitationsSource type
Honolulu med spa A88competitor
Honolulu med spa B88competitor
Honolulu med spa C58competitor
Honolulu med spa D54competitor
Honolulu med spa E53competitor
category-specific directory A11review_directory
category-specific directory B3review_directory

The pattern-ready threshold (3+ usable runs) is met by this category; the aggregator confirms. Cohort coverage improves with each engagement. More runs surface more cohort members, so the competitor share trends toward its true ceiling rather than away from it.

For category contrast, a kill-test cohort against neverranked.com (category aeo_tools, single run, Perplexity) showed a meaningfully different source mix: 85% independent web, 14% YouTube, 1% forum. The same engine, a different category, produces a fundamentally different source-type distribution. That difference is the moat in microcosm. Only an outside observer measuring across categories can name it.

Note on accuracy, updated 2026-05-23. Earlier versions of this page reported "98% independent web, 0% competitor" for this category. That was wrong by a wide margin: the aggregator was passing empty context to the source-type classifier, so the cohort's own domains were silently bucketed as independent web. The bug was fixed and the cohort registered; corrected numbers above. The methodology is unchanged. This is the kind of finding the fail-closed grader catches in prose but the underlying data pipeline has to catch separately. We caught it before any customer-facing claim shipped from these numbers.

What this engagement deliberately does NOT measure

The honest scope on what "forensic" means here and what it does not. The structural axes (cross-engine, cross-competitor, source-type classification, cohort discipline, pre-registration) are evidence-grade. The depth-of-content axes below are gaps. We name them so a prospect evaluating us does not assume we cover something we don't.

Reproducibility

The classifier source code is public. Gemma (one of the seven engines) is open-weight; any auditor can re-run the same prompts against Gemma and verify our numbers independently. The raw measurement data captured for each engagement lives in the customer's data store and is exportable any time. The aggregator and catalog tools are public. If anything in a deliverable does not survive your auditor's review, we want to know.

How public artifacts cite this methodology

Some of what we measure becomes a public artifact: a per-business "AI-visibility receipt" linked from outreach to a named subject. Every public receipt links back to this page for substantiation, and every quantitative claim it carries is anchored in a hash-locked measurement run on the methodology above.

The substantiation chain in one sentence: every quantitative claim in a NeverRanked artifact is derived from a hash-locked, pre-registered measurement, captured by open-source code, against named engines on named dates with named queries, and graded by a fail-closed factual checker before it ships.

Phrasing discipline. Receipts speak observationally only. "On N of M observed queries between [dates], engine X cited business Y" is the canonical form. We do not say an engine "recommends," "prefers," "endorses," or "ranks" a business. Engines cite, full stop. We do not assert that being cited causes a business outcome, or that not being cited causes its absence. The grader rejects artifacts that drift into normative or causal language.

Anonymization commitment. Non-customer businesses are anonymized on every publicly-indexed page. The named subject of a receipt is named (they have notice through the outreach we sent them); other businesses in the same cohort appear as "Competitor A", "Competitor B", and so on. A named subject who is also a paying customer can consent to additional naming in their own deliverable; otherwise the public version stays anonymized.

Measurement window staleness. AI engine behavior changes weekly. Every public artifact carries the measurement window dates so a reader can see exactly when the underlying capture happened. If a finding is stale, we will re-run on request, at no charge, against the same hash-locked methodology so the new run is directly comparable.

Takedown and opt-out. If your business is named anywhere on our property and you want it removed, the process is documented at /takedowns/. The bar is one email; the SLA is 24 hours; no reason or justification required. Opt-out from future measurement is permanent.

If you want to scope an engagement

Email Lance@hi.neverranked.com with the category you want to measure and three to five competitors you want on the cohort. The first conversation locks the query set together. Then the measurement starts.

Return to NeverRanked · Example deliverable · How we differ from the dashboards · Takedowns & opt-out · About