NeverRanked · Measurement methodology

How we measure what AI engines cite.

Every number we publish traces to a hash-locked run against named engines, on named dates, with named queries, graded by a fail-closed factual checker before it ships. This page is the full spec.

What this page is, in plain English

This is the “show your work” page. Everything below explains exactly how we run the measurement, where the numbers come from, and what they mean. It is written for the person on your team (or your compliance reviewer, or your agency’s data lead) who needs to verify what we’re doing before signing on.

The plain-English version, in five sentences:

We figure out what your customers are actually asking AI tools in your category, then lock those questions so every run compares apples to apples.
We ask 7 AI tools the same questions every day for three weeks to build the baseline: five that search the live web and cite sources (ChatGPT, Google AI, Perplexity, Microsoft Copilot, Gemini) and two that answer from training data alone (Claude, Gemma).
We record who AI mentions for each question, who it doesn’t, and where in the answer each name shows up.
We hand you a clear list of what to fix, prioritized by what your competitors are observably doing differently.
Then we keep measuring continuously, because AI answers shift constantly and a one-time audit ages out within a quarter.

It also names, in plain terms, what this engagement does NOT measure, so you are never assuming we cover something we don’t.

If that is enough for you, you can stop reading here and go back to the homepage. Everything below is the proof.

The two-layer measurement model

AI answer engines fall into two structurally different groups, and both groups can fail a brand independently. We measure both, every day, for every category we run.

Layer 1: Citation-grade engines (5)

Engines that search the live web at query time and surface the sources they cite. Five of them dominate the surface area today:

Perplexity, accessed via its API.
ChatGPT search, accessed via OpenAI's search-grade API.
Gemini grounded, accessed via the Gemini API with web grounding enabled.
Microsoft Copilot, accessed via Bing's organic results through a search-data provider.
Google AI Overviews, accessed via a search-data provider's AI Overview feed.

For each query, on each engine, the API returns the AI-generated answer and the source URLs cited in it. We capture all three: the answer text, the cited URLs, and the timestamp of the capture. A brand is "cited" on a query if its domain appears in the URL list for that query's answer.

Layer 2: Model-knowledge engines (2)

Engines that answer from training data without searching the live web. These represent the baseline of what AI says about a brand when it cannot look anything up, which is what users of Claude.ai or any Claude-powered support tool see by default.

Claude, accessed via the Anthropic API. No native web tool is invoked, so the response reflects model knowledge plus RLHF.
Gemma, accessed via a model-hosting provider. Open-weight, which means the Gemma model itself is independently inspectable.

For each query, on each engine, the response is captured and scanned for mentions of the brand name. A brand is "mentioned in model knowledge" on a query if its name appears in the response.

Why both layers matter. A brand invisible in citation is invisible when AI fact-checks itself before answering. A brand invisible in model knowledge is invisible at the baseline, before any search happens. These are different failure modes. Measuring only one half of the picture misses what the other half could tell you.

Query-set discipline

Each engagement opens with a query-set design conversation. The goal: a frozen 18-question set that represents how the customer's buyers actually search in their category. Three rules govern the list:

Hash-locked. The query list is hashed on first run. Every subsequent run prints the same hash. If the hash changes, the runs are no longer comparable and the discipline has been broken. Hash printing is in dryrun/run-dental-honolulu.mjs and similar runners.
Intent-shaped. Queries cover head-intent ("best X in Y"), neighborhood scoping ("X in [specific area]"), service-combined ("X that also does Y"), trust-signal ("X for first-time customers"), value-conscious ("affordable X that takes [insurance]"). The shape mirrors how a real buyer searches, not how an SEO keyword tool ranks.
No naming the customer. Queries do not contain the customer's brand name. Branded queries are a different measurement (does the AI know who you are when asked directly), which we also report but separate from demand queries (does the AI surface you when a buyer searches for the category).

Every measured category gets its own runner with its own 18-question set, locked by hash. We have published pattern-ready measurements across the categories we have measured, including Hawaii consumer banking, Hawaii wealth management, Honolulu dental, Hawaii law firms, Hawaii CPAs, and Austin TX CPAs (our first cross-geo measurement). The full ledger of published measurements, each with its dated run and locked question set, is at /claims/. Each runner’s hash is printed at the start of every run. Any change to the question set produces a new hash and the runs are no longer comparable across the change.

An example 18-question set

The actual locked set we used for Hawaii wealth management (one of the categories we have published). Eight head queries plus ten long-tail queries. This is what a scoping call produces: a list this specific, locked by hash, never edited after the first measurement run.

Head queries (8) · broad buyer intent

best wealth manager in Hawaii
best financial advisor in Honolulu
top financial advisor Hawaii high net worth
fee-only financial advisor Honolulu
most trusted financial advisor in Hawaii
best Hawaii financial advisor for retirement planning
Hawaii wealth manager for business owner
fiduciary financial advisor Honolulu

Long-tail queries (10) · specific buyer scenarios + comparison

Hawaii financial advisor for inherited wealth
Hawaii financial advisor that handles Roth conversions
best Hawaii financial advisor for tax planning
Hawaii wealth manager for real estate investor
Hawaii financial advisor for early retirement
Hawaii financial advisor first time investing
Hawaii financial advisor that handles business sale proceeds
Hawaii wealth manager for physician or attorney
Firm A vs Firm B vs Firm C
Honolulu financial advisor Bishop Street

The shape repeats across categories: 8 head queries capturing the "name a few" demand surface plus 10 long-tail queries capturing specific buyer scenarios (and one named-comparison curiosity query that captures what AI says when someone is already evaluating known firms head-to-head). Per-category wording is tailored at the scoping call. Each set is locked by hash and never edited.

Hash-locking is not a claim, it is a printed artifact. Here is the actual line this runner emitted, identical across its 2026-05-24 and 2026-05-25 runs. The same hash on both runs is what makes the two measurement windows directly comparable. One note for anyone reproducing it: the named-comparison query (number 17) is shown anonymized in the list above, so this hash is taken over the locked real set, not the displayed text.

$ node dryrun/run-wealth-mgmt-hawaii.mjs
queries: 18 × 3 reps × 7 engines = 378 calls
query_set_hash: ae13579a6420bea4bb6a6157d0ec4152182490bf29159cbc802a24ade3903f52

# re-run the next day, byte-identical hash, so the windows compare cleanly
query_set_hash: ae13579a6420bea4bb6a6157d0ec4152182490bf29159cbc802a24ade3903f52

Noise control

AI engines produce different answers to the same query on different runs, so we treat repetition as the signal test. In a measurement run, where a query is sampled three times per engine, a citation in one of three reps is a single-shot, two of three is a moderate signal, and all three is the strongest signal we report. The standing daily measurement applies the same logic across days. Aggregate frequencies over the three-week kickoff are what the punch list is built against.

The mechanism: the standing measurement runs every query once per engine per day, seven days a week, and that daily cadence continues past the three-week kickoff. Across a three-week kickoff, that is roughly 18 × 1 × 5 × 21 = 1,890 citation captures per engagement on the citation-grade engines, plus the model-knowledge captures.

Source-type classifier

For each cited URL captured, the source is classified into one of nine buckets. The classifier logic is documented here as part of the method. The buckets:

Bucket	What's in it
`youtube`	youtube.com, youtu.be
`reddit`	reddit.com (any subreddit)
`wikipedia`	wikipedia.org, wikimedia.org, wikidata.org
`forum`	Stack Exchange family, Quora, Hacker News, Discourse-hosted forums
`social`	LinkedIn, X, Facebook, Instagram, TikTok, Threads, Medium, Substack
`review_directory`	Yelp, G2, Capterra, Trustpilot, Clutch, Healthgrades, RealSelf, TripAdvisor, BBB, Gartner, Glassdoor, ProductHunt, Birdeye, plus category-specific directories
`owned`	The customer's own domain (passed in at engagement start)
`competitor`	Named competitors (passed in at engagement start)
`independent_web`	Everything else: publications, blogs, vendor pages. Honestly lumped because hostnames alone cannot reliably distinguish "major publication" from "random blog" without false confidence.

Why independent_web is honestly lumped. The temptation in this category is to over-claim, to say "the AI cited eight major publications" when really it cited eight independent web pages of unknown editorial weight. Hostname-only classification cannot tell you which of those eight is a real publication and which is a content marketing site. So we put them all in one bucket and label it honestly. Anyone who wants to do the slow, expensive editorial-weight work on a per-host basis is welcome to do it on their own. The raw URLs are in the customer's data store.

Cohort and competitive analysis

Each engagement names a competitive cohort at scoping (typically 3-7 competitors). For every query on every engine, we record whether each cohort member was cited, and how frequently across the three-week window. That produces the competitive gap table at the heart of every research memo (see the example engagement page for what this looks like in practice).

Two structural observations are worth naming up front:

Market-leader effects are real but slow. The two competitors who appear on 40%+ of queries in a category are almost never displaceable in one engagement. The honest framing is to close the gap to the mid-tier first.
Long-tail wins compound. The competitors who appear on 3-9 specific queries each are where focused punch-list work produces fast, measurable movement. The first noticeable citation share gains in any engagement are usually here.

Added 2026-05-21

Within-citation depth

Beyond "your brand was cited on this query," we now capture two finer-grained signals for every recurring brand in the cohort:

Position in the answer. Where in the AI's response does the brand name appear? Bucketed by text-quartile: Q1 (lead) means the AI is opening with you. Q4 (tail) means you're a footnote. The signal is real: in the Honolulu med-spa cohort (the same cohort shown in the worked example below), one med spa appears in Q1 55% of the time (the AI leads with them on most queries) while another appears in Q3 59% of the time (almost never lead, almost never tail). Same total mention count, very different competitive signal.
Sentiment context. Strong-positive ("top", "best", "trusted", "premier") and strong-negative ("avoid", "outdated", "limited") vocabulary detected within ±80 characters of each mention. Honest scope: this is heuristic pattern detection, not nuanced NLP. Strong positive and strong negative are real signals. The default "neutral" bucket may include mild positive/negative the heuristic misses. The honest framing is in the deliverable.

Tool: dryrun/forensic/within-citation.mjs. Runnable on any cohort that has raw measurement data.

Added 2026-05-21

Drift detection

Day-over-day citation share movement, flagged when a host moves more than a configurable threshold (default 5 percentage points) between measurement windows. Same query-set hash required in both windows. The tool refuses to compare across different query sets because the universe of possible citations is different.

Capture is daily. The surfaced artifact is the monthly delta memo, which reports drift findings as observation. Two drift tools run as part of the method: drift.mjs for per-category point-in-time host-level deltas, and drift-summary.mjs for a one-line-per-category scan across all measured categories that the monthly memo workflow uses as a first-pass pre-write. Automated daily alerting (an email to the customer when a host gains or loses more than a threshold in a single day) is the obvious next layer and ships with the dashboard build.

Tool: dryrun/forensic/drift.mjs. Hash-locked, engine-restrictable, JSON-output for piping into downstream alerting.

The pattern-readiness rule

We do not claim a pattern in a category from a single run. The internal pattern-readiness rule is:

Every new category needs at least 3 USABLE runs before we claim a pattern. A usable run is one that produced at least one ok row. A run that returned all ok:false is a failed attempt, not a measurement, and does not count toward the pattern-readiness bar.

The catalog tool (dryrun/forensic/catalog.mjs) enforces this in the read-out, flagging failed runs and surfacing per-category usable-run counts. A "ready" label appears only when a category has three or more runs with at least one successful API capture each. Anything below that is reported as "data point, not pattern" with the warning surfaced inside the deliverable itself.

The aggregate layer

The payoff in one example: measuring the same category in two geographies (Hawaii CPAs and Austin CPAs) split a pattern a single-geo vendor would have published as settled. More on that below. First, what the aggregate is.

Beyond the per-customer measurement, every engagement's raw data feeds a cross-category aggregate. The aggregate answers questions no single engagement can answer alone:

What source types do AI engines cite for category X across the universe of customers we have measured in X?
How does engine behavior differ between category X and category Y?
What recurring hosts (cited across two or more engagements) are the structural authorities in category X right now?
When the same category is measured in two different geographies, do the same patterns hold? Or does the category-level pattern actually split by engine and by geography in ways that single-geo measurement cannot surface?

The fourth question is what we call the cross-geo discipline. The Austin TX CPA measurement (our first non-Hawaii category) tested whether the Hawaii CPA training-data engine findings generalize. The result split the original pattern: one training-data engine’s collapse generalized cross-geo, the other did not. A vendor measuring one geography per category would have published the Hawaii result and called it the category-level pattern. The cross-geo measurement was the only way to discover that the framing needed to be more precise. See the cross-category teardown for the full detail.

Privacy and handling stance for the aggregate: aggregate-level patterns observed across the dataset stay with us, never tied to a named customer, never reverse-engineerable to a specific engagement. The aggregator (dryrun/forensic/aggregate.mjs) has a host-surfacing gate set to min-runs >= 2: a host appears in aggregate output only when it appears in two or more distinct runs. Privacy is enforced in the code, not in the policy.

A working example from real data

The first pattern-ready category in the aggregate was Honolulu med spas, measured in May 2026 against a 2-engine seed (OpenAI + Perplexity). The 7-engine methodology shipped shortly after, and full measurements have published across the categories we have measured (Hawaii categories plus the first cross-geo, Austin TX CPAs). The full ledger is at /claims/. The med-spa example below is preserved because it documents the source-type classifier’s post-bug-fix accuracy verification (see 2026-05-23 note further down). For the current state of measured categories with 7 engines, see the published teardowns and the cross-category teardown. The methodology demonstrated here is the same one that produced those teardowns. The data shape repeats with more engines and tighter cohort coverage.

The current 7-engine read on the same category. From the published med-spa teardown, read across the cross-category table (recomputed 2026-06-11). The five web-searching engines pool to just over half their citations on the cohort’s own sites. Microsoft Copilot (Bing), shown separately below, is the outlier inside that group at essentially zero, so the pooled figure is carried by the other four. The two model-knowledge engines split hard.

Current 7-engine read · Honolulu med spas · recomputed 2026-06-11

Surface	Own-site (cohort) share
Five web-searching engines (pooled)	51%
Claude (model knowledge)	2%
Gemma (model knowledge)	32%
Microsoft Copilot (Bing) alone	0%

The 2-engine seed below (57% own-site) lands within a few points of the current web-searching number. That is exactly why the seed is preserved as the worked example: the data shape held as engines were added, and it carries the source-type classifier’s post-bug-fix accuracy verification (see the 2026-05-23 note below).

Source-type mix · Honolulu med spas · May 2026 (2-engine seed)

Source type	% of citations
Competitor (cohort med-spa businesses themselves)	57%
Independent web	41%
Review directory	2%
YouTube	<1%
Reddit, Wikipedia, forum, social	0%

The non-obvious read. For this category AI cites the businesses' own websites and third-party content at roughly 1:1 (57% competitor / 41% independent web). That is not the result a buyer expects. Conventional SEO instinct says "AI cites third-party content about you, so optimize the content not the site." For Honolulu med spas, the data says the opposite is half-true: own-site optimization matters about as much as off-site presence, because AI is reading both at roughly equal rates. The actionable shape of the list of fixes changes accordingly: both surfaces matter, neither dominates.

Per-engine behavior within the same category:

Per-engine source mix · 2-engine seed

Engine	Competitor	Independent web	Review dir	YouTube
OpenAI	62%	37%	1%	0%
Perplexity	51%	46%	3%	1%

OpenAI weights own-site citations more heavily than Perplexity (62% vs 51%). That difference is exactly the kind of per-engine asymmetry a single-engine measurement misses.

Recurring hosts (cited across two or more measurement runs). The businesses are anonymized here because they are not NeverRanked customers and did not consent to appear in our public materials. A paying customer's deliverable names every host in their cohort in full, because the named competitive map is the product. On a public page, the pattern is what matters, not the names.

Recurring hosts · cited across 2+ runs

Host	Citations	Source type
Honolulu med spa A	88	competitor
Honolulu med spa B	88	competitor
Honolulu med spa C	58	competitor
Honolulu med spa D	54	competitor
Honolulu med spa E	53	competitor
category-specific directory A	11	review_directory
category-specific directory B	3	review_directory

The pattern-ready threshold (3+ usable runs) is met by this category, and the aggregator confirms. Cohort coverage improves with each engagement. More runs surface more cohort members, so the competitor share trends toward its true ceiling rather than away from it.

For category contrast, a kill-test cohort against neverranked.com (category aeo_tools, single run, Perplexity) showed a meaningfully different source mix: 85% independent web, 14% YouTube, 1% forum. The same engine, a different category, produces a fundamentally different source-type distribution. That difference is the moat in microcosm. Only an outside observer measuring across categories can name it.

Note on accuracy, updated 2026-05-23. Earlier versions of this page reported "98% independent web, 0% competitor" for this category. That was wrong by a wide margin: the aggregator was passing empty context to the source-type classifier, so the cohort's own domains were silently bucketed as independent web. The bug was fixed and the cohort registered, with corrected numbers above. The methodology is unchanged. This is the kind of finding the fail-closed grader catches in prose but the underlying data pipeline has to catch separately. We caught it before any customer-facing claim shipped from these numbers.

Convinced already? Scope a measurement in your category.

What this engagement deliberately does NOT measure

The honest scope on what "forensic" means here and what it does not. The structural axes (cross-engine, cross-competitor, source-type classification, cohort discipline, pre-registration) are evidence-grade. The depth-of-content axes below are gaps. We name them so a prospect evaluating us does not assume we cover something we don't.

Voice AI surfaces. Siri, Alexa, Google Assistant. Out of scope until they expose stable APIs.
AI ad placements. Out of scope until those formats are reliably measurable.
Causation. We measure what AI cites. We do not pre-register tests against customer accounts to prove causation of changes. Monthly delta memos report what moved. Whether the punch list caused the movement is inference, not proof. Any vendor claiming proven causation in this category should be asked for the pre-registration file.
Automated daily drift alerts. Daily measurement ships now, and drift detection surfaces in the monthly delta memo as observation (see "Drift detection" section above). Automated daily alerts when a host's citation share moves more than a threshold are the obvious next layer and are not yet in the deliverable.
Cross-language coverage. English only. AI engines do answer in other languages but our query sets and classifier are English-shaped today.
Side-by-side comparison with a dashboard-style tool's own report. Standalone empirical teardowns are published at /teardowns/ demonstrating the 7-AI-tool methodology across the categories we have measured (including Hawaii consumer banking, wealth management, dental, law firms, CPAs, plus the first cross-geo measurement on Austin TX CPAs). The cross-category teardown reads them against each other. The head-to-head comparison version, where the same subject is measured by both NeverRanked and a dashboard-style tool, has not yet been pulled and added as a side-by-side. When a dashboard-tool report on the same subject is available, the teardown extends to include it.
Anything we don't have a confirmed engine API key for. Confirmed-alive coverage is reported per engine in every deliverable's methodology section.

Reproducibility

The method is documented here in full, the question sets are locked by hash, and every published claim ties back to a dated run on the /claims/ ledger. Gemma (one of the seven engines) is open-weight, so the Gemma model itself is independently inspectable. The raw measurement data captured for each engagement lives in the customer's data store and is exportable any time. If anything in a deliverable does not survive your auditor's review, we want to know.

To be precise about what an outsider can check without engaging us: the question-set hashes and dated runs on /claims/ are public, the Gemma engine is open-weight and independently runnable, and the teardowns are frozen dated snapshots anyone can scrutinize. What an outsider cannot reproduce is a private cohort run, because that data lives in the customer's store. We name that boundary so reproducibility is not oversold.

How public artifacts cite this methodology

Some of what we measure becomes a public artifact: a per-business "AI-visibility receipt" linked from outreach to a named subject. Every public receipt links back to this page for substantiation, and every quantitative claim it carries is anchored in a hash-locked measurement run on the methodology above.

The substantiation chain in one sentence: every quantitative claim in a NeverRanked artifact is derived from a hash-locked, pre-registered measurement against named engines on named dates with named queries, and graded by a fail-closed factual checker before it ships.

Phrasing discipline. Receipts speak observationally only. "On N of M observed queries between [dates], engine X cited business Y" is the canonical form. We do not say an engine "recommends," "prefers," "endorses," or "ranks" a business. Engines cite, full stop. We do not assert that being cited causes a business outcome, or that not being cited causes its absence. The grader rejects artifacts that drift into normative or causal language.

Anonymization commitment. Non-customer businesses are anonymized on every publicly-indexed page. The named subject of a receipt is named (they have notice through the outreach we sent them). Other businesses in the same cohort appear as "Competitor A", "Competitor B", and so on. A named subject who is also a paying customer can consent to additional naming in their own deliverable. Otherwise the public version stays anonymized.

Measurement window staleness. AI engine behavior changes weekly. Every public artifact carries the measurement window dates so a reader can see exactly when the underlying capture happened. If a finding is stale, we will re-run on request, at no charge, against the same hash-locked methodology so the new run is directly comparable.

Takedown and opt-out. If your business is named anywhere on our property and you want it removed, the process is documented at /takedowns/. The bar is one email, the SLA is 24 hours, no reason or justification required. Opt-out from future measurement is permanent.

If you want to scope an engagement

Email Lance@hi.neverranked.com with the category you want to measure and three to five competitors you want on the cohort. The first conversation locks the query set together. Then the measurement starts.

Scope a measurement →

Return to NeverRanked · Example deliverable · How we differ from the dashboards · Takedowns & opt-out · About