How we measure what AI engines cite.
The seven AI tools, the question-set discipline, the source classifier, the pattern-readiness rule, and a working example from data we actually captured.
What this page is, in plain English
This is the “show your work” page. Everything below explains exactly how we run the measurement, where the numbers come from, and what they mean. It is written for the person on your team (or your compliance reviewer, or your agency’s data lead) who needs to verify what we’re doing before signing on.
The plain-English version, in five sentences:
- We figure out what your customers are actually asking AI tools in your category, then lock those questions so every run compares apples to apples.
- We ask all 7 AI tools (ChatGPT, Google AI, Perplexity, Microsoft Copilot, Gemini, Claude, Gemma) the same questions every day for three weeks to build the baseline.
- We record who AI mentions for each question, who it doesn’t, and where in the answer each name shows up.
- We hand you a clear list of what to fix, prioritized by what your competitors are observably doing differently.
- Then we keep measuring every week, because AI answers shift constantly and a one-time audit ages out in 60 to 90 days.
If that is enough for you, you can stop reading here and go back to the homepage. Everything below is the proof.
The two-layer measurement model
AI answer engines fall into two structurally different groups, and both groups can fail a brand independently. We measure both, every day, for every category we run.
Layer 1: Citation-grade engines (5)
Engines that search the live web at query time and surface the sources they cite. Five of them dominate the surface area today:
- Perplexity, accessed via the sonar API.
- ChatGPT search, accessed via the OpenAI gpt-4o-search-preview model.
- Gemini grounded, accessed via Gemini 2.5-flash with the google_search tool enabled.
- Microsoft Copilot, accessed via Bing organic search through the DataForSEO API.
- Google AI Overviews, accessed via the DataForSEO ai_overview endpoint.
For each query, on each engine, the API returns the AI-generated answer and the source URLs cited in it. We capture all three: the answer text, the cited URLs, and the timestamp of the capture. A brand is "cited" on a query if its domain appears in the URL list for that query's answer.
Layer 2: Model-knowledge engines (2)
Engines that answer from training data without searching the live web. These represent the baseline of what AI says about a brand when it cannot look anything up, which is what users of Claude.ai or any Claude-powered support tool see by default.
- Claude, accessed via the Anthropic API. No native web tool is invoked, so the response reflects model knowledge plus RLHF.
- Gemma, accessed via Together AI. Open-weight, which means a compliance team or auditor can re-run the exact same prompts independently and verify our numbers.
For each query, on each engine, the response is captured and scanned for mentions of the brand name. A brand is "mentioned in model knowledge" on a query if its name appears in the response.
Why both layers matter. A brand invisible in citation is invisible when AI fact-checks itself before answering. A brand invisible in model knowledge is invisible at the baseline, before any search happens. These are different failure modes. Measuring only one half of the picture misses what the other half could tell you.
Query-set discipline
Each engagement opens with a query-set design conversation. The goal: a frozen list of 15-30 queries that represent how the customer's buyers actually search in their category. Three rules govern the list:
- Hash-locked. The query list is hashed on first run. Every subsequent run prints the same hash. If the hash changes, the runs are no longer comparable and the discipline has been broken. Hash printing is in
dryrun/run-dental-honolulu.mjsand similar runners. - Intent-shaped. Queries cover head-intent ("best X in Y"), neighborhood scoping ("X in [specific area]"), service-combined ("X that also does Y"), trust-signal ("X for first-time customers"), value-conscious ("affordable X that takes [insurance]"). The shape mirrors how a real buyer searches, not how an SEO keyword tool ranks.
- No naming the customer. Queries do not contain the customer's brand name. Branded queries are a different measurement (does the AI know who you are when asked directly), which we also report but separate from demand queries (does the AI surface you when a buyer searches for the category).
Every measured category gets its own runner with its own 18-question set, locked by hash. The runners are in the public repo. We have published runners and pattern-ready measurements for Hawaii consumer banking, Hawaii wealth management, Honolulu dental, Hawaii law firms, Hawaii CPAs, and Austin TX CPAs (our first cross-geo measurement). Each runner’s hash is printed at the start of every run; any change to the question set produces a new hash and the runs are no longer comparable across the change.
An example 18-question set
The actual locked set we used for Hawaii wealth management (one of the six published measurements). Eight head queries plus ten long-tail queries. This is what a scoping call produces: a list this specific, locked by hash, never edited after the first measurement run.
Head queries (8) · broad buyer intent
- best wealth manager in Hawaii
- best financial advisor in Honolulu
- top financial advisor Hawaii high net worth
- fee-only financial advisor Honolulu
- most trusted financial advisor in Hawaii
- best Hawaii financial advisor for retirement planning
- Hawaii wealth manager for business owner
- fiduciary financial advisor Honolulu
Long-tail queries (10) · specific buyer scenarios + comparison
- Hawaii financial advisor for inherited wealth
- Hawaii financial advisor that handles Roth conversions
- best Hawaii financial advisor for tax planning
- Hawaii wealth manager for real estate investor
- Hawaii financial advisor for early retirement
- Hawaii financial advisor first time investing
- Hawaii financial advisor that handles business sale proceeds
- Hawaii wealth manager for physician or attorney
- Hamada Financial vs Cadinha vs CKW
- Honolulu financial advisor Bishop Street
The shape repeats across categories: 8 head queries capturing the "name a few" demand surface plus 10 long-tail queries capturing specific buyer scenarios (and one named-comparison curiosity query that captures what AI says when someone is already evaluating known firms head-to-head). Per-category wording is tailored at the scoping call. Each set is locked by hash and never edited.
Noise control
AI engines produce different answers to the same query on different runs. Sometimes meaningfully different, sometimes not. To separate signal from noise, every query runs three times per engine per day. Across a three-week kickoff, that produces roughly 18 × 3 × 5 × 21 = 5,670 citation captures per engagement on the citation-grade engines alone, plus the model-knowledge captures.
A citation that appears in only one of three reps is a single-shot. A citation that appears in two of three reps is a moderate signal. A citation that appears in all three reps is the strongest single-day signal we report. Aggregate frequencies over the three-week kickoff are what the punch list is built against.
Source-type classifier
For each cited URL captured, the source is classified into one of nine buckets. The classifier source code is public at github.com/LanceRoylo/neverranked-outreach. The buckets:
| Bucket | What's in it |
|---|---|
youtube | youtube.com, youtu.be |
reddit | reddit.com (any subreddit) |
wikipedia | wikipedia.org, wikimedia.org, wikidata.org |
forum | Stack Exchange family, Quora, Hacker News, Discourse-hosted forums |
social | LinkedIn, X, Facebook, Instagram, TikTok, Threads, Medium, Substack |
review_directory | Yelp, G2, Capterra, Trustpilot, Clutch, Healthgrades, RealSelf, TripAdvisor, BBB, Gartner, Glassdoor, ProductHunt, Birdeye, plus category-specific directories |
owned | The customer's own domain (passed in at engagement start) |
competitor | Named competitors (passed in at engagement start) |
independent_web | Everything else: publications, blogs, vendor pages. Honestly lumped because hostnames alone cannot reliably distinguish "major publication" from "random blog" without false confidence. |
Why independent_web is honestly lumped. The temptation in this category is to over-claim, to say "the AI cited eight major publications" when really it cited eight independent web pages of unknown editorial weight. Hostname-only classification cannot tell you which of those eight is a real publication and which is a content marketing site. So we put them all in one bucket and label it honestly. Anyone who wants to do the slow, expensive editorial-weight work on a per-host basis is welcome to do it on their own. The raw URLs are in the customer's data store.
Cohort and competitive analysis
Each engagement names a competitive cohort at scoping (typically 3-7 competitors). For every query on every engine, we record whether each cohort member was cited, and how frequently across the three-week window. That produces the competitive gap table at the heart of every research memo (see the example engagement page for what this looks like in practice).
Two structural observations are worth naming up front:
- Market-leader effects are real but slow. The two competitors who appear on 40%+ of queries in a category are almost never displaceable in one engagement. The honest framing is to close the gap to the mid-tier first.
- Long-tail wins compound. The competitors who appear on 3-9 specific queries each are where focused punch-list work produces fast, measurable movement. The first noticeable citation share gains in any engagement are usually here.
Within-citation depth (added 2026-05-21)
Beyond "your brand was cited on this query," we now capture two finer-grained signals for every recurring brand in the cohort:
- Position in the answer. Where in the AI's response does the brand name appear? Bucketed by text-quartile: Q1 (lead) means the AI is opening with you; Q4 (tail) means you're a footnote. The signal is real: in the Honolulu med-spa cohort, one med spa appears in Q1 55% of the time (the AI leads with them on most queries) while another appears in Q3 59% of the time (almost never lead, almost never tail). Same total mention count, very different competitive signal.
- Sentiment context. Strong-positive ("top", "best", "trusted", "premier") and strong-negative ("avoid", "outdated", "limited") vocabulary detected within ±80 characters of each mention. Honest scope: this is heuristic pattern detection, not nuanced NLP. Strong positive and strong negative are real signals. The default "neutral" bucket may include mild positive/negative the heuristic misses. The honest framing is in the deliverable.
Tool: dryrun/forensic/within-citation.mjs. Public source code, runnable on any cohort that has raw measurement data.
Drift detection (added 2026-05-21)
Day-over-day citation share movement, flagged when a host moves more than a configurable threshold (default 5 percentage points) between measurement windows. Same query-set hash required in both windows; the tool refuses to compare across different query sets because the universe of possible citations is different.
The current monthly delta memo surfaces drift findings as observation. Two drift tools ship in the public repo: drift.mjs for per-category point-in-time host-level deltas, and drift-summary.mjs for a one-line-per-category scan across all measured categories that the monthly memo workflow uses as a first-pass pre-write. Automated daily alerting (an email to the customer when a host gains or loses more than a threshold in a single day) is the obvious next layer and ships with the dashboard build.
Tool: dryrun/forensic/drift.mjs. Hash-locked, engine-restrictable, JSON-output for piping into downstream alerting.
The pattern-readiness rule
We do not claim a pattern in a category from a single run. The rule is in MOAT.md rule 5:
Every new category needs at least 3 USABLE runs before we claim a pattern. A usable run is one that produced at least one ok row. A run that returned all ok:false is a failed attempt, not a measurement, and does not count toward the pattern-readiness bar.
The catalog tool (dryrun/forensic/catalog.mjs) enforces this in the read-out, flagging failed runs and surfacing per-category usable-run counts. A "ready" label appears only when a category has three or more runs with at least one successful API capture each. Anything below that is reported as "data point, not pattern" with the warning surfaced inside the deliverable itself.
The aggregate layer
Beyond the per-customer measurement, every engagement's raw data feeds a cross-category aggregate. The aggregate answers questions no single engagement can answer alone:
- What source types do AI engines cite for category X across the universe of customers we have measured in X?
- How does engine behavior differ between category X and category Y?
- What recurring hosts (cited across two or more engagements) are the structural authorities in category X right now?
- When the same category is measured in two different geographies, do the same patterns hold? Or does the category-level pattern actually split by engine and by geography in ways that single-geo measurement cannot surface?
The fourth question is what we call the cross-geo discipline. The Austin TX CPA measurement (our first non-Hawaii category) tested whether the Hawaii CPA training-data engine findings generalize. The result split the original pattern: one training-data engine’s collapse generalized cross-geo, the other did not. A vendor measuring one geography per category would have published the Hawaii result and called it the category-level pattern. The cross-geo measurement was the only way to discover that the framing needed to be more precise. See the cross-category teardown and pattern 01 in the patterns library for the full detail.
Privacy and handling stance for the aggregate: aggregate-level patterns observed across the dataset stay with us, never tied to a named customer, never reverse-engineerable to a specific engagement. The aggregator (dryrun/forensic/aggregate.mjs) has a host-surfacing gate set to min-runs >= 2: a host appears in aggregate output only when it appears in two or more distinct runs. Privacy is enforced in the code, not in the policy.
A working example from real data
The first pattern-ready category in the aggregate was Honolulu med spas, measured in May 2026 against a 2-engine seed (OpenAI + Perplexity). The 7-engine methodology shipped shortly after; six full measurements have published since (5 Hawaii categories plus the first cross-geo, Austin TX CPAs). The med-spa example below is preserved because it documents the source-type classifier’s post-bug-fix accuracy verification (see 2026-05-23 note further down). For the current state of measured categories with 7 engines, see the published teardowns and the cross-category teardown. The methodology demonstrated here is the same one that produced those teardowns; the data shape repeats with more engines and tighter cohort coverage.
The historical med-spa findings:
| Source type | % of citations |
|---|---|
| Competitor (cohort med-spa businesses themselves) | 57% |
| Independent web | 41% |
| Review directory | 2% |
| YouTube | <1% |
| Reddit, Wikipedia, forum, social | 0% |
The non-obvious read. For this category AI cites the businesses' own websites and third-party content at roughly 1:1 (57% competitor / 41% independent web). That is not the result a buyer expects. Conventional SEO instinct says "AI cites third-party content about you, so optimize the content not the site." For Honolulu med spas, the data says the opposite is half-true: own-site optimization matters about as much as off-site presence, because AI is reading both at roughly equal rates. The actionable shape of the list of fixes changes accordingly: both surfaces matter, neither dominates.
Per-engine behavior within the same category:
| Engine | Competitor | Independent web | Review dir | YouTube |
|---|---|---|---|---|
| OpenAI | 62% | 37% | 1% | 0% |
| Perplexity | 51% | 46% | 3% | 1% |
OpenAI weights own-site citations more heavily than Perplexity (62% vs 51%). That difference is exactly the kind of per-engine asymmetry a single-engine measurement misses.
Recurring hosts (cited across two or more measurement runs). The businesses are anonymized here because they are not NeverRanked customers and did not consent to appear in our public materials. A paying customer's deliverable names every host in their cohort in full, because the named competitive map is the product. On a public page, the pattern is what matters, not the names.
| Host | Citations | Source type |
|---|---|---|
| Honolulu med spa A | 88 | competitor |
| Honolulu med spa B | 88 | competitor |
| Honolulu med spa C | 58 | competitor |
| Honolulu med spa D | 54 | competitor |
| Honolulu med spa E | 53 | competitor |
| category-specific directory A | 11 | review_directory |
| category-specific directory B | 3 | review_directory |
The pattern-ready threshold (3+ usable runs) is met by this category; the aggregator confirms. Cohort coverage improves with each engagement. More runs surface more cohort members, so the competitor share trends toward its true ceiling rather than away from it.
For category contrast, a kill-test cohort against neverranked.com (category aeo_tools, single run, Perplexity) showed a meaningfully different source mix: 85% independent web, 14% YouTube, 1% forum. The same engine, a different category, produces a fundamentally different source-type distribution. That difference is the moat in microcosm. Only an outside observer measuring across categories can name it.
Note on accuracy, updated 2026-05-23. Earlier versions of this page reported "98% independent web, 0% competitor" for this category. That was wrong by a wide margin: the aggregator was passing empty context to the source-type classifier, so the cohort's own domains were silently bucketed as independent web. The bug was fixed and the cohort registered; corrected numbers above. The methodology is unchanged. This is the kind of finding the fail-closed grader catches in prose but the underlying data pipeline has to catch separately. We caught it before any customer-facing claim shipped from these numbers.
What this engagement deliberately does NOT measure
The honest scope on what "forensic" means here and what it does not. The structural axes (cross-engine, cross-competitor, source-type classification, cohort discipline, pre-registration) are evidence-grade. The depth-of-content axes below are gaps. We name them so a prospect evaluating us does not assume we cover something we don't.
- Voice AI surfaces. Siri, Alexa, Google Assistant. Out of scope until they expose stable APIs.
- AI ad placements. Out of scope until those formats are reliably measurable.
- Causation. We measure what AI cites. We do not pre-register tests against customer accounts to prove causation of changes. Monthly delta memos report what moved; whether the punch list caused the movement is inference, not proof. Any vendor claiming proven causation in this category should be asked for the pre-registration file.
- Automated daily drift alerts. Daily measurement ships now; drift detection surfaces in the monthly delta memo as observation (see "Drift detection" section above). Automated daily alerts when a host's citation share moves more than a threshold are the obvious next layer and are not yet in the deliverable.
- Cross-language coverage. English only. AI engines do answer in other languages but our query sets and classifier are English-shaped today.
- Side-by-side comparison with a dashboard-style tool's own report. Standalone empirical teardowns are published at /teardowns/ demonstrating the 7-AI-tool methodology on six measurements (Hawaii consumer banking, wealth management, dental, law firms, CPAs, plus the first cross-geo measurement on Austin TX CPAs). The cross-category teardown reads them against each other. The head-to-head comparison version, where the same subject is measured by both NeverRanked and a dashboard-style tool, has not yet been pulled and added as a side-by-side. When a dashboard-tool report on the same subject is available, the teardown extends to include it.
- Anything we don't have a confirmed engine API key for. Confirmed-alive coverage is reported per engine in every deliverable's methodology section.
Reproducibility
The classifier source code is public. Gemma (one of the seven engines) is open-weight; any auditor can re-run the same prompts against Gemma and verify our numbers independently. The raw measurement data captured for each engagement lives in the customer's data store and is exportable any time. The aggregator and catalog tools are public. If anything in a deliverable does not survive your auditor's review, we want to know.
How public artifacts cite this methodology
Some of what we measure becomes a public artifact: a per-business "AI-visibility receipt" linked from outreach to a named subject. Every public receipt links back to this page for substantiation, and every quantitative claim it carries is anchored in a hash-locked measurement run on the methodology above.
The substantiation chain in one sentence: every quantitative claim in a NeverRanked artifact is derived from a hash-locked, pre-registered measurement, captured by open-source code, against named engines on named dates with named queries, and graded by a fail-closed factual checker before it ships.
Phrasing discipline. Receipts speak observationally only. "On N of M observed queries between [dates], engine X cited business Y" is the canonical form. We do not say an engine "recommends," "prefers," "endorses," or "ranks" a business. Engines cite, full stop. We do not assert that being cited causes a business outcome, or that not being cited causes its absence. The grader rejects artifacts that drift into normative or causal language.
Anonymization commitment. Non-customer businesses are anonymized on every publicly-indexed page. The named subject of a receipt is named (they have notice through the outreach we sent them); other businesses in the same cohort appear as "Competitor A", "Competitor B", and so on. A named subject who is also a paying customer can consent to additional naming in their own deliverable; otherwise the public version stays anonymized.
Measurement window staleness. AI engine behavior changes weekly. Every public artifact carries the measurement window dates so a reader can see exactly when the underlying capture happened. If a finding is stale, we will re-run on request, at no charge, against the same hash-locked methodology so the new run is directly comparable.
Takedown and opt-out. If your business is named anywhere on our property and you want it removed, the process is documented at /takedowns/. The bar is one email; the SLA is 24 hours; no reason or justification required. Opt-out from future measurement is permanent.
If you want to scope an engagement
Email Lance@hi.neverranked.com with the category you want to measure and three to five competitors you want on the cohort. The first conversation locks the query set together. Then the measurement starts.
Return to NeverRanked · Example deliverable · How we differ from the dashboards · Takedowns & opt-out · About