What if my topic is genuinely saturated and there is nothing new to say?

Three options. Reframe the topic at a narrower scope where saturation drops (the saturation set for 'small-business banking fee structures in Q4' is much thinner than for 'small-business banking'). Add primary substance the saturation set lacks (interview five customers and publish quoted observations). Or do not publish on that topic at all; effort moves better to topics where the saturation set is incomplete.

Information Gain for AI SEO

Chapter 4 told you what to make. This chapter covers the engine that gets it cited. AI systems filter retrieval on one property: does the page add substance the corpus does not already hold. Searchbloom calls this property information gain. We make it real through two layers (Information Gain Density and Information Gain Score), a 12-technique catalog, a statistical-formatting standard, a methodology documentation pattern, and a 5-question self-audit you run before you publish. The asset types from Chapter 4 supply the vehicle. Information gain architecture is the engine.

Why This Technique Matters

Every other Evidence technique reduces to information gain. Original research beats restated industry analysis because research adds gain. Sharp opinion beats hedged opinion because the sharp opinion adds gain. Calculators beat comparison blog posts because the calculator gives a one-of-a-kind number no other URL contains. Every winning asset is a vehicle for delivering gain into a corpus that filters for it.

The filter is real. AirOps's March 2026 analysis looked at 12,000+ AI Overview citations. It found that LLMs pick sources that add net-new info. They drop sources that restate what the model already knows. The cited sources are not the highest-ranking organic pages. They are the pages that add substance the rest of the answer could not give. That is the depth modern SEO now demands: AI search is an evolution of SEO, and information gain is the substance bar that evolution raised.

"LLMs reward novelty, not noise."

AirOps, March 2026 analysis of 12,000+ AI citations

The academic record points the same way. The paper that coined the term Generative Engine Optimization built a 10,000-query benchmark, tested optimization methods on it, and validated the results on Perplexity; the methods that lifted visibility most were adding cited sources, direct quotations, and statistics, for a gain of up to 40% over unoptimized content on the engines of the time (Aggarwal et al., GEO, KDD 2024). That study measured 2023-era systems, and the 2026 citation data above shows the pattern has only sharpened as models got better at filtering for substance. Those three levers are all forms of one thing: putting verifiable, attributable substance in front of the model. That is information gain expressed as formatting, and it is why the techniques later in this chapter lean on data, primary sources, and quotable specifics.

This pillar matters most for any brand in a packed category. A topic like "how to improve customer retention" has thousands of pages already indexed. Each one restates the same handful of techniques. Publishing the 1,001st version adds zero gain and earns zero citations. It does not matter how clean the writing is, how it is built, or how strong the domain is. The same operator can publish one page with a benchmark dataset, three counter-intuitive findings from their portfolio, and a documented method. That page is citable. The original 1,001st version was invisible. The difference is not 2x or 5x. It is the gap between zero and one.

Information gain is also what makes the Evidence pillar compound. The first piece of high-gain content earns citations in its category. Each later piece in the same topic neighborhood compounds. The topical entity gets linked to the brand. That entity link is what AI systems retrieve against. The compounding curve depends fully on whether the content adds gain. A program publishing one high-gain piece per quarter beats a program publishing twelve low-gain pieces. The low-gain program never starts the compounding loop.

"The model has already read a thousand pages on your topic. It reaches for the one that tells it something the other thousand could not. That is the whole game now."

Cody C. Jensen, CEO & Founder, Searchbloom

The Mechanism Underneath

The retrieval-and-synthesis pipeline inside generative AI runs on a similarity check. When a user asks a query, the system turns the query into a high-dimensional vector. It then pulls candidate documents whose vectors sit nearby. Those candidates get ranked and filtered. The filter rewards distance, not just closeness. Candidates that look the same as ones already picked get pushed down. Candidates that add substance the others lack rise to the top.

Recent retrieval research makes that geometry concrete at production scale. MUVERA shows how multi-vector retrieval can prune candidate sets by 2 to 5 times before expensive re-ranking, which is why pages that add distinct substance keep surviving deeper in the candidate funnel while lookalike pages get dropped earlier (Dhulipala et al., arXiv 2024).

This is the geometric framing of information gain. Two pages with similar topical coverage but the same substance sit close in vector space. The model only needs one. Two pages with similar coverage but different substance (one adds a benchmark, the other adds a contrarian finding) sit farther apart. The model has reason to include both. The farther your page's vector sits from the others' vectors for the same query, the more likely the model is to pull and cite your page when it builds a response.

Krish Srinivasan introduced this geometric framing at Search Engine Zine in March 2026 as vector shift. Vector shift is the change in your page's vector position relative to the SERP cluster around a query. Searchbloom's add is the earned-versus-gamed split. Vector shift is a receipt, not a strategy. Substance work moves the vector. Trying to move the vector without the substance creates fake novelty that AI systems catch and discount.

Vector shift is a receipt, not a strategy. Substance work moves the vector; nothing else does.

The engine has two outcomes operators underweight. First, ranking in organic and earning AI citations are not the same job. A page can rank third for a query, add no information gain, and never show up in the AI answer built from that same query. Second, brute-force ranking work (link building, on-page tuning at scale, content velocity) does not move AI Search the way operators expect. None of those levers change the substance the model pulls.

A Worked Similarity Example

To make the geometric framing real, look at three pages competing on the query "how to improve customer retention in SaaS."

Page A (high overlap). A 1,800-word blog post that opens by defining customer retention. It covers the standard techniques (onboarding, NPS surveys, customer success, churn analysis). It cites the standard sources (Bain on retention, Gainsight reports). It closes with a list of common mistakes. Cosine similarity to the average of the top 10 SERP results: 0.91. IGS: 0.09 (D grade). The content is correct, well-written, and indexable. It is also invisible to AI Search. The model already has the substance.

Page B (moderate overlap). The same 1,800 words. Rebuilt to lead with a counter-claim ("most SaaS retention programs improve the wrong metric and miss the real churn driver"). Backed by one proprietary benchmark from the brand's customer base. Cosine similarity to the top 10 average: 0.76. IGS: 0.24 (C grade). The shift in vector position is real. The page still sits inside the consensus cluster. Citation share would come and go.

Page C (low overlap, high gain). The same query. Treated with information gain architecture from the brief. Five distinct insights. A proprietary churn benchmark across 240 SaaS partners. Three failure modes from those engagements with named operators. A counter-claim that NPS-driven retention programs underperform in mid-market SaaS. A methodology page that documents the benchmark. A four-question audit teams can run on their own retention programs. Cosine similarity to the top 10 average: 0.49. IGS: 0.51 (A grade). The page sits well outside the consensus cluster in vector space. AI systems pull it and cite it. It adds substance no other candidate in the set can give.

The three pages show a fact of the vector space operators underweight. Two pages with the same word counts, the same build quality, and the same SEO signals can produce wildly different AI Search results based on substance alone. The substance is what the model measures. Everything else sits on top of that measure.

A scatter diagram. A cluster of grey dots marks the SERP consensus cluster of the top-10 indexed results. An arrow leads away from the cluster. Page A sits next to the cluster with cosine similarity 0.91 and IGS 0.09, grade D. Page B sits at a moderate distance with cosine 0.76 and IGS 0.24, grade C. Page C sits farthest out with cosine 0.49 and IGS 0.51, grade A. Greater distance from the cluster means higher information gain and more citations. — Figure 1. Information gain as distance in vector space. Two pages with the same word count and build quality can sit far apart on the score: substance is what moves the page away from the consensus cluster.

The Two Searchbloom Information Gain Metrics

Information gain has been talked about across the AI Search industry since late 2024, and the idea traces back further to Google's information gain patent. Searchbloom's add is to make it measurable in two layers. The author-side count is Information Gain Density. The publish-time geometric check is Information Gain Score.

Information Gain Density (IGD)

IGD is the count of distinct, original, sourced insights on a page, measured against the saturation set for its target query. It is a manual editorial metric you apply before you publish. The unit of measure is the discrete insight. That is a single claim, statistic, framework piece, named position, or sourced observation that is not already in the corpus around the topic.

The 5-to-7 Rule is the working range. Across competitive informational queries Searchbloom has audited, the median IGD of pages that earn AI citations is five to seven. Pages with fewer than five rarely earn citation share. Pages with eight or more do not earn more in proportion. Reader and model attention plateau. The extra insights stop registering as distinct. Five to seven is the production target for any new asset, not a hard cutoff.

Counting IGD is a deliberate editorial step. Read the draft. For each paragraph, ask three questions. Is the claim concrete and tied to a source (you, your data, a named expert, a sourced finding)? Is the claim missing from the top 10 SERP results for the target query? Could a reader quote this claim word-for-word in another piece? If all three are yes, that paragraph adds one unit of IGD. The step usually shows that most paragraphs add zero. That is the point.

IGD is the production metric. Briefs should set a target IGD ("seven distinct insights, each citable"). Editing should reject drafts that hit the word count but miss the IGD floor. Publishing should treat IGD as a release gate alongside spelling, structure, and schema.

Information Gain Score (IGS)

IGS is the publish-time geometric check. The formula is simple:

IGS = 1 - max(cosine similarity to top-10 SERP competitors)

Embed your final page with an off-the-shelf embedding model. Searchbloom uses the same OpenAI text-embedding-3 family that powers most production retrieval stacks. Embed each of the top 10 organic competitors for the target query; that ranking set is the anchor set the score measures against. Compute the cosine similarity between your embedding and each one. Take the maximum (your nearest competitor). Subtract from one. The result is your IGS. It scales from 0.00 (the same as your nearest competitor) to 1.00 (no measurable similarity, which is itself a sign of off-topic content).

Searchbloom maps the score onto a 13-grade letter scale from F to A+ for exec reports. The lower bound on each grade is set by hand against citation share in audited categories:

A+ (0.55+). Category-defining gain. Citation share usually tops 30% for the target query.
A (0.50 to 0.54). Strong gain. Citation share lands in the 18 to 30% range.
A- (0.45 to 0.49). Solid gain. Top-cited segment in packed categories.
B+ (0.40 to 0.44). Above the production target. Earning steady citations.
B (0.35 to 0.39). The floor we suggest before publishing in packed categories.
B- (0.30 to 0.34). Borderline. Some categories accept this. Packed ones do not.
C+ (0.25 to 0.29). Low gain. Citation share is rare even with strong distribution.
C (0.20 to 0.24). Mostly restated content. Poor performance is the rule.
C- (0.15 to 0.19). Heavily derivative.
D+ (0.10 to 0.14). Heavy overlap with competitor pages.
D (0.05 to 0.09). Near-duplicate substance.
D- (0.02 to 0.04). Basically a paraphrase.
F (under 0.02). Same as the saturation set.

A horizontal scale from 0.00 to 0.60. The 13 letter grades F through A-plus run above the bar. The bar is divided into four Embedding Strength zones: Weak below 0.20, Borderline 0.20 to 0.34, Strong 0.35 to 0.49, and Dominant 0.50 and above. A production floor marker sits at 0.35, the B grade, the level most pages in packed categories must clear before publishing. — Figure 3. The IGS grade scale. The 13 letter grades and the four Embedding Strength zones describe the same 0.00 to 0.60 score: one for editorial reports, one for executive reports.

IGS is the check metric. After IGD-targeted writing, the IGS check confirms the substance turned into real distance from the SERP cluster. Drafts that hit IGD but score below B on IGS usually fail in one of two ways. The insights are real but phrased so close to consensus that the model cannot see the gap. Or the insights are real but buried under restated context that watered down the page's vector. Both are fixable with structural edits before you publish.

The deeper take on both metrics, with step-by-step worked math, lives in Searchbloom's information gain primer and its supporting spokes: how the count works, the scoring math, the geometric view, and the Screaming Frog workflow.

The 12 Information Gain Techniques

IGD measures the result. The 12 techniques produce it. Each one below is a steady way to add a unit of distinct, sourced insight to a draft. A typical high-IGD asset uses four to six of these techniques. None of them need fancy research. Most of them need operator discipline at the brief and edit stages.

A grid of the twelve techniques that add information gain to a page. The five that form the reliable A-grade mix are highlighted: proprietary data points, specific counter-claims, operational catalogs, methodology disclosure, and reproducible decision rules. The other seven are named-expert quotes, documented failure modes, named frameworks, industry variants, worked numeric examples, time-bounded specificity, and cross-disciplinary recombination. — Figure 4. The 12 information gain techniques. The five in teal form the reliable A-grade mix; most strong assets combine four to six of the twelve.

1. Proprietary data points

The single highest-impact technique. Internal benchmarks, partner aggregate data, platform-derived statistics, and survey data the brand owns are insights the saturation set cannot hold. Even small datasets work. A benchmark drawn from 40 partner engagements is more citable than a hedged claim with no data. The catch is methodology disclosure, covered in the pattern below.

2. Named-expert pull quotes

Quotes from one named operator with proven expertise on a specific question the saturation set treats vaguely. The format matters. Short. Declarative. Sourced inline ("Cody C. Jensen, CEO of Searchbloom, observes that..."). Brand-voice quotes without an attached person earn lower citation rates. AI systems weight named-entity authority on the source of the claim.

3. Specific counter-claims

A clear, defensible statement that differs from the consensus in the category by at least one specific point. "Most B2B blog posts published in 2026 have zero AI citation potential because they restate widely available consensus" is a counter-claim. "Content marketing is important" is not. Counter-claims earn citations because AI synthesis surfaces opposing views when both are credible.

4. Documented failure modes

What did not work and why. The saturation set leans toward success cases. Failure documentation is structurally underrepresented. A piece that lays out three specific failure patterns, with the operating context that produced them, adds gain that no survey of best practices can match. This is the highest-IGD-per-word technique we have measured.

5. Named frameworks and methodologies

Building a named structural lens that organizes ideas in a new way. The MERIT Framework itself is the worked example. Five pillars, fifteen techniques, a documented path to apply it. Frameworks earn citations because they give writers, analysts, and operators a shared vocabulary that travels by reference. Chapter 4 covers framework development as one of the five asset types. The IGD add is the lens itself.

6. Operational catalogs

A listed set of techniques, patterns, examples, or notes inside a topic where the saturation set talks about the topic only in the abstract. The list itself is the gain. This chapter's 12-technique catalog is an example. The saturation set talks about "information gain" in the abstract. A named catalog of techniques is a citable lift on top of the talk.

7. Industry-specific variants

A general rule applied to a specific industry with the variant details named. "Information gain matters in SaaS" is not specific enough to cite. "In CRM software, where Wikipedia citations correlate with AI visibility at rho=0.577 per Ben Wills's March 2026 research, information gain compounds through Wikipedia inclusion" is specific enough to cite. Industry-specific framing creates IGD on otherwise general claims.

8. Worked numeric examples

A specific sample size, timeline, scoring threshold, or outcome with the underlying math shown. Chapter 4's worked asset-selection decisions are an example. The numbers add IGD. The decision reasoning around them lifts the value by making the numbers reproducible.

9. Time-bounded specificity

Claims dated to a specific period, with the conditions that produced the claim on record. "AirOps's March 2026 analysis confirmed AI systems cite opinion at rates close to research" is time-bounded. "AI cites opinion as readily as research" is not. Time-bounded claims age out cleanly. Readers know when to check them. Ungrounded claims age out without warning.

10. Methodology disclosure

For any claim that comes from a study, audit, or steady observation, a written note on how the work got done. Sample selection, screening rules, instrument, dates, limits. Method disclosure raises the citation rate of the claim by a large factor. The analyst tier reads the method before findings. Their co-citations build AI visibility.

11. Cross-disciplinary recombination

Pulling a concept, model, or pattern from a nearby field and applying it to the target topic with the link named. The framing of vector shift as "a receipt, not a strategy" pulls an audit idea from accounting and applies it to embedding distance. The mix is the gain. Nothing in the saturation set talks about retrieval embeddings in audit-receipt terms.

12. Reproducible decision rules

A rule the reader can use directly without further reading. Chapter 4's decision framework is one. The 5-to-7 Rule for IGD is one. Rules earn citations. They answer the question users bring to AI Search ("how should I decide X") cleanly. Open-ended talk of considerations does not.

A page that blends five of these techniques lands in the A or A+ band on IGS in most packed categories. The reliable mix is proprietary data (Technique 1), a written methodology (Technique 10), three counter-claims (Technique 3), an operational catalog (Technique 6), and a reproducible rule (Technique 12). Most underperforming content uses none of these. It swaps in restated context, hedged best practices, and consensus restatement.

The IGD Technique Stack

The 12 techniques work as a catalog. The stack is how they combine. Not every technique pair compounds equally. Some technique stacks produce IGS scores reliably in the A or A+ band. Others produce additive but unremarkable IGS lift. The stack patterns below come from sustained measured Searchbloom asset audits across mid-market engagements.

Stack	Techniques	Best for	Typical IGS
Research	1 + 10 + 9 + 12	Research-grade assets	A- to A+
Opinion	2 + 3 + 4 + 11	High-impact opinion	B+ to A
Framework	5 + 6 + 12 + 8	Framework assets	A to A+
Calculator	1 + 10 + 12 + 8	Calculator-style assets	A- to A
Template	6 + 7 + 12 + 9	Template-grade assets	B to B+

The Research Stack: Techniques 1 + 10 + 9 + 12. Proprietary data, methodology disclosure, time-bounded specificity, reproducible decision rules. The reliable combo for research-grade assets. IGS lands in the A- to A+ band when all four are present. Removing methodology disclosure (Technique 10) drops the IGS by one full grade in most measured cases.
The Opinion Stack: Techniques 2 + 3 + 4 + 11. Named-expert pull quotes, specific counter-claims, documented failure modes, cross-disciplinary recombination. The reliable combo for high-impact opinion. IGS lands in the B+ to A band. The failure modes (Technique 4) carry the heaviest IGD-per-word load in this stack.
The Framework Stack: Techniques 5 + 6 + 12 + 8. Named framework, operational catalog, reproducible decision rules, worked numeric examples. The reliable combo for framework assets like MERIT itself. IGS lands in the A to A+ band. The catalog dimension (Technique 6) is the IGD multiplier; without it the framework reads as a single concept rather than a structured asset.
The Calculator Stack: Techniques 1 + 10 + 12 + 8. Proprietary data, methodology disclosure, reproducible decision rules, worked numeric examples. The reliable combo for calculator-style assets. IGS lands in the A- to A band when the calculator surfaces unique result URLs as specified in Chapter 4.
The Template Stack: Techniques 6 + 7 + 12 + 9. Operational catalog, industry-specific variants, reproducible decision rules, time-bounded specificity. The reliable combo for template-grade assets. IGS lands in the B to B+ band. Templates rarely reach A grade unless paired with proprietary benchmark data (Technique 1).

The stacks are starting points, not formulas. Categories with deep saturation around one technique type (say, proprietary data benchmarks) shift the weighting toward neighboring techniques. The IG Brief stage (covered later in this chapter) names the stack the asset will execute. Drafting against a named stack rather than picking techniques ad-hoc raises hit rates on IGS targets by 35 to 60% across measured programs. The discipline cost is one line in the brief. The payoff is in citation share.

The stacks also explain why low-IG assets fail. Most underperforming content uses zero of the 12 techniques. Restated context, hedged best practices, and consensus restatement carry no stack. They sit close to the saturation cluster on every dimension. Naming the stack at brief stage exposes the absence early. The author and editor can course-correct before draft time.

Statistical Claim Formatting for Extractability

Information gain is needed but not enough on its own. Gain that AI systems cannot extract is invisible no matter how good the substance is. The most common failure at this stage is statistical claims locked inside infographic images, shown without methodology, or buried in prose where the numbers cannot be lifted as a discrete unit.

The complete-format pattern is the production standard for statistical claims. Every claim that leans on a number should hold six parts in close range:

The claim itself. A clear sentence stating what the number shows.
The number, in parsable text. The actual digits, in HTML the model can read. Not locked in an image.
The sample. What the number was measured on (5,247 respondents, 12,000 page citations, 40 partner engagements).
The methodology pointer. A short phrase that says how the data was collected, with a link or section reference to the full methodology.
The source. Named attribution. If third-party, the author or organization. If internal, the brand and the documented program.
The date. When the data was collected or analyzed.

Inline example. "AI Overviews cite at least one source from the top 20 organic results in 97% of cases, based on seoClarity's February 2025 analysis of search analytics across thousands of queries." All six parts are in one sentence. AI systems can lift the claim cleanly into a response and source it right.

The dual-format rule covers visuals. Every chart, infographic, or visual that carries a statistic must be paired with a parsable HTML version of the same statistic next to the image. The visual is for the human reader. The HTML is for the model. Pages with statistics only in image form earn near-zero AI citation rates no matter how good the visual is. The retrieval stack does not run OCR on images during synthesis.

HTML tables are the working format for compared numbers. A table of asset types with time-to-citation, typical IGS band, and citation profile pulls as a discrete unit. The same content as a string of paragraphs does not. When in doubt, build the table.

Methodology Documentation as a Citable Asset

Most operators treat methodology as a footnote on the main asset. The high-IGS move is to build methodology as its own canonical page with its own URL. Indexable on its own. Linked from the main asset.

The high-IGS move is to build methodology as its own canonical page, with its own URL, indexable on its own.

The dedicated methodology page solves four problems at once. Analysts and journalists can audit the method without reading the main asset, and their co-citations drive AI visibility. The page becomes its own citable source apart from the claim. The analyst tier's bar for citable research is "methodology I can defend in my own writing," and the dedicated page meets it. The page also makes a second indexed asset where the alternative makes zero.

The recommended structure for the methodology page. Title at the top ("Methodology: Searchbloom 2026 Benchmark Study on Mid-Market AI Search Citation Patterns"). Date of data collection. Sample selection (who took part, how recruited, screening criteria). Instrument or audit protocol (survey questions, data collection steps, scoring rubric). Statistical methods (descriptive stats, confidence intervals, significance testing where used). Limitations section. Every credible methodology page has one. The analyst tier reads this first. Citation format. The attribution sentence the brand wants third parties to use.

The methodology page links to and from the main asset. Schema chain. The methodology page carries its own Article schema with isPartOf pointing at the main asset's Dataset or Report schema. Both pages indexed. Both pages citable. Analyst tier serves them both.

The Methodology Page Template

The structure below is the working template Searchbloom uses for methodology pages on benchmark studies, audits, and partner-engagement aggregates. Adapt the headings to your instrument. The order and coverage areas should not change.

Title and abstract. Title names the study and the period. Abstract is two to three sentences. It says what was measured, on whom, and the headline finding the method supports.
Period of data collection. Exact start and end dates. Any data freezes, snapshot dates, or version limits.
Sample selection. Population, sampling frame, recruitment method, screening criteria, response rates, exclusions. Anyone redoing the study should be able to recruit a like sample from this section alone.
Instrument or audit protocol. The full survey, audit rubric, or measurement protocol. If the instrument is too long for the page, link to a downloadable copy and sum up the question categories.
Variables and definitions. Every variable the study reports, with its working definition. Fuzzy terms (engagement, retention, citation) get defined here, not assumed.
Statistical methods. Descriptive stats, confidence intervals, significance testing where used, statistical software. Plain language is fine. The goal is reproducibility, not academic style.
Limitations. The constraints the method cannot solve (sample bias, period effects, response biases, what the data does not measure). Every credible methodology page has one. The analyst tier reads this before findings.
Citation format. The attribution sentence the brand wants third parties to use. Pre-writing the citation raises adoption rates by an order of magnitude over leaving it to readers.

Pages that follow this template land in the A- to A+ IGS band on their own. The methodology page itself is unique substance the saturation set does not hold. The double-asset effect (main asset plus methodology page) is the most reliable IG architecture pattern for research-grade Evidence work.

The Embedding Audit Workflow

The Embedding Audit is the production workflow for measuring IGS at scale. Searchbloom built the method around Screaming Frog v22's embeddings module. The same workflow runs with any tool that lets you batch-embed URLs and compute cosine similarity matrices. The cross-site application to AI citation work is what sets the workflow apart from the embedding audits that already sit inside a mature technical SEO practice.

The retrieval unit can be optimized as well as measured. XTR demonstrates that token retrieval can be trained so the most salient document tokens are selected first, reinforcing why Information Gain Architecture should prioritize claims that remain informative when compressed to high-salience spans (Lee et al., NeurIPS 2023).

The workflow. Pick the target query for the asset under audit. Pull the current top 10 organic results for that query. Crawl all 11 URLs (your page plus the 10 competitors). Embed each page's main content. Skip the boilerplate header, footer, and sidebar. Compute the cosine similarity matrix. Pull your page's row. Take the maximum. Subtract from one. Map onto the IGS grade scale. Look at the highest-similarity competitor to see which substance your page is duplicating.

The Embedding Strength scale is the exec-report wrapper around IGS. Pages map to four bands. Weak (IGS below 0.20). The page is basically a paraphrase of one or more competitors. The gain is theoretical, not measurable. Borderline (0.20 to 0.34). Some gain is detectable but not enough to drive citation share. Strong (0.35 to 0.49). The page is set apart enough to earn citations at category-typical rates. Dominant (0.50+). The page is the citation default for its query cluster.

The full Embedding Audit methodology, with the Screaming Frog config and the cross-site scoring approach, lives in the published embedding audit walkthrough. Operators without Screaming Frog can run the same workflow with a Python notebook. OpenAI embeddings API plus scikit-learn cosine similarity. About one hour of setup.

The 5-Question Self-Audit Before Publishing

The self-audit is the author-side gate. Run it on every asset before it publishes, no matter the type. The audit takes only a few minutes. It catches about 80% of the failures that would otherwise drop IGS below the production target.

Question 1: What is the single most specific claim on this page? Identify it. If the answer is a hedged statement, a best-practice summary, or a topical overview, the page is not yet citable. Citable assets have one load-bearing specific claim. The exercise of forcing the question shows whether the page is info-dense or context-dense.

Question 2: How many distinct, sourced insights does this page hold? Count them. Apply the IGD criteria (concrete, sourced, not in the saturation set, citable word-for-word). If the count is under five, the page is below the production floor. Do not publish it in a packed category.

Question 3: For every statistical claim on the page, are all six parts of the complete-format pattern there? Sample, methodology pointer, source, date, parsable HTML, and the claim itself. Missing parts are the most common reason statistics get cited as decoration but not as load-bearing data.

Question 4: Could the substance of this page survive being summed up into a 100-word answer? AI Search responses are short. If the page's substance only works at full length, the model will not cite it. The compression test is the closest proxy for retrieval behavior.

Question 5: What is the methodology page for this asset, and where does it live? If the asset is research, a benchmark, an audit, or any work that draws findings from a process, the methodology page should exist with its own URL before you publish. If it does not, build it now.

Drafts that pass all five questions sit in the B+ to A+ IGS band in most packed categories. Drafts that fail any of the five reliably underperform on citation, no matter the other quality marks.

Worked Example: The MERIT Framework as Information Gain

MERIT as a High-IGD Framework Asset

The MERIT Framework was drafted with information gain architecture in mind from the start. Counting the IGD contributors at the canonical whitepaper's launch:

Technique 5 (named framework). A five-pillar lens that organizes AI SEO in a way no other published framework had. The pillar names (Mentions, Evidence, Relevance, Inclusion, Transformation) are vocabulary the corpus did not hold.

Technique 6 (operational catalog). Fifteen named strategic techniques, three per pillar, each with a path to apply.

Technique 1 (proprietary data). Observations drawn from sustained Searchbloom partner engagements across 40+ programs.

Technique 3 (counter-claims). Many specific positions counter to industry consensus. One is the claim that AI cites opinion-based content at rates close to empirical research when source authority is set.

Technique 11 (cross-disciplinary recombination). The Evidence-Mentions feedback loop frames asset work through a citation-flow lens borrowed from academic citation analysis.

Technique 12 (reproducible decision rules). The decision rules in each pillar can be used by other operators without further reading.

Outcome. Measured IGS at first launch: 0.51 against the top 10 organic results for "AI SEO framework" (A grade). Citation share for the target query once the asset had time to compound climbed substantially across ChatGPT and Perplexity answers, measured via Profound. The framework name pulled steadily for nearby queries about AEO, GEO, and AI SEO method.

The IG Decay Curve and Saturation Set Drift

An IGS score is a snapshot, not a permanent property. Today's A-grade asset is tomorrow's B-grade asset when competitors publish similar substance. The IG Decay Curve is the rate at which an asset's IGS score drops over time as the saturation set absorbs the same insights. Tracking the curve drives the refresh cadence and tells programs which assets need active maintenance versus which assets earn citations for years on auto-pilot.

A line chart with information gain score on the vertical axis and time since publish on the horizontal axis, with no fixed durations. Three curves all start at an A grade near 0.52. The fast-moving category curve drops steeply and crosses the production floor at 0.35 early. The medium-moving curve crosses the floor later. The slow-moving curve stays above the floor across the whole range. — Figure 5. The IG Decay Curve. The same A-grade page decays at very different rates by category, which is what sets the refresh cadence: fast-moving categories need refreshes, not just new assets.

The decay rate varies by category and technique mix.

Fast-moving categories (AI tooling, AI Search itself, MarTech). Short IGS half-life. The saturation set absorbs new insights quickly. An A-grade asset slides to B fast without refresh. Mid-market brands in these categories need to weight Evidence effort toward refreshes, not just new asset development.
Medium-moving categories (B2B SaaS broadly, professional services, e-commerce). Moderate IGS half-life. The saturation set moves but absorbs more slowly. A moderate refresh cadence keeps assets in their starting IGS band.
Slow-moving categories (regulated industries, deep-niche enterprise). Long IGS half-life. The saturation set evolves slowly. An infrequent refresh cadence is enough for most assets.

The Saturation Set Drift Tracker is the operational practice that catches this kind of corpus drift before it shows up in citation share. Quarterly, re-pull the top 10 SERP results for each high-priority asset's target query. Embed the new top 10. Compute the new IGS score using the same Embedding Audit workflow. Compare to the original IGS. Three patterns emerge.

Stable saturation set. The new top 10 overlaps heavily with the original top 10. IGS scores hold within 0.05 of the original. No refresh needed.
Rotating saturation set. The new top 10 has 3 to 5 new entrants. IGS drops by 0.05 to 0.15. Light refresh (update statistics, add one new sub-section addressing the new entrants) restores the IGS within one grade of the original.
Displaced saturation set. The new top 10 has 6+ new entrants, often led by a competitor asset with materially better substance. IGS drops by 0.15+. The asset has been displaced. Deep refresh or retirement (per the Asset Retirement Decision Framework in Chapter 4) is the right move.

Most programs do not run the drift tracker. They notice IG decay only when citation share starts dropping, which lags the actual decay by a meaningful window. The drift tracker catches the decay at the embedding layer first. That window is enough to refresh the asset before the citation curve flattens. Programs that run the drift tracker quarterly hold their citation share over long windows. Programs that do not see citation share decay reliably after the original publish date.

The Information Gain Brief

Most asset programs underperform their IGD potential. The brief stage treats information gain as something to add later, in editing. The reliable pattern is to brief for IG up front. Brief-stage IGD targets raise the production-stage outcome by a factor of 2 to 3 in our measured programs. The work each technique needs (interviews, data pulls, methodology design, named-expert outreach) has a lead time the edit stage cannot recover.

The Information Gain Brief is a one-page document the author and editor agree on before drafting starts. It covers six fields.

Target query and saturation set. The query the asset is built around. Plus a printed snapshot of the current top 10 organic results. The saturation set is the substance the page must add gain against. Naming it at the brief stage stops the most common failure mode. That is drafting first, then learning later that the substance already exists.
IGD target with technique assignments. The target insight count (five to seven for packed categories, three to four for narrower-scope assets). Each insight slot is tied to one of the 12 techniques. Drafting against a technique-assigned slot is far more reliable than drafting against an abstract "be original" note.
Statistical claim inventory. Each numeric claim the asset will lean on, listed with the planned source, sample, method, and date. Logged at brief stage so missing parts surface before writing rather than at launch.
Methodology page status. Whether the asset needs its own methodology page (research, audit, benchmark, steady observation). If so, the planned URL and scope.
Counter-claims and named positions. The specific positions the asset will defend that differ from category consensus. Drafted as one-sentence claims the author commits to defending in the body.
Topical cluster placement. Where this asset sits in the brand's existing topic neighborhood. What nearby assets it links to or from. Isolated high-IG assets earn citations. Clustered high-IG assets compound.

The IG Brief is not a creative limit. It is the production discipline that turns asset work into citation share. Programs that run sustained steady briefing produce predictable IGS bands in the A- and B+ range. Programs without the discipline get wide variance and below-target citation results, no matter how talented the author is.

Platform-Specific Considerations

Different AI systems weight information gain in different ways. The production-target IGS band (B and above) earns citations across all platforms. The differences show up at the margins.

ChatGPT. Weights structural clarity. High-IGD content shown as discrete list items, numbered steps, or tabular comparisons over-indexes here. Long paragraphs with the same gain do worse. The model cannot pull discrete units as cleanly. The AirOps March 2026 finding that lists and tables appear in nearly 80% of ChatGPT citations versus 29% in Google's top results applies here.
Claude. Weights methodology disclosure and source authority. High-IGD content with a written method and named-expert attribution over-indexes. Reactive opinion without a method does worse than on other platforms.
Perplexity. Weights recency and community talk along with primary source authority. High-IGD content with recent date stamps and community distribution (Reddit, Hacker News, LinkedIn) over-indexes. Static evergreen content with high IGD does worse here than on Claude.
Google AI Overviews. 97% of AIO answers cite at least one source from the top 20 organic results (seoClarity February 2025). High-IGD content that ranks in the top 20 is the strong combo. High-IGD content that does not rank organically rarely shows up in AIO no matter how good the substance is. AIO inherits the organic candidate set as its retrieval input.
Gemini. Tracks close to Google AI Overviews. Shared retrieval stack. The same organic-ranking dependency applies.
Microsoft Copilot. Weights LinkedIn content and Bing-indexed sources. High-IGD opinion shared on LinkedIn over-indexes here. The same opinion published only on owned domain does worse.

The cross-platform pattern is steady. Information gain is the core condition across all systems. The differences are in which parts of presentation each system rewards on top of the gain.

Industry Variants

Ben Wills's March 2026 research covered 145 industries, 1,595 personas, and 105,000+ LLM prompts. It showed that gain compounds through different signal paths in different industries. The IGD work is steady across categories. The top IGS lever varies.

Wikidata-dominant categories (accounting software, CRM software, baby care brands, budget hotel chains). IGS compounds through entity-level claims coded into Wikidata and Wikipedia. Information gain as factual, sourced, named-entity claims raises citation share most directly.
Wikipedia-citation-dominant categories (CRM software at rho=0.577). IGS compounds through citations to your work in Wikipedia articles. Information gain as research, frameworks, and named methodologies that meet Wikipedia inclusion thresholds raises share most directly.
Harmonic-centrality-dominant categories (affiliate marketing networks, auto insurance, brokerage and wealth management apps). IGS compounds through the link graph centrality of your work. Information gain as embeddable data, tools, and unique-result URLs raises share most directly.
SE-outbound-link-dominant categories (agricultural equipment, beauty and cosmetics retail, beer brands). IGS compounds through the breadth of third-party citation. Information gain as templates, calculators, and reference assets that get listed widely raises share most directly.
Backlink-count-dominant categories (car rental brands). IGS compounds through earned backlinks. Information gain as utility-driven assets that earn organic links raises share most directly.

Cross-check your category against the Wills correlations at the brief stage. The IGD techniques you pick should lean toward those that produce the top signal type for the category.

Common Mistakes That Defeat Information Gain

1. Restating consensus in fresh language. Operators paraphrase the corpus and assume new phrasing equals new substance. It does not. The model maps both phrasings to nearby vectors. Counter-test. Can you state the central claim of your page in one sentence that would not show up in any of the top 10 SERP results? If not, the page is restated.

2. Adding context where the corpus already has it. Long intros that explain "what is AI Search" before the load-bearing claim water down the page's vector with material the saturation set already holds. The fix is structural. Lead with the specific claim. Push context to the end or cut it. Let the intro earn its keep through substance.

3. Statistics without methodology. A number without a documented source, sample, or method earns lower citation rates than a qualitative claim with a clear source. The lopsided failure mode is operators paying for research and then publishing the findings without the methodology page. That leaves the citation lift on the table.

4. Image-locked claims. Statistics, framework diagrams, and key claims locked inside infographics without parsable HTML are invisible to retrieval. The dual-format rule applies without exception. Every visual that carries substance needs a parsable HTML companion next to it.

5. Hedged opinion. "Five things to consider" listicles and "it depends" pieces give zero counter-claims and zero sourced specifics. The opinion technique only adds gain when the position is sharp enough to disagree with. Hedging is the most common author-side compromise. It is the most reliable IGD-killer at the edit stage.

6. Gated content. Gates make the substance AI-invisible. AI systems cannot fill out forms. Gating research, calculators, or frameworks erases their IGS no matter how high the gain is. The right pattern is ungated base content with optional gated enhanced layers, covered in Chapter 4.

7. Internal contradiction with the canonical whitepaper or company position. Drafts that fight the brand's documented framework, method, or stated position cut entity-level coherence in retrieval. The model surfaces the consensus from your domain. Contradictory drafts water down that consensus. Counter-test. Does this page agree with your canonical whitepaper and the nearby chapters on the same topics?

8. Topical fragmentation. Brands publishing high-IGD content across too many unrelated topics water down the entity-topic link the model learns. The compounding effect of Evidence content depends on depth in a topic neighborhood. Counter-test. Does this draft sit inside an existing topical cluster the brand is building? Or is it isolated?

Questions & Answers

What exactly is information gain in AI Search? Information gain is the property of being net-new to the working corpus AI systems pull from and blend. A page has high information gain when its claims, data points, frameworks, or views are not already in the saturation set the model has indexed. AirOps's March 2026 analysis confirmed AI systems filter retrieval candidates on this property.

How is information gain different from originality or uniqueness? Originality is a copyright idea (whether your words are yours). Information gain is a content idea (whether your substance is new). Two pages can both be original and still differ widely in gain. A paraphrase of consensus is original but adds no gain. A short opinion with a counter-claim is high gain even in plain language.

What is the difference between IGD and IGS? IGD is the manual editorial count of distinct, sourced insights, done before you publish. IGS is the publish-time geometric measure. One minus the maximum cosine similarity between your page and the top 10 SERP competitors, on a 13-grade letter scale. IGD is what you write toward. IGS is what your final page measures.

Do I need an embeddings tool to apply this chapter? No. The 12-technique catalog and the 5-question self-audit are author-side gates that work before any embedding model touches the page. The Embedding Audit is a check step that adds rigor but is optional. Teams without the tooling can produce high-gain content using IGD as the production metric.

Is the 5-to-7 Rule arbitrary or grounded? Grounded in observation, not in a lab cutoff. Across competitive informational queries Searchbloom has audited, the median number of distinct, sourced insights in the AI Overview citation set sits between five and seven. Pages with fewer than five rarely earn share. Pages with eight or more do not earn more in proportion.

What if my topic is truly saturated and there is nothing new to say? Three options. Reframe the topic at a narrower scope where saturation drops. Add primary substance the saturation set lacks (interview customers, run a small benchmark, publish a documented audit). Or do not publish on that topic. Effort moves better to topics where the saturation set is thin.

How do statistical claims affect information gain? Statistical claims raise IG when they are extractable and unique. The complete-format pattern (claim, number, sample, method, source, date) makes each claim citable as a discrete unit. The same statistic locked inside an image earns near-zero citations no matter how good the research is.

Where does methodology documentation fit? Second-most-valuable IG artifact after the claim itself. AI systems weight claims with a documented method higher than claims without. The analyst tier reads the method before the findings. Treat the methodology page as its own citable asset with its own URL.