Information Gain Density: The 5-to-7 Rule for AI Citation

Updated May 6, 2026

"Information Gain Density is the only content metric I have ever measured that predicts AI citation before the page is published. We invented it at Searchbloom because the patent everyone was citing did not give operators a unit of measurement. Now we have one."

~ Cody C. Jensen, CEO & Founder, Searchbloom

Information Gain Density (IGD) is a Searchbloom-invented framework for measuring the count of distinct, original, attributable insights a page carries against the saturation level of the competing content already ranking for the same query. It builds on the foundation laid by Google's 2018 information gain patent and extends that foundation into something the patent never intended to be: an operational, per-page, per-section unit of measurement that content teams can engineer toward before they hit publish. The framework is the answer to the single most-asked question I get from operators who have read the patent and the popular SEO commentary on it: how much new information do I actually need?

The patent itself does not answer that question. It was never designed to. It describes a session-level personalization mechanism, scoring how much new information a document contains relative to documents one specific user has already viewed in their session. Useful as that mechanism is, it leaves the content author with nothing to act on. You cannot author a page differently based on what a stranger has read in the last twenty minutes. You can, however, author a page differently based on how dense it is with original insights compared to the corpus of pages already ranking. That second framing is what Searchbloom invented. We call it Information Gain Density. The parent article on information gain in SEO introduces the concept in context. This article is the deep dive on the framework itself.

A vector-space diagram showing how Information Gain Density is measured relative to the top 10 competing pages for a query. Ten gray dots cluster tightly around a centroid representing the saturation set, the things every reader of the topic has already seen. A teal dot sits outside the cluster labeled 'your page (high IGD),' indicating its embedding lands in an unoccupied region of vector space and earns AI citation slots. A second light-gray dot inside the cluster labeled 'your page (low IGD, absorbed)' shows what happens when a page repeats the saturation set. A dotted boundary around the cluster marks the saturation threshold. The 5-to-7 Rule is the empirical benchmark: most competitive topics in 2026 require 5 to 7 distinct original insights for a page's embedding to land outside the saturation cluster. — Information Gain Density is measured against the saturation set of top-10 competing pages. A page that repeats the cluster gets absorbed; a page with 5 to 7 original insights lands in unoccupied vector space.

TL;DR

Information Gain Density (IGD) is a Searchbloom invention. Cody C. Jensen and the Searchbloom team coined the framework in response to a gap in the literature: every commentator was citing the same Google patent, and none of them were giving operators a unit of measurement. IGD is the unit.
It builds on Google's information gain patent without being a parallel measure. The patent (US 11,354,342 B2) describes a session-level personalization signal. IGD takes the underlying intuition (novelty matters) and reframes it as a content-author-controllable density metric across the corpus of competing pages.
The unit is the distinct, original, attributable insight. An insight counts toward density if it is specific enough to be quoted, traceable to its source, and would fail the "could a language model have generated this paragraph from existing public sources" test.
The 5-to-7 Rule is the empirical benchmark: most competitive topics in 2026 require 5 to 7 distinct original insights for a piece to credibly compete in AI Overviews, ChatGPT, Claude, Perplexity, and Google AI Mode. Lower-saturation topics need fewer; the most saturated topics need more.
Density is per-section, not per-thousand-words. Third-party RAG engines retrieve passages; Google's AI Overviews and AI Mode run on its core Search ranking and quality systems, not a raw passage contest. The right unit of analysis is the H2 or H3 section, each of which should carry at least one original insight that does not appear in the top ten ranking pages.
Density compounds across the MERIT Framework. Evidence is where you author IGD most directly. Mentions delivers it through third-party content. Relevance elevates it through structural change. Inclusion makes it discoverable. Transformation measures and compounds it over time.
IGD is the count framework. The Information Gain Score (IGS) is the math. The companion piece in this hub-and-spoke set walks through the IGS formula in detail. This article focuses on the framework itself.

What Information Gain Density Actually Is

Information Gain Density is the count of distinct, original, attributable insights in a piece of content, measured against the saturation level of the existing top-ranking pages for the same query. The phrase has three load-bearing components, and each one is doing specific work.

Distinct means the insights are non-redundant. Two paragraphs that say the same thing in different words count as one insight, not two. Padding does not increase density.

Original means the insight does not already appear in the top-ranking pages for the query. Original is a relative property. A claim that is novel in one corpus is commonplace in another. The point of comparison is always the existing top-ranked content, not the entirety of the indexed web.

Attributable means the insight is traceable to its source. A statistic is attributable to the dataset it came from. A quote is attributable to the person who said it. A failure mode is attributable to the engagement or operation it was observed in. Insights without attribution still count toward word count but do not count toward density, because AI engines weight cited and authoritative sources more heavily and unattributed claims drift toward synthesis.

Density is the count divided by the saturation of the topic, not by word count. A 1,500-word article in a sparsely-covered topic can have higher IGD than a 9,000-word article in a saturated topic. The mistake most content teams make is treating density as a word-count efficiency metric. It is not. It is a saturation-relative measurement of how much your page adds beyond what already exists.

Why Density and Not Volume

The intuition behind density is that AI search engines retrieve passages and synthesize answers. They are not paying you for word count. They are paying you for the count of citation-eligible passages your page contributes to a synthesized answer. A page with eight original insights spread across eight sections has eight passage-level citation opportunities. A page with eight original insights buried inside a single 4,000-word section has one passage-level citation opportunity, because the surrounding context dilutes the embedding of any single passage and the retrieval layer cannot lift one insight cleanly without dragging in the rest.

This is why density, not volume, is the right unit. Volume rewards length. Density rewards architecture. The architecture of a high-density page is the architecture AI retrieval is built to read.

Why "Density" Was the Word We Picked

We considered alternatives: information gain count, information gain ratio, insight density, citation surface area. We landed on density because it carries the right physics. Density is a ratio of substance to space. A high-density page packs more citation-eligible substance into the same passage real estate than a low-density page. Two pages of identical IGS can have very different IGD if one concentrates its differentiation in three paragraphs and the other distributes it across twelve sections. Density was the only term we tested that captured both the count and the distribution.

The Origin: How Information Gain Density Was Invented at Searchbloom

The framework did not arrive in a single moment. It accumulated across roughly eighteen months of measuring what happened when we added different types of original content to partner pages and watched the rankings, the AI Overview citations, and the LLM mentions move (or not). The literature gave us the term information gain. The patent gave us the underlying mathematical idea. Neither gave us a way to operationalize the concept inside a content workflow.

Three observations forced the framework into existence.

Observation 1: The Patent Describes Personalization, Not Content Strategy

The patent every information-gain article cites is US Patent 11,354,342 B2, "Contextual Estimation of Link Information Gain," filed by Victor Carbune and Pedro Gonnet at Google in October 2018 and granted in June 2022. The abstract describes a method for determining a score "indicative of additional information that is included in the document beyond information contained in documents that were previously viewed by the user." Read carefully, the patent describes session-level personalization. The score is computed against documents one user has already seen, not a global novelty signal across the indexed web.

The popular SEO commentary treats the patent as proof that Google rewards globally novel content. The patent itself describes filtering redundancy from a single user's session. Both are useful ideas; they are not the same idea, and conflating them produced two years of advice that did not match the patent's actual scope. What the patent does not provide is a way for an author to know, before publishing, whether their page is novel enough to win citation slots in the AI search ecosystem it could not anticipate. Information Gain Density is the framework we built to fill that gap.

Observation 2: AI Engines Retrieve Passages, Not Pages

The mechanics of retrieval-augmented generation became visible to us in early 2025 as we instrumented partner pages and tracked what got cited. AI Overviews, ChatGPT search, Perplexity, Claude search, and Copilot Search in Bing all decompose queries into sub-queries (query fan-out) and retrieve documents per sub-query, then synthesize across the retrieved chunks. Citations land at the passage level. A page with one buried insight loses to a page with seven discrete sectioned insights, even when the buried-insight page has more total information. That observation forced the unit of analysis to shift from page-level novelty to passage-level density.

Observation 3: The Number of Insights Was Surprisingly Consistent

We started counting. Across roughly four hundred priority pages we audited or produced through 2025 and into 2026, the count of distinct original insights on a page that won a citation slot in AI Overviews clustered in a tight band: five to seven, with occasional outliers above. The count of distinct original insights on an absorbed page was almost always zero or one. The threshold was empirically obvious once we started counting consistently. That count produced the 5-to-7 Rule.

The 5-to-7 Rule

The 5-to-7 Rule says that for most competitive topics in 2026, a piece of content needs five to seven distinct, original, attributable insights to credibly compete for citation in third-party RAG engines like ChatGPT, Perplexity, and Claude. On Google surfaces this is standard helpful-content SEO; Google states its generative AI features run on core Search ranking and quality systems and need no special structuring. Lower-competition topics need fewer; the most saturated topics need the full count or more. The number is empirical, not theoretical, and was derived from auditing the original-insight count of pages that won AI Overview citation slots versus the pages that got absorbed into AI synthesis.

Empirical pattern across roughly 400 priority pages: absorbed pages had 0 to 1 original insights; cited pages had 5 to 7.

Three operating heuristics underpin the rule.

Heuristic 1: One Original Insight Per Major Section, Minimum

Each H2 or H3 should carry at least one citable, specific, attributable thing that does not appear in the top ten competing articles. Sections that fail this test get absorbed into AI synthesis without attribution; sections that pass earn citation slots. AI engines retrieve passages, and they cannot retrieve a passage that does not contain something worth retrieving. The discipline this heuristic enforces is that nothing publishes without a passage-level review.

Heuristic 2: Compete on Saturation, Not Volume

Pull the top ten ranking articles for the target query. Count their original insights. Beat that count by 2x. Most existing articles average zero to one original insights per page, which is why a piece with five to eight original insights dominates citation slots even when the existing pages are longer. Vector embeddings cluster around shared centroids: an article that repeats the baseline gets pulled into the consensus cluster, while one that adds five to seven novel insights lands in an unoccupied region where the retrieval layer can identify and attribute it cleanly.

Heuristic 3: Own Two to Three Sub-Questions Outright

AI Overviews break a topic into sub-questions and pick the best source for each. The strategic move is not to be the best source on the head term; it is to be the definitive source on a few specific sub-questions inside the topic. Two to three is the right number for most engagements. More than that and you dilute. Fewer than that and you are competing on the head term, which is the hardest battle and the one with the least durable wins. For this article, the sub-questions Searchbloom is staking out are what is Information Gain Density, how was IGD invented, and how do you count distinct insights on a page.

The Three Criteria for a Counted Insight

Not every paragraph that feels novel counts toward density. We use three criteria, applied as a pre-publish checklist on every section.

A three-stage decision flowchart for whether a candidate insight counts toward Information Gain Density. The first stage tests whether the claim is specific enough to be quoted as a standalone passage (passing example: '400% lift across 25 engagements, 90-day window' with five named specifics; failing example: 'Original content matters more than ever' which is too vague). The second stage tests whether the claim is attributable to a named source like a dataset, speaker, engagement, or operational context (passing if attributed; failing if a stat with no provenance). The third stage tests whether a current-generation language model could have generated the claim from existing public content (passing if the claim is irreducible substance like proprietary data or voice; failing if it summarizes common knowledge well). A candidate that passes all three stages counts toward the page's IGD; failing any single stage means the candidate does not count, no matter how well-written. — A candidate insight only counts toward density if it passes all three tests. Failing any single criterion sends the paragraph to the absorbed pile, regardless of writing quality.

Criterion 1: Specific Enough to Quote

The insight must be specific enough that a single sentence or short passage from it could be lifted out of the article and used as a standalone citation. Vague claims do not pass. "Original content matters more than ever" does not pass. "Across 25 Searchbloom partner engagements over a rolling 90-day window, adding a direct subject matter expert quote produced an average 400% lift in blended visibility across AI Mode, AI Overviews, Copilot Search in Bing, LLM citations, and organic Google rankings" passes. The difference is the specificity of the claim. The first sentence could be written by any LLM in five seconds. The second contains five specific facts (25 engagements, 90-day window, 400% lift, the named surfaces, the named provenance) that produce a unique embedding and a citation-eligible passage.

The specificity test is also a writing test. If you cannot write the insight specifically, you do not actually know the insight. Most "could a language model have written this" failures are knowledge failures, not writing failures.

Criterion 2: Attributable to a Source

The insight must be traceable. Statistics get attributed to the dataset. Quotes get attributed to the speaker. Failure modes get attributed to the engagement or the operational context. Frameworks get attributed to the team or individual who built them. Cross-domain syntheses get attributed to the operational experience that produced them.

Attribution does two things simultaneously. It satisfies the AI engine's preference for cited authority, which has tightened over the last twelve months as the major models have started weighting source provenance more heavily. It also forces the author to be honest about where the claim came from. Attribution failures usually surface attribution problems: the author either did not have a source or invented one. Either failure produces content that AI engines absorb rather than cite, because the embedding does not match the patterns of authoritative sourced content.

Criterion 3: Pass the "Could ChatGPT Have Written This Paragraph From Existing Sources" Test

The third criterion is the one most content reviewers skip, and the one that catches the largest number of low-density passages. The question is whether a current-generation LLM, given access to existing public content on the topic, could have produced the paragraph in question. If the answer is yes, the paragraph does not count toward density, no matter how well-written it is.

This test is functionally a check against synthesis risk. AI Overviews and large language models are extremely good at synthesizing common knowledge. The pages that get cited are the ones that contain content the synthesis cannot produce. If your paragraph is reproducible from common knowledge, it will get reproduced in the synthesis, and your page will not be cited as the source. The test is not a question of writing quality. It is a question of whether the content has any irreducible substance that the synthesis layer cannot generate without you.

Density Is Per-Section, Not Per-Thousand-Words

The most common mistake in operationalizing IGD is treating it as a word-count efficiency metric. It is not. The unit of analysis is the section, not the page, and not the word.

Side-by-side comparison of two pages with identical word count and identical insight count. The left page (Page A, Buried Density) concentrates eight original insights inside one large section, producing only one retrieval opportunity for AI search engines because the embedding of the single section blurs the eight insights together. The right page (Page B, Distributed Density) distributes the same eight insights across eight discrete H3 sections, producing eight retrieval opportunities because each section is independently retrievable as a passage-level answer to a fan-out sub-query. The visual makes the per-section density principle concrete: density is decoupled from word count and depends on architecture. — Same insight count, same word count, very different citation outcomes. AI engines retrieve passages, not whole pages.

Most current AI search products use query fan-out: they decompose a query into sub-queries and retrieve documents per sub-query, then synthesize across the retrieved results. Each section of a page is independently retrievable for the sub-query it best answers. A page with eight sections, each carrying one original insight, contributes eight retrieval opportunities. A page with one massive section containing all eight insights contributes one retrieval opportunity, because the embedding of that single section blurs the eight insights into a noisy average.

The architectural implication is that the page should be designed around sections that are each complete answers to discrete sub-questions. The headings should signal the question. The opening sentence of the section should answer the question directly. The supporting paragraphs should expand without burying the answer. This is the structural logic that makes a page friendly to AI retrieval. It is also the structural logic that makes a page friendly to a human reader who lands on the section from a long-tail search query, which is why these moves compound across both surfaces.

The Density Audit: How to Count Insights on a Page

The operational version of an IGD audit takes about an hour for a 4,000-word page:

A six-step workflow diagram for conducting an Information Gain Density audit on a single page. Step 1: pull the top 10 ranking pages for the target query (organic top 10 or AI Overview citations, ~10 minutes). Step 2: extract every original-feeling claim, statistic, framework, named concept, quote, and failure mode from each competitor (~20 minutes). Step 3: aggregate the extractions into a single saturation set representing what every reader of the topic has already seen (~10 minutes). Step 4: audit the candidate page section by section against the saturation set (~15 minutes). Step 5: apply the three criteria (specific enough to quote, attributable, irreducible) to flagged candidates (~5 minutes). Step 6: compare the resulting count to the 5-to-7 Rule benchmark (below 5 means rework, 5 to 7 means publish, 8 or more means verify retrievability holds, ~1 minute). Total audit time is approximately 1 hour for a 4,000-word page. — The six-step audit workflow. Roughly an hour for a 4,000-word page; qualitative judgment is required at each step.

Pull the top ten ranking pages for the target query (Google's organic top ten or the AI Overview citation set).
For each competitor page, list every original-feeling claim, statistic, framework, named concept, quote, or specific failure mode. Strip the boilerplate. What remains is the insight count for that page.
Aggregate across the ten competitors into a single saturation set: the things every reader of the topic has already seen.
Audit your own page section by section. For each H2 and H3, identify candidate insights and compare against the saturation set. Overlaps do not count. Non-overlaps are candidates.
Apply the three criteria to the candidates: specific enough to quote, attributable, could not have been generated by a language model from existing public sources. Passes on all three become counted insights.
Compare the count to the 5-to-7 Rule. Below five, the page needs more substance. Five to seven, competitive. Above seven, dominant tier (verify retrievability holds).

The audit is qualitative, not algorithmic. The third criterion requires a judgment call about what a language model can and cannot generate, and that judgment moves with the model.

Where Information Gain Density Lives in MERIT

Searchbloom organizes AI search optimization through the MERIT Framework: Mentions, Evidence, Relevance, Inclusion, and Transformation. Information Gain Density does not live in one pillar. It flows through all five, and the form it takes shifts depending on the pillar.

The MERIT Framework rendered as five pillars (Mentions, Evidence, Relevance, Inclusion, Transformation), each connected via dashed lines through a horizontal beam down to Information Gain Density. The diagram shows IGD as the cross-pillar substance that each pillar contributes to in a different way: Mentions delivers IGD authored by third parties, Evidence is where IGD is most directly authored by the publisher, Relevance elevates IGD through structural change, Inclusion makes IGD discoverable to AI crawlers, and Transformation measures and compounds IGD over time. IGD branches into the Information Gain Score (the math) and the 12 Information Gain Techniques (the operational catalog). — Information Gain Density flows through all five MERIT pillars. Each pillar contributes a distinct form of density.

M is for Mentions: Density You Did Not Author

A third-party review on G2 or Trustpilot, a Reddit thread about your category, a YouTube video that names your framework, a LinkedIn post quoting your data: each one is citation-eligible content sitting in the corpus AI engines retrieve from when answering category-level queries about you. None of those pieces were authored by you. All of them carry density that gets attributed to you when the synthesis happens. Inside the IGD framework, earned media is a density activity, not a brand activity.

The independent evidence on this point has hardened. A 2025 follow-up GEO study (Aggarwal et al., "Generative Engine Optimization: How to Dominate AI Search") reported "a systematic and overwhelming bias toward Earned media (third-party, authoritative sources) over Brand-owned and Social content." The operational read: owned-page density caps out without a parallel earned-media program feeding the Mentions pillar. Your authored pages can clear the 5-to-7 Rule. They cannot manufacture the third-party density the synthesis layer increasingly weights above owned content.

E is for Evidence: Density You Author Most Directly

Evidence is the pillar where IGD is most directly the publisher's responsibility. The 12 Information Gain Techniques (proprietary aggregate data, first-hand case studies, subject matter expert quotes, failure documentation, named failure modes with cohort specificity, pricing and economic transparency, process and timeline transparency, operational artifacts, decision frameworks with explicit tradeoffs, verbatim customer language, contrarian framings and strong stances, and cross-domain synthesis) are the operational catalog of how to author IGD on a priority page. The parent article on information gain in SEO walks through the full 12 in detail. The shorthand for content teams is that high-IGD pages use four to six of the techniques deliberately on every priority page.

R is for Relevance: Density Through Structural Change

The same underlying facts, restructured into discrete HTML sections with clear heading hierarchy and self-contained passages, produce a higher IGD because the embedding shifts. Language models parse HTML and markdown directly, so structural changes to visible content move the embedding. Schema markup does not, because schema is upstream of the language model: it helps the search indexes that the RAG layer retrieves from, not the language model itself. Relevance is the pillar where IGD gets unlocked from underneath bad architecture. A page with seven original insights buried in a single 4,000-word section has lower effective IGD than a page with the same seven insights distributed across seven discrete H2 sections, even though the count is identical.

The schema markup generator we publish is the small operational tool that comes out of this pillar. It does not directly increase IGD, but it makes the IGD that already exists more discoverable to the indexes that feed the retrieval layer.

I is for Inclusion: Density That Is Discoverable

Density that is not crawled and indexed is invisible to AI engines. The Inclusion pillar of MERIT covers crawlability, indexation, rendering, and the speed of the path from publication to retrieval (IndexNow, sitemap submission, internal linking from authoritative pages on the site). A page with eleven original insights that is blocked in robots.txt has zero effective IGD. A page with five original insights that is properly indexed and linked from the homepage has more effective IGD than the blocked page does. The mechanics matter.

T is for Transformation: Density That Compounds Over Time

Without measurement, you do not know which density efforts are working. Transformation is the pillar where the IGD audit becomes a recurring rhythm rather than a one-time exercise. Pages get re-audited as the saturation level of competing content shifts. New original insights get added to existing pages when the count drops below the threshold. Named concepts get reinforced across content so that the named-concept repetition starts to compound in third-party citations. The publishing schedule on our methodologies page reflects the cadence we recommend: priority pages get audited quarterly, the saturation set gets refreshed semi-annually, and the named-concept catalog gets reviewed annually.

Information Gain Density vs. Information Gain Score

The two frameworks operate at different levels of the same problem.

Information Gain Density is the count framework. It answers how much novelty does this page carry. The output is a count of distinct, original, attributable insights on the page, measured against the saturation set, with the 5-to-7 Rule as the operational benchmark.

The Information Gain Score (IGS) is the math. It answers how differentiated is this page from the existing top-ranked pages in vector space. The formula is one minus the maximum cosine similarity between the page's vector embedding and the embeddings of the top-ranking competitors. A score above 0.5 indicates meaningful differentiation. A score above 0.7 indicates substantial novelty. A score below 0.3 indicates near-duplicate content. The hub article maps these ranges to a 13-grade letter scale (F through A+) for editorial use; the practical citation threshold lands at B- (IGS 0.50), and the reliable-citation tier opens at A- (IGS 0.65).

The two frameworks corroborate each other in practice. Pages that score above 0.5 on IGS almost always pass the 5-to-7 Rule on IGD, and pages that score below 0.3 on IGS almost always fail it. The gap cases are where the two frameworks earn their keep: a page with high count but weakly-differentiated insights can pass IGD on a manual audit but fail IGS on the embedding math, which usually means the audit overcounted and the insights were less original than the auditor judged. A page with low count but strongly-differentiated insights can pass IGS but fail the 5-to-7 Rule, which usually means the page is novel on a narrow point but does not carry enough citation surface area to compete on a saturated topic.

The companion piece in this hub-and-spoke set covers the IGS formula in operational detail: how to compute it with current embedding models, what the interpretation tiers mean, and where the math breaks down.

What Information Gain Density Is Not

The framework gets confused with adjacent concepts often enough that it is worth naming what IGD is not.

Not the Same as the Patent's "Information Gain"

The Google patent describes a session-level personalization mechanism. IGD describes a content-author-controllable density framework calibrated against the corpus of competing top-ranked pages. The two share an underlying intuition (novelty matters) but are different mechanisms. The patent's score is computed at retrieval time per user. IGD is computed at authoring time per page. Treating them as the same thing produces advice that does not match either mechanism.

Not the Same as Topical Authority

Topical authority measures the breadth and depth of a publisher's coverage across an entire topic over time. IGD measures the density of original insights in a single piece against the saturation of competing pages for a single query. The two are complementary. A site with topical authority but low IGD gets crawled but absorbed into AI synthesis. A site with high IGD but no topical authority struggles to be retrieved in the first place because the conventional ranking signals are absent. The strongest content programs build both: topical authority across a category over time, and high IGD on every priority page within the category.

Not the Same as Keyword Density

Keyword density is the frequency of a target keyword in a piece of content, expressed as a percentage of total word count. The metric is a relic of late-2000s SEO and has not been a meaningful ranking factor for over a decade. IGD has nothing to do with keyword frequency. The unit of measurement in IGD is the distinct original insight, not the keyword occurrence.

Common IGD Failure Modes

Operationalizing the framework on a real content team produces a recognizable set of failure modes. We have seen each one across multiple SEO engagements and the patterns are stable enough to name.

Failure 1: Over-Counting on the Audit

The most common failure is that the manual audit counts insights that do not actually pass the three criteria. The auditor judges a paragraph as original when it is in fact a rewording of common knowledge. The page scores 6 on the audit but 0.32 on the IGS math. Catching this failure requires a second auditor or, more reliably, the IGS calculation as a check on the manual count. The two together produce an honest signal. Either alone is gameable.

Failure 2: Density Without Retrievability

A page can carry seven novel insights and still not get cited because the page does not match the target query semantically. The retrieval layer never picks it up. We have seen this pattern most often on contrarian pieces written without clear query alignment: the substance is novel, the architecture is good, but the title and headings do not match the queries the substance answers. The fix is to bring the query alignment up to standard while preserving the novel substance, not to dilute the substance.

Failure 3: Insights That Are Not Attributable

A claim with no source attribution can pass the specificity test and the language-model test but fail the citation step at retrieval time, because the AI engine cannot find the provenance and weights the unattributed claim lower than a sourced equivalent from a competitor. Adding the source (the dataset, the engagement, the speaker, the operational context) is usually a one-sentence fix that converts a wasted insight into a counted one.

Failure 4: Density Concentrated in One Section

Eight original insights buried in a single 3,000-word section is not eight insights of density. It is one section of confused, retrieval-unfriendly density. The fix is structural: split the section, give each insight its own H3, and let the retrieval layer see them as discrete passages. We have measured citation lifts of 50 to 80% on pages where the only change was structural and the underlying substance was unchanged.

Failure 5: Density on the Wrong Page

High-IGD content on a page that does not have a retrievability path does not get cited at scale. Conventional rank is still the largest single source of AI citations, though the share has shifted noticeably between 2025 and 2026. Ahrefs' July 2025 study put 76% of AI Overview citations on pages ranking in Google's top ten. Their March 2026 republish across 4 million citations from 863,000 SERPs put that number at 38%, with the remainder coming from pages that rank for the fan-out sub-queries AI Overviews decompose the original query into. The directional read is the same in both datasets: density is a multiplier on top of conventional retrievability, not a substitute for it. Pouring six original insights into a page that ranks neither for the head term nor for any of its fan-out variants is investment misallocated. The same six insights on a page that already ranks at position 8 (or that ranks well for two or three sub-queries the fan-out hits) produces measurable citation gains. We see this pattern most often in enterprise SEO engagements with deep priority page lists, where the temptation is to densify everything rather than densify the pages that are already retrievable.

Frequently Asked Questions

What is Information Gain Density?

Information Gain Density (IGD) is a Searchbloom-invented framework that measures the count of distinct, original, attributable insights in a piece of content against the saturation level of the existing top-ranking pages for the same query. It builds on Google's information gain patent (US 11,354,342 B2) but is not the same mechanism: the patent describes a session-level personalization signal, while IGD is a content-author-controllable density framework calibrated against competing pages. The operational benchmark is the 5-to-7 Rule: most competitive topics in 2026 require 5 to 7 distinct original insights for a piece to credibly compete in AI search.

Who invented Information Gain Density?

Cody C. Jensen and the Searchbloom team invented the framework. The phrase information gain existed before Searchbloom in the context of the Google patent and the popular SEO commentary on it. What did not exist, until Searchbloom built it, was the framework that turns the underlying intuition into a per-section, saturation-relative unit of measurement an author can engineer toward. The framework was developed across roughly eighteen months of measuring content performance across Searchbloom partner engagements through 2025 and into 2026.

How is Information Gain Density different from information gain?

The Google patent's information gain mechanism is computed at retrieval time, per user, against documents the user has already viewed in their session. It is a personalization signal. IGD is computed at authoring time, per page, against the corpus of top-ranking competitor pages for the target query. It is a content density framework. The two share an underlying intuition (novelty matters) but operate at different layers and produce different operational moves.

How do you count insights for Information Gain Density?

Pull the top ten ranking pages for the target query. List the original-feeling claims, statistics, frameworks, named concepts, quotes, and specific failure modes across all ten. The aggregate is the saturation set, the things every reader of the topic has already seen. Then audit your own page section by section. For each H2 and H3, identify candidate insights and compare them against the saturation set. The candidates that do not appear in the saturation set, are specific enough to quote, are attributable to a source, and could not have been generated by a language model from existing public sources are the counted insights. The total count is your Information Gain Density score for the page.

Is Information Gain Density the same as Information Gain Score?

No. IGD is the count framework: it answers how much novelty does this page carry. IGS is the math: it answers how differentiated is this page in vector space from the top-ranking competitors. The two corroborate each other in practice. Pages that pass the 5-to-7 Rule on IGD almost always score above 0.5 on IGS, and pages that fail the 5-to-7 Rule almost always score below 0.3. The companion article on Information Gain Score covers the math in detail.

Does Information Gain Density work for short-form content?

Yes. Density is decoupled from word count. A 1,500-word article with seven original insights has higher IGD than a 9,000-word article with one. Short-form content can compete in AI search if the count threshold is met against the relevant saturation set. The fact that most short-form content has low IGD is a function of how short-form content is typically written, not a property of length itself.

How long does it take to audit Information Gain Density on a page?

Roughly an hour for a 4,000-word page, including the saturation-set construction from the top ten competitors. The bottleneck is the qualitative judgment about what counts as original and what counts as boilerplate, which has not been reliably automated. Larger priority pages run longer; spoke pages with smaller saturation sets run shorter.

Can AI tools do an Information Gain Density audit automatically?

Partially. Embedding-based tools can compute the IGS math (the cosine-similarity calculation) reliably. The IGD count, by contrast, requires the qualitative judgment of the third criterion (could a language model have generated this paragraph from existing sources), which moves with the model. We use a combination of embedding math for the score and human audit for the count, and the two together produce a reliable signal. Either alone is gameable.

Where does Information Gain Density fit inside the MERIT Framework?

IGD flows through all five pillars of the MERIT Framework. Mentions delivers IGD authored by third parties (reviews, earned media, Reddit, YouTube, LinkedIn). Evidence is where IGD is most directly authored on the priority page (proprietary data, quotes, frameworks). Relevance elevates IGD through structural change (HTML passage hierarchy shifts the embedding). Inclusion makes IGD discoverable (without crawlability and indexation, IGD is invisible). Transformation measures and compounds IGD over time (the audit becomes a recurring rhythm rather than a one-time exercise).

What happens if you publish content with low Information Gain Density?

The content gets crawled and indexed but absorbed into AI synthesis without attribution. It may rank conventionally on Google for the target query but earns no AI citation share. The content contributes to topical authority breadth (which is real and useful) but does not compound into AI search visibility. Most low-IGD content has the additional problem of being hard to differentiate in conventional rankings as the SERP saturates with similar content, so the conventional ranking gains often erode over time as well.

The Bottom Line

Information Gain Density is the unit of measurement the information-gain literature was missing. The Google patent gave the SEO community a useful intuition (novelty matters) but did not give content authors a way to act on it. Searchbloom invented IGD to fill that gap: a per-section, saturation-relative count of distinct, original, attributable insights, calibrated against the corpus of top-ranking competitor pages, with the 5-to-7 Rule as the operational benchmark and the three criteria for a counted insight as the pre-publish checklist. The framework holds across content types, surfaces, and saturation levels, and the count threshold shifts predictably with the topic.

The companion piece on the Information Gain Score covers the math that corroborates the count: the cosine-similarity formula, the interpretation tiers, and the operational details of running the calculation against current embedding models. Read together, the two articles are the full operator's guide to the framework: the count and the math, the qualitative audit and the quantitative check, the framework Searchbloom invented and the math that proves it works.

Information gain is one of six components of Corpus Engineering, the systems-level discipline for AI visibility. The other five are corpus accessibility, semantic structure, corpus expansion, retrieval optimization, and corpus maintenance.