The Intra-Site Embedding Audit: Measuring Your Corpus Against Itself

"Your cross-site audit tells you how you compare to peers. Your intra-site audit tells you whether your own corpus stays topically aligned."

~ Cody C. Jensen, CEO & Founder, Searchbloom

The published Embedding Audit workflow measures your page against the top-10 SERP competitors. Cross-site, pre-publish, per page. The Intra-Site Embedding Audit is the sibling workflow. It measures your pages against each other. Same cosine math, different question, different output.

The intra-site question is not "how do I compare to competitors?" It is "does my own corpus stay topically aligned?" Pages that are too close to each other cannibalize retrieval. Pages that are too far from a topical center fall outside the cluster. Pages that drift over time, as new pages publish around them, lose their place in the internal graph.

The intra-site question matters more in 2026 than it did in 2024. AI search retrieval treats your site as a corpus, not a list of pages. Internal RAG systems, semantic search modules, related-content widgets, and AI-driven internal search all read your site through the same embedding lens. A site that stays topically aligned performs well across all of these. A site with internal drift performs poorly across all of them, with no single failure that explains why.

TL;DR

The Intra-Site Embedding Audit measures your own pages against each other using cosine similarity. Same math as the cross-site Embedding Audit. Different reference set. Different operational output.
It catches three things the cross-site audit misses: internal duplication, topical gaps, and retrieval-pipeline drift inside your site.
It is not a retrieval-eligibility test: page-to-page cosine measures internal duplication and gaps, not whether a page is retrievable for a query. Pair it with synthetic fan-out queries run through retrieval and with AI-bot crawl frequency from server logs.
Cadence: semiannual for a mature site. Quarterly for a site under active content build. Monthly for sites that run internal RAG.
Workflow: crawl, embed, pairwise cosine, cluster, gap detection, response.
Framework slot: Component 5 of Corpus Engineering, Retrieval Optimization. The intra-site audit is the diagnostic that tells Component 5 where to act.

Why an Intra-Site Audit Matters

A page that ranks well against top-10 competitors can still be the wrong page when an internal search, related-content widget, or AI assistant looks at your site. Cross-site scoring measures your competitive position. Intra-site scoring measures whether your own corpus stays topically aligned. The two answer different questions.

Internal duplication is the most common pattern. Two pages on related topics drift toward each other over a year of editing. Their cosines climb from 0.55 at launch to 0.85 a year later. The retrieval system can no longer tell them apart. Both get pulled for the same query. Neither earns a clean citation. Cross-site audits do not flag this because each page still looks fine against its top-10 SERP peers.

Topical gaps are the second pattern. A site that grew through a series of priority pieces has clusters of strong pages and stretches of empty topical space between them. A query that lands in the empty space finds nothing in your corpus. The retrieval system either skips your site or pulls a tangentially related page. An intra-site audit makes the gaps visible.

The third pattern is retrieval-pipeline drift. A site running an internal RAG system or a semantic search module embeds its corpus once and serves the embeddings to users for months. As new pages publish, the embedding index drifts. Internal search results degrade quietly. Users get less relevant results and nobody traces it to the embedding layer.

What an Intra-Site Audit Is

A pairwise cosine matrix diagram titled The Pairwise Cosine Matrix showing an 8 by 8 grid representing 8 priority pages. The diagonal is teal indicating self-similarity. Two red cells off-diagonal mark pages with cosine over 0.85 (duplicate pairs). A light-blue band around the diagonal indicates the cluster region of cosine 0.65 to 0.85. White cells indicate gaps of cosine under 0.50. A legend on the right explains each color and lists three outputs: duplication report, gap report, drift report. Footnote: Pages 1 and 4, and pages 3 and 6, score too close. Pages 7 and 8 have nothing within 0.65 of them. — Three signals from one matrix: duplication, gaps, and drift.

The Intra-Site Embedding Audit is a workflow that embeds every priority page on your site. It then computes pairwise cosine similarity across the set. The output is three reports.

First, a duplication report. Pairs of pages with cosine over 0.85 that should consolidate or differentiate. Second, a gap report. Clusters of pages with empty topical space between them where the retrieval system has nothing to pull. Third, a drift report. Pages that have moved in cosine position over time, indicating the page or its neighbors have changed.

The workflow runs on the same embedding model the site already uses for retrieval. If you run an internal RAG system on text-embedding-3-large, the audit runs on text-embedding-3-large. If you do not run internal retrieval but plan to, pick the embedding model first and run the audit on that. The audit produces a snapshot of how the retrieval system sees your site.

The Workflow

A horizontal flow diagram titled The Intra-Site Embedding Audit Workflow showing six numbered dark teal circles connected by arrows from left to right: 1. Define set, 2. Crawl, 3. Embed, 4. Pairwise cosine, 5. Reports, 6. Respond. Footnote: Match the embedding model to production retrieval. — Six steps. Run semiannually on a mature site. Monthly when internal RAG is live.

Step 1: Define the priority page set

Pick the pages that matter for retrieval. Typically 100 to 500 pages for a mid-size site. Service pages, priority blog content, top product pages, key resource center pieces. Skip thin archives and pages with no retrieval value.

Step 2: Crawl and extract content

Run a Screaming Frog v22 crawl across the priority set. Extract the body text of each page using the content-area selector for your theme. Save each as a structured record (URL, title, H1, body text, last modified).

Step 3: Embed every page

Embed the body text of each page under the production embedding model. Store both the vectors and the model-version tag. For a 200-page priority set under text-embedding-3-large, the workload is about 200 API calls and a few megabytes of vector storage.

Step 4: Compute the pairwise cosine matrix

For every pair of pages in the set, compute cosine similarity. A 200-page priority set produces 19,900 pairs. Store the matrix. The matrix is the data structure that all three reports read from.

Step 5: Generate the three reports

Duplication report. Filter pairs with cosine over 0.85. Sort by cosine descending. For each pair, mark the suggested response: consolidate, differentiate, or accept.

Gap report. Cluster the pages by cosine proximity using a simple distance threshold (typically 0.65). Visualize the clusters. Identify topical regions with sparse coverage relative to the priority program. Mark them as candidate gaps.

Drift report. If a prior audit exists, compute the cosine delta on each page against the prior month or quarter. Flag pages that moved more than 0.05. The drift report is empty on the first run and becomes load-bearing after the third.

Step 6: Map findings to the response set

Each finding gets a response.

Duplication response: consolidate the lower-IGS page into the higher-IGS page with a 301 redirect, or differentiate the two pages by repositioning one toward a different sub-topic.

Gap response: either commission a new piece to fill the gap or accept the gap if the topical region is not strategic.

Drift response: if a page drifted away from its cluster, check whether content edits caused the drift and decide if the drift is intentional. If unintentional, restore the page's adjacency by adding back the relevant references or sections.

Selection Criteria for the Priority Set

The priority set is not the full site. Every page produces a matrix dominated by tag pages, paginated archives, and thin content. The audit reads as noise.

Three rules for picking the priority set.

First, include every page that should be a retrieval destination. Service pages, key blog content, comparison pages, methodology pages, top product pages. Anything you would point a customer to.

Second, exclude paginated archives, tag pages, search result pages, and any template-driven page that aggregates other pages. These pages embed as the union of their list items and pollute the matrix.

Third, include pages with at least 600 words. Shorter pages embed inconsistently. They create false-positive duplication pairs. The 600-word floor is a working anchor, not a universal rule. Sites with a strong short-form publication style can lower it. Sites with long-form policy default to a higher floor.

Common Findings

Service-page duplication. Two service pages on adjacent offerings drift to cosine 0.85+. The retrieval system pulls one or the other inconsistently. The fix is either consolidation (one page covers both offerings) or differentiation (each page emphasizes a distinct angle).

Blog-to-service-page cannibalization. A blog post becomes the de-facto landing page for a service query because it scores higher than the service page itself. The fix is to strengthen the service page substance and direct the blog post toward an adjacent query.

Cluster orphans. A page sits at cosine 0.45 from every other page on the site. It is topically isolated. The audit surfaces it. The response is usually to either build out the surrounding cluster or repurpose the orphan into something that fits.

Drift after a content sprint. A page that was 0.62 from its cluster center is now 0.78 after a sprint added five neighboring pages. The page got pulled toward the new neighbors. Whether the drift is desired depends on the intent of the sprint. The audit makes the drift visible.

Template-driven false positives. Two pages score 0.95 because they share a heavy template footer. The fix is to strip template chrome before embedding. Most embedding workflows skip this step and produce noisy duplication reports.

Five Failure Modes

Running the audit on the full site. A 5,000-page site produces a 5,000 by 5,000 cosine matrix. The audit becomes unusable. Stick to the priority set.

Skipping template stripping. Pages with heavy shared chrome (sidebars, footers, mega-menus) score artificially high in cosine pairs. The duplication report fills with template noise. Strip the chrome before embedding.

Using a different embedding model than production retrieval. If your internal RAG runs on text-embedding-3-large and you audit on Voyage 4, the audit reads a corpus that retrieval is not actually seeing. Match the model to production.

Running once and never re-running. A first audit produces a baseline. The second audit, six months later, produces the drift report that catches the real problems. Single-shot audits miss the time dimension.

No response framework. The audit produces three reports. Without a defined response for each finding type, the reports turn into a backlog that grows quarter over quarter. Pair the audit with a response cadence.

Where The Intra-Site Audit Sits in the Framework

Component 5 of Corpus Engineering: Retrieval Optimization. This is the component that tunes the corpus for retrieval, distinct from authoring (Components 1 to 4) and maintenance (Component 6).

An intra-site audit is the diagnostic. The response set is the work. Together they make up the Retrieval Optimization practice.

The intra-site audit pairs with the cross-site Embedding Audit. Cross-site measures competitive position. Intra-site measures whether your own corpus stays topically aligned. Run cross-site at publish, per page. Run intra-site quarterly or semiannually, across the priority set.

Inside MERIT, the intra-site audit sits in Inclusion. Inclusion is the pillar that ensures the corpus is technically accessible and structurally readable. Topical alignment is a tech-access requirement. Any retrieval system reading your site as a corpus depends on it.

Frequently Asked Questions

What is the Intra-Site Embedding Audit?

A workflow that embeds every priority page on a site. It then computes pairwise cosine similarity across the set. The output is three reports: duplication, topical gaps, and drift over time. It tells you whether your corpus stays on-topic, separate from how you stack up against peers.

How is it different from the cross-site Embedding Audit?

The cross-site audit measures your page against top-10 SERP peers. Intra-site measures your pages against each other. Same cosine math, different reference set, different operational output. Use both.

How often should I run the Intra-Site Embedding Audit?

Semiannual is the working default for a mature site. Quarterly for a site under active content build. Monthly for sites that run internal RAG or semantic search.

How many pages should be in the priority set?

100 to 500 for most mid-size sites. The set covers pages that should be retrieval destinations. Skip paginated archives, tag pages, and pages under 600 words.

What does the duplication report do?

Lists page pairs with cosine over 0.85. Each pair gets a response: consolidate (301 redirect the lower-IGS page), differentiate (reposition one page), or accept (rare).

What is the gap report for?

It clusters pages by cosine proximity and surfaces topical regions where the priority set has sparse coverage. Empty regions inside the priority program are content gaps.

What does the drift report show?

How each page moved in cosine space since the last audit. Pages that drifted more than 0.05 get flagged.

Does this apply to sites that do not run internal RAG?

Yes. Even without an internal retrieval system, your site appears as a corpus to external AI search engines. Internal topical alignment affects citation patterns and related-content widgets.

What embedding model should I use for the audit?

Match the model to your production retrieval. If your internal RAG runs on text-embedding-3-large, audit with text-embedding-3-large.

Where does it fit in the MERIT Framework?

Inside Inclusion. MERIT names Inclusion as the pillar that ensures the corpus is technically accessible and structurally readable.

The Bottom Line

The cross-site Embedding Audit tells you how your page compares to top-10 SERP peers at publish. The Intra-Site Embedding Audit tells you whether your own corpus stays topically aligned across pages and over time.

Three reports. Duplication, gap, and drift. Each report maps to a response. Consolidate, differentiate, fill, or restore.

The audit runs on a semiannual default for mature sites and a quarterly cycle for sites under active build. Programs running internal RAG run it monthly. The cost is small. The signal is the diagnostic that drives Component 5 of Corpus Engineering: Retrieval Optimization.

Run both audits. Cross-site at publish, per page. Intra-site quarterly, across the priority set. The two together cover what AI retrieval actually reads when it looks at your site.