I’m building an AI search optimization product and wanted to apply the same principles internally: fix content architecture before launch instead of correcting problems after users — or AI systems — struggle to understand it.
To do this, I created a Python CLI tool that analyzes semantic structure using vector embeddings. It parses markdown files, generates embeddings (all-mpnet-base-v2 or OpenAI), computes cosine similarity, runs k-means clustering, detects redundancy and semantic gaps, and produces visualizations like heatmaps, dendrograms, and UMAP projections. The stack includes Python 3.12, sentence-transformers, scikit-learn, UMAP, and Plotly, with embedding caching for speed.
Analysis Overview:
The site contains 25 pages (~12.9k words) across features, concepts, use cases, and resources. No stub pages were found.
Topic coherence (measured via average similarity between sections) ranged from 0.73 to 0.93, with most pages between 0.78–0.88. Lower coherence wasn’t necessarily bad — the Proof Engine page scored lower because it intentionally covers many subtopics.
Semantic redundancy showed only one pair above 0.85 similarity, both intentional cross-link sections. Earlier, I removed two index pages with 85%+ similarity to parent pages, flattening navigation from three layers to two.
No semantic gaps were detected; all pages were well connected. Hub analysis confirmed that Home, Learn, and the AEO Playbook act as central nodes, matching the intended architecture of concepts → applications → tools.
Embeddings were chosen over keyword analysis because they capture meaning rather than wording, detecting paraphrased overlap and relationships relevant to AI retrieval systems.
Limitations include model sensitivity, arbitrary cluster counts, and coherence scores that don’t fully account for intentional structure. Planned improvements include entity coverage analysis, competitor comparisons, and query-simulation testing.
The entire process took under a minute but prevented structural issues that could cause discoverability problems later. Running semantic analysis pre-launch helped validate architecture, reduce duplication, and ensure content works for both humans and AI retrieval systems.
Aduttya•1h ago
To do this, I created a Python CLI tool that analyzes semantic structure using vector embeddings. It parses markdown files, generates embeddings (all-mpnet-base-v2 or OpenAI), computes cosine similarity, runs k-means clustering, detects redundancy and semantic gaps, and produces visualizations like heatmaps, dendrograms, and UMAP projections. The stack includes Python 3.12, sentence-transformers, scikit-learn, UMAP, and Plotly, with embedding caching for speed.
Analysis Overview: The site contains 25 pages (~12.9k words) across features, concepts, use cases, and resources. No stub pages were found.
Topic coherence (measured via average similarity between sections) ranged from 0.73 to 0.93, with most pages between 0.78–0.88. Lower coherence wasn’t necessarily bad — the Proof Engine page scored lower because it intentionally covers many subtopics.
Semantic redundancy showed only one pair above 0.85 similarity, both intentional cross-link sections. Earlier, I removed two index pages with 85%+ similarity to parent pages, flattening navigation from three layers to two.
No semantic gaps were detected; all pages were well connected. Hub analysis confirmed that Home, Learn, and the AEO Playbook act as central nodes, matching the intended architecture of concepts → applications → tools.
Heatmap clustering revealed:
* Concept pages: 0.65–0.80 similarity * Feature pages: 0.45–0.65 similarity * Use cases: 0.70–0.79 similarity
Embeddings were chosen over keyword analysis because they capture meaning rather than wording, detecting paraphrased overlap and relationships relevant to AI retrieval systems.
Limitations include model sensitivity, arbitrary cluster counts, and coherence scores that don’t fully account for intentional structure. Planned improvements include entity coverage analysis, competitor comparisons, and query-simulation testing.
The entire process took under a minute but prevented structural issues that could cause discoverability problems later. Running semantic analysis pre-launch helped validate architecture, reduce duplication, and ensure content works for both humans and AI retrieval systems.