Using Vector Embeddings to Audit Content Architecture

https://drive.google.com/file/d/1ugXvRmhzIpIuR4Xt_-sXWF0RP6fwgHEX/view?usp=sharing

1•Aduttya•1h ago

Comments

Aduttya•1h ago

I’m building an AI search optimization product and wanted to apply the same principles internally: fix content architecture before launch instead of correcting problems after users — or AI systems — struggle to understand it.

To do this, I created a Python CLI tool that analyzes semantic structure using vector embeddings. It parses markdown files, generates embeddings (all-mpnet-base-v2 or OpenAI), computes cosine similarity, runs k-means clustering, detects redundancy and semantic gaps, and produces visualizations like heatmaps, dendrograms, and UMAP projections. The stack includes Python 3.12, sentence-transformers, scikit-learn, UMAP, and Plotly, with embedding caching for speed.

Analysis Overview: The site contains 25 pages (~12.9k words) across features, concepts, use cases, and resources. No stub pages were found.

Topic coherence (measured via average similarity between sections) ranged from 0.73 to 0.93, with most pages between 0.78–0.88. Lower coherence wasn’t necessarily bad — the Proof Engine page scored lower because it intentionally covers many subtopics.

Semantic redundancy showed only one pair above 0.85 similarity, both intentional cross-link sections. Earlier, I removed two index pages with 85%+ similarity to parent pages, flattening navigation from three layers to two.

No semantic gaps were detected; all pages were well connected. Hub analysis confirmed that Home, Learn, and the AEO Playbook act as central nodes, matching the intended architecture of concepts → applications → tools.

Heatmap clustering revealed:

* Concept pages: 0.65–0.80 similarity * Feature pages: 0.45–0.65 similarity * Use cases: 0.70–0.79 similarity

Embeddings were chosen over keyword analysis because they capture meaning rather than wording, detecting paraphrased overlap and relationships relevant to AI retrieval systems.

Limitations include model sensitivity, arbitrary cluster counts, and coherence scores that don’t fully account for intentional structure. Planned improvements include entity coverage analysis, competitor comparisons, and query-simulation testing.

The entire process took under a minute but prevented structural issues that could cause discoverability problems later. Running semantic analysis pre-launch helped validate architecture, reduce duplication, and ensure content works for both humans and AI retrieval systems.

NetNewsWire Turns 23

MolmoSpaces: A large-scale, open platform and benchmark for embodied AI research

Show HN: Stop Getting Rejected by ATS – I Built a Fix

Spotify-fs Store any file inside Spotify tracks

Show HN: Gottp – A Postman/Insomnia-Like TUI API Client Built in Go

Migrating from Slurm to Kubernetes

Evolving Git for the next decade (FOSDEM 2026)

Lightweight daemon to remap the Copilot keyboard key in Linux using libevdev

We built a museum exhibit about a 1990s game hint line, with a physical binder

EU commission eyes turning 5G antennas into drone detectors

Maxis Software Toys

SEO Score for Your Docs

Interactive guide to Bitcoin's proof of work

Show HN: TagLib-WASM – Read/write audio metadata with all JavaScript runtimes

Talk to Proteins

Opus 4.6, Codex 5.3, and the post-benchmark era

Andreessen Horowitz's Rising Influence over Trump-Era AI Policy

Challenger Center announces Space Coding Challenges with Hack Club

The hunt for zero-CVE container images

Old Reddit Broken

Turning YouTube into Cloud Storage [video]

Gallup will no longer measure presidential approval after 88 years

Show HN: Simulate Anybody's Gmail Inbox

Show HN: Steam and Autism, a book by Opus 4.6

Are we losing our sense of "Quality" in the age of AI agents

Metasurfaces create super-sized neutral atom arrays for quantum computing

Google sent personal and financial information of student journalist to ICE

Building High-Performance Electron Apps

What Is a Diminished Value Claim? The Secret to Recovering Your Car's Lost Value

Can my SPARC server host a website?