What I think is the most interesting artifact is the dataset: 37,734 chunks of NRC regulatory documents (NUREG-0800, 10 CFR Parts 20/50/51/52/72/73/100, and Regulatory Guides) embedded with OpenAI text-embedding-3-small. It covers the full regulatory corpus an applicant would need for a COL submission. I'm not aware of anything like this being publicly available before.
The embeddings are ready to load directly into ChromaDB, Pinecone, or any other vector store. If you're doing nuclear AI, regulatory NLP, or just want a large real-world RAG dataset to experiment with, it should be useful.
Here's the full codebase if you're interested: https://github.com/Davenporten/nrc-licensing-rag