i built a fun little python package called ardage (ARxiv DAtaset GEnerator) that lets you generate markdown datasets of research papers using natural language queries, blazing fast. it's perfect for generating post-training datasets for llms, rag knowledge bases, and more.
you can install it with 'pip install ardage', and use it in interactive mode in the cli, use it directly in the cli with flags, or import the library into your own code and build with it!
demo: https://x.com/hariharprasadd/status/1991346557459841196?s=20