New budget financial API, based on EDGAR data

7•jgfriedman1999•6mo ago

Hey everyone,

I'm the developer of an open-source (MIT License) python package to convert SEC submissions into useful data. I've recently put a bunch of stuff in the cloud for a nominal convenience fee.

Cloud:

1. SEC Websocket - notifies you of new submissions as they come out. (Free)

2. SEC Archive - download SEC submissions without rate limits. ($1/100,000 downloads)

3. MySQL RDS ($1/million rows returned)

- XBRL

- Fundamentals

- Institutional Holdings

- Insider Transactions

- Proxy Voting Records

Posting here, in case someone finds it useful.

Links:

Datamule (Package) GitHub: https://github.com/john-friedman/datamule-python

Documentation: https://john-friedman.github.io/datamule-python/datamule-python/sheet/sheet/

Get an API Key: https://datamule.xyz/dashboard2.html

Comments

jgfriedman1999•6mo ago

How it works:

Websocket:

1. Two AWS ec2 t4g.nano instances polling the SEC's RSS and EFTS endpoints. (RSS is faster, EFTS is complete). 2. When new submissions are detected, they are sent to the Websocket (t4g.micro websocket, using Go for greater concurrency). 3. Websocket sends signal to consumers.

Archive:

1. One t4g.micro instance. Receives notifications from websocket, then gets submissions SGML from the SEC. 2. If submission is over size threshold, compresses with zstandard. 3. Uploads submissions to Cloudflare R2 bucket. (Zero egress fee, just class A / B operations). 4. Cloudflare R2 bucket is proxied behind my domain, with caching.

RDS

1. ECS Fargate instances set to run daily at 9 AM UTC. 2. Downloads data from archive, then parses them, and uploads them into AWS dbt.medium MySQL RDS. 3. Also handles reconciliation for the archive in case any filings were missed.

conditionnumber•6mo ago

Cool, EDGAR is an amazing public service. I think they use Akamai as their CDN so the downloads are remarkably fast.

A few years ago I wrote an SGML parser for the full SEC PDS specification (super tedious). But I have trouble leveraging my own efforts for independent research because I don't have a reliable securities master to link against. I can't take a historical CUSIP from 13F filings and associate it to a historical ticker/return. Or my returns are wrong because of data errors so I can't fit a factor model to run an event study using Form 4 data.

I think what's missing is a serious open source effort to integrate/cleanse the various cheapo data vendors into something reasonably approximating the quality you get out of a CRSP/Compustat.

jgfriedman1999•6mo ago

Yep! Pretty sure it is still Akamai. Via testing I've noticed they cap downloads at ~6mbps from e.g. home internet, but not GitHub or AWS.

SGML parsing is fun! - I've opensourced a sgml parser here https://github.com/john-friedman/secsgml

Securities master to link against - Interesting. Here's a pipeline off the top of my head 1. Get CUSIP, nameOfIssuer, titleOfClass using the Institutional Holdings database 2. Use the company metadata crosswalk to link CUSIP + titleOfClass to nameOfIssuer to get cik https://github.com/john-friedman/datamule-data/blob/master/d... (recompiled daily using GH actions) 3. Get e.g. us-gaap:EarningsPerShareBasic from the XBRL database. Link using cik. Types of stock might be a member - so e.g. Class A, Class B? Not sure there.

For form 4, not sure what you mean by event study. Would love to know!

conditionnumber•6mo ago

Event study: A way to measure how returns respond to events. Popularized by Fama in "The Adjustment of Stock Prices to New Information" but ubiquitous in securities litigation, academic financial economics, and equity L/S research. The canonical recipe is MacKinlay's "Event Studies in Economics and Finance". Industry people tend to just use residual returns from Axioma / Barra / in house risk model.

So let's say your hypothesis is "stock go up on insider buy". Event studies help you test that hypothesis and quantify how much up / when.

Cool metadata table, I'm curious about the ticker source (Form4, 10K, some SEC metadata publications?).

My comment about CUSIP linking was trying to illustrate a more general issue: it's difficult to use SEC data extractions to answer empirical questions if you don't have a good securities master to link against (reference data + market data).

Broadly speaking a securities master will have 2 kinds of data: reference data (identifiers and dates when they're valid) and market data (price / volume / corporate actions... all the stuff you need to accurately compute total returns). CRSP/Compustat (~$40k/year?) is the gold standard for daily frequency US equities. With a decent securities master you can do many interesting things. Realistic backtests for the kinds of "use an LLM to code a strategy" projects you see all over the place these days. Or (my interest) a "papers with code" style repository that helps people learn the field.

What you worry about with bad data is getting a high tstat on a plausible sounding result that later fails to replicate when you use clean data (or worse, try to trade it). Let's say your securities master drops companies 2 weeks before they're delisted... just holding the market is going to have serious alpha. Ditto if your fundamental data reflects restatements.

On the reference data front, the Compustat security table has (from_date, thru_date, cusip, ticker, cik, name, gics sector/industry, gvkey, iid) etc all lined up and ready to go. I don't think it's possible to generate this kind of time-series from cheap data vendors. I think it could be possible to do it using some of the techniques you described, and maybe others. Eg get (company-name, cik, ticker) time-series from Form4 or 10K. Then get (security-name, cusip) time-series from the 13F security lists SEC publishes quarterly (pdfs). Then merge on date/fuzzy-name. Then validate. To get GICS you'd need to do something like extract industry/sector names from a broad index ETF's quarterly holdings reports, whose format will change a lot over the years. Lots of tedious but valuable work. Also a lot of surface area to leverage LLMs. I dunno, at this point it may be feasible to use LLMs to extract all this info (annually) from 10Ks.

On the market data front, the vendors I've seen have random errors. They tend to be worst for dividends/corporate-actions. But I've seen BRK.A trade $300 trillion on a random Wednesday. Haven't noticed correlation across vendors, so I think this one might be easy to solve. Cheap fundamental data tends to have similar defects to cheap market data.

Sorry for the long rant, I've thought about this problem for a while but never seriously worked on it. One reason I haven't undertaken the effort: validation is difficult so it's hard to tell if you're actually making progress. You can do things like make sure S&P500 member returns aggregate to SPY returns to see if you're waaay off. But detailed validation is difficult without a source of ground truth.

jgfriedman1999•6mo ago

Love the long rant.

re: metadata table - it's constructed from the SEC's submissions.zip, which they update daily. What my script does is download the zip, decompress just the bytes where the information (ticker, sic code, etc) is stored, then convert into a csv.

And yep! Agree with most of this. Currently, I'd say my data is in the stage where it's useful for startups / phd research and some hedge funds / quant stuff (at least that's who is using it so far!)

I've seen the trillion dollar trades, and they're hilarious! You see it every so often in Form 3,4,5 disclosures.

re: LLMs, this is something I'm planning to move into in a month or two. I'm mostly planning to use older NLP methods which are cheaper and faster, while using LLMs for specific stuff like structured output. e.g. wrds boardex data can be constructed from 8-k item 5.02s.

I think the biggest difficulty wrt to data is just the raw data ingest is annoying AF. My approach has been to make each step easy -> use it to build the next step.

Discuss – Do AI agents deserve all the hype they are getting?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Non AI-obsessed tech forums

Ask HN: Ideas for small ways to make the world a better place

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

AI Regex Scientist: A self-improving regex solver

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: Is there anyone here who still uses slide rules?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

Ask HN: Any International Job Boards for International Workers?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: How Did You Validate?

Ask HN: OpenClaw users, what is your token spend?

Discuss – Do AI agents deserve all the hype they are getting?

Ask HN: Anyone Using a Mac Studio for Local AI/LLM?

LLMs are powerful, but enterprises are deterministic by nature

Ask HN: Non AI-obsessed tech forums

Ask HN: Ideas for small ways to make the world a better place

Ask HN: 10 months since the Llama-4 release: what happened to Meta AI?

Ask HN: Who wants to be hired? (February 2026)

Ask HN: Who is hiring? (February 2026)

Ask HN: Non-profit, volunteers run org needs CRM. Is Odoo Community a good sol.?

AI Regex Scientist: A self-improving regex solver

Tell HN: Another round of Zendesk email spam

Ask HN: Is Connecting via SSH Risky?

Ask HN: Has your whole engineering team gone big into AI coding? How's it going?

Ask HN: Why LLM providers sell access instead of consulting services?

Ask HN: How does ChatGPT decide which websites to recommend?

Ask HN: What is the most complicated Algorithm you came up with yourself?

Ask HN: Is it just me or are most businesses insane?

Ask HN: Mem0 stores memories, but doesn't learn user patterns

Ask HN: Is there anyone here who still uses slide rules?

Kernighan on Programming

Ask HN: Anyone Seeing YT ads related to chats on ChatGPT?

Ask HN: Does global decoupling from the USA signal comeback of the desktop app?

Ask HN: Any International Job Boards for International Workers?

We built a serverless GPU inference platform with predictable latency

Ask HN: Does a good "read it later" app exist?

Ask HN: Have you been fired because of AI?

Ask HN: Anyone have a "sovereign" solution for phone calls?

Ask HN: Cheap laptop for Linux without GUI (for writing)

Ask HN: How Did You Validate?

Ask HN: OpenClaw users, what is your token spend?

New budget financial API, based on EDGAR data

Comments