Launch HN: Reducto Studio (YC W24) – Build accurate document pipelines, fast

85•adit_a•7mo ago

Hi HN! We’re Adit and Raunak, co-founders of Reducto (YC W24, https://reducto.ai). Reducto turns unstructured documents (e.g., PDFs, scans, spreadsheets) into structured data. This data can then be used for retrieval, passed into LLMs, or used elsewhere downstream.

We started Reducto when we realized that so many of today’s AI applications require good quality data. Everyone knows that good inputs lead to better outputs, but 80% of the world’s data is still trapped inside of things like messy PDFs and spreadsheets. Raunak and I launched a really early MVP of parsing and extracting from unstructured documents, and were lucky to have a lot of interest from technical teams when they realized that the accuracy was something they hadn’t seen before.

We started by just releasing an API for engineers to build with, but over time we realized that an accurate API was only part of the puzzle. Our customers wanted to be able to easily set up multi step pipelines, evaluate and iterate on performance within their use case, and work with non-engineering teammates that were also involved in the real world document processing flow.

That’s why we’re launching Reducto Studio, a web platform that sits on top of our APIs for users to build and iterate on end-to-end document pipelines.

With Studio, you can:

- Drop an entire file set and get per-field and per-document accuracy scores against your eval data.

- Auto-generate and continuously optimize extraction schemas to hit production-grade quality fast.

- Save every run, iterate on parse/extract configs, and compare results side-by-side.

You can see some examples here (https://studio.reducto.ai) or you can watch this walkthrough: https://www.loom.com/share/b243551741c642c6a594c00353fcecb3.

If you’d like to upload your own document you can log in and do so as well - we don’t make you book a demo or put a payment down to try it.

Thanks for reading and checking it out! This is only the first step for Studio, so we’d love feedback on anything: UX rough edges (we know they’re there!), features that would make evaluations better for you, hard documents you’ve had trouble with, or anything else about wrangling with unstructured data.

Comments

omaerkhan•7mo ago

FYI - https://links.reducto.ai/studio doesn't seem to be working... ERR_TOO_MANY_REDIRECTS

adit_a•7mo ago

Fixed! Sorry about that

TimMeade•7mo ago

Still not working here

adit_a•7mo ago

The direct loom link isn't working for you? Are you seeing the same redirects error?

weego•7mo ago

I'm not a product fit, but I would like to take a moment to praise the detailed beauty of the design work on the site.

From the typography and layout to the line-work down to how the gradients in the, in fashion, large logotype at the bottom of the footer are tied in by using texture.

Was it in house, or an agency? I'd love to see some more of whoever's work it was

adit_a•7mo ago

Thank you! We worked with Airfoil for the website :)

esafak•7mo ago

https://www.airfoil.studio/ presumably

raunakchowdhuri•7mo ago

yep!

iyn•7mo ago

Agreed — came here to say exactly that. I like that this is not yet another tailwind template (nothing wrong with them, I use them all the time) but something with its own identity. I especially love the illustrations/icons. Well done!

skadamat•7mo ago

Congrats on the launch! How do you guys compare with Datalab with regards to accuracy?

https://www.datalab.to/

gbertb•7mo ago

I want to know this, too. Lots of these companies are doing the same thing, but leave out benchmarks that include marker

adit_a•7mo ago

Thanks! We have a lot of respect for the work VikP and his team did on Surya but we haven't benchmarked his newer pipeline so I don't want to make a 1:1 claim.

If you want to do a side by side with your use case we'd be happy to set you up with free trial access.

anduril22•7mo ago

I want to test side by side with them and my own pipeline - can you set me up with trial access please? Thanks

jackienotchan•7mo ago

I saw your recent $24M series A and was kind of surprised to only see you launching now, congrats!

YC seems to fund quite many document extraction companies, even within the same batch:

- Pulse (YC W24): https://www.ycombinator.com/companies/pulse-3

- OmniAI (YC W24): https://www.ycombinator.com/companies/omniai

- Extend (YC W23): https://www.ycombinator.com/companies/extend

How do you differentiate from these? And how do you see the space evolving as LLMs commoditize PDF extraction?

echelon•7mo ago

How do you raise Series A before launch / PMF?

I assume y'all launched before this to select partners? Or perhaps this is a new product on top of the core product?

Congrats! Keep at it!

adit_a•7mo ago

Thank you!

To clarify, our API was already fully launched and in prod with customers when we raised our series A. This launch is specifically for the platform we're building around the API :)

adit_a•7mo ago

Thanks! To clarify, we launched our document processing APIs a while ago. This launch is specifically for a new platform we're building around our API based on all of the things our customers previously had to build internally to support their use of Reducto (eval tools, monitoring etc).

Generally speaking, my view on the space is that this was crowded well before LLMs. We've met a lot of the folks that worked on things like drivers for printers to print PDFs in the 1990s, IDP players from the last few decades, and more recent cloud offerings.

The context today is clearly very different than it was in the IDP era though (human process with semi-structured content -> LLMs are going to reason over most human data), and so is the solution space (VLMs are an incredible new tool to help address the problem).

Given that I don't think it's surprising that companies inside and outside of YC have pivoted into offering document processing APIs over the past year. Generally speaking we don't see differentiation in the sense of just feature set since that'll converge over time, and instead primarily focus on accuracy, reliability, and scalability, all 3 of which have a very substantive impact from last mile improvements. I think the best testament I have to that is that the customers we've onboarded are very technical, and as a result are very thorough when choosing the right solution for them. That includes a company wide roll out at one of the 4 biggest tech companies, one of the 3 biggest trading firms, and a big set of AI product teams like Harvey, Rogo, ScaleAI etc.

At the end of the day I don't see VLM improvements as antagonistic to what we're doing. We already use them a lot for things like an agentic OCR (correcting mistakes from our traditional CV pipeline). On some level our customers aren't just choosing us for PDF->markdown, they're onboarding with us because they want to spend more of their time on the things that are downstream from having accurate data, and I expect that there'll be room for us to make that even more true as models improve.

kbyatnal•7mo ago

Founder of Extend (https://www.extend.ai/) here, it's a great question and thanks for the tag. There definitely are a lot of document processing companies, but it's a large market and more competition is always better for users.

In this case, the Reducto team seems to have cloned us down to the small details [1][2], which is a bit disappointing to see. But imitation is the best form of flattery I suppose! We thought deeply about how to build an ergonomic configuration experience for recursive type definitions (which is deceptively complex), and concluded that a recursive spreadsheet-like experience would be the best form factor (which we shipped over a year ago).

> "How do you see the space evolving as LLMs commoditize PDF extraction?"

Having worked with a ton of startups & F500s, we've seen that there's still a large gap for businesses in going from raw OCR outputs —> document pipelines deployed in prod for mission-critical use cases. LLMs and VLMs aren't magic, and anyone who goes in expecting 100% automation is in for a surprise.

The prompt engineering / schema definition is only the start. You still need to build and label datasets, orchestrate pipelines (classify -> split -> extract), detect uncertainty and correct with human-in-the-loop, fine-tune, and a lot more. You can certainly get close to full automation over time, but it takes time and effort — and that's where we come in. Our goal is to give AI teams all of that tooling on day 1, so they hit accuracy quickly and focus on the complex downstream post-processing of that data.

[1] https://dub.sh/ojv9b7p

[2] https://dub.sh/X7GFlDd

wilson090•7mo ago

I've used instabase before which has had the same UX for years. What about benchmarks between the two on extraction performance?

adit_a•7mo ago

Hey, we've never used or even attempted to use your platform. Respectfully I think you know that, and that you also know that your team has tried to get access to ours using personal gmail accounts dating back to 2024.

A schema builder with nested array fields has been part of our playground (and nearly every structured extraction solution) for a very long time and is just not something that we even view as a defining part of the platform.

kbyatnal•7mo ago

Thanks for the reply. Not sure what you're referring to, but I don't believe we've ever copied or taken inspo from you guys on anything — but please do let me know if you feel otherwise.

It's not a big deal at the end of the day, and excited to see what we can both deliver for customers. congrats on the launch!

Kiro•7mo ago

Two YC companies openly fighting and accusing each other. Not a good look and I'm surprised that you haven't been reprimanded yet.

serjester•7mo ago

I'm completely impartial here - seems like there's only so many ways you can design a schema builder?

throwaway7783•7mo ago

I agree. I don't know either company, schema builder is a very common feature in many data platforms. Nested or otherwise. Neither is claiming this is a big deal though.

koakuma-chan•7mo ago

Those are all ninja turtles and pdftotext is splinter.

bze12•7mo ago

Nice! I was already considering using reducto api. Will give this a try

adit_a•7mo ago

Let us know if you have any feedback!

serjester•7mo ago

Congrats on the launch guys, mobile website seems to be broken though.

adit_a•7mo ago

Thank you! What's the error you're seeing on mobile?

serjester•7mo ago

It crashes with "a problem repeatedly occurred". I think there's some sort of infinite loop - fails on both safari and chrome on my iPhone.

raunakchowdhuri•7mo ago

We're fixing it! This for some reason happens on only _some_ phones in our office so was hard to repro. I think has to do with Safari rendering. Will tone down our WebGPU usage

willwjack•7mo ago

This would have saved me so much pain back when I was working on RAG workflows. Great to see.

adit_a•7mo ago

Would love to help if you end up having any use cases in the future!

Fraaaank•7mo ago

Why do you only get a data processing agreement when on the enterprise plan? It's a legal requirement for any European company.

adit_a•7mo ago

We have a default DPA we're willing to sign on all tiers -- the note in the pricing page is meant to refer to custom/redlined DPAs that become complex to manage over time

We'll edit that to make it more clear

b0a04gl•7mo ago

if reducto leans in fully as the layer that remembers every correction, every edge case, every shift in layout or wording across document versions it starts becoming more than a pipeline. it becomes institutional memory for unstructured data. none of the other players really do that. they extract, maybe evaluate once, then forget.

but the real pain is always in the second and third batch. when formats change subtly. if reducto becomes the system that adapts without you babysitting it, that's where it may win. continuity's the moat imo among the competitors

raunakchowdhuri•7mo ago

this is exactly where we're going with this! glad you see the vision :)

adit_a•7mo ago

Yeah, we're extremely excited about the potential of building a flywheel for each individual customer's pipeline.

c_moscardi•7mo ago

We chatted a few months back -- congrats on launch! Looks like a great UX.

adit_a•7mo ago

Ah yeah I remember! Great to hear from you and thanks :)

nicodjimenez•7mo ago

For accurate and easy PDF to Markdown / LaTeX / JSON check out:

https://github.com/mathpix/mpxpy

Disclaimer: I'm the founder. Reducto does cool stuff on post processing (and other input formats), but some people have told me Mathpix is better at just getting data out of PDFs accurately.

bravesoul2•7mo ago

Got a nasty doc ill test on you ha ha! Tried to OCR/AI it and it drove me nuts.

techguy06•7mo ago

Your landing page looks absolutely beautiful, did you use Framer or any other landing page builder or is it code?

adit_a•7mo ago

Code!

rd•7mo ago

Cool product, but also crazy to see someone with my name in the wild. Every Raunak I've ever met has the Ronak spelling.

raunakchowdhuri•7mo ago

dang no way! we were both in boston too

jhuguet•7mo ago

Founder of anyformat.ai here, building from Madrid, Spain, with a specific focus on Europe and its unique market and regulation dynamics.

Just want to say how energizing it is to see this space maturing through thoughtful products like Extend and Reducto. Congrats to both for your Series A. I’d also mention GetOmni, as they’re doing great work leading the open-source front with their ZeroX project. We’ve learned a lot by observing your execution, and frankly, anyone serious about document intelligence tracks this ecosystem closely. It’s been encouraging to see ideas we were exploring early last year reflected in your recent successes. No shame there; good ideas often converge over time.

When we started fundraising (previous to GPT-4o), few investors believed LLMs would meaningfully disrupt this space. Finding the right supporters meant enduring a lot of rejection and delayed us quite a bit. Raising is always hard, and especially in Spain, where even a modest €500K pre-seed round typically requires proven MRR in the order of €10K.

We’re earlier-stage, but strongly aligned in product philosophy. Especially in the belief that the challenge isn’t just parsing PDFs. It’s building a feedback loop so fast and intuitive that deploying new workflows feels like development, not consulting. That’s what enables no-code teams to actually own automation.

From our experience in Europe, the market feels slower. Legacy tools like Textract still hold surprising inertia, and even €0.04/page can trigger pushback, signaling deeper friction tied to organizational change. Curious if US-based teams see the same, or whether pricing and adoption are more elastic. We’ve also heard “we’ll build this internally in 3 weeks” more times than we can count—usually underestimating what it takes to scale AI-based workflows reliably.

One experiment we’re excited about is using AI agents to ease the “blank page” problem in workflow design. You type: “Given a document, split it into subdocuments (contract, ID, bank account proof), extract key fields, and export everything into Excel.” The agent drafts the initial pipeline automatically. It helps DocOps teams skip the fiddly config and get straight to value. Again, no magic—just about removing friction and surfacing intent.

Some broader observations that align with what others here have said:

- Parsing/extraction isn’t a long-term moat. Foundation models keep improving and are beginning to yield bounding boxes. Not perfect yet, but close. - Moats come from orchestration-first strategies and self-adaptive systems: rapid iteration, versioning, observability, and agent-assisted configuration using visual tools like ReactFlow or Langflow. Basically, making an easier life to the pipeline owner. - Prompt-tuning (via DSPY, human feedback, QA) holds promise for adaptability but is still hard to expose through intuitive UX—especially for semi-technical DocOps users without ML knowledge. - Extraction confidence remains a challenge. No method fully prevents hallucinations. We shared our mitigation approach here: http://bit.ly/3T5nB3h. OCR errors are a major contributor—we’ve seen extractions marked high-confidence despite poor OCR input. The extraction logic was right, but we failed to penalize for OCR confidence (we’re fixing that). -Excel files are still a nightmare. We’re experimenting with methods like this one (https://arxiv.org/html/2407.09025v1), but large, messy files (90+ tabs, 100K+ rows) still break most approaches.

I’d love to connect with other founders in this space. Competition is energizing, and the market is big enough for multiple winners. You guys, along with llamaparse, are spearheding from what I see the movement. Also, incumbents are moving fast. Like Snowflake + Landing AI partnership, but fragmentation is probably inevitable. Feels like the space will stratify fast, some will vanish, some will thrive quietly, and a few might become the core infrastructure layer.

We’re small, building hard, and proud to be part of this wave. Kudos again to @kbyatnal and @adit_a for raising the bar, would be great to chat anytime or even offer some workspace if you ever visit Spain!

adit_a•7mo ago

Appreciate the thoughtful note and want to wish you guys the best as well!

murshudoff•7mo ago

Thanks for sharing so much detail. I am the CTO & Co-Founder of https://turbotable.ai (landing is outdated, will be updated soon), similar product in the space, but mainly focused on more general automation and data analysis for non-technical teams. OCR is one of the tools in our arsenal and our bet is that LLMs will get better at it. 2 limitations with this approach I can see: - No reliable grounding, bounding box (for now) - Context length (we have a solution for this, similar to Zerox by Omni)

Even if in the long run foundation models will not solve OCR completely and reliably, we still have option to develop custom solutions or to integrate with mature players.

I’d love to connect with other founders as well.

throwaway7783•7mo ago

Congratulations! Do you compete with unstructured.io? And how do you think about ever-improving models from Google etc? (The doc extract API, AWS has something as well)

zeld4•7mo ago

Looks promising.

Do you store the uploaded doc from free/test account?

adit_a•7mo ago

Yes in the sense that we have features that will create persisted share links, and by default you can revisit results in your free account until you decide to delete them.

If helpful, we also offer free trial accounts with zero data retention if that's important for your use case

think4coffee•7mo ago

Are you guys Harry Potter fans? Curious how you came up with the name Reducto

adit_a•7mo ago

Hahaha, a while ago (even before choosing this idea space) we said we would build "magical tools for developers" and Reducto was the name we landed on out of a long list of magic adjacent things

bitdribble•7mo ago

Lovely to see Reducto's studio, and get pointers to many other players in the field!

I am the founder of http://DocRouter.AI, https://github.com/analytiq-hub/doc-router. Available online as http://app.docrouter.ai (no paywall, working on Stripe integration).

Pre-seed stage, looking for collaborators and funding.

Ours is open source. Think of us as an ERP for documents, LLM prompts, and extraction schemas. We run on top of litellm, as a portability layer, so we support all major LLM models.

Extraction schema can be configured though a drag-and-drop UI, or inline by editing JSON.

A tagging mechanism is used to determine which prompts run on which documents - so we don't run all prompts against all documents, which would be a quadratic problem.

APIs are available for all functions (upload docs, configure prompts & schemas, download results).

We are designed for human-in-the-loop workflows, where precise processing of financial, insurance, or medical data is essential.

We see two main use cases, right now:

1 - Accelerating AI adoption in other engineering organizations, who don't have time to build the AI pipelines in-house. In this use case, we can quickly develop a specialized UI for you (Lovable, Bolt + adapting the generated UI with Cursor for your use case). In this play, we are a data layer accelerator for your AI solution.

2 - Solving point problems in document processing in insurance, medical, biotech, revenue cycle management, supply chain... In this use case, the business pain point we solve is manual processing of documents in an ERP that may not have the latest AI features. DocRouter.AI sits inline, in front of the ERP, picking selected faxes, emails, docs - processing them with LLMs, and inserting structured data into your ERP, saving on human labor.

The 2nd use case is something we see again and again in the industry. Legacy ERP vendors are slow to adopt AI processing, and businesses sitting on top of an ERP find it prohibitive to switch ERPs. These businesses are nickel and dimed over any small new ERP feature (...want to support PDFs not just TIFFs? that's thousands of dollars!... want to call APIs into the ERP? that's charged per API call!...)

They desperately need solutions to solve business workflows with AI, to free up FTEs to do more interesting work.

Here is a 30m recorded talk from a Mindsone meetup: https://community.mindstone.com/annotate/article_AuDOhLA5awW... where I showed how DocRouter.AI can be used to grade middle school quizzes with AI, with a teacher-in-the-loop. This was a "1st use case" application, with a custom UI, specialized to the application.

For the grade-school-quizzes-with-AI application, we generated the quiz rubric synthetically with AI, as we did the student quizzes. The rubric is embedded in the LLM prompt. The quiz PDF is tagged with the same tag as the corresponding rubric prompt (so it's graded with the corresponsing rubric).

This idea of matching a quiz agains a quiz rubric comes up again and again in many other examples. The same mechanism can be used to:

- Match invoices with purchase orders

- Or, to verify invoices against allowed amounts in a contract.

- Or, to check if standard operating procedures for transportation security comply with government or insurance rules.

- Or, to check if medical documents comply with a set of insurance rules. This is a use case I developed over a year and a half in the Durable Medical Equipment space, as consulting work (and it inspired the design of the DocRouter as a more general solution).

The idea of a system just keeping track of prompts, extraction schemas and documents - while very simple, it can solve many problems, in different verticals.

In fact, I believe that, when multiple products can solve the same problem, it is the simplest product that has the best chance to succeed.

So, a lot of thinking goes into keeping the design simple, the APIs complete - removing unnecessary artifacts. If new features are needed, they can be added as an external block, so the central function of the DocRouter does not need to become cluttered.

Here are tech slides from my Boston PyData presentation, where I showed how DocRouter.AI was implemented, using React, NextJS, FastAPI, and with a MongoDB back end: https://docs.google.com/presentation/d/14nAjSmZA1WGViqSk5IZu...

(I did not know how to program React before this... but in the brave new world of Cursor and Windsurf editors, I can venture into bold new directions!)

Ping me if you are interested to collaborate, or just if you are interested in the space!

Our thesis is that the space is large enough, and there's a market for multiple players. We do specialize on business workflows with human-in-the-loop, and we offer consulting services for project integration / turnkey delivery.

Andrei Radulescu-Banu, andrei@analytiqhub.com

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2020) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Scientists discover “levitating” time crystals that you can hold in your hand

Rammstein – Deutschland (C64 Cover, Real SID, 8-bit – 2020) [video]

Tell HN: Yet Another Round of Zendesk Spam

Postgres Message Queue (PGMQ)

Show HN: Django-rclone: Database and media backups for Django, powered by rclone

NY lawmakers proposed statewide data center moratorium

OpenClaw AI chatbots are running amok – these scientists are listening in

Show HN: AI agent forgets user preferences every session. This fixes it

Introduce the Vouch/Denouncement Contribution Model

Show HN: SSHcode – Always-On Claude Code/OpenCode over Tailscale and Hetzner

Microsoft appointed a quality czar. He has no direct reports and no budget

Multi-agent coordination on Claude Code: 8 production pain points and patterns

Washington Post CEO Will Lewis Steps Down After Stormy Tenure

DevXT – Building the Future with AI That Acts

A Minimal OpenClaw Built with the OpenCode SDK

The silent death of Good Code

The Internal Negotiation You Have When Your Heart Rate Gets Uncomfortable

Show HN: Glance – Fast CSV inspection for the terminal (SIMD-accelerated)

Busy for the Next Fifty to Sixty Bud

Imperative

Show HN: I decomposed 87 tasks to find where AI agents structurally collapse

I went back to Linux and it was a mistake

Octrafic – open-source AI-assisted API testing from the CLI

US Accuses China of Secret Nuclear Testing

Peacock. A New Programming Language

A postcard arrived: 'If you're reading this I'm dead, and I really liked you'

What to know about the software selloff

Show HN: Syntux – generative UI for websites, not agents

Microsoft appointed a quality czar. He has no direct reports and no budget

AI overlay that reads anything on your screen (invisible to screen capture)

Launch HN: Reducto Studio (YC W24) – Build accurate document pipelines, fast

Comments