frontpage.

Show HN: An unstructured data workspace for data transformations with LLM

https://www.usefolio.ai/

3•nibab•1h ago

hi HN!

a couple of months ago I had to analyze a few thousand audio recordings to help identify issues with customer support. i was able to get some raw high-level initial results with python scripts invoking LLM APIs, but they were too general and unhelpful. writing basic prompts is easy, but tuning them and making them specific enough to ensure no faint signal is missed is hard. you need to iterate through the data with an initial prompt, segment the data into different buckets, chain another prompt for each bucket etc. Then you need to constantly review the raw data to tweak the prompts just the right way to get the desired results.

There are no good user-facing tools for scaling to thousands of rows of unstructured data analysis with LLMs. Claude Cowork / agents with access to filesystems are scratching the surface, but having a text-only UI is challenging, especially when you want to go back and adjust your research pipeline, narrow down deterministically to a specific subset of your data with SQL-like filters, or do any cost management. Scaling past 100 files is not well supported. Deep research is difficult to steer and verify.

I needed a mini-data warehouse that could help me get insights out of my data, optimize costs with bulk LLM operations (via cost estimation and model choice), and let me browse and verify the data in a user-friendly way, without requiring me to set up something like Databricks. So, I built folio.

Folio is a free, local, macOS app for analyzing your unstructured data with LLMs. It's a UI wrapper around a minimal data warehouse that lets users (and agents) do LLM-based transformations on big unstructured datasets. All you need to get started is an AI API key and an account with modal.com

Users bring their files into Folio which then get loaded into a tabl, where each row contains a markdown representation of the file contents. Users can then run LLM operations in bulk on those files and use sql filters to create views and narrow down the scope of the transformations. Agents are a first-class citizen and they can plug into folio to do most of the work for you. To take load off the desktop for OCR/Audio Transcription as well as the thousands of http requests to AI APIs we integrate with modal.com as the execution engine. A local orchestrator fans out jobs to modal and then fans them in once complete. Data is never stored anywhere, and only moves in transit through AI API provider and the user's own modal infrastructure.

folio workspaces are multi-modal (you can load different data types in the same workspace and move it through the same analysis pipeline) and they can support thousands of files.

People use folio today to: - review customer support tickets/emails: bucket issue into different categories, narrow in on categories of interest, and then action that data by generating a response. - extract detailed data from financial documents: load all data that can be found on a particular company, extract structured data like revenue numbers and projections. - do literature reviews: there are lots of agents that help you load data from research paper repositories. once that data is loaded into folio, users can do a steerable deep research over those files. - perform criteria-based search: generate yes/no criteria like "document contains data on XYZ", "document mentions ABC", "documented cites XYZ".

Companies like v7labs, hebbia, Legora, Harvey have similar "Tabular Document Review" features, but they are not scalable or compatible with outside agents like Claude Code. Additionally they require expensive enterprise contracts.

I see folio moving beyond data analysis into the perfect companion for agentic tasks that require a human-facing UI/UX, cost management and actioning on data in bulk.

Website: https://www.usefolio.ai Github: https://github.com/usefolio/folio X: https://x.com/usefolio_ai

Looking forward to hearing what people think!

Modular hardcoded circuits for computer vision plus autonomous production tools

Anthropics Buffa: a Rust implementation of protobuf

We need to talk about Elixir

Nasdaq Flirts with Correction Territory

JPMorgan now monitoring investment banker screen time to prevent burnout

Artisanal software in the age of AI codegen

China Learned to Love the Classics

Basecamp CLI and Agent Skill

All code is sorcery, until it isn't

Taming LLMs: Using Executable Oracles to Prevent Bad Code

What's coming to our GitHub Actions 2026 security roadmap

A 54KB client-side HNSW vector search engine in WASM

Testing Neglected VHS Tapes and CDs

One Pipe, Two Sandboxes, Zero Prompt Injection

Show HN: Switch Country – Get News, Stalk what's happening in other countries?

Key Disclosure Law

SpecKit: Not Impressed

Show HN: Content Addressable Storage for ML Checkpoints

Ethics.md – distributed AI Ethics Framework (co-created with AIs)

Cops Use AI to Jail Innocent Grandmother for 6 Months [video]

What did your strengths cost you? (Interactive)

Edera spent years calling KVM less secure. Here's why it changed its mind

Zero Days: Electric Motorcycles Are a Security Nightmare

Native Instant Space Switching on macOS

Mitochondrial Ca2 efflux controls neuronal metabolism and long-term memory

Siclair Microvision (1977)

Android Canary blesses the Linux Terminal with a modern UI, new features

Open-source startups should do more embedded/OEM deals

Red Lobster's Last Gasp

$500 GPU outperforms Claude Sonnet on coding benchmarks