frontpage.

Show HN: Norma – build good datasets (using an objective)

https://norma.grouplabs.ca

1•noelfranthomas•9m ago

My team has worked for F500s, startups, and everything in between.

In every case, we found it almost impossible to assemble an ideal dataset for training models. In real-world systems, the information you actually need is scattered across 30–300+ tables, stored in different warehouses, parquets, CSVs, and legacy DBs that nobody fully understands anymore.

We realized the real job isn’t ETL (too wide), or feature engineering (too narrow) it’s constructing the ideal representation of the problem so downstream models can actually learn something meaningful.

So we built Norma, an optimization-first data platform. It does the things every ML team wishes their stack would do: 1. Unity Catalog integration that works out of the box - connect a warehouse, instantly browse tables with lineage, schemas, and metadata.

2. A unified SQL/Python pipeline engine - both languages execute in the same memory buffer (via DuckDB), so no more glue code or brittle data hops.

3. An AI assistant for transformations - ask for a feature, a join, an explanation, a visualization (generates pipeline steps).

4. Multi-bandit 5-fold cross-validation - fast, automatic evaluation of transformed datasets with xgboost.

5. Visual lineage + shared datasets - every step is inspectable, reproducible, and sharable across teams.

That’s what we have today.

We’re still building:

- Automatic leakage detection (timestamp violations, post-outcome signals, unsafe joins)

- Relevant table discovery (find the tables that actually matter for predicting your target)

- Relevant row selection (especially for PFN-style models with row limits)

- Automated feature representation (scaling, encoding, aggregation, embeddings)

- AutoGluon + TabPFN integration (train strong models on normalized, optimized datasets)

- Differential privacy guardrails for LLM usage inside your data workflows

We’re trying to build the equivalent of a representation compiler: raw warehouse → optimal feature space → any model or BI tool.

If you’ve ever lost days hunting through a schema, debugging leakage, redoing feature pipelines, or trying to understand why a model plateaus even though your data is “fine,” I’d genuinely love your feedback. We’re still working closely with teams to refine our features and capabilities, and we’d love to share a private beta with your team. Please join the waitlist!

Happy to answer anything here.

Trifold is a tool to quickly and cheaply host static websites using a CDN

AI-crafted interactive experiences, generated instantly from any prompt

LLM assisted book reader by Karpathy

I launched a directory with well-made products because everything seems buggy

Show HN: Norma – build good datasets (using an objective)

Quake Engine Indicators

Show HN: Implementing a core subset of ARM assembly in pure C89

I am building a collaborative coding agent

LGTM Culture: A Short Story

Cloudflare: Piracy Liability Ruling Has Global Implications; Publishers Disagree

The Anatomy of a Dysfunctional Standards Body – Peter Gutmann [pdf]

Solar Superstorm Gannon crushed Earth's plasmasphere to a record low

A tiny fantasy console inspired by early 90s handheld consoles

What is the most cramped memory card you own?

The "Good Enough" Lie in Engineering

Earth just got hit by a stealth solar storm no one saw coming

Is the AI Bubble About to Burst?

A million ways to die from a data race in Go

The Latent Role of Open Models in the AI Economy

No free lunch in vibe coding

IDescriptor: A Cross-Platform iOS Device Management Tool

Show HN: Qdrant Vector Aggregator

Hackers Bypass Signal, Telegram and WhatsApp Encryption to Read Messages

Build a Compiler in Five Projects

Show HN: Syd – An offline-first, AI-augmented workstation for blue teams

A One-Minute ADHD Test

Technology Radar: An opinionated guide to today's technology landscape

AI Document Processing with Docling Java, Arconia, and Spring Boot

User reports indicate possible problems at Cloudflare

Show HN: Simulating the vacuum as a superfluid to derive Alpha = 1/137

Show HN: Norma – build good datasets (using an objective)

Trifold is a tool to quickly and cheaply host static websites using a CDN

AI-crafted interactive experiences, generated instantly from any prompt

LLM assisted book reader by Karpathy

I launched a directory with well-made products because everything seems buggy

Show HN: Norma – build good datasets (using an objective)

Quake Engine Indicators

Show HN: Implementing a core subset of ARM assembly in pure C89

I am building a collaborative coding agent

LGTM Culture: A Short Story

Cloudflare: Piracy Liability Ruling Has Global Implications; Publishers Disagree

The Anatomy of a Dysfunctional Standards Body – Peter Gutmann [pdf]

Solar Superstorm Gannon crushed Earth's plasmasphere to a record low

A tiny fantasy console inspired by early 90s handheld consoles

What is the most cramped memory card you own?

The "Good Enough" Lie in Engineering

Earth just got hit by a stealth solar storm no one saw coming

Is the AI Bubble About to Burst?

A million ways to die from a data race in Go

The Latent Role of Open Models in the AI Economy

No free lunch in vibe coding

IDescriptor: A Cross-Platform iOS Device Management Tool

Show HN: Qdrant Vector Aggregator

Hackers Bypass Signal, Telegram and WhatsApp Encryption to Read Messages

Build a Compiler in Five Projects

Show HN: Syd – An offline-first, AI-augmented workstation for blue teams

A One-Minute ADHD Test

Technology Radar: An opinionated guide to today's technology landscape

AI Document Processing with Docling Java, Arconia, and Spring Boot

User reports indicate possible problems at Cloudflare

Show HN: Simulating the vacuum as a superfluid to derive Alpha = 1/137