Why Most Valuable AI Systems Are Still Tabular Models

5•madman2890•2h ago

The Hard Part of Predictive AI Isn’t the Model

I’ve spent most of my career building predictive systems on tabular data.

The highest-value AI systems I’ve seen in production aren’t LLMs. They’re predictive models that operate on structured operational data: customers, orders, shipments, transactions, support events, etc.

These systems quietly generate millions in value by replacing expensive third-party services, improving operational decisions, and turning predictions into products.

Examples include churn prediction, fraud detection, ETA prediction, inventory demand forecasting, and operational anomaly detection.

In practice, the model itself is rarely the bottleneck.

The real bottleneck is integrating signals from relational data.

Why Tabular Data Is Hard

Most operational systems store data across many relational tables, not a single ML-ready dataset.

For example, consider a simple commerce schema:

customers -> orders -> order_notifications

If you want to train a model predicting something like:

Will this customer churn in the next 30 days?

the model does not train directly on these tables.

Instead you must first construct a training table like:

customer_id num_orders_last_30_days avg_order_value days_since_last_order num_notifications_last_7_days notification_rate_per_order ... target_churn

Building this dataset requires joins, aggregations, time windows, handling one-to-many relationships, and preventing data leakage.

For example:

num_orders_last_30_days = COUNT(orders WHERE order_timestamp >= now() - 30d)

num_notifications_last_7_days = COUNT(order_notifications WHERE timestamp >= now() - 7d)

This sounds simple, but at scale this quickly becomes hundreds of features, dozens of tables, and complex temporal joins.

In most organizations this data preparation step dominates the project.

Not the model.

Where Tabular Foundation Models Fit

Recently there has been a lot of excitement around tabular foundation models, such as TabPFN, TabTransformer variants, and other pretrained tabular architectures.

These models are interesting because they can often produce strong predictions with very little tuning.

You can often train them with something as simple as:

model.fit(X_train, y_train)

and they work surprisingly well.

However, these models typically expect a single flat table.

Something like:

customer_id f1 f2 f3 f4 ... target

They generally do not operate directly on relational schemas.

So the fundamental bottleneck remains:

How do you turn relational data into a useful feature table?

GraphReduce: Treating Relational Data as a Graph

This is where approaches like GraphReduce come in.

Relational schemas naturally form a graph structure.

Using the previous example:

customers | orders | order_notifications

Each edge represents a relationship where signals can propagate.

For example, orders can propagate to customers, and notifications can propagate to orders and then to customers.

GraphReduce treats the schema as a propagation graph.

Each table contributes signals that are aggregated upward.

Example propagation:

From order_notifications to orders:

notifications_per_order max_notification_delay notification_count

Then from orders to customers:

total_orders avg_order_value orders_last_30_days notification_rate_per_order

The result is a feature table at the target level:

customer_id orders_last_30_days avg_order_value notification_rate days_since_last_order ...

This table can then be fed directly into a predictive model.

Why This Matters for Tabular Foundation Models

Tabular foundation models are strongest when operating on a well-constructed flat dataset.

GraphReduce helps produce that dataset automatically by traversing relational graphs, aggregating signals, and generating structured features.

The pipeline looks like this:

Relational DB -> GraphReduce -> Unified feature table -> Tabular foundation model (e.g. TabPFN) -> Prediction

In practice this can dramatically increase the throughput of building predictive systems, because the hardest step, data integration, becomes much easier.

Why This Is Still an Open Problem

Most AI discussion today focuses on models.

But for structured data systems, the real challenges are relational structure, temporal aggregation, signal propagation, and feature construction.

Until those problems are solved, the modeling layer will always be limited.

Tabular foundation models may significantly reduce the modeling effort.

But relational data preparation remains the gating step.

The interesting opportunity is combining both.

Example Implementation

Here is a simple end-to-end example combining relational aggregation with a tabular foundation model:

https://wesmadrigal.github.io/GraphReduce/end_to_end_examples/predictive_ai_tabpfn/

There is some similar research coming out of the University of Hong Kong: https://arxiv.org/pdf/2602.13697

Thoughts?

Comments

amazonbezos•1h ago

totally agree

madman2890•1h ago

with all of it? wow :)

Formalizing Data Structures and Algorithms with Agents

Is the AI Compute Crunch Here?

Show HN: A tool that automatically installs Python and common dev libraries

Show HN: AI-Proof Careers Leaderboard

LeRobot v0.5.0: Scaling Every Dimension

Ulysses Sequence Parallelism: Training with Million-Token Contexts

In the '90s Germany's air traffic control ran on Emacs

Simulating Queueing 2

Trump says Iran 'war is complete,' talks to Putin

Feed Palestine

Skill to slim down your bloated AGENTS.md file

I wrote a OpenClaw Operators Field Guide for operating multi-agent AI systems

Snice – 130 web components and a decorator-based framework

Some skills become second nature

One Year with Hyprland

Oracle is building yesterday's data centers with tomorrow's debt

Show HN: Making Codex stop rediscovering the same repository over and over

Setting Up a Debug Environment for QEMU PCI Device Exploitation

Taara Beam

Talking Face Animation Using a Learned Kalman Filter on Mobile Devices

Show HN: DevToolbox – 13 browser-based dev tools, privacy-first

Thomas Selfridge: The First Airplane Fatality

Show HN: MDviewer – native macOS app for opening Markdown as print-ready docs

'Love Is Strong as Death' Review: Triangles of Life

Closing the verification loop: Observability-driven harnesses for agents

Show HN: Efficient LLM Architectures for 32GB RAM (Ternary and Sparse Inference)

Show HN: SubmitGate – catch mobile submission/compliance issues before release

Show HN: Agents.txt – proposed standard for AI agent permissions on the web

Bluesky CEO Jay Graber will step aside

Anthropic launches code review tool to check flood of AI-generated code