Ask HN: Is our data warehouse setup normal or over-complicated?

4•ealready_value•1h ago

I've been pulled onto a new feature for replacing some of our existing customer-facing reports with reports from the data warehouse. This isn't the first data report from the data team we've integrated into the product, but since it involves existing reports that I'm the local expert on, I'm getting pulled into the process. The current reports don't have any performance issues, but the decision to change has been made anyway.

From what I've been able to gather, the data goes from the production MySQL database to a secondary MySQL database using DMS. Then come the Glue jobs that ship the data out to a data lake in S3. After that there are several transformation jobs that I've been told convert the data into a "canonical" form, smoothing out all the differences between verticals. I think they said that next the data goes into a second data lake and has additional transformations performed. Finally the entire process gets the data to its final resting place in Redshift where QuickSight is used to create reports. I'm fairly certain I missed a couple steps because I just couldn't figure out the purpose of each step as they were describing the process.

Getting reports out of that process seems painful. Showing a report for an internal customer (sales or customer support for instance) means they need a QuickSight account and access to the specific report. Getting access to that for myself was not straightforward, which makes me think it is hand-managed by a dev.

For showing a report in product it feels worse. First the data team are about the only people that can create these reports because not only do the product devs not know this "canonical" form, but getting the development environment running consistently for product devs has been like pulling teeth. Once someone has written the report, they have to promote the report by copying it exactly, including an identical report id, to another region. Finally the report id is given to the product team to put into the product. Adding the report id to the product is the easiest part, but the data journey doesn't stop there. The product has to pass that report id and user information to a lambda the data team maintains that generates a URL for the product to embed with an iframe. And after all of that, the report doesn't come close to matching the look of the site.

Is this data warehouse setup normal? Is this a common way to handle in-product reports after a company invests in a data warehouse? There are a lot of what seem like redundant steps, as well as a lot of custom code for what I would expect to be built into these products.

Comments

icedchai•43m ago

Without understanding differences between the "source" and "canonical" forms, it is tough to say. Also how much data are we actually talking about? The pipeline you describe may be entirely reasonable, or it may be an over engineered, convoluted contraption that could be replaced with a single DB replica and a few views to simplify queries.

My experience with QuickSight has been pretty negative. The overall UI/UX is pretty meh. If you're embedding it in your product you may be better off generating your own reports, in app.

ealready_value•26m ago

The source form is the production database, which is what the current reports pull from. The canonical form is the form that in theory all of the verticals get rolled into, but many of the nuances that our customers are used to having end up getting replaced with similar, but are not quite the same. Right now that's my biggest concern that customers are not going to get the data they need because of this canonical form.

We're talking about a few-hundred megabytes of data for all of the customers that these reports pull, but that's also for the past 15 years. We do have like 25k customers, which shrinks how much a customer can pull in even further. One last point is that we already de-normalize the report data into its own table specifically for these reports, so that's not something the data warehouse is doing for us.

I agree with your experience with QuickSight, it is exactly my experience. My preference is to continue using the reports we generate in the app, but I'm trying to wrap my head around cases where this ends up being the better direction.

Fake Viral Guitarists Strike Again

Hackers Publish Knicks and Madison Square Garden Data Online

Biff.fx: lightweight effects system for Clojure

Yes, we still need engineers

Ask HN: What do you think about blockchain's current trajectory

Google's Training Supercomputers from TPU v2 to Ironwood: Five Generations

Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

Apache Fory Serialization 1.2.0: JDK 25 support without sun.misc.Unsafe

Earth's underground fungal network would span 10% of the Milky Way

Qwen-RobotWorld Technical Report

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Making ast.walk 220x Faster

A Short Explanation of the Zettelkasten Method

Robinhood to cut 10% of workforce in restructuring

The ongoing debacle of hiring a fake coworker

Uncritical use of AI causes countrywide scandal at Starbucks Korea

Agent Architecture Is a Compute Allocation Problem: The Advisor Strategy

How we evaluate our LLM judge

Can gzip be a language model?

The Faithfulness of LLMs as Solvers and Autoformalizers in Legal Reasoning

The AI Hype – Too Costly – Alternative Rock, Original Lyrics [Video]

The Same Hetzner VM Cost $60 Last Week. Today It Costs $154

Python 3.13 gets a JIT (2024)

TreeTrace, Git records what changed;this records how you steer your LLM sessions

Never Talk to the Police. Period

Databricks Acquires Panther

Show HN: Sentinel – prevent duplicate execution using Postgres

GateGPT: 56k tokens per second Transformer (KV cache) on FPGA at 80 MHz

Hardware Is Asynchronous. Most of Our Operating Systems Still Aren't

Apple's weird anti-nausea dots cured my car sickness