frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Smart Homes Are Terrible

https://www.theatlantic.com/ideas/2026/02/smart-homes-technology/685867/
1•tusslewake•37s ago•0 comments

What I haven't figured out

https://macwright.com/2026/01/29/what-i-havent-figured-out
1•stevekrouse•1m ago•0 comments

KPMG pressed its auditor to pass on AI cost savings

https://www.irishtimes.com/business/2026/02/06/kpmg-pressed-its-auditor-to-pass-on-ai-cost-savings/
1•cainxinth•1m ago•0 comments

Open-source Claude skill that optimizes Hinge profiles. Pretty well.

https://twitter.com/b1rdmania/status/2020155122181869666
1•birdmania•1m ago•1 comments

First Proof

https://arxiv.org/abs/2602.05192
2•samasblack•3m ago•1 comments

I squeezed a BERT sentiment analyzer into 1GB RAM on a $5 VPS

https://mohammedeabdelaziz.github.io/articles/trendscope-market-scanner
1•mohammede•5m ago•0 comments

Kagi Translate

https://translate.kagi.com
1•microflash•5m ago•0 comments

Building Interactive C/C++ workflows in Jupyter through Clang-REPL [video]

https://fosdem.org/2026/schedule/event/QX3RPH-building_interactive_cc_workflows_in_jupyter_throug...
1•stabbles•6m ago•0 comments

Tactical tornado is the new default

https://olano.dev/blog/tactical-tornado/
1•facundo_olano•8m ago•0 comments

Full-Circle Test-Driven Firmware Development with OpenClaw

https://blog.adafruit.com/2026/02/07/full-circle-test-driven-firmware-development-with-openclaw/
1•ptorrone•9m ago•0 comments

Automating Myself Out of My Job – Part 2

https://blog.dsa.club/automation-series/automating-myself-out-of-my-job-part-2/
1•funnyfoobar•9m ago•0 comments

Google staff call for firm to cut ties with ICE

https://www.bbc.com/news/articles/cvgjg98vmzjo
22•tartoran•9m ago•1 comments

Dependency Resolution Methods

https://nesbitt.io/2026/02/06/dependency-resolution-methods.html
1•zdw•9m ago•0 comments

Crypto firm apologises for sending Bitcoin users $40B by mistake

https://www.msn.com/en-ie/money/other/crypto-firm-apologises-for-sending-bitcoin-users-40-billion...
1•Someone•10m ago•0 comments

Show HN: iPlotCSV: CSV Data, Visualized Beautifully for Free

https://www.iplotcsv.com/demo
1•maxmoq•11m ago•0 comments

There's no such thing as "tech" (Ten years later)

https://www.anildash.com/2026/02/06/no-such-thing-as-tech/
1•headalgorithm•11m ago•0 comments

List of unproven and disproven cancer treatments

https://en.wikipedia.org/wiki/List_of_unproven_and_disproven_cancer_treatments
1•brightbeige•12m ago•0 comments

Me/CFS: The blind spot in proactive medicine (Open Letter)

https://github.com/debugmeplease/debug-ME
1•debugmeplease•12m ago•1 comments

Ask HN: What are the word games do you play everyday?

1•gogo61•15m ago•1 comments

Show HN: Paper Arena – A social trading feed where only AI agents can post

https://paperinvest.io/arena
1•andrenorman•16m ago•0 comments

TOSTracker – The AI Training Asymmetry

https://tostracker.app/analysis/ai-training
1•tldrthelaw•20m ago•0 comments

The Devil Inside GitHub

https://blog.melashri.net/micro/github-devil/
2•elashri•21m ago•0 comments

Show HN: Distill – Migrate LLM agents from expensive to cheap models

https://github.com/ricardomoratomateos/distill
1•ricardomorato•21m ago•0 comments

Show HN: Sigma Runtime – Maintaining 100% Fact Integrity over 120 LLM Cycles

https://github.com/sigmastratum/documentation/tree/main/sigma-runtime/SR-053
1•teugent•21m ago•0 comments

Make a local open-source AI chatbot with access to Fedora documentation

https://fedoramagazine.org/how-to-make-a-local-open-source-ai-chatbot-who-has-access-to-fedora-do...
1•jadedtuna•22m ago•0 comments

Introduce the Vouch/Denouncement Contribution Model by Mitchellh

https://github.com/ghostty-org/ghostty/pull/10559
1•samtrack2019•23m ago•0 comments

Software Factories and the Agentic Moment

https://factory.strongdm.ai/
1•mellosouls•23m ago•1 comments

The Neuroscience Behind Nutrition for Developers and Founders

https://comuniq.xyz/post?t=797
1•01-_-•23m ago•0 comments

Bang bang he murdered math {the musical } (2024)

https://taylor.town/bang-bang
1•surprisetalk•23m ago•0 comments

A Night Without the Nerds – Claude Opus 4.6, Field-Tested

https://konfuzio.com/en/a-night-without-the-nerds-claude-opus-4-6-in-the-field-test/
1•konfuzio•26m ago•0 comments
Open in hackernews

Ask HN: Data engineers, What suck when working on exploratory data-related task?

14•robz75•7mo ago
Hey guys,

Founder here. I’m working on building my next project and I don’t want to waste time solving fake problems.

Right now, what's currently extremely painful & annoying to do in your job? (You can be very brutally honest)

More specifically, I'm interested how you handle exploratory data-related tasks from your team?

Very curious to get your current workflows, issues and frustrations :)

Comments

squircle•7mo ago
Conversations and interviews > Jupyter notebook
robz75•7mo ago
Why? What's currently annoying about notebooks that you have to deal with compared to just directly going to users?
squircle•7mo ago
Ah, well, rereading your original post I realize now this isn't necessarily painful for me. Perhaps though, the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake. Conway's Law?

Does VS here mean Visual Studio? I would not call myself a data engineer, I just play one at work sometimes. Many hats, yknow?

robz75•7mo ago
"the annoying aspect is seeing others use proprietary excel spreadsheets without a data lake" => what's painful about that?

VS = compared to, versus

squircle•7mo ago
Hah okay. I read VS different from vs. The pain, in part, is hidden functions, rarely ever inline documentation, difficult to reuse or repurpose, Windows-centric, etc.
clejack•7mo ago
The main issues for problems like this fall into 3 categories

- Things that prevent you from starting the job. Org silos, security, and permissions

- Things that prevent you from doing the job. This is primarily data cleaning.

- Things that make the job more difficult. This involves poor tooling, and you'll struggle to break the stranglehold that SQL and python-pandas have in this area. I'll also add plotting libraries to this. Many of them suck in a seemingly unavoidable way.

On the second and third points llms will most likely own these soon enough, though maybe there's room to build something small and local that's more efficient if the scope of the agent is reduced?

The first point is organizational generally, and it's very difficult to solve outside of integrating your system into an environment which is the strategy pursued by companies like snowflake and databricks.

robz75•7mo ago
What are the pain points your are facing with data cleaning? How do you handle it for now?
dapperdrake•7mo ago
Data cleaning depends on the problem domain.

Compare output from a spoctrometer (or spectrograph) vs. eliminating outliers from an almost linear process. One will wreck your data and the other is the only correct thing to do.

         *         
**** ****
daemonologist•7mo ago
As clejack said, "Org silos, security, and permissions" - this is usually the largest single time sink on any project that needs production data.

Related to this is obtaining data in bulk - teams (understandably) are usually not willing to hand out direct read access to their databases and would prefer you use their API, and they've usually built APIs intended for accessing single records at a relatively slow rate. It often takes some convincing (DoSing their API) to get a more appropriate bulk solution.

ahahs•7mo ago
my experiences are pretty much this. having db access would make my life so much easier.
dapperdrake•7mo ago
Have been working on this for a while with real stakes.

You have two issues that computers cannot help with (by their nature). And this incidental complexity dominates all the rest.

1. What people want to do with data

2. Bureaucracies are willfully oblivious to this problem domain

What people actually want to do with data: Answer questions that are interesting to them. It is all about the problem domain and its geometry.

Problem: You can only falsify hypothesis when asking reality questions. Everything else will bankrupt you. You can only work with the data that you have. Collecting data will always be hard. Computers are only involved, because they happen to be good with crunching numbers.

Bureaucracies only care about process and never about outcomes. And LLMs can now produce random plausible PowerPoint material to satisfy this demand. Only plausibility ever mattered, because it is empirically sufficient as an excuse for CYA.

---------

Naval Ravikant (abridged): "Tell truth, don't waste word."

ferguess_k•7mo ago
Mostly human problems especially if you work with Analytic teams. I need a PO for data. We usually don't have dedicated PO for data products so we have to do all the requirement findings by ourselves.

For exploratory data-related tasks, these are mostly related to checking data format or malformed data, so it is not a huge issue. But since you are building a product, I'll share my experience -> What I need is a quick way to explore schema changes in a column of a database table (not the schema of the table). Imagine you have a table `user` which has a column says `context` which is a bunch of JSON payload, I need to quick way to summarize and give me all "variations" of the schema of that field.

saulpw•7mo ago
VisiData makes a lot of things easier.
PaulShin•7mo ago
Great question. As a founder also working on my next project, the fear of "solving fake problems" is something I think about every day. Thanks for asking it.

For me, the single most frustrating part of any data-related task isn't the data itself. It's the "work about the work" – the soul-crushing feeling that I'm doing the same thing two or three times in different windows.

The biggest irony is that this is often caused by the very "smart work" tools that are supposed to make us more productive.

My typical workflow looks like this:

A request for data comes in on Slack. I pull the data, analyze it, and share a conclusion in the Slack thread. Then, I have to go to Jira to create a ticket that summarizes what I just said on Slack. Finally, I have to open Notion to write a brief document explaining the findings for the record. The context is constantly being copied, pasted, and fragmented. It's exhausting and feels like a waste of human potential.

This isn't a pitch, but this exact frustration is the only thing I'm focused on solving right now. My entire thesis is that the endless context switching between our communication layer (chat) and our execution layer (tasks, docs) is the biggest source of "fake work" in modern companies.

I'm building a tool where that entire "copy/paste the context" cycle is eliminated. A place where the conversation is the task, is the doc, is the context—all in one single flow.

I'm just a founder who is sincerely obsessed with this problem, and it's validating to see I'm not the one who feels this pain.

axegon_•7mo ago
Not a data engineer but my work revolves around processing a ton of data(let's call it partial data engineering). Much of the data I get is inputted by humans from different sources and platforms and countries. My biggest pain in a nutshell - the human factor. Believe it or not, people have managed to misspell "Austria" over 11,000 times(accents, spaces, different encodings, alphabets, languages, null characters and so on. Multiply that by 250-something countries and multiply that by around 90-100 other fields which suffer from similar issues and multiply that by 2.something billion rows and you get the picture.
didgetmaster•7mo ago
I am building a new data management system that can also handle relational data well. It is really good at finding anomalies in data to help clean it up after the fact, but I wanted to find a way to prevent the errors from polluting the data set in the first place.

My solution was to enable the user to create a 'dictionary' of valid values for each column in the table. In your case you would create a list of all valid country names for that particular column. When inserting a new row, it checks to make sure the country name matches one of the values. I thought this might slow things down significantly, but testing shows I can insert millions of rows with just a minor performance hit.

The next step is to 'auto correct' error values to the closest matching one instead of just rejecting it.

axegon_•7mo ago
This isn't wildly different from what I've done but it's the sheer volume of crap that's scattered through all the different fields. The countries are the least of my problems. There are others where I'm faced with tens of millions of different combinations. The countries are a relatively trivial problem in comparison.
didgetmaster•7mo ago
My solution will catch a lot of trivial errors like simple misspellings. You could have a relational tables with dozens of columns, each with its own dictionary; but that won't catch wrong combinations between columns.

For example, there is a Paris, Texas; but I doubt there is a London, Texas. Dictionaries of state names and city names would not catch someone's error of matching the wrong city with a state when entering an address.

Is this the kind of error you encounter often?

axegon_•7mo ago
Oh don't even get me started on cities, states, towns and addresses :D https://www.google.com/maps/place/95450+Us,+France/@49.10531...
revskill•7mo ago
Are u hiring ?
faxmeyourcode•7mo ago
I'm going to reiterate something that has been said already. Everything around auditing and pedaling permissions in my mid sized company snowflake account is a gigantic pain in the neck.