frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Deep dive on Nvidia circular funding

https://philippeoger.com/pages/deep-dive-into-nvidias-virtuous-cycle
118•jeanloolz•1h ago•54 comments

Jepsen: NATS 2.12.1

https://jepsen.io/analyses/nats-2.12.1
88•aphyr•1h ago•13 comments

Strong earthquake hits northern Japan, tsunami warning issued

https://www3.nhk.or.jp/nhkworld/en/news/20251209_02/
176•lattis•5h ago•88 comments

AMD GPU Debugger

https://thegeeko.me/blog/amd-gpu-debugging/
149•ibobev•4h ago•19 comments

Let's put Tailscale on a jailbroken Kindle

https://tailscale.com/blog/tailscale-jailbroken-kindle
123•Quizzical4230•3h ago•32 comments

Hunting for North Korean Fiber Optic Cables

https://nkinternet.com/2025/12/08/hunting-for-north-korean-fiber-optic-cables/
137•Bezod•3h ago•12 comments

Launch HN: Nia (YC S25) – Give better context to coding agents

https://www.trynia.ai/
57•jellyotsiro•3h ago•46 comments

IBM to acquire Confluent

https://www.confluent.io/blog/ibm-to-acquire-confluent/
251•abd12•6h ago•204 comments

Has the cost of building software just dropped 90%?

https://martinalderson.com/posts/has-the-cost-of-software-just-dropped-90-percent/
23•martinald•1h ago•33 comments

A series of tricks and techniques I learned doing tiny GLSL demos

https://blog.pkh.me/p/48-a-series-of-tricks-and-techniques-i-learned-doing-tiny-glsl-demos.html
61•ibobev•3h ago•3 comments

We collected 10k hours of neuro-language data in our basement

https://condu.it/thought/10k-hours
38•nee1r•2h ago•33 comments

Microsoft Download Center Archive

https://legacyupdate.net/download-center/
44•luu•3d ago•3 comments

Legion Health (YC S21) is hiring a founding engineer (SF, in-person)

1•the_danny_g•3h ago

Show HN: DuckDB for Kafka Stream Processing

https://sql-flow.com/docs/tutorials/intro/
30•dm03514•2h ago•10 comments

Flow: Actor-based language for C++, used by FoundationDB

https://github.com/apple/foundationdb/tree/main/flow
142•SchwKatze•7h ago•37 comments

Quanta to publish popular math and physics books by Terence Tao and David Tong

https://www.simonsfoundation.org/2025/12/08/quanta-books-to-publish-popular-math-and-physics-titl...
73•digital55•2h ago•15 comments

Paramount launches hostile bid for Warner Bros

https://www.cnbc.com/2025/12/08/paramount-skydance-hostile-bid-wbd-netflix.html
117•gniting•6h ago•97 comments

GitHub Actions has a package manager, and it might be the worst

https://nesbitt.io/2025/12/06/github-actions-package-manager.html
318•robin_reala•12h ago•203 comments

Nova Programming Language

https://nova-lang.net
60•surprisetalk•5h ago•32 comments

Google confirms Android attacks; no fix for most Samsung users

https://www.forbes.com/sites/zakdoffman/2025/12/08/google-confirms-android-attacks-no-fix-for-mos...
81•mohi-kalantari•3h ago•68 comments

Uber is turning data about trips and takeout into insights for marketers

https://www.businessinsider.com/uber-ads-launches-intelligence-insights-trips-takeout-data-market...
204•sethops1•5h ago•194 comments

Colors of Growth

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5804462
46•mhb•7h ago•16 comments

I successfully recreated the 1996 Space Jam website with Claude

https://theahura.substack.com/p/i-successfully-recreated-the-1996
78•theahura•4h ago•70 comments

The "confident idiot" problem: Why AI needs hard rules, not vibe checks

https://steerlabs.substack.com/p/confident-idiot-problem
268•steerlabs•3d ago•310 comments

Microsoft has a problem: lack of demand for its AI products

https://www.windowscentral.com/artificial-intelligence/microsoft-has-a-problem-nobody-wants-to-bu...
316•mohi-kalantari•3h ago•267 comments

Jujutsu worktrees are convenient (2024)

https://shaddy.dev/notes/jj-worktrees/
113•nvader•4d ago•84 comments

Emacs is my new window manager (2015)

https://www.howardism.org/Technical/Emacs/new-window-manager.html
205•gpi•3d ago•78 comments

Show HN: Persistent memory for Claude Code sessions

https://github.com/TonyStef/Grov
7•tonyystef•6d ago•4 comments

Cancer Is Surging, Bringing a Debate About Whether to Look for It

https://www.nytimes.com/2025/12/08/health/cancer-young-people-deaths.html
10•brandonb•47m ago•1 comments

Twelve Days of Shell

https://12days.cmdchallenge.com
215•zoidb•10h ago•72 comments
Open in hackernews

We collected 10k hours of neuro-language data in our basement

https://condu.it/thought/10k-hours
37•nee1r•2h ago

Comments

ArjunPanicksser•2h ago
Makes sense that CL ends up being the best for recruiting first-time participants. Curious what other things you tried for recruitment and how useful they were?
n7ck•2h ago
The second most useful by far is Indeed, where we post an internship opportunity for participants interested in doing 10 sessions over 10 weeks. Other things that work pretty well are asking professors to send out emails to students at local universities, putting up ~300-500 fliers (mostly around universities and public transit), and posting on Nextdoor. We also just texted a lot of groupchats/posted on linkedin/ gave out fliers and the signup link to kind of everyone we talked to in cafes and similar. We take on some participants as ambassadors as well, and pay them to refer their friends.

We tried google/facebook/instagram ads, and we tried paying for some video placements. Basically none of the explicit advertisement worked at all and it wasn't worth the money. Though for what it's worth, none of us are experts in advertising, so we might have been going about it wrong -- we didn't put loads of effort into iterating once we realized it wasn't working.

mishajw•2h ago
Interesting dataset! I'm curious what kind of results you would get with just EEG, compared to multiple modalities? Why do multiple modalities end up being important?
n7ck•2h ago
EEG has very good temporal resolution, but quite bad spacial resolution, and other modalities have different tradeoffs
g413n•2h ago
what's the basis for conversion between hours of neural data to number of tokens? is that counting the paired text tokens?
rio-popper•2h ago
edit: oops sorry misread - the neural data is tokenised by our embedding model. the number of tokens per second of neural data varies and depends on the information content.
n7ck•2h ago
Hey I'm Nick, and I originally came to Conduit as a data participant! After my session, I started asking questions about the setup to the people working there, and apparently I asked good questions, so they hired me.

Since I joined, we've gone from <1k hours to >10k hours, and I've been really excited by how much our whole setup has changed. I've been implementing lots of improvements to the whole data pipeline and the operations side. Now that we train lots of models on the data, the model results also inform how we collect data (e.g. we care a lot less about noise now that we have more data).

We're definitely still improving the whole system, but at this point, we've learned a lot that I wish someone had told us when we started, so we thought we'd share it in case any of you are doing human data collection. We're all also very curious to get any feedback from the community!

Gormisdomai•2h ago
The example sentences generated “only from neural data” at the top of this article seem surprisingly accurate to me, like, not exact matches but much better than what I would expect even from 10k hours:

“the room seemed colder” -> “ there was a breeze even a gentle gust”

ninapanickssery•2h ago
Yeah, agreed
ag8•2h ago
This is a cool setup, but naively it feels like it would require hundreds of thousands of hours of data to train a decent generalizable model that would be useful for consumers. Are there plans to scale this up, or is there reason to believe that tens of thousands of hours are enough?
n7ck•1h ago
Yeah I think the way we trained the embedding model focused a lot on how to make it as efficient as possible, since it is such a data-limited regime. So I think based on (early) scaling results, it'll be closer to 50-70k hours, which we should be able to get in the next months now we've already scaled up a lot.

That said, the way to 10-20x data collection would be to open a couple other data collection centers outside SF, in high-population cities. Right now, there's a big advantage in just having the data collection totally in-house, because it's so much easier to debug/improve it because we're so small. But now we've mostly worked out the process, it should also be very straightforward for us to just replicate the entire ops/data pipeline in 3-4 parallel data collection centers.

richardfeynman•1h ago
This is an interesting dataset to collect, and I wonder whether there will be applications for it beyond what you're currently thinking.

A couple of questions: What's the relationship between the number of hours of neurodata you collect and the quality of your predictions? Does it help to get less data from more people, or more data from fewer people?

n7ck•1h ago
1. The predictions get better with more data - and we don't seem to be anywhere near diminishing returns. 2. The thing we care about is generalization between people. For this, less data from more people is much better.
richardfeynman•1h ago
I noticed you tracked sessions per person, implying a subset of people have many hours of data collected on them. Are predictions for this subset better than the median?

For a given amount of data, is it better to have more people with less data per person or fewer people with more data per person?

clemvonstengel•40m ago
Yes, the predictions are much better for people with more hours of data in the training set. Usually, we just totally separate the train and val set, so no individual with any sessions in the train set is ever used for evals. When we instead evaluate on someone with 10+ hours in the train set, predictions get ~20-25% better.

For a given amount of data, whether you want more or less data per person really depends on what you're trying to do. The thing we want is for it to be good at zero-shot, that is, for it to decode well on people who have zero hours in the train set. So for that, we want less data per person. If instead we wanted to make it do as well as possible on one individual, then we'd want way more data from that one person. (So, e.g., when we make it into a product at first, we'll probably finetune on each user for a while)

richardfeynman•33m ago
Makes a ton of sense, thanks.

I wonder if there will be medical applications for this tech, for example identifying people with brain or neurological disorders based on how different their "neural imaging" looks from normal.

wiwillia•1h ago
Really interested in how accuracy improves with the scale of the data set. Non-invasive thought-to-action would be a whole new interaction paradigm.
devanshp•1h ago
Cool post! I'm somewhat curious whether the data quality scoring has actually translated into better data; do you have numbers on how much more of your data is useful for training vs in May?
rio-popper•1h ago
so the neural quality real-time checking was the most important thing here. Before we rewrote the backend, between 58-64% of participant hours were actually usable data. Now, it's between 90-95%

If you mean the text quality scoring system, then when we added that, it improved the amount of text we got per hour of neural data by between 30-35%. (That includes the fact that we filter which participants we have return based on their text quality scores)

rajlego•1h ago
Did you consider trying to collect data in a much poorer country that still has high quality English? e.g. the Philippines
rio-popper•1h ago
Yeah we did consider this. For now, there's an advantage to having the data collection in the same building as the whole eng team, but once we hire a couple more engs, I expect we'll just replicate the collection setup in other countries as well
estitesc•1h ago
Loved watching this unfold in our basement. : )
dang•1h ago
[under-the-rug stub]

[see https://news.ycombinator.com/item?id=45988611 for explanation]

ClaireBookworm•2h ago
Yoo this is sick!! sometimes it might actually just be a data game, so huge props to them for actually collecting all that high-quality data
ninapanickssery•2h ago
This is very cool, thanks for writing about your setup in such detail! It’s impressive that you can predict stuff from this noninvasive data. Are there similar existing datasets or this the first of its kind?
cpeterson42•1h ago
Wild world we live in
titzer•57m ago
I lol'd at the hardware "patch" that kept the software from crashing--removing all but the alpha-numeric keys (!?). Holy cow, you had time to collect thousands of hours of neurotraces but couldn't sanitize the inputs to remove a stray [? That sounds...funky.
NoraCodes•55m ago
Presumably it's more like an errant Ctrl-C.
clemvonstengel•40m ago
Yup exactly this. Also Ctrl-W, alt tab, etc.
in-silico•30m ago
It's interesting that the model generalizes to unseen participants. I was under the impression that everyone's brain patterns were different enough that the model would need to be retrained for new users.

Though, I suppose if the model had LLM-like context where it kept track of brain data and speech/typing from earlier in the conversation then it could perform in-context learning to adapt to the user.

clemvonstengel•19m ago
Basically correct intuition: the model does much better when we give it, e.g., 30 secs of neural data in the leadup instead of e.g. 5 secs. My sense is also that it's learning in context, so people's neural patterns are quite different but there's a higher-level generator that lets the model learn in context (or probably multiple higher-level patterns, each of which the model can learn from in context).

We only got any generalization to new users after we had >500 individuals in the dataset, fwiw. There's some interesting MRI studies also finding a similar thing that when you have enough individuals in the dataset, you start seeing generalization.

asgraham•12m ago
Really cool dataset! Love seeing people actually doing the hard work of generating data rather than just trying to analyze what exists (I say this as someone who’s gone out of his way to avoid data collection).

Have you played at all with thought-to-voice? Intuitively I’d think EEG readout would be more reliable for spoken rather than typed words, especially if you’re not controlling for keyboard fluency.