frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Google and Microsoft Paying Creators $500K+ to Promote AI Tools

https://www.cnbc.com/2026/02/06/google-microsoft-pay-creators-500000-and-more-to-promote-ai.html
1•belter•52s ago•0 comments

New filtration technology could be game-changer in removal of PFAS

https://www.theguardian.com/environment/2026/jan/23/pfas-forever-chemicals-filtration
1•PaulHoule•1m ago•0 comments

Show HN: I saw this cool navigation reveal, so I made a simple HTML+CSS version

https://github.com/Momciloo/fun-with-clip-path
1•momciloo•2m ago•0 comments

Kinda Surprised by Seadance2's Moderation

https://seedanceai.me/
1•ri-vai•2m ago•1 comments

I Write Games in C (yes, C)

https://jonathanwhiting.com/writing/blog/games_in_c/
1•valyala•2m ago•0 comments

Django scales. Stop blaming the framework (part 1 of 3)

https://medium.com/@tk512/django-scales-stop-blaming-the-framework-part-1-of-3-a2b5b0ff811f
1•sgt•3m ago•0 comments

Malwarebytes Is Now in ChatGPT

https://www.malwarebytes.com/blog/product/2026/02/scam-checking-just-got-easier-malwarebytes-is-n...
1•m-hodges•3m ago•0 comments

Thoughts on the job market in the age of LLMs

https://www.interconnects.ai/p/thoughts-on-the-hiring-market-in
1•gmays•3m ago•0 comments

Show HN: Stacky – certain block game clone

https://www.susmel.com/stacky/
2•Keyframe•6m ago•0 comments

AIII: A public benchmark for AI narrative and political independence

https://github.com/GRMPZQUIDOS/AIII
1•GRMPZ23•6m ago•0 comments

SectorC: A C Compiler in 512 bytes

https://xorvoid.com/sectorc.html
1•valyala•8m ago•0 comments

The API Is a Dead End; Machines Need a Labor Economy

1•bot_uid_life•9m ago•0 comments

Digital Iris [video]

https://www.youtube.com/watch?v=Kg_2MAgS_pE
1•Jyaif•10m ago•0 comments

New wave of GLP-1 drugs is coming–and they're stronger than Wegovy and Zepbound

https://www.scientificamerican.com/article/new-glp-1-weight-loss-drugs-are-coming-and-theyre-stro...
4•randycupertino•11m ago•0 comments

Convert tempo (BPM) to millisecond durations for musical note subdivisions

https://brylie.music/apps/bpm-calculator/
1•brylie•13m ago•0 comments

Show HN: Tasty A.F.

https://tastyaf.recipes/about
1•adammfrank•14m ago•0 comments

The Contagious Taste of Cancer

https://www.historytoday.com/archive/history-matters/contagious-taste-cancer
1•Thevet•16m ago•0 comments

U.S. Jobs Disappear at Fastest January Pace Since Great Recession

https://www.forbes.com/sites/mikestunson/2026/02/05/us-jobs-disappear-at-fastest-january-pace-sin...
1•alephnerd•16m ago•1 comments

Bithumb mistakenly hands out $195M in Bitcoin to users in 'Random Box' giveaway

https://koreajoongangdaily.joins.com/news/2026-02-07/business/finance/Crypto-exchange-Bithumb-mis...
1•giuliomagnifico•16m ago•0 comments

Beyond Agentic Coding

https://haskellforall.com/2026/02/beyond-agentic-coding
3•todsacerdoti•17m ago•0 comments

OpenClaw ClawHub Broken Windows Theory – If basic sorting isn't working what is?

https://www.loom.com/embed/e26a750c0c754312b032e2290630853d
1•kaicianflone•19m ago•0 comments

OpenBSD Copyright Policy

https://www.openbsd.org/policy.html
1•Panino•20m ago•0 comments

OpenClaw Creator: Why 80% of Apps Will Disappear

https://www.youtube.com/watch?v=4uzGDAoNOZc
2•schwentkerr•24m ago•0 comments

What Happens When Technical Debt Vanishes?

https://ieeexplore.ieee.org/document/11316905
2•blenderob•25m ago•0 comments

AI Is Finally Eating Software's Total Market: Here's What's Next

https://vinvashishta.substack.com/p/ai-is-finally-eating-softwares-total
3•gmays•26m ago•0 comments

Computer Science from the Bottom Up

https://www.bottomupcs.com/
2•gurjeet•26m ago•0 comments

Show HN: A toy compiler I built in high school (runs in browser)

https://vire-lang.web.app
1•xeouz•28m ago•1 comments

You don't need Mac mini to run OpenClaw

https://runclaw.sh
1•rutagandasalim•28m ago•0 comments

Learning to Reason in 13 Parameters

https://arxiv.org/abs/2602.04118
2•nicholascarolan•31m ago•0 comments

Convergent Discovery of Critical Phenomena Mathematics Across Disciplines

https://arxiv.org/abs/2601.22389
1•energyscholar•31m ago•1 comments
Open in hackernews

Analyzing Modern Nvidia GPU Cores

https://arxiv.org/abs/2503.20481
178•mfiguiere•9mo ago

Comments

winwang•9mo ago
I hope this can help shed the misconception that GPUs are only good at linear algebra and FP arithmetic, which I've been hearing a whole lot!

Edit: learned a bunch, but the "uniform" registers and 64-bit (memory) performance are some easy standouts.

remcob•9mo ago
It’s well known GPUs are good at cryptography. Starting with hash functions (e.g. crypto mining) but also zero knowledge proofs and multi party computation.
saagarjha•9mo ago
They're not particularly good at cryptography, but they are good at highly parallel tasks like trying a bunch of hashes.
qwertox•9mo ago
Wasn't it well known that CUDA cores are programmable cores?
winwang•9mo ago
Haha, if you're the type to toss out the phrase "well known", then yes!
YetAnotherNick•9mo ago
In a sense, GPUs are only great at matrix-matrix multiplication. For anything else you would only get 7% of the FLOPs/s compared to it(989 vs 67 TFLOP/s for H100)[1].

[1]: https://www.nvidia.com/en-in/data-center/h100/

winwang•9mo ago
lol, I haven't thought about it like that, true. though of course, I mean compared to CPUs :P

I try and use tensor cores for non-obvious things every now and then. The most promising so far seems to be for linear arithmetic in Datalog, but that's just matrix-vector/gemv

harperlee•9mo ago
Could you expand the Datalog example? I'm quite interested
winwang•9mo ago
This was just a brief moment of thought over a year ago, but I can try to summarize. I was thinking about how to unify variables in certain simple Datalog settings. If we think of a clause as a vector of variables, then simple unifications can look like just a gather operation. A gather can be thought of as a matrix-vector multiplication, but that's not really useful (performance wise). But if those variables are also in a linear equation, then it becomes possibly-useful, e.g. for something like `P(x, y) :- E(x, 3x+4y)`
cma•9mo ago
That link says "* With sparsity". For extremely sparse matrixes you can get more than 989 TFLOPS on CPU, if we're counting elided operations in TFLOPS.
YetAnotherNick•9mo ago
I am counting FP16/BF16 without sparsity, which is used in majority of AI.
cma•9mo ago
That change checks out then. They didn't see much need for FP16 outside of that so no longer run it at double FP32 rate outside of tensor cores (unless I'm mixing that up with AMD).

Other forms of sparsity are heavily used at training time now, like block compression in Deepseek.

randomgermanguy•9mo ago
Funnily, they're far from being optimal for GEMM ops (especially in terms of power consumption).

For GEMM you need to visit each row/vec n-times so theres a bunch of data-reuse going on, which isn't optimal for GPUs since you can't keep that all so close to your processing-units. And while the tensor-cores kinda implement this i think they don't quite scale up to a full sized systolic array, which is you would want for larger matrix multiplications.

Also just a simpler view: with GPUs most of their silicon is spent NOT tensor-core, so just from that you know its not optimal i guess.

Just referring to that FLOP/s number doesn't really mean much nowadays with tensor-cores and sparsity.

In my eyes the big win of GPUs are that not only are they pretty good at GEMMs but also really good at a lot of other easily parallelizable tasks PLUS they're comparatively easy to program ^^

dist-epoch•9mo ago
From your comment I've learned that you never did GPU graphical programming :)

"uniform registers" exist for about 20 years now.

rnrn•9mo ago
I think you are confusing uniform registers with the uniform keyword in RSL / GLSL / HLSL?

maybe some vendors have had an equivalent to uniform registers for 20 years, but per the articles’ references they are new in nvidia GPUs in turing (2018)

dist-epoch•9mo ago
They are the same thing. The uniform keyword in shading languages is implemented using the uniform registers.

I don't know what Nvidia did in 2018, maybe they opened up access to the uniform registers to CUDA code.

I made Grok research this topic:

> In conclusion, research strongly suggests that the "uniform" keyword in GLSL is implemented in hardware using NVIDIA's "uniform registers," as evidenced by NVIDIA's own documentation on the Turing architecture and historical practices of mapping uniforms to constant registers. While explicit links can be limited due to proprietary details, the combination of technical presentations, community discussions, and historical context supports this connection. The uniform register file, with its capacity and usage in shader instructions, aligns with GLSL's uniform functionality, ensuring efficient data access during shader execution.

https://grok.com/share/c2hhcmQtMg%3D%3D_358362f3-21e2-4fe0-a...

Firadeoclus•9mo ago
While you can use uniform registers to implement the uniform keyword from shading languages, the two are not the same. Uniform registers are not constants, and they are only uniform/shared across one warp. Nvidia architectures before Turing did not have uniform registers.
pjc50•9mo ago
I didn't get that at all - to me this looks like a very smart investigation into instruction latency and the precise mechanics of out-of-order execution (no reference made to speculative or branch prediction, though?) without looking at what the instructions do in detail.

GPUs can certainly do bulk integer arithmetic but most use cases prefer FP. Maybe for DSP fixed-point is ideal.

gmays•9mo ago
The special sauce:

> "GPUs leverage hardware-compiler techniques where the compiler guides hardware during execution."

kookamamie•9mo ago
> NVIDIA RTX A6000

Unfortunately that's already behind the latest GPU by two generations. You'd have these after A6000: 6000 Ada, Pro 6000.

flowerthoughts•9mo ago
It's a major step forward compared to 2006.

A6000 was released in 2020: https://www.techpowerup.com/gpu-specs/rtx-a6000.c3686

KeplerBoy•9mo ago
Nvidia's Quadro naming scheme really is bad these days, isn't it?

I bet there are plenty of papers out there claiming to have used a RTX 6000 instead of a RTX 6000 Ada gen.

kookamamie•9mo ago
The naming scheme is horrible, to be quite frank.

To understand this, consider these names in the order of release time: Quadro RTX 6000, RTX A6000, RTX 6000 Ada, RTX Pro 6000, RTX Pro 6000 Max-Q.

pjmlp•9mo ago
Still better than most folks have access to.

I bet I can do more CUDA with my lame GeForce MX 150 from 2017, than what most people can reach for to do ROCm, and that is how NVidia keeps being ahead.

kookamamie•9mo ago
Yeah, kind of. I have an 6000 Ada and 5090 here.
pjmlp•9mo ago
On a laptop?

Because that is part of my point, that is a laptop GPU.

kookamamie•9mo ago
Oh no, there are high-end desktops. You're right, laptops completely different profiles for these things.
gitroom•9mo ago
Haha honestly I always thought GPUs were mostly number crunchers, but there's way more under the hood than I realized. Wondering now if anyone really gets the full potential of these cores, or if we're all just scratching the surface most days?
Dlemo•9mo ago
There are very good performance tools for GPUs.

I don't think GPU utilization is a real bottleneck in most cases.

dist-epoch•9mo ago
Yet DeepSeek managed to get huge improvement by optimizing GPU code.
nabla9•9mo ago
>Overall, we can conclude that GPUs are hardware-compiler codesign where the compiler guides the hardware in handling dependencies and introduces hints that can improve performance and energy.

New architectures rely on the compiler to handle register data dependencies, and controlling register file cache allocation policy.

dist-epoch•9mo ago
This is an age-old idea, RISC compilers were supposed to do this too, the mythical "sufficiently smart compiler"

https://wiki.c2.com/?SufficientlySmartCompiler

MindSpunk•9mo ago
It's not so much about having a "sufficiently smart compiler" in the case of GPUs doing compiler assisted scheduling. It's about not having to implement that logic in hardware at all. The more smarts they push into the core hardware, the more silicon each core needs, the less cores you can fit, and more power you spend on figuring out what to run rather than crunching numbers.

Doing the work in the compiler may produce less optimal scheduling than what is theoretically possible, but with the number of "cores" in a GPU you would spend a lot of power doing it in hardware for each one.

nabla9•9mo ago
RISC woks well with compilers (ARM, RISC-V), they don't require mythical compilers, just standard good ones.

You are probably thinking VLIW like Intels Itanium, and Transmeta. Those architectures required really smart compiler for scheduling and it was a bust.

Nvidia GPU's need smart compiler and it works because the task is limited to optimizing numerical pipelines that are 99% matrix multiplication, dot products. The data movement is more predicable. Compilers know how the data will be used and know how to schedule.

peterfirefly•9mo ago
MIPS used to not have interlocked pipeline stages. It was the compiler's job to work around that. MIPS -- and many other RISCs -- had a branch delay slot. It was the compiler's job to try to do something useful with that. RISCs stayed in-order for a long time -- it was the compiler's job to try do schedule the instructions in a way that compensated as well as possible for that.

GPUs rely on fairly smart compilers -- but they also hide latency (memory access) by switching hw threads (a bit like barrel processors of yore).

pjc50•9mo ago
> New architectures

[citation needed] - which architectures?

nabla9•9mo ago
Citation is the paper we are discussing. It also mentions the architectures.

There is so much stuff you miss when you don't follow the links ;)