frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

fp.

Atlas: Manage your database schema as code

https://github.com/ariga/atlas
1•quectophoton•2m ago•0 comments

Geist Pixel

https://vercel.com/blog/introducing-geist-pixel
1•helloplanets•5m ago•0 comments

Show HN: MCP to get latest dependency package and tool versions

https://github.com/MShekow/package-version-check-mcp
1•mshekow•13m ago•0 comments

The better you get at something, the harder it becomes to do

https://seekingtrust.substack.com/p/improving-at-writing-made-me-almost
2•FinnLobsien•14m ago•0 comments

Show HN: WP Float – Archive WordPress blogs to free static hosting

https://wpfloat.netlify.app/
1•zizoulegrande•16m ago•0 comments

Show HN: I Hacked My Family's Meal Planning with an App

https://mealjar.app
1•melvinzammit•16m ago•0 comments

Sony BMG copy protection rootkit scandal

https://en.wikipedia.org/wiki/Sony_BMG_copy_protection_rootkit_scandal
1•basilikum•19m ago•0 comments

The Future of Systems

https://novlabs.ai/mission/
2•tekbog•19m ago•1 comments

NASA now allowing astronauts to bring their smartphones on space missions

https://twitter.com/NASAAdmin/status/2019259382962307393
2•gbugniot•24m ago•0 comments

Claude Code Is the Inflection Point

https://newsletter.semianalysis.com/p/claude-code-is-the-inflection-point
3•throwaw12•25m ago•1 comments

Show HN: MicroClaw – Agentic AI Assistant for Telegram, Built in Rust

https://github.com/microclaw/microclaw
1•everettjf•26m ago•2 comments

Show HN: Omni-BLAS – 4x faster matrix multiplication via Monte Carlo sampling

https://github.com/AleatorAI/OMNI-BLAS
1•LowSpecEng•26m ago•1 comments

The AI-Ready Software Developer: Conclusion – Same Game, Different Dice

https://codemanship.wordpress.com/2026/01/05/the-ai-ready-software-developer-conclusion-same-game...
1•lifeisstillgood•28m ago•0 comments

AI Agent Automates Google Stock Analysis from Financial Reports

https://pardusai.org/view/54c6646b9e273bbe103b76256a91a7f30da624062a8a6eeb16febfe403efd078
1•JasonHEIN•32m ago•0 comments

Voxtral Realtime 4B Pure C Implementation

https://github.com/antirez/voxtral.c
2•andreabat•34m ago•1 comments

I Was Trapped in Chinese Mafia Crypto Slavery [video]

https://www.youtube.com/watch?v=zOcNaWmmn0A
2•mgh2•40m ago•0 comments

U.S. CBP Reported Employee Arrests (FY2020 – FYTD)

https://www.cbp.gov/newsroom/stats/reported-employee-arrests
1•ludicrousdispla•42m ago•0 comments

Show HN: I built a free UCP checker – see if AI agents can find your store

https://ucphub.ai/ucp-store-check/
2•vladeta•47m ago•1 comments

Show HN: SVGV – A Real-Time Vector Video Format for Budget Hardware

https://github.com/thealidev/VectorVision-SVGV
1•thealidev•49m ago•0 comments

Study of 150 developers shows AI generated code no harder to maintain long term

https://www.youtube.com/watch?v=b9EbCb5A408
1•lifeisstillgood•49m ago•0 comments

Spotify now requires premium accounts for developer mode API access

https://www.neowin.net/news/spotify-now-requires-premium-accounts-for-developer-mode-api-access/
1•bundie•52m ago•0 comments

When Albert Einstein Moved to Princeton

https://twitter.com/Math_files/status/2020017485815456224
1•keepamovin•53m ago•0 comments

Agents.md as a Dark Signal

https://joshmock.com/post/2026-agents-md-as-a-dark-signal/
2•birdculture•55m ago•0 comments

System time, clocks, and their syncing in macOS

https://eclecticlight.co/2025/05/21/system-time-clocks-and-their-syncing-in-macos/
1•fanf2•56m ago•0 comments

McCLIM and 7GUIs – Part 1: The Counter

https://turtleware.eu/posts/McCLIM-and-7GUIs---Part-1-The-Counter.html
2•ramenbytes•59m ago•0 comments

So whats the next word, then? Almost-no-math intro to transformer models

https://matthias-kainer.de/blog/posts/so-whats-the-next-word-then-/
1•oesimania•1h ago•0 comments

Ed Zitron: The Hater's Guide to Microsoft

https://bsky.app/profile/edzitron.com/post/3me7ibeym2c2n
2•vintagedave•1h ago•1 comments

UK infants ill after drinking contaminated baby formula of Nestle and Danone

https://www.bbc.com/news/articles/c931rxnwn3lo
1•__natty__•1h ago•0 comments

Show HN: Android-based audio player for seniors – Homer Audio Player

https://homeraudioplayer.app
3•cinusek•1h ago•2 comments

Starter Template for Ory Kratos

https://github.com/Samuelk0nrad/docker-ory
1•samuel_0xK•1h ago•0 comments
Open in hackernews

Myths Programmers Believe about CPU Caches (2018)

https://software.rajivprab.com/2018/04/29/myths-programmers-believe-about-cpu-caches/
156•whack•3mo ago

Comments

yohbho•3mo ago
2018, discussed on HN in 2023: https://news.ycombinator.com/item?id=36333034

  Myths Programmers Believe about CPU Caches (2018) (rajivprab.com)
 176 points by whack on June 14, 2023 | hide | past | favorite | 138 comments
breppp•3mo ago
Very interesting, I am hardly an expert, but this gives the impression that if only without that meddling software we will all live in a synchronous world.

This ignores store buffers and consequently memory fencing which is the basis for the nightmarish std::memory_order, the worst api documentation you will ever meet

jeffbee•3mo ago
If the committee had any good taste at all they would have thrown DEC AXP overboard before C++11, which would have cut the majority of the words out of the specification for std::memory_order. It was only the obsolete and quite impossible to program Alpha CPU that requires the ultimate level of complexity.
gpderetta•3mo ago
If Alpha never existed, memory_order_consume would still exist. But we have done this discussion before.
QuaternionsBhop•3mo ago
Since the CPU is doing cache coherency transparently, perhaps there should be some sort of way to promise that an application is well-behaved in order to access a lower-level non-transparent instruction set to manually manage the cache coherency from the application level. Or perhaps applications can never be trusted with that level of control over the hardware. The MESI model reminded me of Rust's ownership and borrowing. The pattern also appears in OpenGL vs Vulkan drivers, implicit sync vs explicit sync. Yet another example would be the cache management work involved in squeezing out maximum throughput CUDA on an enterprise GPU.
cpgxiii•3mo ago
There are some knobs that newer processors give for cache control, mostly to partition or reserve cache space to improve security or reduce cache contention between processes.

Actual manual cache management is way too much of an implementation detail for a general-purpose CPU to expose; doing so would deeply tie code to a specific set of processor behavior. Cache sizes and even hierarchies change often between processor generations, and some internal cache behavior has changed within a generation as a result of microcode and/or hardware steppings. Actual cache control would be like MIPS exposing delay slots but so much worse (at least older delay slots really only turn into performance issues, older cache control would easily turn into correctness issues).

Really the only way to make this work is for the final compilation/"specialization" step to occur on the specific device in question, like with a processor using binary translation (e.g. Transmeta, Nvidia Denver) or specialization (e.g. Mill) or a system that effectively enforces runtime compilation (e.g. runtime shader/program compilation in OpenGL and OpenCL).

groundzeros2015•3mo ago
Did this article share more than one myth?

The reason why programmers don’t believe in cache coherency is because they have experienced a closely related phenomena, memory reordering. This requires you to use a memory fence when accessing a shared value between multiples cores - as if they needed to synchronize.

Lwerewolf•3mo ago
I'm pretty sure that most cases of x86 reordering issues are a matter of the compiler reordering things, which isn't (afaik) solved with just "volatile". Caveat - haven't dealt with this for at least over a year (multicore sync without using OS primitives in general).
nly•3mo ago
x86 has a Total Store Order (TSO) memory model, which effectively means (in a mental model where only 1 shared memory operation happens at once and completes before the next) stores are queued but loads can be executed immediately even if stores are queued in the store buffer.

On a single core a load can be served from the store buffer (queue), but other cores can't see those stores yet, which is where all the inconsistencies come from.

fer•3mo ago
Literally the volatile keyword in Java is to make the Java compiler aware in order to insert memory barriers. That only guarantees consistent reads or writes, but it doesn't make it thread safe (i.e. write after read), that's what atomics are for.

Also not only compilers reorder things, most processors nowadays do OoOE; even if the order from the compiler is perfect in theory, different latencies for different instruction operands may lead to execute later things earlier not to stall the CPU.

zozbot234•3mo ago
Note that this is only true of the Java/C# volatile keyword. The volatile keyword in C/C++ is solely about direct access in hardware to memory-mapped locations, for such purposes as controlling external devices; it is entirely unrelated to the C11 memory model for concurrency, does not provide the same guarantees, and should never be used for that purpose.
igtztorrero•3mo ago
In Golang there are sync.Mutex, sync.Atomic and Channels to create this fence and prevent data races. I prefer sync.Mutex.

Does anyone understand how Go handles the CPU cache?

groundzeros2015•3mo ago
Yes. Locks will use a memory fence. More advanced programs will need fence without locking.
daemontus•3mo ago
I may be completely out of line here, but isn't the story on ARM very very different? I vaguely recall the whole point of having stuff like weak atomics being that on x86, those don't do anything, but on ARM they are essential for cache coherency and memory ordering? But then again, I may just be conflating memory ordering and coherency.
jeffbee•3mo ago
Well, since this is a thread about how programmers use the wrong words to model how they think a CPU cache works, I think it bears mentioning that you've used "atomics" here to mean something irrelevant. It is not true that x86 atomics do nothing. Atomic instructions or, on x86, their prefix, make a naturally non-atomic operation such as a read-modify-write atomic. The ARM ISA actually lacked such a facility until ARMv8.1.

The instructions to which you refer are not atomics, but rather instructions that influence the ordering of loads and stores. x86 has total store ordering by design. On ARM, the program has to use LDAR/STLR to establish ordering.

phire•3mo ago
Everything it says about cache coherency is exactly the same on ARM.

Memory ordering has nothing to do with cache coherency, it's all about what happens within the CPU pipeline itself. On ARM reads and writes can become reordered within the CPU pipeline itself, before they hit the caches (which are still fully coherent).

ARM still has strict memory ordering for code within a single core (some older processors do not), but the writes from one core might become visible to other cores in the wrong order.

gpderetta•3mo ago
you are getting downvoted, but you are of course correct.
gpderetta•3mo ago
Normally when talking about relaxed memory models, full cache coherency is still assumed. For example the C++11 memory model cannot be implemented on a non-cache-coherent system, at least not without massive performance penalties.
ashvardanian•3mo ago
Here's my favorite practically applicable cache-related fact: even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines. Some thread-pool implementations like CrossBeam in Rust and my ForkUnion in Rust and C++, explicitly document that and align objects to 128 bytes [1]:

  /**
   *  @brief Defines variable alignment to avoid false sharing.
   *  @see https://en.cppreference.com/w/cpp/thread/hardware_destructive_interference_size
   *  @see https://docs.rs/crossbeam-utils/latest/crossbeam_utils/struct.CachePadded.html
   *
   *  The C++ STL way to do it is to use `std::hardware_destructive_interference_size` if available:
   *
   *  @code{.cpp}
   *  #if defined(__cpp_lib_hardware_interference_size)
   *  static constexpr std::size_t default_alignment_k = std::hardware_destructive_interference_size;
   *  #else
   *  static constexpr std::size_t default_alignment_k = alignof(std::max_align_t);
   *  #endif
   *  @endcode
   *
   *  That however results into all kinds of ABI warnings with GCC, and suboptimal alignment choice,
   *  unless you hard-code `--param hardware_destructive_interference_size=64` or disable the warning
   *  with `-Wno-interference-size`.
   */
  static constexpr std::size_t default_alignment_k = 128;
As mentioned in the docstring above, using STL's `std::hardware_destructive_interference_size` won't help you. On ARM, this issue becomes even more pronounced, so concurrency-heavy code should ideally be compiled multiple times for different coherence protocols and leverage "dynamic dispatch", similar to how I & others handle SIMD instructions in libraries that need to run on a very diverse set of platforms.

[1] https://github.com/ashvardanian/ForkUnion/blob/46666f6347ece...

Sesse__•3mo ago
This makes attempts of cargo-culting __attribute__((aligned(64))) without benchmarking even more hilarious. :-)
rnrn•3mo ago
It’s not a cargo cult if the actions directly cause cargo to arrive based on well understood mechanics.

Regardless of whether it would be better in some situations to align to 128 bytes, 64 bytes really is the cache line size on all common x86 cpus and it is a good idea to avoid threads modifying the same cacheline.

Sesse__•3mo ago
It indeed isn't, but I've seen my share of systems where nobody checked if cargo arrived. (The code was checked in without any benchmarks done, and after many years, it was found that the macros used were effectively no-ops :-) )
rnrn•3mo ago
> even on x86 on recent server CPUs, cache-coherency protocols may be operating at a different granularity than the cache line size. A typical case with new Intel server CPUs is operating at the granularity of 2 consecutive cache lines

I don’t think it is accurate that Intel CPUs use 2 cache lines / 128 bytes as the coherency protocol granule.

Yes, there can be additional destructive interference effects at that granularity, but that’s due to prefetching (of two cachelines with coherency managed independently) rather than having coherency operating on one 128 byte granule

AFAIK 64 bytes is still the correct granule for avoiding false sharing, with two cores modifying two consecutive cachelines having way less destructive interference than two cores modifying one cacheline.

j_seigh•3mo ago
Coherent cache is transparent to the memory model. So if someone trying to explain memory model and ordering mentioned cache as affecting the memory model, it was pretty much a sign they didn't fully understand what they were talking about.
wpollock•3mo ago
This post, along with the tutorial links it and the comments contain, provide good insights on the topic of caches, coherence, and related topics. I would like to add a link that I feel is also very good, maybe better:

<https://marabos.nl/atomics/hardware.html>

While the book this chapter is from is about Rust, this chapter is pretty much language-agnostic.

mannyv•3mo ago
The author talks about Java and cache coherency, but ignores the fact that the JVM implementation is above the processors caches and per the spec can have different caches per thread.

This means you need to synchronize every shared access, whether it's a read or write. In hardware systems you can cheat because usually a write performs a write-through. In a JVM that's not the case.

It's been a long time since I had to think about this, but it bit us pretty hard when we found that.