frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Garbage collection of object storage at scale

https://www.warpstream.com/blog/taking-out-the-trash-garbage-collection-of-object-storage-at-massive-scale
35•ko_pivot•3d ago

Comments

juancn•3h ago
Another possible mechanism for doing GC at scale (a variation on Asynchronous Reconciliation in the article) in some file/object store, is doing a probabilistic mark and sweep using bloom filters.

The mark phase can be done in parallel building many bloom filters for the files/objects found.

Then the bloom filters are merged (or'ed together essentially) and then a parallel sweep phase can use the bloom filter to answer the question: is this file/object live?

The bloom filter then answers either "No" with 100% certainty or "Maybe" with some probability p that depends on the parameters used for the bitset and the hash function family.

cogman10•2h ago
What does the bloom filter solve?

The expensive portion of the mark and sweep for the object store is the mark phase, not the storage of what's been marked. 100s, 1000s, or even millions of live objects wouldn't hardly take any space to keep in a remembered set.

On the other hand, querying the S3 bucket to list those 1M objects would be expensive no matter how you store the results.

But this does tickle my brain. Perhaps something akin to the generational hypotheses can be applied? Maybe it's the case that very old, very young, or very untouched objects are more likely to be garbage than not. If there's some way to divide the objects up and only look at objects whose are in "probably need to be collected" regions, then you could do minor fast sweeps semi frequently and schedule more expensive "really delete untracked stuff" infrequently.

donavanm•1h ago
If you like big beautiful storage and probabilistic structures check out https://www.usenix.org/conference/osdi14/technical-sessions/.... The coho data folks ended up in AWS S3 a few years later.
deathanatos•2h ago
> Why Not Just Use a Bucket Policy?

I've heard these words so many times, it's refreshing to see someone dig into why bucket policies aren't a cure-all.

As for "Why not use synchronous deletion?" — regarding the pitfall there, what about a WAL? I.e., you WAL the deletions you want to perform into an object in the object store, perform the deletions, and then delete the WAL. If you crash and find a WAL file, you repeat the delete commands contained in the WAL.

(I've used this to handle this problem where some of the deletions are mixed: i.e., some in an object store, some in a SQL DB, etc. The object store is essentially being used as strongly consistent storage.)

(Perhaps this is essentially the same as your "delayed queue"? All I've got is an object store though, not a queue, and it's pretty useful hammer.)

telotortium•1h ago
> HN Disclaimer: WarpStream sells a drop-in replacement for Apache Kafka built directly on-top of object storage.

First time I’ve seen one of these. That’s actually a better way to advertise your product than putting it at the end.

hencq•59m ago
Yes, though I think they meant to say disclosure instead of disclaimer.

Type-constrained code generation with language models

https://arxiv.org/abs/2504.09246
65•tough•2h ago•28 comments

Your fingers wrinkle the same way every time you're in the water too long

https://www.binghamton.edu/news/story/5547/do-your-fingers-wrinkle-the-same-way-every-time-youre-in-the-water-too-long-new-research-says-yes
22•gnabgib•1h ago•2 comments

Flattening Rust's Learning Curve

https://corrode.dev/blog/flattening-rusts-learning-curve/
47•birdculture•2h ago•22 comments

Branch Privilege Injection: Exploiting branch predictor race conditions

https://comsec.ethz.ch/research/microarch/branch-privilege-injection/
319•alberto-m•8h ago•127 comments

Starcloud

https://www.ycombinator.com/companies/starcloud
126•wiley1454•4h ago•245 comments

Map of Palaeohispanic Coins and Inscriptions

http://hesperia.ucm.es/consulta_hesperia/mapas.php
8•brendanashworth•35m ago•0 comments

Build real-time knowledge graph for documents with LLM

https://cocoindex.io/blogs/knowledge-graph-for-docs/
64•badmonster•4h ago•11 comments

Failed Soviet Venus lander Kosmos 482 crashes to Earth after 53 years in orbit

https://www.space.com/space-exploration/launches-spacecraft/failed-soviet-venus-lander-kosmos-482-crashes-to-earth-after-53-years-in-orbit
98•taubek•3d ago•62 comments

Google is building its own DeX: First look at Android's Desktop Mode

https://www.androidauthority.com/android-desktop-mode-leak-3550321/
194•logic_node•10h ago•160 comments

Show HN: HelixDB – Open-source vector-graph database for AI applications (Rust)

https://github.com/HelixDB/helix-db/
114•GeorgeCurtis•7h ago•50 comments

PDF to Text, a challenging problem

https://www.marginalia.nu/log/a_119_pdf/
222•ingve•9h ago•124 comments

Multiple security issues in GNU Screen

https://www.openwall.com/lists/oss-security/2025/05/12/1
331•st_goliath•13h ago•202 comments

Launch HN: Miyagi (YC W25) turns YouTube videos into online, interactive courses

157•bestwillcui•11h ago•87 comments

A tool to verify estimates, II: a flexible proof assistant

https://terrytao.wordpress.com/2025/05/09/a-tool-to-verify-estimates-ii-a-flexible-proof-assistant/
12•jjgreen•3d ago•0 comments

When graphic design saves lives

https://news.harvard.edu/gazette/story/2025/05/when-graphic-design-saves-lives/
8•gnabgib•3d ago•0 comments

Garbage collection of object storage at scale

https://www.warpstream.com/blog/taking-out-the-trash-garbage-collection-of-object-storage-at-massive-scale
35•ko_pivot•3d ago•6 comments

It Awaits Your Experiments

https://www.rifters.com/crawl/?p=11511
125•pavel_lishin•9h ago•34 comments

How (memory) safe is Zig? (2021)

https://www.scattered-thoughts.net/writing/how-safe-is-zig/
18•vortex_ape•2h ago•16 comments

Cardiac: A CARDboard Illustrative Aid to Computation [pdf]

https://www.cs.drexel.edu/~bls96/museum/CARDIAC_manual.pdf
15•throwaway71271•2h ago•5 comments

Coffee for people who don't like coffee

https://ostwilkens.se/blog/coffee
20•ostwilkens•3d ago•56 comments

Less meat is nearly always better than sustainable meat

https://ourworldindata.org/less-meat-or-sustainable-meat
8•sohkamyung•52m ago•1 comments

Y Combinator says Google is a monopolist, no comment about its OpenAI ties

https://techcrunch.com/2025/05/13/y-combinator-says-google-is-a-monopolist-that-has-stunted-the-startup-ecosystem/
108•mastazi•2h ago•27 comments

OpenTelemetry protocol with Apache Arrow

https://opentelemetry.io/blog/2025/otel-arrow-phase-2/
55•tanelpoder•6h ago•13 comments

The world could run on older hardware if software optimization was a priority

https://twitter.com/ID_AA_Carmack/status/1922100771392520710
562•turrini•14h ago•541 comments

I learned Snobol and then wrote a toy Forth

https://ratfactor.com/snobol/
115•ingve•2d ago•30 comments

Turritopsis dohrnii: Immortal jellyfish

https://www.nhm.ac.uk/discover/immortal-jellyfish-secret-to-cheating-death.html
30•vinnyglennon•4d ago•6 comments

Using obscure graph theory to solve programming languages problems

https://reasonablypolymorphic.com/blog/solving-lcsa/
27•matt_d•4h ago•3 comments

Membrane: Media Framework for Elixir

https://membrane.stream/
112•lawik•3d ago•35 comments

Insurers launch cover for losses caused by AI chatbot errors

https://www.ft.com/content/1d35759f-f2a9-46c4-904b-4a78ccc027df
107•jmacd•2d ago•41 comments

In a high-stress work environment, prioritize relationships

https://wqtz.bearblog.dev/high-stress-job-relationships/
285•wqtz•11h ago•178 comments