frontpage.
newsnewestaskshowjobs

Made with ♥ by @iamnishanth

Open Source @Github

Open in hackernews

Fast and cheap bulk storage: using LVM to cache HDDs on SSDs

https://quantum5.ca/2025/05/11/fast-cheap-bulk-storage-using-lvm-to-cache-hdds-on-ssds/
78•todsacerdoti•4h ago

Comments

gopalv•3h ago
As always the YMMV of caching is access patterns, but the more consistent cacheable pattern has been the ext4 journals for me.

They are tiny and often hit with a huge number of IOPS.

Ext4 supported external journals and moving it to a single SSD for a large number of otherwise slow SMR disks has worked great in the past.

However, when you hit a failure that SSD becomes a single root cause of data loss from several disks when losing that SSD (unlike a read cache).

Where I was working that didn't matter as I was mostly working with HDFS which both likes a JBOD layout of several disks instead of RAID (no battery backed write caches), tolerant to a single node failing completely and having a ton more metadata operations thanks to writing a single large file as many fixed-size files named blk_<something> with a lot of directories containing thousands of files.

SSDs were expensive then, but it's been a decade of getting cheaper from that.

trinsic2•2h ago
This reminds me of the hybrid drives. When the NVM failed its was a nightmare to deal with. IMHO it's a bad idea from a stability perspective to be caching off drive to Non-volatile memory.
wtallis•33m ago
Your last sentence does not follow from the preceding one. Hybrid drives were doomed by having truly tiny caches, making them not particularly fast (you need a lot of flash chips in parallel to get high throughput), prone to cache thrashing, and easy to wear out the NAND flash. These days, even if you try, it's hard to build a caching system that bad. There just aren't SSDs small and slow enough to have such a crippling effect. Even using a single consumer SSD as a cache for a full shelf of hard drives wouldn't be as woefully unbalanced as the SSHDs that tried to get by with only 8GB of NAND.
GauntletWizard•1h ago
The same for ZFS; there's provisioning to make a "zil" device - ZFS Intent Log, basically the journal. ZFS is a little nicer in that this journal is explicitly disposable - If you lose your ZIL device, you lose any writes since it's horizon, but you don't lose the whole array.

The next step up is building a "metadata" device, which stores the filesystem metadata but not data. This is dangerous in the way the ext4 journal is; lose the metadata, and you lose everything.

Both are massive speedups. When doing big writes, a bunch of spinning rust can't achieve full throughput without a SSD ZIL. My 8+2 array can write nearly two gigabits, but it's abysmal (roughly the speed of a single drive) without a ZIL.

Likewise, a metadata device can make the whole filesystem feel as snappy as SSD, but it's unnecessary if you have enough cache space; ZFS prefers it, so if your metadata fits into your cache SSD, most of it will stay loaded

Szpadel•1h ago
I just want to mention that ZIL is just to speed up sync writes, as it ends syscall when data are written to ZIL, but might be still in progress on slower storage.

ZIL is also basically write only storage, therefore sad without very significant over provisioning will die quickly (you only read from ZIL after unclean shutdown)

if you don't really case about latest version of file (risk of loosing recent chances is acceptable) you might set sync=disabled for that dataset and you can have great performance without ZIL

JonChesterfield•40m ago
There's a configuration option that amounts to putting a directory (or maybe a volume) entirely into the metadata drive.

It's been a long time since I set that up, but the home storage has spinning rust plus a raid 1 of crucial ssd (sata! But ones with a capacitor to hopefully handle writes after power loss), where the directory I care about performance for lives on the ssd subarray. Still presents as one blob of storage. Metadata on the ssd too, probably no ZIL but could be wrong about that. Made ls a lot more reasonable.

Thinking about it that system must be trundling towards expected death, it might be a decade old now.

bjt12345•2h ago
Oh I miss Optane drives.
ggm•2h ago
The logic for not zfs cited reduces to two things: FUD and not in baseline Linux.

The pro case for BTRFS is being able to do JBOD with a bit of additional comfort around mirror state over drives.

Szpadel•2h ago
something that people forget with raid1 is that this only protect from catastrophic disk failure.

this means your your drive need to be dead for raid to do it's protection and this is usually the case.

the problem is when starts corrupting data it reads of writes. in that case raid have no way to know that and can even corrupt data on the healthy drive. (data is read corrupted and then written to both drives)

the issue is that there are 2 copies of the data and raid have no way of telling with one is correct so it's basically flips a coin and select one of them, even if filesystem knows that content makes no sense.

that's basically biggest advantage of filesystems like zfs or btrfs that manage raid themselves, they have checksums and that know with copy is valid and are able to recover and say that one drive appears healthy but it's corrupting data so you probably want to replace it

iforgotpassword•1h ago
Made that experience once ca. 2011. I hosted a Minecraft server ona box with raid1.

The "cool" part was that I ran a cronjob that rendered the map to a png file once and hour, and at some point a friend asked why there were holes in the map. Back then, Minecraft stored every 16x16 chunk of the map in an individual gzipped file. When the raid1 decided to read the chunk from the bad drive, it couldn't unzip it. If that happened to the renderer, there was a hole on the map. If that happened to the game server, it would regenerate the chunk, and overwrite the old one on both drives, even the healthy one. Luckily as far a I remember that only happened on random terrain, otherwise someone would have ended up with half their house missing.

iam-TJ•42m ago
When using LVM one can use the dm-integrity target to detect data corruption.
riedel•2h ago
Does someone know what the technology behind the tiering on QNAP NAS Systems is? I use an SSD RAID 1 in front of an RAID 10, which seems to work great.

IMHO flexible tiering rather than caching would be very nice for many Systems as it is rather difficult to teach users to separate rather stale data from changing data. Often does not have to be perfect.

rsync•2h ago
A reminder that zfs recently (past ~5 years) implemented dedicated metadata cache devices ... which allows you to cache either filesystem metadata or even small files to a blazing fast SSD mirror:

https://www.rsync.net/resources/notes/2021-q3-rsync.net_tech...

This is a quick and easy way to add thousands of iops to even something very slow like a raidz3 zpool.

As always:

"Let's repeat, and emphasize: unlike an SLOG or L2ARC which are merely inconvenient to lose, if you lose your metadata vdev (your "special" vdev) you will lose your entire zpool just as surely as if you lost one of the other full vdevs ..."

sitkack•1h ago
I would hope ZFS has a way to mirror metadata from the pool into an ssd, so it is actually a cache but doesn't increase the probability of dataloss.
wongarsu•1h ago
If you set up a normal L2arc (read cache device) that will cache both data and metadata. However you can configure it to only cache one of the two. Set it to metadata only and size it appropriately and you have basically a read-only metadata mirror.

If you also want to have fast writes you can get a second SSD and set up a mirrored metadata device (storing metadata on mirrored SSDs, and regular data on whatever the rest of your pool uses)

Padriac•1h ago
RAID is great but without monitoring and alerting you can still have a problem. Better still is the automatic creation of incident records and escalation.
iam-TJ•48m ago
When using LVM there is no need to use separate mdadm (MD) based RAID - just use LVM's own RAID support.

I have a workstation with four storage devices; two 512GB SSDs, one 1GB SSD, and one 3TB HDD. I use LUKS/dm_crypt for Full Disk Encryption (FDE) of the OS and most data volumes but two of the SSDs and the volumes they hold are unencrypted. These are for caching or public and ephemeral data that can easily be replaced: source-code of public projects, build products, experimental and temporary OS/VM images, and the like.

  dmsetup ls | wc -l 
reports 100 device-mapper Logical Volumes (LV). However only 30 are volumes exposing file-systems or OS images according to:

  ls -1 /dev/mapper/${VG}-* | grep -E "${VG}-[^_]+$" | wc -l
The other 70 are LVM raid1 mirrors, writecache, crypt or other target-type volumes.

This arrangement allows me to choose caching, raid, and any other device-mapper target combinations on a per-LV basis. I divide the file-system hierarchy into multiple mounted LVs and each is tailored to its usage, so I can choose both device-mapper options and file-system type. For example, /var/lib/machines/ is a LV with BTRFS to work with systemd-nspawn/machined so I have a base OS sub-volume and then various per-application snapshots based on it, whereas /home/ is RAID 1 mirror over multiple devices and /etc/ is also a RAID 1 mirror.

The RAID 1 mirrors can be easily backed-up to remote hosts using iSCSI block devices. Simply add the iSCSI volume to the mirror as an additional member, allow it to sync 100%, and then remove it from the mirror (one just needs to be aware of and minimising open files when doing so - syncing on start-up or shutdown when users are logged out is a useful strategy or from the startup or shutdown initrd).

Doing it this way rather than as file backups means in the event of disaster I can recover immediately on another PC simply by creating an LV RAID 1 with the iSCSI volume, adding local member volumes, letting the local volumes sync, then removing the iSCSI volume.

I initially allocate a minimum of space to each volume. If a volume gets close to capacity - or runs out - I simply do a live resize using e.g:

  lvextend --resizefs --size +32G ${VG}/${LV}
or, if I want to direct it to use a specific Physical Volume (PV) for the new space:

    lvextend --resizefs --size +32G ${VG}/${LV} ${PV}
One has to be aware that --resizefs uses 'fsadmn' and only supports a limited set of file-systems (ext*, ReiserFS and XFS) so if using BTRFS or others their own resize operations are required, e.g:

  btrfs filesystem resize max /srv/NAS/${VG}/${LV}

NAD+ the New Collagen? The Anti-Ageing Molecule Everyone's Talking About

https://www.marieclaire.co.uk/beauty/nad-benefits
1•Bluestein•6m ago•0 comments

I Saved a PNG Image to a Bird [video]

https://www.youtube.com/watch?v=hCQCP-5g5bo
2•houzi•8m ago•0 comments

Why Are Quiet Spaces Disappearing?

https://www.honest-broker.com/p/why-are-quiet-spaces-disappearing
1•Khaine•12m ago•1 comments

Show HN: DogNamesWorld – A fast dog name directory built with Astro

https://dognamesworld.com
1•laimingj•14m ago•0 comments

Thorpe is a SWE at a startup – he's also serving his 11th year in prison

https://techcrunch.com/2025/07/24/preston-thorpe-is-a-software-engineer-at-a-san-francisco-startup-hes-also-serving-his-11th-year-in-prison/
1•Gunnerhead•16m ago•0 comments

Myo Gesture Armband Teardown

https://learn.adafruit.com/myo-armband-teardown/inside-myo
1•downboots•16m ago•0 comments

Show HN: ExtractQ cuts auto-insurance claim time 75% with zero-training AI

https://www.scalong.com/case-studies/revolutionizing-auto-insurance-claims-with-processq
1•berwinsingh•20m ago•0 comments

Study Suggests Covid Shots Saved Fewer Lives Compared with Prior Estimates

https://www.medpagetoday.com/infectiousdisease/covid19vaccine/116674
2•Ozarkian•22m ago•0 comments

Multi vs. Single Page Apps – two implementations comparison

https://binaryigor.com/multi-vs-single-page-apps.html
1•BinaryIgor•22m ago•0 comments

Barbie's new pink (insulin) pumps – help children with type 1 diabetes

https://www.science.org/content/article/meet-diabetes-researcher-behind-barbie-s-new-pink-insulin-pumps
1•MukundMohanK•22m ago•0 comments

Climate groups call for wealth tax to make super-rich fund sustainable economy

https://www.theguardian.com/environment/2025/jul/15/climate-groups-call-uk-wealth-tax-make-super-rich-fund-sustainable-economy
3•PaulHoule•26m ago•0 comments

Channel-level EEG analysis systematically misattributes cortical source

https://neuromechanist.github.io/papers/uecog-2025/
2•lentoutcry•28m ago•0 comments

San Francisco's AI boom is intensifying battles for workers, housing

https://www.washingtonpost.com/business/2025/07/26/ai-boom-san-francisco-tech-workers-housing/
1•edward•30m ago•0 comments

Does visualization help AI understand data?

https://arxiv.org/abs/2507.18022
1•babushkaboi•33m ago•0 comments

Worlds Largest "Vibe Coding" Hackathon Winner

https://twitter.com/boltdotnew/status/1949171389224624301
1•babushkaboi•34m ago•0 comments

Sapients paper on the concept of Hierarchical Reasoning Model

https://arxiv.org/abs/2506.21734
6•hansmayer•51m ago•0 comments

Beyond Food and People

https://aeon.co/essays/nietzsches-startling-provocation-youre-edible-and-delicious
3•Petiver•55m ago•0 comments

Fragmentary Latin inscriptions can be completed with AI

https://www.economist.com/science-and-technology/2025/07/23/fragmentary-latin-inscriptions-can-be-completed-with-ai
1•helsinkiandrew•56m ago•1 comments

First release candidate of systemd 258 is here

https://www.theregister.com/2025/07/25/systemd_258_first_rc_here/
2•ossusermivami•58m ago•0 comments

When We Get Komooted

https://bikepacking.com/plog/when-we-get-komooted/
3•atakan_gurkan•59m ago•0 comments

AWR6843AOP – Single-chip 60GHz radar sensor with antenna on package, DSP and MCU

https://www.ti.com/product/AWR6843AOP
2•nynyny7•1h ago•0 comments

Linux on Snapdragon X Elite: Linaro and Tuxedo Pave the Way for ARM64 Laptops

https://www.linaro.org/blog/linux-on-snapdragon-x-elite/
9•MarcusE1W•1h ago•1 comments

Development shells with Nix: four quick examples

https://michael.stapelberg.ch/posts/2025-07-27-dev-shells-with-nix-4-quick-examples/
2•todsacerdoti•1h ago•0 comments

Releasing a Python Library for Deploying Agents with Microsoft Azure

1•manuelfdng•1h ago•0 comments

Christopher Lasch, Plain Writing, and Democracy

https://providencemag.com/2025/07/christopher-lasch-plain-writing-and-democracy/
1•Caiero•1h ago•0 comments

Specials (2021)

https://www.brendangregg.com/specials.html
2•todsacerdoti•1h ago•0 comments

TimescaleDB-Art

https://blog.cloudflare.com/timescaledb-art/
1•Vedant817•1h ago•0 comments

The RomM Project: all-in-one app for managing your game collection

https://romm.app/
1•thunderbong•1h ago•0 comments

Show HN: ClosedLinks

2•chistev•1h ago•1 comments

The 14 Pains of Billing for AI Agents

https://arnon.dk/the-14-pains-of-billing-ai-agents/
2•arnon•1h ago•0 comments