So when a user wants to remove a CAS-addressed document, before really deleting it you need to detect if it's the last reference. This is not easy to do, it is in fact much harder to do correctly than eating the cost of storing duplicate files.
And this paragraph is the purported solution:
And usually when CAS is considered as a solution, it's to solve the need of deduplicating files to save on storage. But even there, the good solution is to give files their own internal uuids as storage keys, store its hash alongside, and generate external uuids for each file upload, then use refcounts to handle the final delete.
The problem is this solution reframes the problem but doesn't solve it. It still requires:
- Accurate reference counting
- Careful handling of deletes
- Synchronization across systems
Which is all part of the original problem.
At the end of the day, you can't safely and scalably do distributed deletes with refcounts unless you centralize the operation, which kills scalability. There are work-arounds, such as marking the file as unreferenced and then running a garbage collector to delete unreferenced files, but the author doesn't discuss them.
taylodl•38m ago
So when a user wants to remove a CAS-addressed document, before really deleting it you need to detect if it's the last reference. This is not easy to do, it is in fact much harder to do correctly than eating the cost of storing duplicate files.
And this paragraph is the purported solution:
And usually when CAS is considered as a solution, it's to solve the need of deduplicating files to save on storage. But even there, the good solution is to give files their own internal uuids as storage keys, store its hash alongside, and generate external uuids for each file upload, then use refcounts to handle the final delete.
The problem is this solution reframes the problem but doesn't solve it. It still requires:
- Accurate reference counting
- Careful handling of deletes
- Synchronization across systems
Which is all part of the original problem.
At the end of the day, you can't safely and scalably do distributed deletes with refcounts unless you centralize the operation, which kills scalability. There are work-arounds, such as marking the file as unreferenced and then running a garbage collector to delete unreferenced files, but the author doesn't discuss them.