On de-duplicating uploaded file-content.
This evening I've been mostly playing with removing duplicate content. I've had this idea for the past few days about object-storage, and obviously in that context if you can handle duplicate content cleanly that's a big win.
The naive implementation of object-storage involves splitting uploaded files into chunks, storing them separately, and writing database-entries such that you can reassemble the appropriate chunks when the object is retrieved.
If you store chunks on-disk, by the hash of their contents, then things are nice and simple.
The end result is that you might upload the file /etc/passwd, split that into four-byte chunks, and then hash each chunk using SHA256.
This leaves you with some database-entries, and a bunch of files on-disk:
/tmp/hashed/ef267892ee080862c96a8d2d05de62f48e20f0875f27379e7d58c73ea4455bf1 /tmp/hashed/a378977155fb42bb006496321cbe31f74cbda803c3f6ca590f30e76d1afad921 .. /tmp/hashed/3805b0245bc8375be7125ae228eef711552ac082ffb9bf8756e2964a2393a9de
In my toy-code I wrote out the data in 4-byte chunks, which is grossly ineffeciant. But the value of using such small pieces is that there is liable to be a lot of collisions, and that means we save-space. It is a trade-off.
So the main thing I was experimenting with was the size of the chunks. If you make them too small you lose I/O due to the overhead of writing out so many small files, but you gain because collisions are common.
The rough testing I did involved using chunks of 16, 32, 128, 255, 512, 1024, 2048, and 4096 bytes. As sizes went up the overhead shrank, but also so did the collisions.
Unless you could handle the case of users uploading a lot of files like /bin/ls which are going to collide 100% of the time with prior uploads using larger chunks just didn't win as much as I thought they would.
I wrote a toy server using Sinatra & Ruby, which handles the splitting/hashing/and stored block-IDs in SQLite. It's not so novel given that it took only an hour or so to write.
The downside of my approach is also immediately apparent. All the data must live on a single machine - so that reassmbly works in the simple fashion. That's possible, even with lots of content if you use GlusterFS, or similar, but it's probably not a great approach in general. If you have large capacity storage avilable locally then this might would well enough for storing backups, etc, but .. yeah.