DragonFly users List (threaded) for 2011-07
Re: Easy way to find identify files which share some content/blocks
On 2011-05-02, Justin Sherrill <email@example.com> wrote:
> You could dump out the B-tree information. I don't know how clear a
> picture would come from that, and it may require some massaging of
> data anyway since nonduplicated files may have some degree of
> matching, duplicated data anyway, especially when dealing with larger
> image file.
That's a bit beyond my current C programming skills I guess, and a
little to much effort for this little cleanup project. Anyway, thanks
for the idea.
> If you are sure that the corruption lies at the end of the files, you
> could loop over the files, read the first x bytes of each, then MD5
> that data. Matching MD5 = matching file.
It mostly is at the end. This suggestion (partitioning files into
chunks) is what I had done so far (on Linux) with a few lines of shell
(changed old existing script for that), then, due to inherent
inefficiencies, in python.
A handful of lines, and output "inode, chunkId, hash" to file or SQL,
then go from there.
I had hoped hammer, as a deduplicating filesystem, had tools that could
easily give me that information without "hacks" like above.
> On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
>> now that Dragonfly's HAMMER has got deduplication I ask myself if there
>> is a simple way to identify "pairs" or groups of files which share a lot
>> of data, i.e. are mostly identical.
>> I have a rather large repository of downloaded pictures, which contain
>> a lot of dupes in multiple locations. I have no problems finding those
>> given some time and a shell prompt.
>> I'm interested in identifying broken files. Broken in the sense that
>> A is an incomplete version of B (some bytes missing), or B a damaged
>> version of A (some additional bytes at the end).
>> Is there a way to get to something like this:
>> "File A shares 1234 (98.3%) data blocks with file B"
>> "File A shares xxxx (xx.x%) data blocks with file C"
>> Getting a step closer helps too.
>> Thanks for any insights.