DragonFly users List (threaded) for 2011-07
[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]

Re: Easy way to find identify files which share some content/blocks

From:	Thomas Keusch <fwd+usenet-spam2011q3@xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx>
Date:	21 Jul 2011 19:02:41 GMT

On 2011-05-02, Justin Sherrill <justin@shiningsilence.com> wrote:

Hi Justin,

> You could dump out the B-tree information.  I don't know how clear a
> picture would come from that, and it may require some massaging of
> data anyway since nonduplicated files may have some degree of
> matching, duplicated data anyway, especially when dealing with larger
> image file.

That's a bit beyond my current C programming skills I guess, and a
little to much effort for this little cleanup project. Anyway, thanks
for the idea.

> If you are sure that the corruption lies at the end of the files, you
> could loop over the files, read the first x bytes of each, then MD5
> that data.  Matching MD5 = matching file.

It mostly is at the end. This suggestion (partitioning files into
chunks) is what I had done so far (on Linux) with a few lines of shell
(changed old existing script for that), then, due to inherent
inefficiencies, in python.

A handful of lines, and output "inode, chunkId, hash" to file or SQL,
then go from there.

I had hoped hammer, as a deduplicating filesystem, had tools that could
easily give me that information without "hacks" like above.

Regards
Thomas

> On Sun, May 1, 2011 at 2:39 PM, Thomas Keusch
><fwd+usenet-spam2011q2@bsd-solutions-duesseldorf.de> wrote:
>> Hello,
>>
>> now that Dragonfly's HAMMER has got deduplication I ask myself if there
>> is a simple way to identify "pairs" or groups of files which share a lot
>> of data, i.e. are mostly identical.
>>
>> I have a rather large repository of downloaded pictures, which contain
>> a lot of dupes in multiple locations. I have no problems finding those
>> given some time and a shell prompt.
>>
>> I'm interested in identifying broken files. Broken in the sense that
>> A is an incomplete version of B (some bytes missing), or B a damaged
>> version of A (some additional bytes at the end).
>>
>> Is there a way to get to something like this:
>>
>> "File A shares 1234 (98.3%) data blocks with file B"
>> "File A shares xxxx (xx.x%) data blocks with file C"
>>
>> Getting a step closer helps too.
>>
>> Thanks for any insights.
>>
>>
>> Regards
>> Thomas
>>

[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index][Thread Index]