DragonFly kernel List (threaded) for 2008-11
Re: HEADS UP - HAMMER work
Matthew Dillon wrote:
> 64-bit directory hash encoding (for smaller filenames out of bounds
> indices just store a 0).
> aaaaa name & 0x1F
> bbbbb name & 0x1F
> ccccc name & 0x1F
> mmmmmm crc32(name + 3, len - 5) and some xor magic -> 6 bits
> yyyyy name[len-2] & 0x1F
> zzzzz name[len-1] & 0x1F
> h crc32(name, len) (entire filename)
> 0aaaaabbbbbccccc mmmmmmyyyyyzzzzz hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh0
> There are several issues but most of the trade-offs seem to be ok. The
> one issue I am looking at closely is perhaps to try to do a better job
> sorting files that have both a prefix and a postfix. In the above scheme
> any large prefix or postfix covers the a, b, c, y, and z
> bits and devolves us back to an effectively random hash code.
You already mentioned it. That's exactly the problem
that I'm seeing ... I'm not sure whether a, b, c,
y and z buy you anything in practice.
If a single directory contains a huge number of files,
it is likely they are all of the same type, e.g. it could
be a collection of images or whatever. That means they
all have the same extension (e.g. .jpg), so y and z
Furthermore, it isn't completely unlikely that they even
begin with the same prefix. For example, all of my
digital camera pics are named "img%05d.jpg". Admittedly
those aren't millions (but more than 10k anyway), and
I'm not stupid enough to collect them in a single
Another example: The cache directory of my Opera browser.
It contains several thousands of files all beginning
It might be a good idea to make a small survey, i.e. find
people who actually _do_ have directories with a huge
number of files in them (and I mean more than just a few
thousands), and ask them what the filenames typically look
An obvious improvement would be to store name[d-2] and
name[d-1] in y and z, respectively, where d is the
location of the last dot in the filename, if any, or the
location of the terminating zero if there is no dot.
In other words: Ignore the extension when identifying
y and z. Finding the last dot shouldn't be more
computationally expensive than strlen(name), so this
shouldn't be a problem.
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing b. M.
Handelsregister: Registergericht Muenchen, HRA 74606, Geschäftsfuehrung:
secnetix Verwaltungsgesellsch. mbH, Handelsregister: Registergericht Mün-
chen, HRB 125758, Geschäftsführer: Maik Bachmann, Olaf Erb, Ralf Gebhart
FreeBSD-Dienstleistungen, -Produkte und mehr: http://www.secnetix.de/bsd