DragonFly kernel List (threaded) for 2008-11
DragonFly BSD
DragonFly kernel List (threaded) for 2008-11
[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]


To: Dennis Melentyev <dennis.melentyev@xxxxxxxxx>
From: Michael Neumann <mneumann@xxxxxxxx>
Date: Sat, 15 Nov 2008 09:56:30 +0100

Dennis Melentyev schrieb:

2008/11/14 Oliver Fromme <check+kabwj800rsns8lml@fromme.com>:
Matthew Dillon wrote:
 >    64-bit directory hash encoding (for smaller filenames out of bounds
 >    indices just store a 0).
 >    aaaaa       name[0] & 0x1F
 >    bbbbb       name[1] & 0x1F
 >    ccccc       name[2] & 0x1F
 >    mmmmmm      crc32(name + 3, len - 5) and some xor magic -> 6 bits
 >    yyyyy       name[len-2] & 0x1F
 >    zzzzz       name[len-1] & 0x1F
 >    h[31]       crc32(name, len) (entire filename)
 >    0aaaaabbbbbccccc mmmmmmyyyyyzzzzz hhhhhhhhhhhhhhhh hhhhhhhhhhhhhhh0
 > [...]

You already mentioned it.  That's exactly the problem
that I'm seeing ...  I'm not sure whether a[], b[], c[],
y[] and z[] buy you anything in practice.

If a single directory contains a huge number of files,
it is likely they are all of the same type, e.g. it could
be a collection of images or whatever.  That means they
all have the same extension (e.g. .jpg), so y[] and z[]
are useless.

Furthermore, it isn't completely unlikely that they even
begin with the same prefix.  For example, all of my
digital camera pics are named "img%05d.jpg".  Admittedly
those aren't millions (but more than 10k anyway), and
I'm not stupid enough to collect them in a single
directory.  ;-)

Another example:  The cache directory of my Opera browser.
It contains several thousands of files all beginning
with "opr*".

It might be a good idea to make a small survey, i.e. find
people who actually _do_ have directories with a huge
number of files in them (and I mean more than just a few
thousands), and ask them what the filenames typically look

An obvious improvement would be to store name[d-2] and
name[d-1] in y[] and z[], respectively, where d is the
location of the last dot in the filename, if any, or the
location of the terminating zero if there is no dot.
In other words:  Ignore the extension when identifying
y[] and z[].  Finding the last dot shouldn't be more
computationally expensive than strlen(name), so this
shouldn't be a problem.

I do agree with Oliver. But have another proposal: Also, I doubt that there are usually more than 1-2 affected directories per host. And usually, file names has very similar pattern.

Sysctl/some-other-tunable with some kind of mask would be great for
fine-tuning (and just useless for the 90% of users).

sysctl.hammer.dirhash.hashmask.prefix=1 (Starting at first filename
byte, 3 bytes fixed length)
sysctl.hammer.dirhash.hashmask.suffix=-1 (Starting last byte, 2 bytes length)

Hm, but the hash is stored on-disk within each b-tree node, so changing the hash-function becomes pretty dangerous!



[Date Prev][Date Next]  [Thread Prev][Thread Next]  [Date Index][Thread Index]