ext2 performance with large directories

ext2 performance with large directories

Post by jtnew » Sat, 02 Jun 2001 23:03:41



Anyone know how ext2 performance degrades
with large directories?

I'm developing a software program
which stores 100,000 or more files in
one single directory.

Is directory entry lookup based on some
hashing type of scheme? Or is it a linear
lookup?

 
 
 

ext2 performance with large directories

Post by el.. » Sun, 03 Jun 2001 01:50:43




>Anyone know how ext2 performance degrades
>with large directories?

>I'm developing a software program
>which stores 100,000 or more files in
>one single directory.

>Is directory entry lookup based on some
>hashing type of scheme? Or is it a linear
>lookup?

It's a linear lookup, i.e. it gets very slow.

--
http://www.spinics.net/linux/

 
 
 

ext2 performance with large directories

Post by jurriaan kalkm » Sun, 03 Jun 2001 03:41:16






>>Anyone know how ext2 performance degrades
>>with large directories?

>>I'm developing a software program
>>which stores 100,000 or more files in
>>one single directory.

>>Is directory entry lookup based on some
>>hashing type of scheme? Or is it a linear
>>lookup?

> It's a linear lookup, i.e. it gets very slow.

There are patches floating around the linux-kernel mailinglist to deal
with this problem, but the general consensus seems to be such programs
are shitty and should be avoided. Reiserfs may do better, BTW.

Good luck,
Jurriaan
--
I that case, I shall prepare my Turnip Surprise.
And the surprise is?
There's nothing else in it except turnip.
        Baldrick on Haute Cuisine      
GNU/Linux 2.4.5-ac4 SMP/ReiserFS 2x1402 bogomips load av: 0.01 0.01 0.00

 
 
 

ext2 performance with large directories

Post by Patrick Draper/Austin/Sector 7 USA, Inc » Sun, 03 Jun 2001 04:02:06


A typical solution is to break up the directory into a large tree. The
directories are named according to the files contained inside them.

example:

all files wil names starting with 'a' go into /a. Same with other
letters. If that doesn't break it up enough, start going with two
letters, or three letters, in a tree arrangement.

/a
/a/aa
/a/ab
/a/ac
/a/ad
/b
/b/ba
/b/bb

and so on. The reason for the tree is to make it so that no directory
has too many files. Your program would then look at the filename to get
the right path to the file it's looking for. A good way to do it is to
make a function that given a filename, returns the path that you would
expect to find the file in.

To see another example of this, take a look at your terminfo database
which on my Debian system is in /usr/share/terminfo. I have 2139 files
in that heirarchy, which was enough to warrant splitting into the tree.
You should definitely do this if you have 100,000 files.

 
 
 

ext2 performance with large directories

Post by Anonymou » Sun, 03 Jun 2001 12:08:57



> It's a linear lookup, i.e. it gets very slow.

One thing I noticed is that even for as many
as 60,000 directory entries, the performance
isn't all that bad.

I wonder why?

  --------== Posted Anonymously via Newsfeeds.Com ==-------
     Featuring the worlds only Anonymous Usenet Server
    -----------== http://www.newsfeeds.com ==----------

 
 
 

ext2 performance with large directories

Post by Linus Torval » Tue, 05 Jun 2001 14:28:58





>> It's a linear lookup, i.e. it gets very slow.

>One thing I noticed is that even for as many
>as 60,000 directory entries, the performance
>isn't all that bad.

>I wonder why?

Depending on your access patterns, the directory cache will kick in, and
do most of the real work.

And the dcache uses a pretty efficient hashing mechanism, regardless of
what the underlying filesystem is doing.

But you should realize that the dcache is nothing but a cache, and while
very good for most normal loads you can still get into * performance
behaviour by having the "wrong" access patterns.

                Linus

 
 
 

ext2 performance with large directories

Post by el.. » Wed, 06 Jun 2001 10:58:14




>Depending on your access patterns, the directory cache will kick in, and
>do most of the real work.

>And the dcache uses a pretty efficient hashing mechanism, regardless of
>what the underlying filesystem is doing.

>But you should realize that the dcache is nothing but a cache, and while
>very good for most normal loads you can still get into * performance
>behaviour by having the "wrong" access patterns.

IIRC inn was great at having the "wrong" patterns.  But storing news as
one file per article is probably one of the worst things you can do.
 
 
 

ext2 performance with large directories

Post by cLIeNUX us » Wed, 06 Jun 2001 13:03:23






>>> It's a linear lookup, i.e. it gets very slow.

>>One thing I noticed is that even for as many
>>as 60,000 directory entries, the performance
>>isn't all that bad.

>>I wonder why?

>Depending on your access patterns, the directory cache will kick in, and
>do most of the real work.

>And the dcache uses a pretty efficient hashing mechanism, regardless of
>what the underlying filesystem is doing.

>But you should realize that the dcache is nothing but a cache, and while
>very good for most normal loads you can still get into * performance
>behaviour by having the "wrong" access patterns.

>            Linus

How was Japan?

Rick Hohensee
301-595-4063