performance degradation in large directories

performance degradation in large directories

Post by Thaddeus P. Flory » Wed, 06 May 1992 16:31:19




>[   This is a reposting.  An earlier post to comp.unix.questions
>    and comp.unix.ultrix yielded no results after a week.   ]

>The unix folklore is that "large directories" (that is, directories
>with more than a few hundred files) pay a horrible performace penalty.
>[...]

If you examine Figure 4.6 on page 69 of the Bach book ["The Design of the
UNIX Operating System", Marice Bach, Prentice-Hall, ISBN 0-13-201799-7], it
should be evident why large files exact a performance penalty:

        after 10 blocks, one indirect block is needed to access further data;
        this is the single indirect block

        then a still-larger file requires double-indirection.

        and the largest files require triple-indirection.

For those files with triple indirection, you may need to read 4 "related"
disk blocks before you have the pointer to the DATA block you're seeking
(which requires yet another disk read to fetch).


 
 
 

performance degradation in large directories

Post by David Pott » Thu, 07 May 1992 00:13:14


Wayne describes a file system where reading the directory seems to be
independent of the directory size. That is opening a file for read is
not a function of the directory size. Then he states that he ran tests
that indicate that creating new files, i.e., creating a new entry in
the directory, is O(N squared), where N is the size of the directory.

I would not expect this kind of N squared behavior to be due to the
indirect blocks.  That would appear as a step function when the
directory got big enough to require indirect blocks.  I would still
expect to find an N squared algorithm in file create that does occur
when opening a file for read. I have not looked at the source code
for directory search, nor directory insert. I still consider this
an open question.

--
*                tuktusiuriagatigitqingnapin'ngitkiptin'nga, David Potter

 
 
 

performance degradation in large directories

Post by Gerrit Huizen » Thu, 07 May 1992 05:47:36



>>The unix folklore is that "large directories" (that is, directories
>>with more than a few hundred files) pay a horrible performace penalty.
>>[...]

>If you examine Figure 4.6 on page 69 of the Bach book ["The Design of the
>UNIX Operating System", Marice Bach, Prentice-Hall, ISBN 0-13-201799-7], it
>should be evident why large files exact a performance penalty:
>    after 10 blocks, one indirect block is needed to access further data;
>    this is the single indirect block
>    then a still-larger file requires double-indirection.
>    and the largest files require triple-indirection.
>For those files with triple indirection, you may need to read 4 "related"
>disk blocks before you have the pointer to the DATA block you're seeking
>(which requires yet another disk read to fetch).

BSD derived Fast Filesystem Code used to contain a comment saying that
triple indirect blocks had never been tested - you need (I forget exact
numbers and am too lazy to go add it up again) a file of over a
gigabyte before you use triple indirect blocks.  Because *lots* of
inodes will fit in 10 blocks and thousands will fit in a single
indirect block, it is unlikely that this is the real problem.

When doing a directory lookup, the vnode/inode for the directory is
locked while searching or updating the vnode/inode; hence multiple
activities in the same directory are often serialized (only updates
actually have to be serialized, but accesses usually are as well).
Also, on most/all systems which use the BSD Fast File System, there is
a buffer cache and very few of these activities actually hit the disk
(I seem to remember an inode flush on updates which would mean a disk
write for every update).

The activity that you are doing with the file(s) in the directory will
determine the general performance.  Simply a scandir of the directory
without sorting should degrade only linearly.  A stat() of every file
in the directory is likely to cause a lot of disk activity, especially
if you are filling the buffer cache and making worst case (or even just
"bad" case) use of the buffer cache.

Just like any other performance related issue, you have to evaluate
your use of the issue and consider performance tuning your machine for
that style of use.

gerrit

 
 
 

performance degradation in large directories

Post by John F Haugh » Fri, 08 May 1992 08:38:34



Quote:>For those files with triple indirection, you may need to read 4 "related"
>disk blocks before you have the pointer to the DATA block you're seeking
>(which requires yet another disk read to fetch).

This isn't much of an issue since the indirect block will be buffered while
the lookup is taking place.  Also, for a 10 block directory you will have
640 entries (assuming 16 byte 7th Edition style entries and a 1K filesystem).
Getting through a single indirect block would put 17,024 total entries in
that directory.  I won't even do double indirect blocks since it should be
obvious right about now that you will never have a directory with more than
a few double indirect blocks.

I looked at this a while back and discovered that moving directory names
to the beginning of a directory gives a fair performance improvement.  I
wrote a tool that stat()'s all the files in a directory and moves directories
to the front, followed by regular files in access time sorted order.
--
John F. Haugh II      | VNET: LCCB386 at AUSVMQ | "Physical Fitness is the
SneakerNet: 042/2F068 | MaBell: (512) 823-1078  |  basis for all other forms

 
 
 

1. performance degradation in large directories

In most Unix file systems I have seen, the directory search is purely a
linear search.  CPU time for opening an existing file goes linearly with
the position of the file name in the directory.  CPU time for failing an
open of a non-existent file or for creating a new file name goes linearly
with the size of the directory file (as measured by number of files plus
number of holes -- which is proportional to the total directory file size
on old fashioned (V7 or SysV) file systems).  I/O for each of these also
tends to go linearly with the same parameters, but successive use of
the same directory will be more I/O efficient as the pages will be in
the buffer cache.  Some systems have directory caches, and you can do
better than the linear search when accessing the same file repeatedly.

Any benchmarker should consider whether the intent is to model a load
which repeatedly accesses a small number of files in a small number of
directories or a large number of files in a large number of directories
or whatever.  Size of the buffer cache (and directory cache, if any) can
be important tuning parameters, and modeling a realistic amount of other
system activity which affects these caches may also be important.

Richard M. Mathews                      Lietuva laisva = Free Lithuania

UUCP:       ...!uunet!lcc!richard       Eesti vabaks   = Free Estonia

2. Install onto CD

3. ext2 performance with large directories

4. libtt syslog messages

5. Performance on large directory

6. need help with simple arithmatic

7. Linux disk performance (large directories)

8. Yahoo! Messenger Linux Version Error

9. Performance degradation after enabling routed daemon?

10. degradation of 2d performance

11. IDE performance degradation on -ac tree

12. ppp performance degradation?

13. LX164 serial performance degradation