performance degradation in large directories

performance degradation in large directories

Post by Richard M. Mathe » Wed, 13 May 1992 10:36:56

>The unix folklore is that "large directories" (that is, directories
>with more than a few hundred files) pay a horrible performace penalty.

In most Unix file systems I have seen, the directory search is purely a
linear search.  CPU time for opening an existing file goes linearly with
the position of the file name in the directory.  CPU time for failing an
open of a non-existent file or for creating a new file name goes linearly
with the size of the directory file (as measured by number of files plus
number of holes -- which is proportional to the total directory file size
on old fashioned (V7 or SysV) file systems).  I/O for each of these also
tends to go linearly with the same parameters, but successive use of
the same directory will be more I/O efficient as the pages will be in
the buffer cache.  Some systems have directory caches, and you can do
better than the linear search when accessing the same file repeatedly.

Any benchmarker should consider whether the intent is to model a load
which repeatedly accesses a small number of files in a small number of
directories or a large number of files in a large number of directories
or whatever.  Size of the buffer cache (and directory cache, if any) can
be important tuning parameters, and modeling a realistic amount of other
system activity which affects these caches may also be important.

Richard M. Mathews                      Lietuva laisva = Free Lithuania

UUCP:       ...!uunet!lcc!richard       Eesti vabaks   = Free Estonia


1. performance degradation in large directories

If you examine Figure 4.6 on page 69 of the Bach book ["The Design of the
UNIX Operating System", Marice Bach, Prentice-Hall, ISBN 0-13-201799-7], it
should be evident why large files exact a performance penalty:

        after 10 blocks, one indirect block is needed to access further data;
        this is the single indirect block

        then a still-larger file requires double-indirection.

        and the largest files require triple-indirection.

For those files with triple indirection, you may need to read 4 "related"
disk blocks before you have the pointer to the DATA block you're seeking
(which requires yet another disk read to fetch).

2. Athlon runtime problems

3. ext2 performance with large directories

4. Process table

5. Performance on large directory

6. performance of linux vs nt

7. Linux disk performance (large directories)

8. ydl 2.1 imac modem

9. Performance degradation after enabling routed daemon?

10. degradation of 2d performance

11. IDE performance degradation on -ac tree

12. ppp performance degradation?

13. LX164 serial performance degradation