Mac "find" performs far better than Unix O(n^2) "find"

Mac "find" performs far better than Unix O(n^2) "find"

> The only way a Mac could run faster would be if they had a better algorithm to
> proform the search with. Since (as we all know) a linear search of elements is
> O(n), and this is probly the worst search anyone would implement (unless they
> were crazy _and_ stupid).

Well, I hate to break it to you, but the Unix find is O(n^2) on the number
of files.

Roughly speaking (without having seen the source code in years), here is
how it works:

For every directory:

cd into that directory (efficiency hack, explained later)
Get a list of files in the current directory.
stat() each file (you have to do this for every file even if you are only
printing out file names, since you need to know which of those files are
directories)
Apply command line criteria to the file
If it is a directory, apply this algorithm recursively
repeat on all the files in that directory
cd out of that directory

Now, the above appears to be an O(n) algorithm, doesn't it?  However, if
we analyze what stat(), a fairly expensive system call, actually does:

Switch into kernel context (swap page tables, etc.; not cheap)
Resolve the path to get the inode and the stat data.  How expensive is
this?  Well...

Since Unix directories are unstructured, you have to do a linear search
through the directory to find a matching filename.  This is O(n) *for each
and every component of the path name resolution*.  (This is the reason
that you cd into the directory; you only want to have to resolve one
component).  Since this step is done on each and every file, the whole
algorithm is O(n) * O(n) == O(n^2).  Even in the typical case, it is still
O(n^2), since every file has to be looked at (you get 1 + 2 + 3 + ... + n
= n(n+1)/2 = (n^2 + n)/2 == O(n^2) for the cost of the stat of each file
in the directory).

To finish stat() off:
copy the data to user space
switch back to the user's context (swap page tables, etc.; not cheap)

To summarize:  the Unix find, which does O(n) expensive system calls
(stat(), which is itself O(n)), is actually an O(n^2) algorithm, in both
the worst case and the typical case.

On the other hand, the heart of the Mac find command is the system call
PBGetCatInfo(), which is an iterator.  You provide it with the dirId and
an index (iterating on the index), and it returns the same family of info
that stat does for every file in a given directory.  If I remember
correctly (does someone have the old Inside Mac Volume 4, or the new IM:
Files handy?), all the offspring for a given directory are stored
contiguously, making most of the PBGetCatInfo() calls O(1) (constant
time).  Finding a directory involves looking through a B*-tree, which is
O(log n) time.  In the very worst case (no files, only directories), the
Mac find takes O(n) * O(log n) * O(1) == O(n log n).  In the typical case
(an order of magnitude or more files than directories), the performance is
much closer to O(n).

To summarize:  the Mac find, is O(n log n) in the worst case, and much
closer to O(n) in the typical case.

Quote:>To be more efficent, a find utility would have to
> have a pre-existant data structure that enabled faster searches (Binary Search
> Tree, Binomial Heap etc...), but since you are ususlly looking for substrings
> (like all the .h files or something) these implementations may have mixed
> results (fast on some types of searches, slow on others)

Of course, Unix doesn't impose *any* kind of structure on its
directories.  I can't think of any case where this better than having a
preexisting data structure, with the single exception that having no
structure makes life pretty simple for those few people in the world who
have to implement the filesystem (a lot of "design decisions" of Unix can
be attributed to ease of implementation).  Since the other two types of
operations performed on directories is creating a file and deleting a
file, let's analyse them:

file creation, no structure (Unix):  O(n) on the size of the directory,
since the whole directory has to be searched to see if there is a
duplicate file name being created.

file creation, tree structure (eg: AVL trees):  O(log n) on the lookup,
and O(log n) on the insertion, totalling O(log n)

file deletion, no structure (Unix):  O(n) (you may have to look at the
whole directory before you find the correct file).

file deletion, tree structure (eg: AVL trees):  O(log n).

Unix:  the system to use when you have lots and lots of spare CPU cycles
you wish to go to waste.
--

office: (520) 621-1815

Mac "find" performs far better than Unix O(n^2) "find"

Quote:>cd into that directory (efficiency hack, explained later)
>Get a list of files in the current directory.
>stat() each file (you have to do this for every file even if you are only
>printing out file names, since you need to know which of those files are
>directories)
>Apply command line criteria to the file
>If it is a directory, apply this algorithm recursively
>repeat on all the files in that directory
>cd out of that directory

Efficiency hack #1 - don't stat all files in the directory if it has only

Quote:>Now, the above appears to be an O(n) algorithm, doesn't it?  However, if
>we analyze what stat(), a fairly expensive system call, actually does:
>Switch into kernel context (swap page tables, etc.; not cheap)

Not true, in default Unix has one context which contains the
kernel pages in the top, unreadable from user mode.
So there's not much swapping going around.

Quote:>Since Unix directories are unstructured, you have to do a linear search
>through the directory to find a matching filename.

Efficiency hack #2: name lookups in a directory will often search from the
last name found.  Precisely because it is so often used (ls and
other programs also go through the directory linearly)

The slow speed in Unix can possibly be explained from the seperation
of directory data and inodes.  Each directory access requries
of files in that directory.

Quote:>Unix:  the system to use when you have lots and lots of spare CPU cycles
>you wish to go to waste.

Except that disks are typically so much slower than your typical CPU
that it doesn't matter much for typical directories.

But hashed directories is something that needs looking into.
(And since NFS did away with reading directories and the BSD ffs
changed the format once already, old Unix programs that use
open/read on directories should be gone by now)

Casper
--
Expressed in this posting are my opinions.  They are in no way related
to opinions held by my employer, Sun Microsystems.

Mac "find" performs far better than Unix O(n^2) "find"

Quote:>Roughly speaking (without having seen the source code in years), here is
>how it works:

>For every directory:

>cd into that directory (efficiency hack, explained later)
>Get a list of files in the current directory.
>stat() each file (you have to do this for every file even if you are only
>printing out file names, since you need to know which of those files are
>directories)

An interesting aside: the folks who developed AFS recognized how
expensive an operation stat() is, and how bogus it was to need to
perform that operation every time you want to test a file for
directory-ness.  So, they came up with a solution.  Give ordinary
files odd-numbered inodes, and directories even-numbered inodes.  They
also implemented a "ws" command (walk subtree), that's sorta like
"find" except that it'll take advantage of this hack, given the
opportunity.

Is it worth considering putting this hack into e2fs and gnu find at
some point?

Quote:>Since Unix directories are unstructured,

(for some filesystems; remember, some Unixes can use HFS for example)

Quote:>you have to do a linear search
>through the directory to find a matching filename.  This is O(n) *for each
>and every component of the path name resolution*.

Ah, but this is a different n.  First, your n was the number of files
in the subtree.  *This* n is the number of files in the *directory*,
no?  So it's really O(n*m), where n is the number of files in the
subtree, and m is something akin to an average number of files per
directory.  Admittedly, in the worst case (one big directory), m=n.
Furthermore, in *some* cases, m is in general proportional to n (when
the number of directories is fairly constant over number of files,
like for example in a news spool), and in that case we'll have O(n^2)
too.  But are those the right assumptions to be making?

In other cases the maximum directory size is what'll be held
"relatively constant", and new files will end up in new subdirectories
(that's the way my own document folders work -- first a ~/doc
directory, then if it's getting full a ~/doc/school and ~/doc/work
division, then if school is getting to full ~/doc/school/cs1501 and
~/doc/school/engcmp0800 for example...).  In *those* cases, it'll be
more like O(n).

(Anyhow, this discussion is helping me see the need for object
oriented filesystems -- have mutliple types of containers, all
superclasses of some generic "directory" class, and having different
access methods.  Match the particular implementation to the uses a
particular directory is going to see; news spool and /tmp have very
different usage patterns, and using the exact same structures and
algorithms for both is potentially highly bogus.  I'm sure there are
pathological cases where HFS performance breaks down -- being tied to
one data structure for all your needs is just bogus.  Right now, we
Unix folks solve this problem by using totally different filesystems
within one unified namespace, but that's not 100% satisfactory.  Might
be something to play with, when my TODO list shrinks a zillion orders
of magnitude.)

Quote:>On the other hand, the heart of the Mac find command is the system call
>PBGetCatInfo(), which is an iterator.

Hm.  Worth playing with the idea of a "statv()" system call, that does
a simultaneous stat of all the filenames in a vector?  Naaah, in the
general case the performance gain wouldn't be enough to justify it,
I'd think.  But improving the data structures used in the underlying
filesystem, on the other hand...

Quote:>Of course, Unix doesn't impose *any* kind of structure on its
>directories.

Again, you're being too general.  There are many, many different Unix
filesystems, and they're each free to do what they want in this
regard.  If you'd said "Berkeley FFS and the standard SysV
filesystem", my objection wouldn't be valid.  But, there are folks
running Unix systems that use HFS, FAT and various extended FAT
filesystems (VFAT, UMSDOS), experimental object-oriented filesystems,
and some hyper-performance-tuned custom filesystems like the ones AFS
uses on its server machines.

But I admit that's just a pedantic nit on my part -- 99.999% of the
Unix filesystems in actual use in the world probably *do* suck big fat
hairy razor-studded rocks.  Through a garden hose.

Discussion on this topic is a very good thing -- convince the Unix
community at large that the filesystem is an inefficient piece of
crap, and a new one will eventually arise that squashes the particular
objections that've been raised.  The standard Linux filesystem (e2fs)
already has an attempt at some big performance improvements over FFS.

Of course, we free Unix users (Linux, FreeBSD, NetBSD, HURD) will see
the benefits much more quickly than the commercial Unix world.  This
is only just.  Anyone else with an OS that can dynamically add
filesystems (AmigaOS?  Plan9?) should also be able to reap the
benefits as well.

Come to think of it, I know the Mac can dynamically add filesystems
too -- aren't ISO9660 (cdrom) and FAT (dos) filesystems readable, via
extensions or inits or whatever the hell they're called?

(Maybe this would be a good thesis topic for that tripple CS/IS/LS
--

Mac "find" performs far better than Unix O(n^2) "find"

[...]
Quote:>directory-ness.  So, they came up with a solution.  Give ordinary
>files odd-numbered inodes, and directories even-numbered inodes.  They

[...]

Someone pointed out to me that I got it backwards -- ordinary files
have even-numbered inodes, and directories get odd numbers.  Mea
culpa.  Shoulda done a quick "ls -i" before posting rather than just
trusting to my memory.

--

Mac "find" performs far better than Unix O(n^2) "find"

Quote:> stat() each file (you have to do this for every file even if you are only
> printing out file names, since you need to know which of those files are
> directories)
> Since Unix directories are unstructured, you have to do a linear search
> through the directory to find a matching filename.  This is O(n) *for each
> and every component of the path name resolution*.

No, it's O(m), where m is the number of files per directory. As you say,
you cd-ed into the directory to avoid running namei, which is O(mo) where
o is the deepest directory tree in the system.

So on each directory you have an O(m^2) search.

So find is O(nm^2), and m is not necessarily proportional to n. In practice,
m is proportional to the size of your newsfeed, or a constant.

Quote:> Unix:  the system to use when you have lots and lots of spare CPU cycles
> you wish to go to waste.

Meanwhile, in practice, UFS is as fast as any file system I've used.

The only system I've used that tried a different algorithm for directories
is AmigaDOS, which builds trees of hash tables. It's a total loss for
generating a list of files, since it has to seek all over the place to
traverse the tree. It probably speeds up exec() some, but "ls" is really
slow.
--
Peter da Silva    (NIC: PJD2)                             `-_-'
Network Management Technology Incorporated                 'U`
1601 Industrial Blvd.     Sugar Land, TX  77478  USA
+1 713 274 5180                                "Har du kramat din varg idag?"

Mac "find" performs far better than Unix O(n^2) "find"

[technical unix filesystems o(n^2) discussion snipped]

Quote:> Of course, we free Unix users (Linux, FreeBSD, NetBSD, HURD) will see
> the benefits much more quickly than the commercial Unix world.  This
> is only just.  Anyone else with an OS that can dynamically add
> filesystems (AmigaOS?  Plan9?) should also be able to reap the
> benefits as well.

> Come to think of it, I know the Mac can dynamically add filesystems
> too -- aren't ISO9660 (cdrom) and FAT (dos) filesystems readable, via
> extensions or inits or whatever the hell they're called?

Yes, it can be done, but not easily (to the programmer).  Audio CDs,
ISO9660, ProDOS, FAT and extended FAT can all be read (& written
where applicable) on a current Mac system transparently (to the user).

Copland will make this much easier (for the programmer), inherently
supporting multiple filesystems at the core, favoring no one
filesystem over another.  This should allow a wider array of filesystems
to become accessible to Mac users, with no trippy extensions needed.

Quote:> (Maybe this would be a good thesis topic for that tripple CS/IS/LS
> grad degree I've been contemplating...)

Thar ya go!

--

Hi,

Does anyone know why

struct servent *serv;
serv=getservbyname("exec","tcp");

gives a warning err of incomparible pointer type?

I also can't get rexec to function. It compiles ok....

Thanks

Kirk