> The only way a Mac could run faster would be if they had a better algorithm to
> proform the search with. Since (as we all know) a linear search of elements is
> O(n), and this is probly the worst search anyone would implement (unless they
> were crazy _and_ stupid).
Roughly speaking (without having seen the source code in years), here is
how it works:
For every directory:
cd into that directory (efficiency hack, explained later)
Get a list of files in the current directory.
stat() each file (you have to do this for every file even if you are only
printing out file names, since you need to know which of those files are
Apply command line criteria to the file
If it is a directory, apply this algorithm recursively
repeat on all the files in that directory
cd out of that directory
Now, the above appears to be an O(n) algorithm, doesn't it? However, if
we analyze what stat(), a fairly expensive system call, actually does:
Switch into kernel context (swap page tables, etc.; not cheap)
Resolve the path to get the inode and the stat data. How expensive is
Since Unix directories are unstructured, you have to do a linear search
through the directory to find a matching filename. This is O(n) *for each
and every component of the path name resolution*. (This is the reason
that you cd into the directory; you only want to have to resolve one
component). Since this step is done on each and every file, the whole
algorithm is O(n) * O(n) == O(n^2). Even in the typical case, it is still
O(n^2), since every file has to be looked at (you get 1 + 2 + 3 + ... + n
= n(n+1)/2 = (n^2 + n)/2 == O(n^2) for the cost of the stat of each file
in the directory).
To finish stat() off:
copy the data to user space
switch back to the user's context (swap page tables, etc.; not cheap)
To summarize: the Unix find, which does O(n) expensive system calls
(stat(), which is itself O(n)), is actually an O(n^2) algorithm, in both
the worst case and the typical case.
On the other hand, the heart of the Mac find command is the system call
PBGetCatInfo(), which is an iterator. You provide it with the dirId and
an index (iterating on the index), and it returns the same family of info
that stat does for every file in a given directory. If I remember
correctly (does someone have the old Inside Mac Volume 4, or the new IM:
Files handy?), all the offspring for a given directory are stored
contiguously, making most of the PBGetCatInfo() calls O(1) (constant
time). Finding a directory involves looking through a B*-tree, which is
O(log n) time. In the very worst case (no files, only directories), the
Mac find takes O(n) * O(log n) * O(1) == O(n log n). In the typical case
(an order of magnitude or more files than directories), the performance is
much closer to O(n).
To summarize: the Mac find, is O(n log n) in the worst case, and much
closer to O(n) in the typical case.
Of course, Unix doesn't impose *any* kind of structure on itsQuote:>To be more efficent, a find utility would have to
> have a pre-existant data structure that enabled faster searches (Binary Search
> Tree, Binomial Heap etc...), but since you are ususlly looking for substrings
> (like all the .h files or something) these implementations may have mixed
> results (fast on some types of searches, slow on others)
directories. I can't think of any case where this better than having a
preexisting data structure, with the single exception that having no
structure makes life pretty simple for those few people in the world who
have to implement the filesystem (a lot of "design decisions" of Unix can
be attributed to ease of implementation). Since the other two types of
operations performed on directories is creating a file and deleting a
file, let's analyse them:
file creation, no structure (Unix): O(n) on the size of the directory,
since the whole directory has to be searched to see if there is a
duplicate file name being created.
file creation, tree structure (eg: AVL trees): O(log n) on the lookup,
and O(log n) on the insertion, totalling O(log n)
file deletion, no structure (Unix): O(n) (you may have to look at the
whole directory before you find the correct file).
file deletion, tree structure (eg: AVL trees): O(log n).
Unix: the system to use when you have lots and lots of spare CPU cycles
you wish to go to waste.
office: (520) 621-1815