On Tue, 24 Jun 2003 22:58:49 GMT, Werner Cohn
> I am interested in using the USA Counties data set (new one expected later
> this year) and would like to perform a cluster analysis using substantially
> the whole data set. But this data set is very large. There are about 3000
> counties and perhaps something in the neighborhood of 2000 variables, making
> about 6 million observations.
a) That sounds like the task that someone in a science fiction
story casually tosses off to his AI . That task mainly shows that
the author knows nothing about USING statistics and hasn't thought
about what he should be asking; or, the author is showing
how VERY clever those future AI's actually are.
b) The default for cluster programs is to use the scaling that
the variable starts with. In practice, for that random
assortment of numbers, it is rather highly likely that 5 or 10
variables have huge Standard Deviations compared to all
the others. Therefore, you can drop everything except those
10 and achieve 98 or 99 percent agreement, I would
guess, with every other version of clustering that would be
based on the full 2000.
c) If you standardize the variables, then: It is a very high
likelihood that 80% or more of the variables will reflect the
size of the county, either in People or in Area; that's the
nature of public statistics and the way that we gather them.
So, you can take your raw area and raw count, and
form clusters out of them, up to whatever size that you
want -- and that will be something that will achieve very
high agreement with whatever someone else invents,
starting with the same data.
d) I think that the time required for full clustering increases
with the cube of the p-variables, square of the N-cases.
Is there any excuse to use more than 30 variables?
e) Find the FAQs by Warren Sarle. Part of what he has
on clustering is in my own stat-FAQ.
[ snip ]
"Taxes are the price we pay for civilization." Justice Holmes.