large data base

large data base

Post by Werner Coh » Thu, 26 Jun 2003 07:58:49



I am interested in using the USA Counties data set (new one expected later
this year) and would like to perform a cluster analysis using substantially
the whole data set.  But this data set is very large.  There are about 3000
counties and perhaps something in the neighborhood of 2000 variables, making
about 6 million observations.

I don't have access to a main frame.  What sort of hardware would do this
conveniently ?  I generally use Mac.  I've heard of a new Mac G5 machine.
Would that do, do you think ?

Also, would the SPSS version that is now available for Mac be appropriate
for this task ?

 
 
 

large data base

Post by Rich Ulric » Fri, 27 Jun 2003 06:31:44


On Tue, 24 Jun 2003 22:58:49 GMT, Werner Cohn


> I am interested in using the USA Counties data set (new one expected later
> this year) and would like to perform a cluster analysis using substantially
> the whole data set.  But this data set is very large.  There are about 3000
> counties and perhaps something in the neighborhood of 2000 variables, making
> about 6 million observations.

Several comments.

a) That sounds like the task that someone in a science fiction
story casually tosses off to his AI .  That task mainly shows that
the author knows nothing about USING statistics and hasn't thought
about what he should be asking; or, the author is showing
how VERY  clever those future AI's  actually are.

b) The default for cluster programs is to use the scaling that
the variable starts with.  In practice, for that random
assortment of numbers, it is rather highly likely that 5 or 10
variables have huge Standard Deviations compared to all
the others.  Therefore, you can drop everything except those
10  and achieve 98 or 99 percent agreement, I would
guess, with every other version of clustering that would be
based on the full 2000.

c) If you standardize the variables, then:  It is a very high
likelihood that 80% or more of the variables will reflect the
size of the county, either in People  or in Area; that's the
nature of public statistics and the way that we gather them.
So, you can take your raw area and raw count, and
form clusters  out of them, up to whatever size that you
want -- and that will be something that will achieve very
high agreement with whatever someone else invents,
starting with the same data.

d) I think that the time required for full clustering increases
with the cube of the p-variables, square of the N-cases.
Is there any excuse to use more than 30 variables?

e) Find the FAQs by Warren Sarle.  Part of what he has
on clustering is in my own stat-FAQ.

[ snip ]

--

http://www.pitt.edu/~wpilib/index.html
"Taxes are the price we pay for civilization."  Justice Holmes.