## CLUSTER - Cofenetic problem

### CLUSTER - Cofenetic problem

Dear all

Collegue Rivero, last 28th August, pointed to a very important problem
to those who use Cluster Analysis in SPSS: The absence of measures of
validating a given classification.

Everyone is aware of the subjectivity inherent to Cluster Analysis,
specially (which I think is the most) when we do not have any idea a
priori on the stucture of the data. So, usually, we experiment  a
combination, say, 3 distance measures between observations, with 3
classification methods, which gives 3x3=9 different classifications.

Which one reflects better the original structure of the data?
There are severall methods to find out, among them: (1) The Cophenetic
correlation; (2) The Monte Carlo procedures, and (3) The significance
tests.

None of this is implemented in SPSS. Without any of these, or others, in
my opinion, any classification is allmost useless.

So, we have to write a syntax on it. Maybe the most feasible of the
methods would be the Cophenetic correlation, which involves the
calculation of a correlation between two matrices (A and B), which I think
is more or less like:

<A.B>           (internal product)
--------------
||A||.||B||        (product of "norms")

I think this is hard to do in SPSS, so we would use the "RESHAPE"
procedure as Dr. Nichols pointed out. The problem is that we do not have
the second matrix, ie, the matrix of "fusion distances".

By hand, we could create this matrix from, for instance, the dendrogram.
But once again, SPSS transforms ("rescales") the original distances
(without allowing us to state not to do so) becoming impossible to
compare the two things.
Anyway, this was only feasible with small sets of data.

I had this problem, ie, TO OBTAIN THE FUSION MATRIX a few months ago
(if anyone remember) but I did not end up with a solution.

If anyone could please give any idea, or criticise the ideas on this
e-mail, I (and many other people I guess) would be very grateful.

Sorry for the long e-mail.

Thanks and Regards to all,

Pedro Sousa --------------------------------------------------------------
Universidade do Algarve, U.C.T.R.A. Campus de Gambelas, 8000 FARO PORTUGAL
Telephone: +351-89-800100 Ext. 7394 / +351-931-9834172

--------------------------------------------------------------------------

Dear Friends:

I am conducting a cluster analysis and am following
the traditional method of conducting a hierarchical
cluster on a sample size first to generate a set of
feasible solutions. Then, I will feed each initial
solution into a k-mean clustering (proc fastclus) to
generate clusters. The final solution will be then
determined by some internal and external validation,

As you know, proc cluster (hierarchical) generate a
set of solutions all the from from one to say 40
clusters. I then look at the cubic clustering
criterion and select several feasible solutions.
Suppose I may want to try cluster # =4,5,6 all in
using k-mean. What I also like to do is to extract the
corresponding cluster centeriods to be used as the
initial seeds for the k-means. Can someone help me
with this problem? I checked the SAS manual and it
does not see to have an example.

thanks.

Hongjie

__________________________________________________
Do you Yahoo!?
U2 on LAUNCH - Exclusive greatest hits videos
http://launch.yahoo.com/u2

6. woops

8. XWing