SQL dilemma

SQL dilemma

Post by Sterling Pri » Sat, 30 Mar 2002 07:53:08



Suppose I've got two DB2 tables, which I will simulate here using SAS
datasets for the moment (the actual tables contain millions of records):

data test1;
input ID;
cards;
1234
5678
9876
5432
;
run;

data test2;
input ID;
cards;
1234
9876
7777
;
run;

I need to find the count of unique IDs across both tables; that is to say, I
need to count every distinct value of ID only once, without double (or
triple, quadruple, etc) counting any id that appears in both tables.

So, I'd expect the resulting count from the simulated tables above to be 5,
or, if I were to get the actual IDs instead of the count:

1234
5678
9876
5432
7777

So far the best thing I've hit on (below) seems to produce the list of IDs I
expect, but I think it is very inefficient as it involves two Cartesian
joins.  Also, I haven't quite worked out how to get the count of the IDs
rather than the IDs themselves from this query.

proc sql;

create table test3 as

select a.id from test1 a,
                       test2 b
where a.id ne b.id

union

select b.id from test1 a,
                       test2 b
where b.id ne a.id;

quit;

The query will really be done in a pass-through SQL to DB2, so whatever I do
has to be valid DB2 SQL and I need to handle it on the DB2 side rather than
pull both multi-million row tables into SAS to do it in a data step.

Can anyone suggest a better approach?

Thanks,

Sterling Price

**********************************************************************
Notice:  The area code for the Wal-Mart General Office has
changed from 501 to 479.  Please make sure that you are
dialing 479 when making calls to any General Office location.

**********************************************************************
This email and any files transmitted with it are confidential
and intended solely for the individual or entity to
whom they are addressed.  If you have received this email
in error destroy it immediately.
**********************************************************************

 
 
 

SQL dilemma

Post by Jeff Voell » Sat, 30 Mar 2002 08:28:01


Doesn't UNION automatically de-dupe by default and only keep duplicates if
you use UNION ALL?  I think you could get away with:

proc sql;
create view test3 as
select id from test1
union
select id from test2

create table count as
select count(*) as count
from test3;
quit;

-----Original Message-----

Sterling Price
Sent: Thursday, March 28, 2002 2:53 PM

Subject: SQL dilemma

Suppose I've got two DB2 tables, which I will simulate here using SAS
datasets for the moment (the actual tables contain millions of records):

data test1;
input ID;
cards;
1234
5678
9876
5432
;
run;

data test2;
input ID;
cards;
1234
9876
7777
;
run;

I need to find the count of unique IDs across both tables; that is to say, I
need to count every distinct value of ID only once, without double (or
triple, quadruple, etc) counting any id that appears in both tables.

So, I'd expect the resulting count from the simulated tables above to be 5,
or, if I were to get the actual IDs instead of the count:

1234
5678
9876
5432
7777

So far the best thing I've hit on (below) seems to produce the list of IDs I
expect, but I think it is very inefficient as it involves two Cartesian
joins.  Also, I haven't quite worked out how to get the count of the IDs
rather than the IDs themselves from this query.

proc sql;

create table test3 as

select a.id from test1 a,
                       test2 b
where a.id ne b.id

union

select b.id from test1 a,
                       test2 b
where b.id ne a.id;

quit;

The query will really be done in a pass-through SQL to DB2, so whatever I do
has to be valid DB2 SQL and I need to handle it on the DB2 side rather than
pull both multi-million row tables into SAS to do it in a data step.

Can anyone suggest a better approach?

Thanks,

Sterling Price

**********************************************************************
Notice:  The area code for the Wal-Mart General Office has
changed from 501 to 479.  Please make sure that you are
dialing 479 when making calls to any General Office location.

**********************************************************************
This email and any files transmitted with it are confidential
and intended solely for the individual or entity to
whom they are addressed.  If you have received this email
in error destroy it immediately.
**********************************************************************


 
 
 

SQL dilemma

Post by Karl K » Sat, 30 Mar 2002 10:54:01


Piece of cake.  Jeff is right, UNION de-dupe's.  The following produces the
correct answer (i.e., 5).

proc sql;
 select count(distinct id) from
  (select id from test1 union select id from test2);

It's standard SQL, so should work with DB2.


> Doesn't UNION automatically de-dupe by default and only keep duplicates if
> you use UNION ALL?  I think you could get away with:

> proc sql;
> create view test3 as
> select id from test1
> union
> select id from test2

> create table count as
> select count(*) as count
> from test3;
> quit;

> -----Original Message-----

> Sterling Price
> Sent: Thursday, March 28, 2002 2:53 PM

> Subject: SQL dilemma

> Suppose I've got two DB2 tables, which I will simulate here using SAS
> datasets for the moment (the actual tables contain millions of records):

> data test1;
> input ID;
> cards;
> 1234
> 5678
> 9876
> 5432
> ;
> run;

> data test2;
> input ID;
> cards;
> 1234
> 9876
> 7777
> ;
> run;

> I need to find the count of unique IDs across both tables; that is to say,
I
> need to count every distinct value of ID only once, without double (or
> triple, quadruple, etc) counting any id that appears in both tables.

> So, I'd expect the resulting count from the simulated tables above to be
5,
> or, if I were to get the actual IDs instead of the count:

> 1234
> 5678
> 9876
> 5432
> 7777

> So far the best thing I've hit on (below) seems to produce the list of IDs
I
> expect, but I think it is very inefficient as it involves two Cartesian
> joins.  Also, I haven't quite worked out how to get the count of the IDs
> rather than the IDs themselves from this query.

> proc sql;

> create table test3 as

> select a.id from test1 a,
>                        test2 b
> where a.id ne b.id

> union

> select b.id from test1 a,
>                        test2 b
> where b.id ne a.id;

> quit;

> The query will really be done in a pass-through SQL to DB2, so whatever I
do
> has to be valid DB2 SQL and I need to handle it on the DB2 side rather
than
> pull both multi-million row tables into SAS to do it in a data step.

> Can anyone suggest a better approach?

> Thanks,

> Sterling Price

> **********************************************************************
> Notice:  The area code for the Wal-Mart General Office has
> changed from 501 to 479.  Please make sure that you are
> dialing 479 when making calls to any General Office location.

> **********************************************************************
> This email and any files transmitted with it are confidential
> and intended solely for the individual or entity to
> whom they are addressed.  If you have received this email
> in error destroy it immediately.
> **********************************************************************

 
 
 

SQL dilemma

Post by Jim Groeneve » Sat, 30 Mar 2002 19:51:00


Hi Sterling,

How about this solution? It does not apply PROC SQL:

PROC SORT DATA=Test1 NODUPKEY; BY ID;RUN;
PROC SORT DATA=Test2 NODUPKEY; BY ID;RUN;

DATA Both;
  MERGE Test1 Test2; BY ID;
RUN;

PROC PRINT DATA=Both; RUN;

Regards - Jim.
--
Y. (Jim) Groeneveld, MSc      IMRO TRAMARKO       tel. +31 412 407 070
senior statistician,          P.O. Box 1          fax. +31 412 407 080
senior data manager           5350 AA BERGHEM     IMRO TRAMARKO: a CRO

My computer beeps when it complains; I complain when it beeps

Notice of confidentiality: this e-mail may contain confidential information
intended for the addressed recipient only.
If you have received this e-mail in error please delete this e-mail and
please notify the sender so that proper delivery
can be arranged.

> -----Original Message-----

> Sent: Thursday, March 28, 2002 11:53 PM

> Subject:      SQL dilemma

> Suppose I've got two DB2 tables, which I will simulate here using SAS
> datasets for the moment (the actual tables contain millions of records):

> data test1;
> input ID;
> cards;
> 1234
> 5678
> 9876
> 5432
> ;
> run;

> data test2;
> input ID;
> cards;
> 1234
> 9876
> 7777
> ;
> run;

> I need to find the count of unique IDs across both tables; that is to say,
> I
> need to count every distinct value of ID only once, without double (or
> triple, quadruple, etc) counting any id that appears in both tables.

> So, I'd expect the resulting count from the simulated tables above to be
> 5,
> or, if I were to get the actual IDs instead of the count:

> 1234
> 5678
> 9876
> 5432
> 7777

> So far the best thing I've hit on (below) seems to produce the list of IDs
> I
> expect, but I think it is very inefficient as it involves two Cartesian
> joins.  Also, I haven't quite worked out how to get the count of the IDs
> rather than the IDs themselves from this query.

> proc sql;

> create table test3 as

> select a.id from test1 a,
>                        test2 b
> where a.id ne b.id

> union

> select b.id from test1 a,
>                        test2 b
> where b.id ne a.id;

> quit;

> The query will really be done in a pass-through SQL to DB2, so whatever I
> do
> has to be valid DB2 SQL and I need to handle it on the DB2 side rather
> than
> pull both multi-million row tables into SAS to do it in a data step.

> Can anyone suggest a better approach?

> Thanks,

> Sterling Price

> **********************************************************************
> Notice:  The area code for the Wal-Mart General Office has
> changed from 501 to 479.  Please make sure that you are
> dialing 479 when making calls to any General Office location.

> **********************************************************************
> This email and any files transmitted with it are confidential
> and intended solely for the individual or entity to
> whom they are addressed.  If you have received this email
> in error destroy it immediately.
> **********************************************************************