SQL Question - Doing equijoins and getting this log message

SQL Question - Doing equijoins and getting this log message

Post by Gerstle, Jo » Sat, 02 Nov 2002 01:53:37



Hallo,
I'm looking for some more information and an understanding of this message
in the log after running an sql program, matching each row of a table with
itself using a WHERE statement that prevents a row from matching with itself
or others before it in the order of the original rows (i.e. line1 > line2).
It's a modified program from the fuzzy matching programs put up on Charles
Patridge's website.

   proc sql;
        create table matches as
      select t1.lineID as lineID1,t2.lineID as lineID2,
                                t1.idno as idno1, t2.idno as idno2,
              from scenarios as t1, scenarios as t2
      where t1.lineID < t2.LineID              ;
    quit;

I've searched the SAS-L archives and SAS Tech Support, and found some
interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian
product joins that can not be optimized.

Is this a worry?  Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle
        Biostatistician
CDC Information Technical Services (CITS)
Contractor Support to NCHSTP
Division of HIV/AIDS Prevention
HIV/AIDS Incidence and Case Study Branch (HICSB)
Phone:   404-639-3980          Fax:     404-639-2980


 
 
 

SQL Question - Doing equijoins and getting this log message

Post by Sigurd Hermans » Sat, 02 Nov 2002 04:39:12


Summary: Avoid solutions that require Cartesian products for matching more
than a few thousand records. Contact Paul Dorfman if you need a reasonable
solution to a large-scale matching problem.

A thread that touched indirectly on this topic wove its way through the list
not too long back. The message means, in basic terms, that the SQL compiler
cannot determine an index or sort order of the keys (LineID) that it can use
to limit the rows being compared to resolve the where condition. Sometimes
that means that none of the typical optimization strategies will work, and
sometimes it means that the SAS SQL compiler cannot implement whatever
strategies might work.

The inequality operator (<) in the WHERE clause is complicating the SQL plan
for the query. I believe that this query represents an extreme case of a
theta-join. I consider it an extreme case because the query should select
about half of the Cartesian product of scenarios reflexed on itself! That
means that if scenarios has 5,000 rows, you would expect the query to yield
12.5 million rows. The numbers increase geometrically with the number of
rows. SAS SQL may as well in this case form the Cartesian product and throw
out the ones that do not meet the WHERE condition.

Inexact matching of natural key values may require comparisons of every
possible pairing of two sets of records. If the decision to select a match
depends on the yield of a binary function with one argument being taken from
each record, how else could a program determine which pairs of records to
select?  Binary comparison methods for deduplicating large sets of records
tend to bog down quickly as the number of records increases.

The solution? No perfect solution exists for this class of problem, but
reasonable approximations involving prescreening on indexed unary function
values cost little in terms of sensitivity and allow control of specificity
of matching. If anyone has to find a way to conduct fuzzy matching or
deduplication of a large number of records, send an e-mail to Paul Dorfman.
He has developed a very fast and accurate implementation of unary function
value indexing, and he understands the in's and out's of fuzzy linkage.

Sig

-----Original Message-----

Sent: Thursday, October 31, 2002 11:54 AM

Subject: SQL Question - Doing equijoins and getting this log message

Hallo,
I'm looking for some more information and an understanding of this message
in the log after running an sql program, matching each row of a table with
itself using a WHERE statement that prevents a row from matching with itself
or others before it in the order of the original rows (i.e. line1 > line2).
It's a modified program from the fuzzy matching programs put up on Charles
Patridge's website.

   proc sql;
        create table matches as
      select t1.lineID as lineID1,t2.lineID as lineID2,
                                t1.idno as idno1, t2.idno as idno2,
              from scenarios as t1, scenarios as t2
      where t1.lineID < t2.LineID              ;
    quit;

I've searched the SAS-L archives and SAS Tech Support, and found some
interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian
product joins that can not be optimized.

Is this a worry?  Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle
        Biostatistician
CDC Information Technical Services (CITS)
Contractor Support to NCHSTP
Division of HIV/AIDS Prevention
HIV/AIDS Incidence and Case Study Branch (HICSB)
Phone:   404-639-3980          Fax:     404-639-2980




 
 
 

SQL Question - Doing equijoins and getting this log message

Post by Ian Whitlo » Sat, 02 Nov 2002 05:36:13


John,

Sometimes there is no way to execute an SQL join without comparing every
record in the first set with every record in the second.  There is nothing
wrong with this, but it may take a long time to execute.  The message in
translation says,  "None of my short cut tricks are going to work on this
problem the way you phrased it, I am using brute force to please you."  If
you are satisfied with the performance, who cares?  If you are not, then the
message tells you to modify the code so that it could apply a short cut, or
get another problem.  Sometimes problems can only be solved with brute force
and you need to get a bigger and faster computer, or give up.  (Say that for
the best formulation of the problem, the estimated CPU time is 1 century to
finish the task, then give up or at least wait a few years.)

An equi-join says that the WHERE condition can be structured as a request
that two entities be equal with possibly other conditions joined to the
equal condition by an AND.  For example,

     where x = y            equi-join
     where x = y and z > 5  equi-join
     where x = y
       and ( any kind of complexity you wish )  equi-join
     where x = y or z > 5   not an equi-join

In the last case the restriction to Z > 5 might be so strong that you could
use a UNION in which the first part used X = Y equi-join and the second part
uses brute force but the remaining amount of data is so small that it
doesn't matter.  Or possibly you might have extra information, say when Z >
5 then X = Y + 9.  Again you could improve the performance with a UNION of
two equi-joins.

The significance of an equi-join is that at worst you can sort and merge.
This means there is a way to know when to stop comparing records, hence it
is usually more efficient that brute force where all pairs must be compared.
This does not necessarily mean faster execution, but it often does and the
computer doesn't see the necessity for resorting to brute force.  For
example,  you have

     where x < y

So you create a new variable Z always 0 and then use the condition

     where x < y and z = 0

Now you have an equi-join, but the new WHERE is not one bit better than the
previous one, because the equi-part adds no new information to the problem.

Howard Schreier is the master of rephrasing problems to introduce
equi-joins.  My all time favorite occurred around 1993 when someone on SAS-L
posed the question of finding records with near values of X, say

     where abs ( a.x - b.x ) <= 1

(I think the value was 2.3, but let's work with 1 anyway.)  Howard saw that
the above implies one of the following is always true

       round ( a.x , 1 ) = round ( b.x , 1 )
       round ( a.x , 1 ) = round ( b.x , 1 ) + 1
       round ( a.x , 1 ) = round ( b.x , 1 ) - 1

So he introduced a third data set, C, with 3 records having variable R in
(0, 1 , -1).  The condition then became

       where abs ( a.x - b.x ) < 1
         and round ( a.x , 1 ) = round ( b.x , 1 ) + c.r

This speeded up the performance significantly.


-----Original Message-----

Sent: Thursday, October 31, 2002 11:54 AM

Subject: SQL Question - Doing equijoins and getting this log message

Hallo,
I'm looking for some more information and an understanding of this message
in the log after running an sql program, matching each row of a table with
itself using a WHERE statement that prevents a row from matching with itself
or others before it in the order of the original rows (i.e. line1 > line2).
It's a modified program from the fuzzy matching programs put up on Charles
Patridge's website.

   proc sql;
        create table matches as
      select t1.lineID as lineID1,t2.lineID as lineID2,
                                t1.idno as idno1, t2.idno as idno2,
              from scenarios as t1, scenarios as t2
      where t1.lineID < t2.LineID              ;
    quit;

I've searched the SAS-L archives and SAS Tech Support, and found some
interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian
product joins that can not be optimized.

Is this a worry?  Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle
        Biostatistician
CDC Information Technical Services (CITS)
Contractor Support to NCHSTP
Division of HIV/AIDS Prevention
HIV/AIDS Incidence and Case Study Branch (HICSB)
Phone:   404-639-3980          Fax:     404-639-2980



 
 
 

SQL Question - Doing equijoins and getting this log message

Post by Sigurd Hermans » Sat, 02 Nov 2002 06:10:20


As an aside, Howard's innovative methods inspired a very efficient fuzzy
matching method for dates and other sequences that I have included in a
number of fuzzy linkage programs. Anyone interested in some of the finer
points of conditional merge/joins should take a look at Ian's, Howard's, and
Paul Kent's discussions of techniques for converting match conditions from
inequalities to linear forms.
Sig
-----Original Message-----
From: Ian Whitlock
Sent: Thursday, October 31, 2002 3:36 PM

Subject: Re: SQL Question - Doing equijoins and getting this log message

John,

Sometimes there is no way to execute an SQL join without comparing every
record in the first set with every record in the second.  There is nothing
wrong with this, but it may take a long time to execute.  The message in
translation says,  "None of my short cut tricks are going to work on this
problem the way you phrased it, I am using brute force to please you."  If
you are satisfied with the performance, who cares?  If you are not, then the
message tells you to modify the code so that it could apply a short cut, or
get another problem.  Sometimes problems can only be solved with brute force
and you need to get a bigger and faster computer, or give up.  (Say that for
the best formulation of the problem, the estimated CPU time is 1 century to
finish the task, then give up or at least wait a few years.)

An equi-join says that the WHERE condition can be structured as a request
that two entities be equal with possibly other conditions joined to the
equal condition by an AND.  For example,

     where x = y            equi-join
     where x = y and z > 5  equi-join
     where x = y
       and ( any kind of complexity you wish )  equi-join
     where x = y or z > 5   not an equi-join

In the last case the restriction to Z > 5 might be so strong that you could
use a UNION in which the first part used X = Y equi-join and the second part
uses brute force but the remaining amount of data is so small that it
doesn't matter.  Or possibly you might have extra information, say when Z >
5 then X = Y + 9.  Again you could improve the performance with a UNION of
two equi-joins.

The significance of an equi-join is that at worst you can sort and merge.
This means there is a way to know when to stop comparing records, hence it
is usually more efficient that brute force where all pairs must be compared.
This does not necessarily mean faster execution, but it often does and the
computer doesn't see the necessity for resorting to brute force.  For
example,  you have

     where x < y

So you create a new variable Z always 0 and then use the condition

     where x < y and z = 0

Now you have an equi-join, but the new WHERE is not one bit better than the
previous one, because the equi-part adds no new information to the problem.

Howard Schreier is the master of rephrasing problems to introduce
equi-joins.  My all time favorite occurred around 1993 when someone on SAS-L
posed the question of finding records with near values of X, say

     where abs ( a.x - b.x ) <= 1

(I think the value was 2.3, but let's work with 1 anyway.)  Howard saw that
the above implies one of the following is always true

       round ( a.x , 1 ) = round ( b.x , 1 )
       round ( a.x , 1 ) = round ( b.x , 1 ) + 1
       round ( a.x , 1 ) = round ( b.x , 1 ) - 1

So he introduced a third data set, C, with 3 records having variable R in
(0, 1 , -1).  The condition then became

       where abs ( a.x - b.x ) < 1
         and round ( a.x , 1 ) = round ( b.x , 1 ) + c.r

This speeded up the performance significantly.


-----Original Message-----

Sent: Thursday, October 31, 2002 11:54 AM

Subject: SQL Question - Doing equijoins and getting this log message

Hallo,
I'm looking for some more information and an understanding of this message
in the log after running an sql program, matching each row of a table with
itself using a WHERE statement that prevents a row from matching with itself
or others before it in the order of the original rows (i.e. line1 > line2).
It's a modified program from the fuzzy matching programs put up on Charles
Patridge's website.

   proc sql;
        create table matches as
      select t1.lineID as lineID1,t2.lineID as lineID2,
                                t1.idno as idno1, t2.idno as idno2,
              from scenarios as t1, scenarios as t2
      where t1.lineID < t2.LineID              ;
    quit;

I've searched the SAS-L archives and SAS Tech Support, and found some
interesting tidbits, but not any info on this specific log message.

Here is the message:

NOTE: The execution of this query involves performing one or more Cartesian
product joins that can not be optimized.

Is this a worry?  Can this be corrected, if so?

Thanks for any input and/or opinions.

John Gerstle
        Biostatistician
CDC Information Technical Services (CITS)
Contractor Support to NCHSTP
Division of HIV/AIDS Prevention
HIV/AIDS Incidence and Case Study Branch (HICSB)
Phone:   404-639-3980          Fax:     404-639-2980



 
 
 

1. Log-Log CD plot question (novice question)

My statistics side is very poorly developed .... :-) any help would be
greatly appreciated.

I need to make a log-log complementary distribution plot with on the x-axis
log10(x) and on the y-axis log10(P[X>x]). It is easy to tranform x into
log10(x), but how do I get the values of log10(P[X>x]) to plot against?

Thanks, for any help.

--
Werner

2. Syncing IIIe with two different PCs running Outlook 2K

3. Question on SQL Server Trans Log

4. WinRAR password problem

5. Save LOG Clear LOG Save LOG ...

6. HP ScanJet 5s: User Comments?

7. Please Help - HP Scanjet 4c ands NT

8. How can I use EVE instead of EDIT for posting

9. Getting User name logged in?

10. Getting User name logged on the machine.

11. Getting User name logged in?

12. messengers getting logged off

13. Help! I keep getting logged off!