Ya:
Thanks for all the investigations.
The solution I worked up involves an outer join. But my problem space (N)
is small enough to not be worried about the size of the OJ (N**2).
data foo;
length like $20 deleteThis 8;
input like ;
cards;
jabba
jabba%
jabba%the%hut
xyz%abc
xyz%abc%def
132%xyz
132%xyz%npq
132%dbc%asd%ijk
132
123%abc
12%3%
12345
12
234
23456
;
run;
proc sql ;
* likematrix is not used, but it demonstrates what the outer join looks
like
* as well as the 'like-ness' measures;
create table likematrix as
select x.like as x
, y.like as y
, trim(x.like) like trim(y.like) as x_like_y
, x.like eq y.like as x_equal_y
from foo as x
, foo as y
;
reset undo_policy=none;
* delete all likes that are explicitly like a different like;
delete from foo where
like in (select x.like
from foo as x
, foo as y
where trim(x.like) like trim(y.like)
and (x.like ne y.like)
) ;
quit;
--
Richard A. DeVenezia, http://www.devenezia.com
> Richard,
> I just came out the following code. Under some strict assumptions,
> it seems to work:
> data foo;
> length like $20;
> input like ;
> cards;
> jabba
> jabba%
> jabba%the%hut
> xyz%abc
> xyz%abc%def
> 132%xyz
> 132%xyz%npq
> 132%dbc%asd%ijk
> 132
> 123%abc
> 12%3%
> 12345
> 12
> 234
> 23456
> ;
> run;
> proc sql;
> create table root as
> select distinct scan(tranwrd(like,'%','% '),1,' ') as root
> from foo
> having substr(reverse(trim(calculated root)),1,1)='%'
> ;
> create table root1 as
> select distinct a.root
> from root a, root b
> group by a.root
> having max(compress(a.root,'%') contains trim(left(compress(b.root,'%')))
> and a.root ne b.root)=0
> ;
> create table foo as
> select a.*,
> case when substr(a.like,1,length(b.root)-1)=compress(b.root,'%') then
b.root
> else ' ' end as grp
> from foo a, root1 b
> having calculated grp ne ' '
> ;
> create table foo as
> select *
> from foo
> group by grp
> having like contains trim(grp) and
> length(like)=min(length(case when like contains trim(grp) then like
> else repeat('*',100) end))
> ;
> proc print;
> run;
> -----------------
> like grp
> 12%3% 12%
> 132%xyz 132%
> jabba% jabba%
> xyz%abc xyz%
> My assumptions are:
> . There is always a common 'root' for similar strings,
> this 'root' has to be ended with a '%'. Therefore,
> jabba% is a root, but jabba is not.
> . There is no strings with same 'root' and equal length,
> such as xyz%abc and xyz%bcd.
> . Similar string without a 'root' will be dropped.
> For instance, 234 and 23456 are dropped because there is
> not a 234%.
> I think regular express function might be better
> for searching a common root, but I'm not good at those
> function, so I can only build my code based on old fashion
> string manipulation functions.
> I know this is not even close to perfection, especially
> if you have other roles in your mind. I hope this can provide
> you some hints.
> Kind regards,
> Ya Huang
> -----Original Message-----
> Sent: Friday, June 20, 2003 3:16 PM
> Subject: Re: delete rows based on a column being like-compared to other
> rows
> I provided Ya Huang this additional clarification.
> of data
> jabba
> jabba%
> jabba%the%hut
> xyz%abc
> 132%xyz
> 123%abc
> 12%3%
> I would want to reduce to
> jabba%
> xyz%abc
> 132%xyz
> 12%3%
> --
> Richard A. DeVenezia, http://www.devenezia.com