Egrep question

Egrep question

Post by David Henderso » Sun, 15 Jun 1997 04:00:00



I am trying to use egrep to find doubled words.  An example
sentence is "The the theory is incorrect".

I tried:        egrep -i '\<([a-z]+) +\1\>'

but it didn't catch it.  If the sentence is "... the the ...",
it will find it.

My egrep supports backreferences.  (I don't know what version
it is, since egrep doesn't report its version).  I am using
RedHat Linux 4.1 with the usual set of GNU utilities.

Am I doing something wrong, or is there a problem with the GNU
egrep?

Thanks!
David

*=============================================================*

*=============================================================*

Genius may have its limitations, but stupidity is not thus
handicapped.
                                -- Elbert Hubbard

 
 
 

Egrep question

Post by Icarus Spar » Tue, 17 Jun 1997 04:00:00




Quote:>I am trying to use egrep to find doubled words.  An example
>sentence is "The the theory is incorrect".

>I tried:    egrep -i '\<([a-z]+) +\1\>'

>but it didn't catch it.  If the sentence is "... the the ...",
>it will find it.

>My egrep supports backreferences.  (I don't know what version
>it is, since egrep doesn't report its version).  I am using
>RedHat Linux 4.1 with the usual set of GNU utilities.

>Am I doing something wrong, or is there a problem with the GNU
>egrep?

The problem is that '\1' matches the same thing that has already been
matched, and 'The' does not match 'the'. The '-i' makes no difference
here. You could make a case for saying that it should.

To solve this particular problem, I would be inclined to use 'deroff'
to split the file into words, use 'tr' to convert them to lower case,
use 'uniq -d' to find the offending patterns. Then, knowing what I
was looking for, I would go back and search for them. This approach
is simular to the original 'spell' script for unix, and catches this
this error (where the two words 'this this' are split by a newline).

However I would be even more inclined to use 'perl'.

#!/usr/bin/perl -0p
s/(\b(\w+)\W+\2\b)/{repeat $1}/gi;

and then look for the character sequence '{repeat' in the output.

Thanks for asking this question, I have just used this tool on my
thesis, and found some unexpected repeated words!

Icarus

 
 
 

Egrep question

Post by Eli the Bearde » Tue, 17 Jun 1997 04:00:00



Quote:> I am trying to use egrep to find doubled words.  An example
> sentence is "The the theory is incorrect".

> I tried:   egrep -i '\<([a-z]+) +\1\>'

> but it didn't catch it.  If the sentence is "... the the ...",
> it will find it.

Yup. Backreferences are pretty much just hacks to the regexp engine,
so things like -i do not always work with them. They don't with gnu
grep 2.0, as you have noticed. Try perl:

perl -e 'while(<>){m:\b([a-z]+) +\1\b:i && print;}'

Quote:> My egrep supports backreferences.  (I don't know what version
> it is, since egrep doesn't report its version).  I am using
> RedHat Linux 4.1 with the usual set of GNU utilities.

Gnu grep 2.0, almost certainly. Try "egrep -V" and look at the first
line of the error message.

Elijah
------
or use tr to remove the case problem

 
 
 

1. egrep question

If I seach in documents with egrep, I can say something like
"show me all lines where abc and/or xyz occur"
egrep -e "abc|xyz"

what I want to do now is like
"show me abc but not when it is in the same line with xyz"

2. Computers...

3. egrep question:

4. How to remotely connect to linux machine and then use GUI software on it?

5. egrep question

6. Recomended Network cards

7. tape status

8. grep/egrep question

9. egrep question

10. Please help with egrep question