Extract from one file lines NOT IN another ...

Extract from one file lines NOT IN another ...

Post by R Chi » Fri, 04 Oct 2002 23:48:42



2 files I have...
a200.dat --- Has 200 lines (assume all unique)
b250.dat --- this is  a200.dat plus 50 additional lines

I need to extract the 50 lines from b250.dat that are NOT IN a200.dat:

I found out this way:

cat file a200.dat >> b250.dat
sort b250.dat | uniq -u > c50.dat

Please suggest alternatives, faster ways

Thanks
Robert

 
 
 

Extract from one file lines NOT IN another ...

Post by dav.. » Fri, 04 Oct 2002 23:51:29



Quote:> I need to extract the 50 lines from b250.dat that are NOT IN a200.dat:

This should work:

grep --invert-match -f a200.dat b250.dat

Note: I didn't tested (how could I ?)

man grep for more info.

Davide

 
 
 

Extract from one file lines NOT IN another ...

Post by Barry Margoli » Fri, 04 Oct 2002 23:56:09



>2 files I have...
>a200.dat --- Has 200 lines (assume all unique)
>b250.dat --- this is  a200.dat plus 50 additional lines

>I need to extract the 50 lines from b250.dat that are NOT IN a200.dat:

>I found out this way:

>cat file a200.dat >> b250.dat

what's "file"?

Quote:>sort b250.dat | uniq -u > c50.dat

>Please suggest alternatives, faster ways

fgrep -v -x -f a200.dat b250.dat > c50.dat

Some versions of fgrep have a limit on the size of the -f file, and 200
lines may exceed it.

sort a200.dat > a200.sorted
sort b250.dat > b250.sorted
comm -13 a200.sorted b250.sorted > c50.dat

In some shells you can abbreviate this version to:

comm -13 <(sort a200.dat) <(sort b250.dat) > c50.dat

--

Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

 
 
 

Extract from one file lines NOT IN another ...

Post by mats.blomstr.. » Sat, 05 Oct 2002 00:14:16



Quote:> Please suggest alternatives, faster ways

You could try 'comm' if you have it on your system.
//Mats

bash$ man comm
COMM(1)                        FSF                        COMM(1)

NAME
       comm - compare two sorted files line by line

SYNOPSIS
       comm [OPTION]... LEFT_FILE RIGHT_FILE

DESCRIPTION
       Compare  sorted  files  LEFT_FILE  and  RIGHT_FILE line by
       line.

       -1     suppress lines unique to left file

       -2     suppress lines unique to right file

       -3     suppress lines unique to both files

       --help display this help and exit

       --version
              output version information and exit

AUTHOR
       Written by Richard Stallman and David MacKenzie.

REPORTING BUGS

COPYRIGHT
       Copyright ? 2001 Free Software Foundation, Inc.
       This is free software; see the source for  copying  condi-
       tions.  There is NO warranty; not even for MERCHANTABILITY
       or FITNESS FOR A PARTICULAR PURPOSE.

SEE ALSO
       The full documentation for comm is maintained as a Texinfo
       manual.   If  the  info  and  comm  programs  are properly
       installed at your site, the command

              info comm

       should give you access to the complete manual.

comm (textutils) 2.0.14     March 2001                    COMM(1)
bash$

 
 
 

Extract from one file lines NOT IN another ...

Post by R Chi » Sat, 05 Oct 2002 01:34:13


Thanks, "comm" seems to be the way to go for massive data files !
no *grep...

>sort a200.dat > a200.sorted
>sort b250.dat > b250.sorted
>comm -13 a200.sorted b250.sorted > c50.dat

>In some shells you can abbreviate this version to:

>comm -13 <(sort a200.dat) <(sort b250.dat) > c50.dat

 
 
 

Extract from one file lines NOT IN another ...

Post by dav.. » Sat, 05 Oct 2002 01:42:33



Quote:> Thanks, "comm" seems to be the way to go for massive data files !
> no *grep...

Thinking about it, you could also use diff.

Davide

 
 
 

Extract from one file lines NOT IN another ...

Post by Robert Kat » Sat, 05 Oct 2002 02:01:29



> 2 files I have...
> a200.dat --- Has 200 lines (assume all unique)
> b250.dat --- this is  a200.dat plus 50 additional lines

> I need to extract the 50 lines from b250.dat that are NOT IN a200.dat:

...
...

awk ' !n {a[$0]++}
       n && !a[$0]++' a200.dat n=1 b250.dat

will print the uniq lines in b250.dat that are not in a200.dat

--

Regards,

---Robert

 
 
 

Extract from one file lines NOT IN another ...

Post by Barry Margoli » Sat, 05 Oct 2002 02:24:41




>> Thanks, "comm" seems to be the way to go for massive data files !
>> no *grep...

>Thinking about it, you could also use diff.

You could, but why would you want to?  Then you have to use something else
to get rid of all the extra output that diff produces to show which lines
were added, deleted, or changed (AFAIK, there's no option to diff to make
it print just the added lines).  Comm's output is already in the desired
form.

I've seen plenty of scripts that do this with diff.  To me, it always
signals that the author didn't know about comm.

--

Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

 
 
 

Extract from one file lines NOT IN another ...

Post by Kenny McCorma » Sat, 05 Oct 2002 02:27:41




>> Thanks, "comm" seems to be the way to go for massive data files !
>> no *grep...

>Thinking about it, you could also use diff.

>Davide

comm, diff, join, etc, all assume the input files are sorted.  I can't
remember if that was part of the original problem description, and/or if
the postings recommending these solutions always require a sort step
(and/or temp files to store the sorted data).

This topic comes up every couple of months or so in comp.lang.awk, where,
IMHO, nice AWK solutions can always be found.

 
 
 

Extract from one file lines NOT IN another ...

Post by Robert Kat » Sat, 05 Oct 2002 02:53:09


...
...

Quote:

> comm, diff, join, etc, all assume the input files are sorted.  I can't
> remember if that was part of the original problem description, and/or if
> the postings recommending these solutions always require a sort step
> (and/or temp files to store the sorted data).

> This topic comes up every couple of months or so in comp.lang.awk, where,
> IMHO, nice AWK solutions can always be found.

Here's an awk version of comm, without the -123 options:

awk '{ FNR == NR ? a[$0]++ : b[$0]++ }
     END {
             print "uniq lines in fileA but not in fileB"
             for (x in a)
                 if (!(x in b)) print x
             print ""
             print "uniq lines in fileB but not in fileA"
             for (x in b)
                 if (!(x in a)) print x
             print ""
             print "uniq lines common to both fileA and fileB"
             for (x in a)
                 if (x in b) print x
         }' fileA fileB

--

Regards,

---Robert

 
 
 

Extract from one file lines NOT IN another ...

Post by John Gordon,217-352-6511x7418,CEERD-CF- » Sat, 05 Oct 2002 03:32:29



> comm, diff, join, etc, all assume the input files are sorted.  I can't

diff doesn't.

John Gordon
---
"She even named one city after Robert, her ex-boyfriend, just to annoy
me.  I have it in a saved game on my laptop.  Every now and then I boot
it up just to let Robertville starve itself off the map."  -- Tom Chick

 
 
 

Extract from one file lines NOT IN another ...

Post by Kenny McCorma » Sat, 05 Oct 2002 04:47:17





>> comm, diff, join, etc, all assume the input files are sorted.  I can't

>diff doesn't.

We're splitting hairs.  In order to use diff to sensibly tell you what's in
one file but not in another, the input files need to be sorted, or else you
get false positives when the same line is in both files, but in different
places/orders.

Obviously, you can use diff on unsorted files and get meaningful results,
but not in the sense of this thread.

 
 
 

Extract from one file lines NOT IN another ...

Post by Barry Margoli » Sat, 05 Oct 2002 04:24:54





>> comm, diff, join, etc, all assume the input files are sorted.  I can't

>diff doesn't.

But it assumes that the two files have the common lines in the same order.

--

Genuity, Woburn, MA
*** DON'T SEND TECHNICAL QUESTIONS DIRECTLY TO ME, post them to newsgroups.
Please DON'T copy followups to me -- I'll assume it wasn't posted to the group.

 
 
 

Extract from one file lines NOT IN another ...

Post by John W. Krah » Sat, 05 Oct 2002 07:05:36



> 2 files I have...
> a200.dat --- Has 200 lines (assume all unique)
> b250.dat --- this is  a200.dat plus 50 additional lines

> I need to extract the 50 lines from b250.dat that are NOT IN a200.dat:

> I found out this way:

> cat file a200.dat >> b250.dat
> sort b250.dat | uniq -u > c50.dat

> Please suggest alternatives, faster ways

perl -ne'$x{$_}++}{print grep$x{$_}==1,keys%x' a200.dat b250.dat

John
--
use Perl;
program
fulfillment

 
 
 

Extract from one file lines NOT IN another ...

Post by Sony Anto » Sat, 05 Oct 2002 07:07:03




> > comm, diff, join, etc, all assume the input files are sorted.  I can't

> diff doesn't.

Yes it does

//file a
onne
randee

//file b
randee
onne

diff a b
1d0
< onne
2a2

Quote:> onne

--sony
 
 
 

1. Trying to extract a date from the first line of a file

      read line < FILE
      date=${line:7:6}  ## bash2 ksh93 only
      year=${line:7:2}
      month=${line:9:2}
      day=${line:11:2}

--
    Chris F.A. Johnson                        http://cfaj.freeshell.org
    ===================================================================
    My code (if any) in this post is copyright 2002, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License

2. Switching from IOCTLs to a RAMFS

3. Extracting multiple lines OR deleting multiple lines from a file using AWK

4. How detect a new window under X

5. Extract one file from an RPM -- Not working

6. Help: AIX System Integration from 4.3.0 to 4.3.2

7. how to append one file's line to another file's line?

8. Is LINUX the same as FREEBSD?

9. Extracting lines from a text file that match a certain criteria to another text file

10. Find First Match Using SED and Then Extract Text from Line

11. sed one liner to extract 1 line before & after the pattern

12. Extracting a line from several ones

13. Content of file is one line has to be displayed in seperated lines