Joining common fields in files-Which Unix util should I use?

Joining common fields in files-Which Unix util should I use?

Post by Dan LeGa » Sun, 19 Nov 1995 04:00:00



Okay, here's my situation:

I've got two files, file1 and file2.  file1 contains data like this:

SSN       Name
-------------------
123456789,Doe, John
234567890,Doe, Jane
345678901,
456789012,One, Any
567890123,One, Some

file2 contains data like this:

 AMOUNT SSN
  10.00 234567890
1000.00 123456789
 100.00 456789012

I want my output to look like this:

 AMOUNT SSN       NAME
---------------------------
  10.00 234567890 Doe, Jane
1000.00 123456789 Doe, John
 100.00 456789012 One, Any

I need to pull the names from file1 and put the corresponding names (with
the correct SSN's) in file2.  Problem is, not every entry that is found
in file1 is also in file2.  So a sort, join wouldn't work, unless join
knows how to exclude fields that don't match up.  I haven't had much
success with join - I can never get it to give me any output.

Here are the issues I'm dealing with:

1.) file1 contains records that are not in file2.  Those would have to be
ignored. (as you can see above, line 3 in file1 has an SSN, but no name,
so it will NOT have a corresponding SSN in file2.)

2.) Neither file is sorted, so simply stripping the unnecesary entries
from both files won't match up properly, especially since file one has
more entries.

My logic at this point is: "Compare the SSNs in file2 to file1.  When
they match, pull the name field from file1 and append it to the end of
the field in file2 for the output (to file3)."

My question is, what unix utility/ies would work best for this?  I'm
going to put it into a nightly crontab which will automate the tasks.  
Any suggestions or hints?  Thanks!

Dan

 
 
 

Joining common fields in files-Which Unix util should I use?

Post by Dan Strombe » Mon, 20 Nov 1995 04:00:00


This seems to give good results, at least when using /bin/sh and GNU
join.  Some native join's should do it, too - but not all.

sed 's/^ *//' < file2 | \
   awk ' { printf("%s %s\n",$2,$1) }' | \
   sort -n > file2.sorted
sort -n file1 | \
   sed -e 's/ *//g' -e 's/,/ /g' | \
   join -t' ' -o2.1,2.2,1.2,1.3 - file2.sorted | \
   awk ' { printf("%10s %10s %s, %s\n",$2,$1,$3,$4) }'

..or in python:

#!/dcs/bin/python

import string

def main():
   file1 = open('file1','r')
   ids={}
   while 1:
      line = file1.readline()
      if not line: break
      fields=string.splitfields(line,',')
      if len(fields)>1:
         ids[fields[0]] = string.strip(string.joinfields(fields[1:],','))
   file1.close()
   file2 = open('file2','r')
   while 1:
      line = file2.readline()
      if not line: break
      fields = string.split(line)
      if ids.has_key(fields[1]):
         print fields[0],fields[1],ids[fields[1]]
   file2.close()

main()

Obviously, the python is longer - but I suspect it may be more clear,
also.  You needn't repeat "string." all the time, but I prefer to do
so, again for clarity.  "string.strip" removes leading and trailing
whitespace - aside from that, even if you don't know python, you
probably know exactly what all this is doing.

There really isn't any reason GNU join couldn't have an option to take
one file on stdin, and another on an arbitrary file descriptor.


<Okay, here's my situation:
<
<I've got two files, file1 and file2.  file1 contains data like this:
<
<SSN       Name
<-------------------
<123456789,Doe, John
<234567890,Doe, Jane
<345678901,
<456789012,One, Any
<567890123,One, Some
<
<file2 contains data like this:
<
< AMOUNT     SSN
<  10.00     234567890
<1000.00     123456789
< 100.00     456789012
<
<I want my output to look like this:
<
< AMOUNT     SSN       NAME
<---------------------------
<  10.00     234567890 Doe, Jane
<1000.00     123456789 Doe, John
< 100.00     456789012 One, Any
<
<I need to pull the names from file1 and put the corresponding names (with
<the correct SSN's) in file2.  Problem is, not every entry that is found
<in file1 is also in file2.  So a sort, join wouldn't work, unless join
<knows how to exclude fields that don't match up.  I haven't had much
<success with join - I can never get it to give me any output.
<
<Here are the issues I'm dealing with:
<
<1.) file1 contains records that are not in file2.  Those would have to be
<ignored. (as you can see above, line 3 in file1 has an SSN, but no name,
<so it will NOT have a corresponding SSN in file2.)
<
<2.) Neither file is sorted, so simply stripping the unnecesary entries
<from both files won't match up properly, especially since file one has
<more entries.
<
<My logic at this point is: "Compare the SSNs in file2 to file1.  When
<they match, pull the name field from file1 and append it to the end of
<the field in file2 for the output (to file3)."
<
<My question is, what unix utility/ies would work best for this?  I'm
<going to put it into a nightly crontab which will automate the tasks.  
<Any suggestions or hints?  Thanks!
<
<Dan

 
 
 

Joining common fields in files-Which Unix util should I use?

Post by Tad McClell » Tue, 21 Nov 1995 04:00:00


: I've got two files, file1 and file2.  file1 contains data like this:

: SSN       Name
: -------------------
: 123456789,Doe, John
: 234567890,Doe, Jane
: 345678901,
: 456789012,One, Any
: 567890123,One, Some

: file2 contains data like this:

:  AMOUNT       SSN
:   10.00       234567890
: 1000.00       123456789
:  100.00       456789012

: I want my output to look like this:

:  AMOUNT       SSN       NAME
: ---------------------------
:   10.00       234567890 Doe, Jane
: 1000.00       123456789 Doe, John
:  100.00       456789012 One, Any

: My logic at this point is: "Compare the SSNs in file2 to file1.  When
: they match, pull the name field from file1 and append it to the end of
: the field in file2 for the output (to file3)."

What do you want the program to do when the SSNs do _not_ match? Your
specification is incomplete, so I've made an assumption to solve the
problem.

I think this Perl script does what you want:

-------------------------------------
#!/usr/bin/perl -w

# create hash with SSN as index...

open(F1, "file1") || die "could not open file1\n";
while (<F1>) {
   chop;
   ($ssn, $name) = split(/,\s*/, $_, 2);
   $person{$ssn} = $name;

Quote:}

close(F1);

# append name if possible

open(F2, "file2") || die "could not open file2\n";
while (<F2>) {
   chop;
   $ssn = "";
   $ssn = $1 if /([0-9]+)$/;
   if ( $person{$ssn} )
      { $_ .= " " . $person{$ssn}; }
   else
      { $_ .= " no name on file"; }
   print "$_\n";

Quote:}

close(F2);
-------------------------------------

--
  Tad McClellan,      Logistics Specialist (IETMs and SGML guy)

O.J.'s login: //\/esc (slash-slash-backslash-slash-escape)

 
 
 

1. Join 2 Files With No Common Field



Try 'paste FICH-A FICH-B'

--
*******************************************************************************
*   Didier Duquennoy                            Geomath International         *
*                                               232 Avenue Napoleon Bonaparte *

*                                               92502 Rueil-Malmaison Cedex   *
*                                               France                        *
*                                                                             *
*                                               Tel : [33].1.47.08.82.47      *
*                                               Fax : [33].1.47.08.82.48      *
*******************************************************************************

2. Monitoring other users on host

3. Join - 2 Files No Common Field.

4. Printing with LPR(ng)

5. joining two files with a common column

6. need help with exabyte 8505 tape drive

7. Need to join fields in two files

8. Hiding Login from other Users.

9. merging 2 files with common field(s)

10. Join two files with join keys in reverse sort order on AIX

11. Compare file modification date using tcsh built-ins

12. Using awk or cut to print every field after the 11th field

13. Need util to merge UNIX group files