SED: parsing a 'flexible' text file ... in over my head!

SED: parsing a 'flexible' text file ... in over my head!

Post by OpenMacNew » Sun, 02 Nov 2003 13:19:38



hi all,

i have a massive text-data file that i need to extract key data from,
and format the output.

i initially tried a combination of MS Word & Excel ... massive failure
there!

because of the input file's not-quite-strict format, my best guess is
that a "regular expression" scipt using a text-pareser like SED is the
way to go.

i'm new to SED, and have read through the tutorial i could find.  one
line at a time, i'm OK, but processing this file -- well, i'm in over
my head.

can anyone out there help with a script that'll do the job, assuming
that SED is even the right tool?

here's the situation:

###############
# INPUT FILE

the data is a simple text file with the following repeating block of
information:
note:
    (a) "separator_text" is the same everytime
    (b) the NUMBER of data elements under the 2nd variable (varName_2)
in each block varies
         -- e.g., 3 elements under the 1st instance varName_2 & 2
elements under the 2nd instance
the

separator_text

(random_text & CRs)

    varName_1
        var_data_1

(random_text & CRs)

    varName_2
        var_data_2a
        var_data_2b
        var_data_2c

(random_text & CRs)

    varName_3
        var_data_3

(random_text & CRs)

separator_text

(random_text & CRs)

    varName_1
        var_data_4

(random_text & CRs)

    varName_2
        var_data_5a
        var_data_5b

(random_text & CRs)

    varName_3
        var_data_6

(random_text & CRs)

separator_text

###############
# OUTPUT FILE

the goal is to extract all blocks of text data from the INPUT FILE and
format the output in a tab-delineated table as follows.

note:  i need to create a complete record for EACH
multiple_variable_value ...

e.g., the output for the two blocks above would look like:

varName1    (tab)    varName2       (tab)   varName3
var_data_1  (tab)    var_data_2a    (tab)   var_data_3
var_data_1  (tab)    var_data_2b    (tab)   var_data_3
var_data_1  (tab)    var_data_2c    (tab)   var_data_3
var_data_4  (tab)    var_data_5a    (tab)   var_data_6
var_data_4  (tab)    var_data_5b    (tab)   var_data_6

This looked really simple at the start ...  but I've simply managed to
generate random garbage.

I'd very much appreciate anyone that could suggest a complete script
that would do the job!

Thanks,

Richard

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by openmacne » Sun, 02 Nov 2003 13:33:12


hi all,

i have a massive text-data file that i need to extract key data from,
and format the output.

i initially tried a combination of MS Word & Excel ... massive failure
there!

because of the input file's not-quite-strict format, my best guess is
that a "regular expression" scipt using a text-pareser like SED is the
way to go.

i'm new to SED, and have read through the tutorial i could find.  one
line at a time, i'm OK, but processing this file -- well, i'm in over
my head.

can anyone out there help with a script that'll do the job, assuming
that SED is even the right tool?

here's the situation:

###############
# INPUT FILE

the data is a simple text file with the following repeating block of
information:
note:
    (a) "separator_text" is the same everytime
    (b) the NUMBER of data elements under the 2nd variable (varName_2)
in each block varies
         -- e.g., 3 elements under the 1st instance varName_2 & 2
elements under the 2nd instance
the

separator_text

(random_text & CRs)

    varName_1
        var_data_1

(random_text & CRs)

    varName_2
        var_data_2a
        var_data_2b
        var_data_2c

(random_text & CRs)

    varName_3
        var_data_3

(random_text & CRs)

separator_text

(random_text & CRs)

    varName_1
        var_data_4

(random_text & CRs)

    varName_2
        var_data_5a
        var_data_5b

(random_text & CRs)

    varName_3
        var_data_6

(random_text & CRs)

separator_text

###############
# OUTPUT FILE

the goal is to extract all blocks of text data from the INPUT FILE and
format the output in a tab-delineated table as follows.

note:  i need to create a complete record for EACH
multiple_variable_value ...

e.g., the output for the two blocks above would look like:

varName1    (tab)    varName2       (tab)   varName3
var_data_1  (tab)    var_data_2a    (tab)   var_data_3
var_data_1  (tab)    var_data_2b    (tab)   var_data_3
var_data_1  (tab)    var_data_2c    (tab)   var_data_3
var_data_4  (tab)    var_data_5a    (tab)   var_data_6
var_data_4  (tab)    var_data_5b    (tab)   var_data_6

This looked really simple at the start ...  but I've simply managed to
generate random garbage.

I'd very much appreciate anyone that could suggest a complete script
that would do the job!

Thanks,

Richard

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by Icarus Sparr » Sun, 02 Nov 2003 16:44:55



> hi all,

> i have a massive text-data file that i need to extract key data from,
> and format the output.

> i initially tried a combination of MS Word & Excel ... massive failure
> there!

> because of the input file's not-quite-strict format, my best guess is
> that a "regular expression" scipt using a text-pareser like SED is the
> way to go.

> i'm new to SED, and have read through the tutorial i could find.  one
> line at a time, i'm OK, but processing this file -- well, i'm in over
> my head.

> can anyone out there help with a script that'll do the job, assuming
> that SED is even the right tool?

> here's the situation:

> ###############
> # INPUT FILE

> the data is a simple text file with the following repeating block of
> information:
> note:
>     (a) "separator_text" is the same everytime
>     (b) the NUMBER of data elements under the 2nd variable (varName_2)
> in each block varies
>          -- e.g., 3 elements under the 1st instance varName_2 & 2
> elements under the 2nd instance
> the

> separator_text

> (random_text & CRs)

>     varName_1
>         var_data_1

I find the problem under specified. Is the indentation significant? If so, can
we assume that the part marked as 'random_text & CRs' does not have any
spaces at the start of the lines?

Are the names "varName_1", "varName_2" and "varName_3" fixed, or do we
have to deduce them from the input file? Will there always be 3?

How can we tell when we have finished reading data values for the second
variable, and are into the 'random_text & CRs' before the third variable
name?

I think the correct tool to use here is 'awk', rather than 'sed'.

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by OpenMacNew » Mon, 03 Nov 2003 01:03:12




Icarus,

Thanks for taking the time to reply, and asking the right questions to
help think this through!

Quote:> I find the problem under specified.

Fair enough.

Quote:>Is the indentation significant?

No, the indentation is not significant ... I just added it here for
reader clarity.

Quote:> If so, can we assume that the part marked as 'random_text & CRs' does not have any
> spaces at the start of the lines?

In the case of the "random text & CRs", they're always present, in
varying quantity and content (e.g., comment sections ...), and some
lines *do* have spaces at start of lines -- not all, though.

Quote:> Are the names "varName_1", "varName_2" and "varName_3" fixed, or do we
> have to deduce them from the input file? Will there always be 3?

yes, "varName_1", "varName_2" and "varName_3" are fixed ... exactly the
same both text and capitalization from instance to instance.  of course,
they're different from each other ...

Quote:> How can we tell when we have finished reading data values for the second
> variable, and are into the 'random_text & CRs' before the third variable
> name?

there are *always* at least 2 CR's after the last varData_N element
beneath varName_2 ... e.g.:

varName_2
   varData_2a
   varData_2b
...
   varData_2z
CR
CR
(random text, possibly starting with spaces)

Quote:> I think the correct tool to use here is 'awk', rather than 'sed'.

awk?

(looking in awk man page)

eeeek!

my first inclination is to ask "why", but since I know next to nothing
of awk, i'll have to take your word for it.

thanks!

richard

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by Icarus Sparr » Mon, 03 Nov 2003 02:29:32





> Icarus,

> Thanks for taking the time to reply, and asking the right questions to
> help think this through!

>> I find the problem under specified.
....
> my first inclination is to ask "why", but since I know next to nothing
> of awk, i'll have to take your word for it.

The main thing that makes me think 'awk' rather than 'sed' is that you have
to store potentially a very large number of values for varName_2, which
makes me want to have a tool with arrays. Yes one can do it with sed using
the hold space, but it gets messy quickly.

Anyway, put the following into a file, e.g. awkprog, and then run

awk -f awkprog datafile
==================cut here=========
#!/bin/awk -f
BEGIN {
    state="look for separator";
    OFS="\t";
    print "varName_1","varName_2","varName_3"

Quote:}

#{ print state,$0} #uncomment for debugging
/separator_text/ && state=="look for separator" { state="look for var1"}
/varName_1/ && state=="look for var1" { getline v1 ; state="look for var2"}
/varName_3/ && state=="look for var3" {
    getline v3 ;
    for(i=0;i<number_of_2;i++) {
        print v1,store2[i],v3;
            }
    state="look for separator"
Quote:}

/varName_2/ && state=="look for var2" {
    number_of_2=0;
    getline t;
    while (t != "") {
        store2[number_of_2++]=t;
        getline t;
    }
    state="look for var3";
Quote:}

==================cut here=========
This produces the desired results for me on your sample data, once I have
removed the extra indentation that you say you added.

In normal operation awk runs over lines that it is given, running rules
against each line, in a simular manner to sed. In this case the rules are
fairly simple, just see if the input line matches a particular pattern and
that we are expecting it. If we do see it, then do the "right thing".

I take a couple of shortcuts, for instance setting OFS to a tab so that I
don't need to print tabs myself.

This should work with any reasonably recent version of awk. You may well
have many versions of awk on your system, under names like awk, nawk, gawk,
mawk. If you need, the source for awk is available from the authors home
page http://cm.bell-labs.com/cm/cs/who/bwk/

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by William Par » Tue, 04 Nov 2003 15:04:41



> separator_text

> (random_text & CRs)

>    varName_1
>        var_data_1

> (random_text & CRs)

>    varName_2
>        var_data_2a
>        var_data_2b
>        var_data_2c

> (random_text & CRs)

>    varName_3
>        var_data_3

> (random_text & CRs)

> separator_text

> (random_text & CRs)

>    varName_1
>        var_data_4

> (random_text & CRs)

>    varName_2
>        var_data_5a
>        var_data_5b

> (random_text & CRs)

>    varName_3
>        var_data_6

> (random_text & CRs)

> separator_text
> varName1    (tab)    varName2       (tab)   varName3
> var_data_1  (tab)    var_data_2a    (tab)   var_data_3
> var_data_1  (tab)    var_data_2b    (tab)   var_data_3
> var_data_1  (tab)    var_data_2c    (tab)   var_data_3
> var_data_4  (tab)    var_data_5a    (tab)   var_data_6
> var_data_4  (tab)    var_data_5b    (tab)   var_data_6

I shall assume that 'varName1' and 'varName3' have only one data each,
and the length of table is determined by 'varName2' data.  Also, the
they appear in order of 'varName1', 'varName2', and then 'varName3'.

1.  First change the format from
        aaa
            bbb
            ccc
    to
        aaa bbb ccc
    Let's see...

        awk -v RS='' '/varName/ {
            for (i=1; i<=NF; i++) printf "%s ", $i
            print ""
        }' infile

2.  Read the lines, and assign
        varName_1=...
        varName_2=...
        varName_3=...
    And, test for 'varName3' which is the last variable in a block.
    So...

        awk -v RS='' '/varName/ {
            for (i=1; i<=NF; i++) printf "%s ", $i
            print ""
        }' infile | while read var data; do
            case "$var" in
                varName_1) varName_1=$data ;;
                varName_2) varName_2=( $data ) ;;
                varName_3) varName_3=$data
                    for i in ${varName_2[*]}; do
                        echo $varName_1 $i $varName_3
                    done ;;
            esac
        done
--

Linux solution for data management and processing.

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by OpenMacNew » Wed, 05 Nov 2003 02:42:11


hi,

Both of your hints worked nicely for understading/processing my spec'd
example.  Thanks!

However, as I started exploring this data further, and another file, I
recognized that I'd STILL underspecified the problem ...

So, I "dove in" and picked up a copy of "Sed & Awk" (O'Reilly) and
started to work.

I worked on a FlatFile of company/contact information, again with
multiple contact names, phone numbers, sic codes, etc. per record ...

What can I say, "awk" is simply cool! :-)  Who knew?

It probably "ain't pretty" (and not quite done), but the code below
actually WORKS (!!!) for me in extracting the follwing data format
example ...  I'll admit that until I got the hang of it a bit, I had
some "interesting" output.  It's also clear that awk, when "written
well" can be incredibly compact.

I'm posting efforts/progress here just in case it may help someone down
the line.

Thanks again for your initial help -- you've created a monster!

Richard

=====================================================================
EXAMPLE DATA (for this example):

Beginning of Company Record

Generic Corp.       (XXXX)

999 StreetName Dr.
SomeCity, AnyState
99999
United States

Tel:
(555) 555-5555
USA(800)555-5555
Fax: FX - USA(555)555-5555

Business
IndustryName: Brief Description of What the Company Does.

Variant Name
OtherCompanyName Corp. (Brief Descrip)  - Merger

SIC Codes
9999 - sic code category name

NAICS Codes
999999 - naics code category name

Annual Sales
$9,999.99 M Sales, Form 10-K

Employees
99,999,           Form 10-K

Sales/Employees
$999,999.99

Year Founded
1900

Fiscal Year
Dec 31, 2003

Features
More Descriptive text

Stock Exch
NASDAQ

Ticker
XXXX

URL
http://www.generic_corp.tld

Toll Free Telephone Number
USA(800)555-5555

Email Address
nothing@generic_corp.tld

Officers
first a. last - Chief Executive Officer and President
first b. last - Chief Financial Officer
first c. last - Vice President, Sales
first d. last - Chief Technical Officer
first last - Vice President, Human Resources
first last - Senior Vice President, Marketing and Sales
first e. last - Vice President, Operations

Beginning of Company Record

=====================================================================
EXAMPLE OUTPUT: (yes, it can be easily formatted now in table form ...)

Name:           Generic Corp.       (XXXX)
Address1:       999 StreetName Dr.
City:           SomeCity
State:          AnyState
Zip:            99999
Country:        United States
Phone1:         (555) 555-5555
Phone2:         USA(800)555-5555
Business:       IndustryName: Brief Description of What the Company
Does.
Sales:          $9,999.99 M
Employee:       99,999
Officer1:       first  a. last   - Chief Executive Officer and President
Officer2:       first  b. last   - Chief Financial Officer
Officer3:       first  c. last   - Vice President, Sales
Officer4:       first  d. last   - Chief Technical Officer
Officer5:       first  last      - Vice President, Human Resources
Officer6:       first  last      - Senior Vice President, Marketing and
Sales
Officer7:       first  e. last   - Vice President, Operations

=====================================================
GAWK FILE (for this example):

#!/bin/gawk -f

BEGIN {
   state = "search_separator";

   OFS = "\t";
   ORS = "\n";
   FS = " ";
   RS = "\n";

   blankline = "/^$/";
   separator = "Beginning of Company Record";

   checkarray["blankline"]=blankline ;
   checkarray["separator"]=separator ;
   print "AA","BB","CC","DD","EE","FF""\n";

}

# find the beginning of the Company Data Record
state == "search_separator" {
   if ($0 ~ separator) {
      state = "search_companyname";
      print "";
      next;
   }

}

# Company Name
state == "search_companyname" {
   if ( $0 in checkarray) {
      next;
   } else if ($0 ~ /^[[:alnum:]\. ]+/) {
      var_companyname = $0;
      state = "search_address";
      next;
   }

}

# Address
state == "search_address" {
   i = 1;
      checkarray["search_address"] = $var_companyname
   if ( $0 in checkarray) {
      next;
   } else if ( $0 ~ /^[[:alnum:]\. ]+/) {
      address[i] = $0;
      i++;
      while (address[i-1] != "") {
         getline address[i];
         i++;
      }
      i_max = i-2;

# counting backwards from @imax ...
# allow for 1 or 2 line addresses
      if (i_max-4 != 0) {
         var_address1 = address[i_max-4];
         var_address2 = address[i_max-3];
      } else {
         var_address1 = address[i_max-3];
         var_address2 = "";
      }
      var_citystate = address[i_max-2];
      var_zip = address[i_max-1];
      var_country = address[i_max];

# split CityState into City & State
      FS = ", ";
      $0 = var_citystate;
#     print var_citystate;
      var_city = $1;
      var_state = $2;
      FS = " ";

# print Address
   print "Name:    ", var_companyname;
   print "Address1:", var_address1;
   if (var_address2 != "") print "Address2:", var_address2;
   print "City:    ", var_city;
   print "State:   ", var_state;
   print "Zip:     ", var_zip;
   print "Country: ", var_country;
   state = "search_phone";
   }

# move on to next line
   next;

}

# Telephone(s)
/^Tel:/ && state == "search_phone" {
   FS_temp = FS; RS_temp = RS; OFS_temp = OFS; ORS_temp = ORS;

   FS = "\n";
   RS = "\nFax:";
   getline;

   OFS = "\t";
   ORS = "\n";
   var_phone_count = NF;
   for ( i=1; i<=NF; i++ ) {
      var_phone[i] = $i;
      print "Phone"i":   ", $i;
   }

   FS = FS_temp; RS = RS_temp; OFS = OFS_temp; ORS = ORS_temp;

   state = "search_business";
   next;

}

# Business
/^Business/ && state == "search_business" {
   getline var_business;
   print "Business:", var_business;
   state = "search_sales";
   next;
   }

# Sales
/^Annual Sales/ && state == "search_sales" {
   getline;
   var_sales = $1;
   print "Sales:   ", var_sales " M";
   state = "search_employees";
   next;

}

# Employees
/^Employees/ && state == "search_employees" {
   FS_temp = FS; RS_temp = RS; OFS_temp = OFS; ORS_temp = ORS;

   FS = ", ";
   getline;
   var_sales = $1;
   print "Employee:", var_sales;

   FS = FS_temp; RS = RS_temp; OFS = OFS_temp; ORS = ORS_temp;

   state = "search_officers";
   next;

}

# Officers
/^Officers/ && state == "search_officers" {
   FS_temp = FS; RS_temp = RS; OFS_temp = OFS; ORS_temp = ORS;

   FS = " - ";
   RS = "\n";
   getline;
   i=0;
   while ($0 != "") {
      i++;
      var_officer_name[i] = $1;
      var_officer_title[i] = $2;
      getline;
   }
   var_officer_count = i;

# split LastName, FirstName, MI
   for ( i=1; i<=var_officer_count; i++ ) {
      FS = " ";
      RS = "\n";
      $0 = var_officer_name[i];
      var_officer_firstname[i] = $1;
      var_officer_lastname[i] = $NF;
      var_officer_mi[i] = "";
      if (NF > 2) {
         for (j=2; j<= (NF-1); j++) {
            var_officer_mi[i] = var_officer_mi[i]" "$j;
         }
      }
   print "Officer"i":   ", var_officer_firstname[i]" "var_officer_mi[i]"
"var_officer_lastname[i], " - "var_officer_title[i];
   FS = FS_temp; RS = RS_temp; OFS = OFS_temp; ORS = ORS_temp;
   }

   state = "search_last";
   next;

}

# last
state == "search_last" {
   state = "search_separator";
}

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by Icarus Sparr » Wed, 05 Nov 2003 13:36:08



> hi,

> Both of your hints worked nicely for understading/processing my spec'd
> example.  Thanks!

> However, as I started exploring this data further, and another file, I
> recognized that I'd STILL underspecified the problem ...

I suspected that was the case. As a general comment if you can tell us
exactly what your problem is, there is a much better chance we can help.
Otherwise we have to try and coax the information out of you.

Quote:> I worked on a FlatFile of company/contact information, again with
> multiple contact names, phone numbers, sic codes, etc. per record ...

> What can I say, "awk" is simply cool! :-)  Who knew?

If you like awk, you will probably love perl. The down side is that perl is
not installed everywhere yet. Seeing how complicated your data is, I am
pleased that we managed to talk you out of using sed.

Quote:> It probably "ain't pretty" (and not quite done), but the code below
> actually WORKS (!!!) for me in extracting the follwing data format
> example ...  I'll admit that until I got the hang of it a bit, I had
> some "interesting" output.  It's also clear that awk, when "written
> well" can be incredibly compact.

I deliberatly left in the following line
        #{ print state,$0} #uncomment for debugging

If you wish to improve your program, I suggest that you look at the "split",
"sub" and "gsub" statements in awk, they will enable you to re-write some
of your code in a more idiomatic manner, e.g. where you are splitting the
officer names.

Just in case you don't know it, the style of program you have written is a
"Finite State Machine" or FSM.

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by <u.. » Thu, 06 Nov 2003 04:19:14




> > recognized that I'd STILL underspecified the problem ...

> I suspected that was the case. As a general comment if you can tell us
> exactly what your problem is, there is a much better chance we can help.
> Otherwise we have to try and coax the information out of you.

fair enuf ... but it DOES presume that i understanding what the heck
i'm trying to do in the first place!  ;-)

in this case it was the COAXING that helped as much as the info!

Quote:> > What can I say, "awk" is simply cool! :-)  Who knew?

> If you like awk, you will probably love perl. The down side is that perl is
> not installed everywhere yet. Seeing how complicated your data is, I am
> pleased that we managed to talk you out of using sed.

I've perl 5.8.1 installed for website purposes ... mainly allowing apps
that need it to use it ...

QUESTION:  in your opinion, would this task-o-mine be better served by
Perl or awk?

Quote:> If you wish to improve your program, I suggest that you look at the "split",
> "sub" and "gsub" statements in awk, they will enable you to re-write some
> of your code in a more idiomatic manner, e.g. where you are splitting the
> officer names.

oh!  fell asleep before I got to that chapter .... thanks! the
applications are clear.

Quote:> Just in case you don't know it, the style of program you have written is a
> "Finite State Machine" or FSM.

nope. didn't know that.  simply hope that it its a good thing!

yeah, yeah  ..... i'll "look it up"! :-)

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by Icarus Sparr » Thu, 06 Nov 2003 05:14:45



> QUESTION:  in your opinion, would this task-o-mine be better served by
> Perl or awk?

Both will do the job perfectly well. I would tend to choose PERL unless you
are running on a machine with limited CPU/memory.

There are a number of reasons, the biggest being the de*. Being able to
single step through the program and print out the current value of
variables can be a big help.

Both AWK and PERL allow you to write complicated programs in a consise
manner. A rule of thumb says that any good programmer can understand 20,000
lines of code. It doesn't matter much what the language is, from assembler,
through C, C++ BASIC, APL to AWK and PERL. The advantage is that You can do
a lot more in 20,000 lines of PERL script than you can do in 20,000 lines
of assembler.

You can do more with 20,000 lines of PERL than with 20,000 lines of AWK.

However you posted to comp.unix.shell, so you got a solution based on sh and
common unix tools, rather than a PERL solution. Your investment in the 'awk
& sed' book is not wasted,  Both of these are fine tools, and having their
power at your fingertips is always useful.

 
 
 

SED: parsing a 'flexible' text file ... in over my head!

Post by OpenMacNew » Thu, 06 Nov 2003 05:36:09




> > QUESTION:  in your opinion, would this task-o-mine be better served by
> > Perl or awk?

> Both will do the job perfectly well. I would tend to choose PERL unless you
> are running on a machine with limited CPU/memory.

> There are a number of reasons, the biggest being the de*. Being able to
> single step through the program and print out the current value of
> variables can be a big help.

excellent point!

Quote:> Both AWK and PERL allow you to write complicated programs in a consise
> manner. A rule of thumb says that any good programmer can understand 20,000
> lines of code. It doesn't matter much what the language is, from assembler,
> through C, C++ BASIC, APL to AWK and PERL. The advantage is that You can do
> a lot more in 20,000 lines of PERL script than you can do in 20,000 lines
> of assembler.

(should've done it in Fortran <-- showing my age!)

Quote:> You can do more with 20,000 lines of PERL than with 20,000 lines of AWK.

> However you posted to comp.unix.shell, so you got a solution based on sh and
> common unix tools, rather than a PERL solution. Your investment in the 'awk
> & sed' book is not wasted,  Both of these are fine tools, and having their
> power at your fingertips is always useful.

well, IF I'D KNOWN !! ya see, if there were a
comp.sys.somewhat_confused, I probably could've 'cut to the chase' more
quickly!  :-0

and, in the spirit of keep-your-eye-on-the-ball ... awk is working
nicely to solve my problem.

thanks again for all your help!

cheers,

richard

 
 
 

1. parsing a 'flexible' text file with SED: i'm in over my head ....

hi all,

i have a massive text-data file that i need to extract key data from,
and format the output.

i initially tried a combination of MS Word & Excel ... massive failure
there!

because of the input file's not-quite-strict format, my best guess is
that a "regular expression" script using a text-parser like SED is the
way to go.

i'm new to SED, and have read through the tutorials i could find, as
well as some of the posts here.  one line at a time, i'm pretty much
OK, but processing this file -- well, i'm in over my head.

can anyone out there help with a script that'll do the job, assuming
that SED is even the right tool?

here's the situation:

###############
# INPUT FILE

the data is a simple text file with the following repeating block of
information:
note:
    (a) "separator_text" is the same everytime
    (b) the NUMBER of data elements under the 2nd variable (varName_2)
in each block varies
         -- e.g., 3 elements under the 1st instance varName_2 & 2
elements under the 2nd instance
the

separator_text

    (random_text & CRs)

    varName_1
        var_data_1

(random_text & CRs)

    varName_2
        var_data_2a
        var_data_2b
        var_data_2c

(random_text & CRs)

    varName_3
        var_data_3

(random_text & CRs)

separator_text

(random_text & CRs)

    varName_1
        var_data_4

(random_text & CRs)

    varName_2
        var_data_5a
        var_data_5b

(random_text & CRs)

    varName_3
        var_data_6

(random_text & CRs)

separator_text

###############
# OUTPUT FILE

the goal is to extract all blocks of text data from the INPUT FILE and
format the output in a tab-delineated table as follows.

note:  i need to create a complete record for EACH
multiple_variable_value ...

e.g., the output for the two blocks above would look like:

varName1    (tab)    varName2       (tab)   varName3
var_data_1  (tab)    var_data_2a    (tab)   var_data_3
var_data_1  (tab)    var_data_2b    (tab)   var_data_3
var_data_1  (tab)    var_data_2c    (tab)   var_data_3
var_data_4  (tab)    var_data_5a    (tab)   var_data_6
var_data_4  (tab)    var_data_5b    (tab)   var_data_6

This looked really simple at the start ...  but I've simply managed to
generate random garbage.

I'd very much appreciate anyone that could suggest a complete script
that would do the job!

Thanks,

Richard

2. Priceline.com Site A Big Joke DONT Use Them!

3. (sed 1q ; sed 2q) : no output from 2nd 'sed'

4. Headless 6100/MkLinux 2.1?

5. Why doesn't echo "text" 'command' "more text" work?

6. Linux and Vlan's

7. mandrake linux install problems.

8. What's 'side effects' of Ksh built-ins?

9. sed -e 's/\'a\'/\';\'/' ?

10. Simple 'sed', 'awk', 'cut' problem

11. substitute '&lt;' chars in XML text with sed

12. how to parse 'variable=value' format from file