parsing a 'flexible' text file with SED: i'm in over my head ....

parsing a 'flexible' text file with SED: i'm in over my head ....

Post by OpenMacNew » Sun, 02 Nov 2003 12:30:44



hi all,

i have a massive text-data file that i need to extract key data from,
and format the output.

i initially tried a combination of MS Word & Excel ... massive failure
there!

because of the input file's not-quite-strict format, my best guess is
that a "regular expression" script using a text-parser like SED is the
way to go.

i'm new to SED, and have read through the tutorials i could find, as
well as some of the posts here.  one line at a time, i'm pretty much
OK, but processing this file -- well, i'm in over my head.

can anyone out there help with a script that'll do the job, assuming
that SED is even the right tool?

here's the situation:

###############
# INPUT FILE

the data is a simple text file with the following repeating block of
information:
note:
    (a) "separator_text" is the same everytime
    (b) the NUMBER of data elements under the 2nd variable (varName_2)
in each block varies
         -- e.g., 3 elements under the 1st instance varName_2 & 2
elements under the 2nd instance
the

separator_text

    (random_text & CRs)

    varName_1
        var_data_1

(random_text & CRs)

    varName_2
        var_data_2a
        var_data_2b
        var_data_2c

(random_text & CRs)

    varName_3
        var_data_3

(random_text & CRs)

separator_text

(random_text & CRs)

    varName_1
        var_data_4

(random_text & CRs)

    varName_2
        var_data_5a
        var_data_5b

(random_text & CRs)

    varName_3
        var_data_6

(random_text & CRs)

separator_text

###############
# OUTPUT FILE

the goal is to extract all blocks of text data from the INPUT FILE and
format the output in a tab-delineated table as follows.

note:  i need to create a complete record for EACH
multiple_variable_value ...

e.g., the output for the two blocks above would look like:

varName1    (tab)    varName2       (tab)   varName3
var_data_1  (tab)    var_data_2a    (tab)   var_data_3
var_data_1  (tab)    var_data_2b    (tab)   var_data_3
var_data_1  (tab)    var_data_2c    (tab)   var_data_3
var_data_4  (tab)    var_data_5a    (tab)   var_data_6
var_data_4  (tab)    var_data_5b    (tab)   var_data_6

This looked really simple at the start ...  but I've simply managed to
generate random garbage.

I'd very much appreciate anyone that could suggest a complete script
that would do the job!

Thanks,

Richard

 
 
 

parsing a 'flexible' text file with SED: i'm in over my head ....

Post by Lincoln DeCours » Mon, 03 Nov 2003 18:20:06


I would suggest awk to be a possibly better solution than sed.

The trick will be to properly handle the random_text & CRs
mentioned below.  How can you differentiate between varName_1,
which you presumably can't predict, and random_text & CRs.

As you describe your "separator text", it mostly simply separates
random_text from more random_text, see quoted material.

You might need to understand your source file better to be able to
predict exactly where the data is and how it is grouped.  This is
because of the high cost of possibly missing data elements because
of guesswork or unexpected variations in input source and format.

Even posting actual example source file might be more fruitful.

Lincoln

Quote:>     (a) "separator_text" is the same everytime
>     (b) the NUMBER of data elements under the 2nd variable (varName_2)
> in each block varies
>          -- e.g., 3 elements under the 1st instance varName_2 & 2
> elements under the 2nd instance

> separator_text

>     (random_text & CRs)

>     varName_1
>         var_data_1

> (random_text & CRs)

>     varName_2
>         var_data_2a
>         var_data_2b
>         var_data_2c

> (random_text & CRs)

>     varName_3
>         var_data_3

> (random_text & CRs)

> separator_text

> ###############
> # OUTPUT FILE
> the goal is to extract all blocks of text data from the INPUT FILE and
> format the output in a tab-delineated table as follows.

> varName1    (tab)    varName2       (tab)   varName3
> var_data_1  (tab)    var_data_2a    (tab)   var_data_3
> var_data_1  (tab)    var_data_2b    (tab)   var_data_3
> var_data_1  (tab)    var_data_2c    (tab)   var_data_3
> var_data_4  (tab)    var_data_5a    (tab)   var_data_6
> var_data_4  (tab)    var_data_5b    (tab)   var_data_6
> Thanks,

> Richard


 
 
 

1. SED: parsing a 'flexible' text file ... in over my head!

hi all,

i have a massive text-data file that i need to extract key data from,
and format the output.

i initially tried a combination of MS Word & Excel ... massive failure
there!

because of the input file's not-quite-strict format, my best guess is
that a "regular expression" scipt using a text-pareser like SED is the
way to go.

i'm new to SED, and have read through the tutorial i could find.  one
line at a time, i'm OK, but processing this file -- well, i'm in over
my head.

can anyone out there help with a script that'll do the job, assuming
that SED is even the right tool?

here's the situation:

###############
# INPUT FILE

the data is a simple text file with the following repeating block of
information:
note:
    (a) "separator_text" is the same everytime
    (b) the NUMBER of data elements under the 2nd variable (varName_2)
in each block varies
         -- e.g., 3 elements under the 1st instance varName_2 & 2
elements under the 2nd instance
the

separator_text

(random_text & CRs)

    varName_1
        var_data_1

(random_text & CRs)

    varName_2
        var_data_2a
        var_data_2b
        var_data_2c

(random_text & CRs)

    varName_3
        var_data_3

(random_text & CRs)

separator_text

(random_text & CRs)

    varName_1
        var_data_4

(random_text & CRs)

    varName_2
        var_data_5a
        var_data_5b

(random_text & CRs)

    varName_3
        var_data_6

(random_text & CRs)

separator_text

###############
# OUTPUT FILE

the goal is to extract all blocks of text data from the INPUT FILE and
format the output in a tab-delineated table as follows.

note:  i need to create a complete record for EACH
multiple_variable_value ...

e.g., the output for the two blocks above would look like:

varName1    (tab)    varName2       (tab)   varName3
var_data_1  (tab)    var_data_2a    (tab)   var_data_3
var_data_1  (tab)    var_data_2b    (tab)   var_data_3
var_data_1  (tab)    var_data_2c    (tab)   var_data_3
var_data_4  (tab)    var_data_5a    (tab)   var_data_6
var_data_4  (tab)    var_data_5b    (tab)   var_data_6

This looked really simple at the start ...  but I've simply managed to
generate random garbage.

I'd very much appreciate anyone that could suggest a complete script
that would do the job!

Thanks,

Richard

2. Any recommendations for good Sun architecture reference

3. sed -e 's/\'a\'/\';\'/' ?

4. monitoring memory usage

5. Simple 'sed', 'awk', 'cut' problem

6. Different informations

7. (sed 1q ; sed 2q) : no output from 2nd 'sed'

8. DPT Driver release delayed, US contact needed

9. substitute '<' chars in XML text with sed

10. Why doesn't echo "text" 'command' "more text" work?

11. What's 'side effects' of Ksh built-ins?

12. how to parse 'variable=value' format from file