faster way to extract fields from large file?

faster way to extract fields from large file?

Post by M » Tue, 28 Mar 2000 04:00:00



Hi,

I am extracting 4 fields from a file with more than 5000 records for
processing using the standard sed, awk/cut constructs - nothing special.
The script works, but it is very slow - approx one line per second. One
of the fields is a date field which I convert to milliseconds, by
storing each time field into its own variable prior to time conversion.

Is there another method I can use to speed up the processing? Maybe I
shouldn't be using a script for such a large job. I have written this in
C also, which is a hell of a lot faster as expected. Thank you for any
suggestions.

Michael

 
 
 

faster way to extract fields from large file?

Post by Charles Dem » Tue, 28 Mar 2000 04:00:00



>Hi,

>I am extracting 4 fields from a file with more than 5000 records for
>processing using the standard sed, awk/cut constructs - nothing special.
>The script works, but it is very slow - approx one line per second. One
>of the fields is a date field which I convert to milliseconds, by
>storing each time field into its own variable prior to time conversion.

>Is there another method I can use to speed up the processing? Maybe I
>shouldn't be using a script for such a large job. I have written this in
>C also, which is a hell of a lot faster as expected. Thank you for any
>suggestions.

Not to be flip, but sometimes people posting here don't really state
what the real problem or constraints are.

If you want help, post what you have done, with some sample input and
expected output.  You might be doing something inefficiently, and
without looking at the code etc, it's hard to tell or advise you.

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.


 
 
 

faster way to extract fields from large file?

Post by M » Tue, 28 Mar 2000 04:00:00


Hi Charles,

I understand what you mean, I should have posted something. Here is the
ksh script (see below function convert2millisecs). Another problem I am
having is placing the code which does the conversion into a function, in
order to reuse the function to process start and end times for jobs I am
running. The time format is HH:MM:SS.sss where sss is milliseconds.

Cannot get this function to work properly. It is currently called by the
following code

let totalStrtSeconds=$(echo `convert2millisecs $starttime`)

I need to get the return value of the function and assign it to the
variable
totalStrtSeconds, rather than recurse. Anyway, thanks for any help you
can offer, I really appreciate it.

Michael

-------------function code---------------

function convert2millisecs {
        # Extract time field and convert to milliseconds
        hour=$(echo "$1" | cut -f1 -d ":")
        let hour2Millisecs=strtHour*60*60*1000
        minute=$(echo "$1" | cut -f2 -d ":")
        let minute2Millisecs=strtMinute*60*1000
        temp=$(echo "$1" | cut -f3 -d ":")      
        second=$(echo "$temp" | cut -f1 -d ".")
        let sec2Millisecs=strtSecond*1000
        millisecs=$(echo "$temp" | cut -f2 -d ".")

        echo
$((hour2Millisecs+minute2Millisecs+sec2Millisecs+millisecs))

Quote:}

---------end of function--------------

----------start of kshscript----------

#!/bin/ksh

FILENAME=${1##*/}

let linecount=1
let totallinecount=`wc -l ${FILENAME} | awk '{print $1}'`

while (( $linecount <= $totallinecount ))
do      
        # Extract finish time
        if (( $linecount % 2 == 0 )); then
            finishtime=`sed -n "${linecount}p" ${FILENAME} | awk '{print
$4}'`

            finHour=$(echo "$finishtime" | cut -f1 -d ":")
            let finHour2Millisecs=finHour*60*60*1000
            finHour=$(echo "$finishtime" | cut -f1 -d ":")
            let finHour2Millisecs=finHour*60*60*1000
            finMinute=$(echo "$finishtime" | cut -f2 -d ":")
            let finMinute2Millisecs=finMinute*60*1000
            temp=$(echo "$finishtime" | cut -f3 -d ":")
            finSecond=$(echo "$temp" | cut -f1 -d ".")
            let finSec2Millisecs=finSecond*1000
            finMillisecs=$(echo "$temp" | cut -f2 -d ".")    #
milliseconds
            let
totalFinMilliSecs=finHour2Millisecs+finMinute2Millisecs+finSec2Millisecs+finMillisecs

            let count=1
            let duration=totalFinMilliSecs-totalStrtMilliSecs

            # time stats
            print "name\t\t$duration\t\t\t\t$starttime\t\t$finishtime"
        fi

        # Extract start time
        nextline=`sed -n "${linecount}p" ${FILENAME} # | awk '{print
$4}'`
        starttime=$(echo "$nextline" | cut -f4 -d " ")
        name=$(echo "$nextline" | cut -f9 -d " ")

        # Extract time field and convert to milliseconds
        strtHour=$(echo "$starttime" | cut -f1 -d ":")
        let strtHour2Millisecs=strtHour*60*60*1000
        strtMinute=$(echo "$starttime" | cut -f2 -d ":")
        let strtMinute2Millisecs=strtMinute*60*1000
        temp=$(echo "$starttime" | cut -f3 -d ":")      
        strtSecond=$(echo "$temp" | cut -f1 -d ".")
        let strtSec2Millisecs=strtSecond*1000
        strtMillisecs=$(echo "$temp" | cut -f2 -d ".")
        let
totalStrtSeconds=strtHour2Millisecs+strtMinute2Millisecs+strtSec2Millisecs+strtMillisecs
        let linecount=linecount+1
done

rm ${FILENAME}.sorted   # clean up

 
 
 

faster way to extract fields from large file?

Post by Heiner Steve » Tue, 28 Mar 2000 04:00:00



> I understand what you mean, I should have posted something. Here is the
> ksh script (see below function convert2millisecs). Another problem I am
> having is placing the code which does the conversion into a function, in
> order to reuse the function to process start and end times for jobs I am
> running. The time format is HH:MM:SS.sss where sss is milliseconds.

[...ksh script removed...]

This looked like a problem suitable for AWK, and therefore I
tried to rewrite your script using this language:

# stats - print process times

for file
do
    awk '
        {
            runtime = $4        # HH:MM:SS.mm format
            nfields = split (stime, t, "[:.]")
            millisecs = (((t [4] * 1000) + t [3]) * 60 + t [2]) * 60 + t [1]

            if ( (NR % 2) == 1 ) {
                # start time
                starttime = millisecs
                name      = $9
            } else {
                finishtime = millisecs
                duration   = etime - stime
                print name "\t\t" duration "\t\t\t\t" starttime \
                        "\t\t" finishtime
            }
        }
    ' "$file"
done
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

I don't have sample data and cannot test it, but maybe you can
use it as a starting point for an own script.

Some comments:

 o  the line "split (...)" splits the input into fields delimited
    by ":" or ".", and writes the results to the array t[1..4]:
        t[1] = hour, t[2] = minute, ...

 o  The line "millisecs = "...
    converts the time to milliseconds (at least it *should* ;-) )

Heiner
--
 ___ _                          

\__ \  _/ -_) V / -_) ' \    UNIX Shell Script Programmers: visit
|___/\__\___|\_/\___|_||_|   http://www.oase-shareware.org/shell/

 
 
 

faster way to extract fields from large file?

Post by Ken Pizzi » Wed, 29 Mar 2000 04:00:00



>let totalStrtSeconds=$(echo `convert2millisecs $starttime`)

If you're worried about performance, loose the UUOE/UUOBT:
  let totalStrtSeconds=$(convert2millisecs $starttime)

Quote:>I need to get the return value of the function and assign it to the
>variable totalStrtSeconds, rather than recurse. Anyway, thanks for
>any help you can offer, I really appreciate it.

I failed to find anything in the code you posted which did any
recursion, so I'm unclear on what the concern there was...

Quote:>function convert2millisecs {
>        # Extract time field and convert to milliseconds
>        hour=$(echo "$1" | cut -f1 -d ":")
>        let hour2Millisecs=strtHour*60*60*1000
>        minute=$(echo "$1" | cut -f2 -d ":")
>        let minute2Millisecs=strtMinute*60*1000
>        temp=$(echo "$1" | cut -f3 -d ":")
>        second=$(echo "$temp" | cut -f1 -d ".")
>        let sec2Millisecs=strtSecond*1000
>        millisecs=$(echo "$temp" | cut -f2 -d ".")

>        echo
>$((hour2Millisecs+minute2Millisecs+sec2Millisecs+millisecs))
>}

Avoid all those external calls to "cut":
  function convert2millisecs {
    typeset IFS=":."
    typeset -A hmsl $1
    echo $(( ( ( ${hmsl[0]}*60 + ${hmsl[1]} )*60
                               + ${hmsl[2]} )*1000 + ${hmsl[3]} ))
  }

Quote:>let totallinecount=`wc -l ${FILENAME} | awk '{print $1}'`

A useless optimization way out here outside of any loop, but you
don't need the "awk" call:
  let totallinecount=`wc -l < ${FILENAME}`
(Actually, as it will turn out later, you don't even need
"totallinecount" in the first place...)

Quote:>        if (( $linecount % 2 == 0 )); then
>            finishtime=`sed -n "${linecount}p" ${FILENAME} | awk '{print
>$4}'`

Better:
             finishtime=`sed "${linecount}q;d" ${FILENAME} | awk '{print $4}'`
or better still:
             finishtime=`awk 'NR=='"${linecount}"'{print $4;exit}' ${FILENAME}`
Or even better still yet: don't re-scan the file for the next
line each time around the loop --- redirect the file into the
loop and "read" each line in for processing.

Quote:>            finHour=$(echo "$finishtime" | cut -f1 -d ":")
...
>            let
>totalFinMilliSecs=finHour2Millisecs+finMinute2Millisecs+finSec2Millisecs+finMillisecs

Instead of that mess we can use the convert2millisecs function:
             let totalFinMilliSecs=$(convert2millisecs "$finishtime")

Quote:>        # Extract start time
>        nextline=`sed -n "${linecount}p" ${FILENAME} # | awk '{print
>$4}'`

Again, you don't want to be re-scanning the file for a specific
line number each time through the loop.

Quote:>        starttime=$(echo "$nextline" | cut -f4 -d " ")
>        name=$(echo "$nextline" | cut -f9 -d " ")

Depending on the exact format of the line you can probably avoid
the calls to "cut" by simply specifying 10 variable names to the
"read" command, or using the "set" command.

Quote:>        # Extract time field and convert to milliseconds
>        strtHour=$(echo "$starttime" | cut -f1 -d ":")
...
>        let
>totalStrtSeconds=strtHour2Millisecs+strtMinute2Millisecs+strtSec2Millisecs+strtMillisecs

Again, making use of convert2millisecs:
             let totalStrtSecs=$(convert2millisecs "$starttime")

Putting it all together:
  #!/bin/ksh

  function convert2millisecs {
    typeset IFS=":."
    typeset -A hmsl $1
    echo $(( ((${hmsl[0]}*60+${hmsl[1]})*60+${hmsl[2]})*1000+${hmsl[3]} ))
  }

  FILENAME=${1##*/}
  exec < "$FILENAME"
  let linetype=1
  while read a b c time e f g h curname j; do
    if (( linetype == 0 )); then
      # Extract finish time
      let totalFinMilliSecs=$(convert2millisecs "$time")
      let duration=totalFinMilliSecs-totalStrtMilliSecs
      # time stats
      print "$name\t\t$duration\t\t\t\t$starttime\t\t$time"
    else
      # Remember start time
      starttime=$time
      name=$curname
      # Extract time field and convert to milliseconds
      let totalStrtMilliSeconds=$(convert2millisecs "$starttime")
    fi
    let linetype=1-linetype
  done

  # Is this next line really wanted?
  rm "${FILENAME}.sorted"   # clean up

That should give you a substantial speed-up with even a
moderately large file.

                --Ken Pizzini

 
 
 

faster way to extract fields from large file?

Post by M » Thu, 30 Mar 2000 04:00:00


Hi Ken,

Your solution is quite amazing. You looked at my terrible code, I've
just started learning shell scripting (very different to programming in
a high level language), and knew exactly what I was trying to do. Apart
from the line

typeset -A hmsl $1

which should have been

set -A hmsl $1

in ksh, it worked great. Thank you also, for your very helpful
explainations and pointers.

Regards,

Michael

 
 
 

faster way to extract fields from large file?

Post by M » Thu, 30 Mar 2000 04:00:00


Hello Heiner,

Thank you for replying, although I did not use the awk language your
example is helpful in understanding how some things work in awk.

Regards,

Michael



> > I understand what you mean, I should have posted something. Here is the
> > ksh script (see below function convert2millisecs). Another problem I am
> > having is placing the code which does the conversion into a function, in
> > order to reuse the function to process start and end times for jobs I am
> > running. The time format is HH:MM:SS.sss where sss is milliseconds.

> [...ksh script removed...]

> This looked like a problem suitable for AWK, and therefore I
> tried to rewrite your script using this language:

> # stats - print process times

> for file
> do
>     awk '
>         {
>             runtime = $4        # HH:MM:SS.mm format
>             nfields = split (stime, t, "[:.]")
>             millisecs = (((t [4] * 1000) + t [3]) * 60 + t [2]) * 60 + t [1]

>             if ( (NR % 2) == 1 ) {
>                 # start time
>                 starttime = millisecs
>                 name      = $9
>             } else {
>                 finishtime = millisecs
>                 duration   = etime - stime
>                 print name "\t\t" duration "\t\t\t\t" starttime \
>                         "\t\t" finishtime
>             }
>         }
>     ' "$file"
> done
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

> I don't have sample data and cannot test it, but maybe you can
> use it as a starting point for an own script.

> Some comments:

>  o  the line "split (...)" splits the input into fields delimited
>     by ":" or ".", and writes the results to the array t[1..4]:
>         t[1] = hour, t[2] = minute, ...

>  o  The line "millisecs = "...
>     converts the time to milliseconds (at least it *should* ;-) )

> Heiner
> --
>  ___ _

> \__ \  _/ -_) V / -_) ' \    UNIX Shell Script Programmers: visit
> |___/\__\___|\_/\___|_||_|   http://www.oase-shareware.org/shell/

 
 
 

1. extracting fields by keyword from large text file, using bash

I have a large file that contains poems. Each poem is separated by
five newlines, but may contain fewer than that number within it. What
I would like is some way to specify either a single small bit of a
poem (hopefully using regular expressions), or several small,
unconnected bits, and have all matching poems extracted to standard
output. So I might give the command

zack$ getpoems "[Ss]hall [Ii]" "temperate"

and receive

Shall I compare thee to a summer's day?
Thou art more lovely and more temperate.
Rough winds
etc.

It would also be nice to specify regulat expressions that should _not_
be extracted, even if they match the other regular expressions in the
command.

Thanks very much in advance for any help.

Zack

2. web applications?

3. How do I extract a field between two other fields?

4. Can the account of a linux user logon to an NT server?

5. sed: extracting single field from each line of a CSV file

6. ls does not give color

7. faster copy tool for large files in Linux and solaris.

8. Unix commands ?

9. In what ways can u optimize sendmail to be faster?

10. Extract paragraph from large file

11. Q: Extracting Glibc and other large tar files

12. Break a large file into chunks based on key field

13. Text extract from very large file