## zsh: large arrays very slow

### zsh: large arrays very slow

Hi all,

I've written a simple zsh script to calculate the standard deviation of a
set of numbers. For convenience, I read the numbers from stdin (1 per line) -
this enables me to use my script in a pipe. (Very handy in my situation.)

I've accomplished this, by reading the data values into an array.
Unfortunately, I'm finding that once I start using more than a few
thousand data values my script runs terribly slowly. On a 2.6GHz CPU, it
takes ~11 seconds to compute the std dev for 10,000 values. It takes
minutes to compute for 100,000 values. The largest data set I can
reasonably expect to use has 500,000 data values.

So my question is: is there anything I can do to optimize my script?
An obvious solution is to store the data in a file instead of an array,
however, I'd really like to know if I'm doing something inherantly wrong
in the way I'm using arrays in zsh.

Any info/pointers muchly appreciated.

SCoTT. :)
PS I'm using zsh 4.2.0 on RH9 Linux.

#!/usr/local/bin/zsh

float sum=0.0

# Read the input into an array & count the number of elements.
let n=0
(( n++ ))
data[\$n]=\$datum
done

echo n is \$n
(( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

# Sum the elements.
for datum in \$data ; do
(( sum += \$datum ))
done

# Calculate the mean value.
let mean=\$sum/\$n

# Calculate the sum of the square of the residuals.
float sumSq=0.0
for datum in \$data ; do
(( sumSq += (\$datum - \$mean) ** 2 ))
done

# Calculate standard deviation.
let sd="(\$sumSq/\$n)**0.5"
# if (( \$sd < 0.0 )) ; then
#       let sd=-\$sd
# fi

printf "sum: %g\n" \$sum
printf "mean: %g\n" \$mean
# printf "sumSq: %g\n" \$sumSq
printf "stdDev: %g\n" \$sd

### zsh: large arrays very slow

I use bash, so i can't tell you about the specifics of arrays in zsh, but i can tell you ... you can't rely on scripts if you want performance, just rewrite it in C.

But, in this case, i think that won't be a solution either. For what you tell me, the problem is in the calculation, not in the for() that goes through the array.

Try Debugging the script with the 'time'  command, and logging the output for it, that way you can check how much processor time uses each calculation that needs to be done, and you can estimate how longer your processor would take, then, if it takes longer than you thought it should, you would be sure that the problem is in the script, and either optimize it, or rewrite it in C, for this case, that would be rather easy.

Hope this helps.

ALMAFUERTE

El Dia Wed, 18 Aug 2004 05:51:45 GMT

> Hi all,

> I've written a simple zsh script to calculate the standard deviation of a
> set of numbers. For convenience, I read the numbers from stdin (1 per line) -
> this enables me to use my script in a pipe. (Very handy in my situation.)

> I've accomplished this, by reading the data values into an array.
> Unfortunately, I'm finding that once I start using more than a few
> thousand data values my script runs terribly slowly. On a 2.6GHz CPU, it
> takes ~11 seconds to compute the std dev for 10,000 values. It takes
> minutes to compute for 100,000 values. The largest data set I can
> reasonably expect to use has 500,000 data values.

> So my question is: is there anything I can do to optimize my script?
> An obvious solution is to store the data in a file instead of an array,
> however, I'd really like to know if I'm doing something inherantly wrong
> in the way I'm using arrays in zsh.

> Any info/pointers muchly appreciated.

> SCoTT. :)
> PS I'm using zsh 4.2.0 on RH9 Linux.

> #!/usr/local/bin/zsh

> float sum=0.0

> # Read the input into an array & count the number of elements.
> let n=0
> while read datum ; do
>    (( n++ ))
>    data[\$n]=\$datum
> done

> echo n is \$n
> (( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

> # Sum the elements.
> for datum in \$data ; do
>    (( sum += \$datum ))
> done

> # Calculate the mean value.
> let mean=\$sum/\$n

> # Calculate the sum of the square of the residuals.
> float sumSq=0.0
> for datum in \$data ; do
>    (( sumSq += (\$datum - \$mean) ** 2 ))
> done

> # Calculate standard deviation.
> let sd="(\$sumSq/\$n)**0.5"
> # if (( \$sd < 0.0 )) ; then
> #  let sd=-\$sd
> # fi

> printf "sum: %g\n" \$sum
> printf "mean: %g\n" \$mean
> # printf "sumSq: %g\n" \$sumSq
> printf "stdDev: %g\n" \$sd

### zsh: large arrays very slow

<snip>

Quote:> # Read the input into an array & count the number of elements.
> let n=0
> while read datum ; do
>    (( n++ ))
>    data[\$n]=\$datum

Move the "(( sum += \$datum ))" line here then you don't need the first
loop below.

Quote:> done

> echo n is \$n
> (( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

> # Sum the elements.
> for datum in \$data ; do
>    (( sum += \$datum ))
> done

Have you considered just building a string expresssion from your input
and passing it to "bc" or "dc" to do the calculations?

Ed.

### zsh: large arrays very slow

> Hi all,

> I've written a simple zsh script to calculate the standard deviation of a
> set of numbers. For convenience, I read the numbers from stdin (1 per line) -
> this enables me to use my script in a pipe. (Very handy in my situation.)

> I've accomplished this, by reading the data values into an array.
> Unfortunately, I'm finding that once I start using more than a few
> thousand data values my script runs terribly slowly. On a 2.6GHz CPU, it
> takes ~11 seconds to compute the std dev for 10,000 values. It takes
> minutes to compute for 100,000 values. The largest data set I can
> reasonably expect to use has 500,000 data values.

> So my question is: is there anything I can do to optimize my script?
> An obvious solution is to store the data in a file instead of an array,
> however, I'd really like to know if I'm doing something inherantly wrong
> in the way I'm using arrays in zsh.

> Any info/pointers muchly appreciated.

Use Awk.  It's designed for this kind of thing.

--

### zsh: large arrays very slow

Hi, :)

Quote:> I use bash, so i can't tell you about the specifics of arrays in zsh,
> but i can tell you ... you can't rely on scripts if you want
> performance, just rewrite it in C.

I'm happy to pay a constant time penalty for scripting it. What bothers
me is that the _rate_ at which my script processes input _drops_ when I
increase the size of the input.

For example,

If I read in 100k lines, my script processes them at ~10k/sec. If I read
in 500k lines, my script processes them at ~1k/sec!

I was expecting 500k of input to take ~5 times longer than 100k of input.
I have rewritten my script to use a file (instead of an array) & all of
the above (about the rate dropping) still holds true. :(

So I guess, now, I'm interested in confirming that the dropping rate is
something inherant in the shell & not a bug.

My rewritten script is attached.

SCoTT. :)

#!/usr/local/bin/zsh

float sum=0.0

# Read the input into an array & count the number of elements.
let n=0
d=\$(date +%s)
foreach datum (\${(f)"\$(<\$1)"})
(( n++ ))
(( \$n % 10000 == 0 )) && echo -n .
# Sum the elements.
(( sum += \$datum ))
end
echo "read took: \$(( \$(date +%s) - \$d )) seconds"

echo n is \$n
(( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

# Calculate the mean value.
let mean=\$sum/\$n

# Calculate the sum of the square of the residuals.
float sumSq=0.0
d=\$(date +%s)
let counter=0
foreach datum (\${(f)"\$(<\$1)"})
(( sumSq += (\$datum - \$mean) ** 2 ))
(( counter++ % 10000 == 0 )) && echo -n .
end
echo "read took: \$(( \$(date +%s) - \$d )) seconds"

# Calculate standard deviation.
let sd="(\$sumSq/\$n)**0.5"

printf "sum: %g\n" \$sum
printf "mean: %g\n" \$mean
# printf "sumSq: %g\n" \$sumSq
printf "stdDev: %g\n" \$sd

### zsh: large arrays very slow

>  Hi all,

>  I've written a simple zsh script to calculate the standard deviation of a
>  set of numbers. For convenience, I read the numbers from stdin (1 per line) -
>  this enables me to use my script in a pipe. (Very handy in my situation.)

>  I've accomplished this, by reading the data values into an array.
>  Unfortunately, I'm finding that once I start using more than a few
>  thousand data values my script runs terribly slowly. On a 2.6GHz CPU, it
>  takes ~11 seconds to compute the std dev for 10,000 values. It takes
>  minutes to compute for 100,000 values. The largest data set I can
>  reasonably expect to use has 500,000 data values.

>  So my question is: is there anything I can do to optimize my script?
>  An obvious solution is to store the data in a file instead of an array,
>  however, I'd really like to know if I'm doing something inherantly wrong
>  in the way I'm using arrays in zsh.

Well the main bottlenecks of your script were in growing the arrays
and using for loops.  It runs much faster when these constructs are
avoided.  Considering that zsh is not intended for calculations,
a slow-down of 8 with respect to gawk is not too bad.

Have fun!

Pavol

\$ time zsh std.zsh < x.dat
n is 500000
sum: 250069
mean: 0.500138
stdDev: 0.288616
### 10.01s user 0.27s system 97% cpu 10.565 total

\$ time gawk '{s2+=\$0^2; s+=\$0}
END{print "stdDev:", sqrt(s2/NR - (s/NR)^2)}' < x.dat
stdDev: 0.288616
### 1.27s user 0.01s system 99% cpu 1.289 total

\$ matlab
>> tic; f=fopen('x.dat'); x=fscanf(f,'%f');fclose(f); std(x,1), toc
ans = 0.2886
Elapsed time is 0.918466 seconds.
>> tic; std(x,1); toc
Elapsed time is 0.034074 seconds.

############ std.zsh ############
#!/usr/local/bin/zsh

float sum=0.0

# Read the input into an array & count the number of elements.
# let n=0
# while read datum ; do
#       (( n++ ))
#       data[\$n]=\$datum
# done
#-------------------------------
# \$(<&3) is the same as \$( cat <&3 ).  You need to save standard
# input in a file descriptor, otherwise it is not
# available inside \$( )
3<&0 data=( \$( <&3 ) )
n=\${#data}

echo n is \$n
(( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

# Sum the elements.
# for datum in \$data ; do
#       (( sum += \$datum ))
# done
#-------------------------------
# build one big expression instead by joining all elements with  "+"
sumExpr=\${(j:+:)data}
(( sum = sumExpr ))

# Calculate the mean value.
let mean=\$sum/\$n

# Calculate the sum of the square of the residuals.
float sumSq=0.0
# for datum in \$data ; do
#       (( sumSq += (\$datum - \$mean) ** 2 ))
# done
#-------------------------------
# again, replace this with a single _BIG_ expression
sumSqExpr=( "("\${^data}" - \$mean)**2.0" )
sumSqExpr=\${(j: + :)sumSqExpr}
(( sumSq = sumSqExpr ))

# Calculate standard deviation.
let sd="(\$sumSq/\$n)**0.5"
# if (( \$sd < 0.0 )) ; then
#       let sd=-\$sd
# fi

printf "sum: %g\n" \$sum
printf "mean: %g\n" \$mean
# printf "sumSq: %g\n" \$sumSq
printf "stdDev: %g\n" \$sd

### zsh: large arrays very slow

2004-08-24, 03:56(+00), Pavol Juhas:
[...]

Quote:> # \$(<&3) is the same as \$( cat <&3 ).  You need to save standard
> # input in a file descriptor, otherwise it is not
> # available inside \$( )
> 3<&0 data=( \$( <&3 ) )

data=(\$(cat))

or:

data=(\$(<&0))
same as
data=(\$(5>&-))
or
data=(\$(\$NULLCMD))

(\$NULLCMD is used when there's a redirection and no command nor
assignment. Beware that \$(<) is not a special operator as in
bash or ksh).

Quote:> n=\${#data}

> echo n is \$n
> (( \$n == 0 )) && { echo "No data!" 1>&2 ; exit(2) }

exit 2

--
Stephane

### zsh: large arrays very slow

>  2004-08-24, 03:56(+00), Pavol Juhas:
>  [...]
> > # \$(<&3) is the same as \$( cat <&3 ).  You need to save standard
> > # input in a file descriptor, otherwise it is not
> > # available inside \$( )
> > 3<&0 data=( \$( <&3 ) )

>  data=(\$(cat))

>  or:

>  data=(\$(<&0))
>  same as
>  data=(\$(5>&-))
>  or
>  data=(\$(\$NULLCMD))

Hi Stephane,  in my zsh 4.2.0

ls | m=( \$(cat) )
cat: -: Input/output error

ls | 3<&0 m=( \$(cat <&3) )
OK

ls | zsh -c 'm=( \$(cat) )'
OK

Would you know why does the first pipe fail?
Thanks,

Pavol

### zsh: large arrays very slow

2004-08-25, 22:23(+00), Pavol Juhas:
[...]
Quote:> Hi Stephane,  in my zsh 4.2.0

>   ls | m=( \$(cat) )
>       cat: -: Input/output error

>   ls | 3<&0 m=( \$(cat <&3) )
>       OK

>   ls | zsh -c 'm=( \$(cat) )'
>       OK

> Would you know why does the first pipe fail?

[...]

Looks like a bug.

--
Stephane

In writing a new compctl command (for dread/dwrite/dremove, accessing the
NeXT defaults database) I have realized that my understanding of arrays
in the zsh is flawed.  Running zsh 2.5.03 under NEXTSTEP 3.2...anyway,
here is the problem:

10% array=( foo "bar baz" )
11% echo \$array
foo bar baz
12% echo \$array
bar baz
13% alias input_function='echo foo \"bar baz\"'
14% input_function
foo "bar baz"
15% array=( \$(input_function) )
16% echo \$array
"bar
17% array=( `input_function` )
18% echo \$array
"bar
19% array=( "\$(input_function)" )
20% echo \$array
foo "bar baz"

Assume I have some shell function, script or program called input_function
that produces the string [foo "bar baz"] as output (without the brackets).
In the above example, input_function is just an echo, but in my real case
it's something complicated.  Now, what I want is to assign to \$array the
two strings [foo] and [bar baz], so that \$array="bar baz".  As you
can see, just running input_function, either in a \$() or in backticks,
treats the quotes as literals and gives me a three-element array, while
protecting the whole thing with quotes gives me a one-element array.  How
can I get the two-element array I want?

I've even considered writing a horrible kluge of a solution that runs
\$(input_function | awk '{print ...}') N times, with an awk string that
pulls out the Nth argument.  But this is not really practical, since in
my real problem N > 200, instead of just 2.  There has to be an easier way.

Nothing is so useless as  |====================================================

-- Macaulay  |====================================================

8. color ls