Log reporting: for "big" logs

Log reporting: for "big" logs

Post by so.. » Thu, 14 Nov 1996 04:00:00



I need to set up some new log reporting for a server which
handles about 4 million hits a month. Logs are currently kept
in sets of 2 days of activity.

and doing dns lookups would take forever to do on a routine
basis without some sort of effective cacheing.

any suggestions for "big" logs..I'm getting ready to look at mkstats
and something called analog i think..

Sonny

 
 
 

Log reporting: for "big" logs

Post by Matt Krus » Thu, 14 Nov 1996 04:00:00


: any suggestions for "big" logs..I'm getting ready to look at mkstats
: and something called analog i think..

As the author of MKStats, I'll tell you that you'll probably run into
problems trying to run my program on very large log files.  It wasn't
designed from the beginning to handle huge log files, because it uses
perl and stores a lot of information in memory.

The advantage is that you get more information about your usage than
anything other program can offer you.  The disadvantage is that it's a
memory hog :)

I'm working on a way to avoid this problem and use a custom database
method of storing the information.  If I could guarantee that everyone
was running SQL server, this would be easy :)
But unfortunately, all I can assume is that the user is running perl, and
I want to keep the program completely cross-platform and independent of
other programs or databases.  So it makes it a little harder :)

Analog is very fast, and will probably handle your amount of info easily,
but won't offer the same kind of information as mkstats.  Other than
that, your best bet is to hook into a database and use a program that can
use that info.

--
Matt Kruse

---------------------------+--------------------------------------------
 MKStats: WWW log analysis |        "First try to Understand.
      www.mkstats.com      |         Then try to be Understood."

 
 
 

Log reporting: for "big" logs

Post by Snowha » Thu, 14 Nov 1996 04:00:00


-----BEGIN PGP SIGNED MESSAGE-----

Nothing above this line is part of the signed message.




>: any suggestions for "big" logs..I'm getting ready to look at mkstats
>: and something called analog i think..

>As the author of MKStats, I'll tell you that you'll probably run into
>problems trying to run my program on very large log files.  It wasn't
>designed from the beginning to handle huge log files, because it uses
>perl and stores a lot of information in memory.

Also look at FTPWebLog 1.0.2
<URL:http://www.netimages.com/~snowhare/utilities/>. It can do incremental
reports and should have no problem handling your volume (I've used it
routinely on systems getting up 1 million hits per day). It can also do
DNS lookups and performs DNS caching during a run to improve performance.
It also has support for graphic reports. Note - the DNS lookup support is
only in the 1.0.2 version - not the 1.0.1 version. And it is free :).

Benjamin Franz

-----BEGIN PGP SIGNATURE-----
Version: 2.6.2

iQCVAwUBMooauOjpikN3V52xAQHpDQQAg66sR/80Boq1IKdaWmJCRXbyTKy0g65I
GFpfQeT2gd83GAI9kTkja+jpaEtO/4XXBv7wlfw+2stEW7M/hyeduOu4kB3RRN4c
CMlASv5jmgH0odBh1qjeUIsAl2vUBhZUJVjdX9BVP2E9OO1j0QN3cEBupKVKsGoN
dV3tDFbReos=
=GqbZ
-----END PGP SIGNATURE-----

 
 
 

Log reporting: for "big" logs

Post by Jay Thorn » Thu, 14 Nov 1996 04:00:00




> : any suggestions for "big" logs..I'm getting ready to look at mkstats
> : and something called analog i think..

> As the author of MKStats, I'll tell you that you'll probably run into
> problems trying to run my program on very large log files.  It wasn't
> designed from the beginning to handle huge log files, because it uses
> perl and stores a lot of information in memory.

> The advantage is that you get more information about your usage than
> anything other program can offer you.  The disadvantage is that it's a
> memory hog :)

> I'm working on a way to avoid this problem and use a custom database
> method of storing the information.  If I could guarantee that everyone
> was running SQL server, this would be easy :)
> But unfortunately, all I can assume is that the user is running perl, and
> I want to keep the program completely cross-platform and independent of
> other programs or databases.  So it makes it a little harder :)

> Analog is very fast, and will probably handle your amount of info easily,
> but won't offer the same kind of information as mkstats.  Other than
> that, your best bet is to hook into a database and use a program that can
> use that info.

Most of the BIND derivative named's cache all requests.  I've seen the
cache on my named get _huge_ when I'm doing a logresolve. One time on a
years worth of logs the process size got to 2 Megs, after 3 hours.  
After such a run, subsequent runs on other logs are usually much faster,
since the cache has most of the common ones already there.

I always run logresolve first then analog, mkstats or wwwstats on the
result file, depending on which my customer prefers.  I find that though
analog is screamingly fast, but the detail of mkstats or wwwstats is
really important to clients who want to know what the usage patterns
are.

--
Jay Thorne                http://net.result.com/
President, The Net Result Systems * Services  Telephone:(604) 220 2504
WWW & Internet Systems Consultant.

 
 
 

Log reporting: for "big" logs

Post by Andrew Gide » Thu, 14 Nov 1996 04:00:00




>: any suggestions for "big" logs..I'm getting ready to look at mkstats
>: and something called analog i think..

[...]

>Analog is very fast, and will probably handle your amount of info easily,
>but won't offer the same kind of information as mkstats.  Other than
>that, your best bet is to hook into a database and use a program that can
>use that info.

[...]

I've used Analog on our log files.  We get about the same activity
that you describe, and run reports monthly.  Analog is quite swift
in its processing.

Analog, BTW, does cache DNS records although I no longer recall the
details involved.

        - Andrew

---
 -----------------------------------------------------------
| Andrew Gideon              |   TAG Online inc.            |
| Consultant                 |   539 Valley Road            |
|                            |   Upper Montclair, N.J.      |
| Tel: (201) 783-5583        |                     07043    |
| Fax: (201) 783-5334        |                              |

 -----------------------------------------------------------

 
 
 

Log reporting: for "big" logs

Post by Greg Vern » Thu, 14 Nov 1996 04:00:00



>I need to set up some new log reporting for a server which
>handles about 4 million hits a month. Logs are currently kept
>in sets of 2 days of activity.

>and doing dns lookups would take forever to do on a routine
>basis without some sort of effective cacheing.

I'm having the same sort of problem.  I've started running the DNS
lookups on the logs separately in the middle of the night.  What I've been
looking at as a solution is to keep a file of hosts/addresses and keep
an address around for about a week or so and then time it out.  
I've tested it a bit and it seems to save quite a bit of time.  I now
just have to figure out how I want to time the hostnames out.

Quote:>any suggestions for "big" logs..I'm getting ready to look at mkstats
>and something called analog i think..

Analog seems to be the best program for running through huge numbers of
hits.  I've abandoned everything else.

Cheers!
Greg

--

Web Services Analyst, UNIX Delivery Sys & Services     |  te kea, ka pai.
Boeing Information and Support Services                |
Disclaimer: My opinions are not necessarily the same as Boeing's.

 
 
 

Log reporting: for "big" logs

Post by Jan Wedeki » Fri, 15 Nov 1996 04:00:00


|> [...]
|> >
|> >Analog is very fast, and will probably handle your amount of info easily,
|> >but won't offer the same kind of information as mkstats.  Other than
|> >that, your best bet is to hook into a database and use a program that can
|> >use that info.
|> >
|> [...]
|>
|> I've used Analog on our log files.  We get about the same activity
|> that you describe, and run reports monthly.  Analog is quite swift
|> in its processing.
|>
|> Analog, BTW, does cache DNS records although I no longer recall the
|> details involved.
|>
Yes, indeed it does, but it's a very slow hash algorithm used inside
of analog.
So we at EUnet Germany are using the 'logresolve' utility coming with
apache to resolve the logs daily and then generating a daily / weekly
and/or monthly report with analog from this saved log files.

Jan
--

Dipl.-Inform. Jan Wedekind              
Emil-Figge-Str. 80, D-44227 Dortmund    The opinions in this mail / article
Tel/Fax +49-231-972 -00/-1188           are my own and not those of EUnet GmbH.

 
 
 

Log reporting: for "big" logs

Post by Stephen Turne » Fri, 15 Nov 1996 04:00:00



> Analog, BTW, does cache DNS records although I no longer recall the
> details involved.

Analog saves DNS records in an external file, with a timestamp. It doesn't
look them up again within H hours, where H can be specified at compilation
time or runtime, but defaults to 168 (1 week).

By the way, I'm somewhat surprised that Jan Wedekind described analog's hash
algorithm as "very slow", but I'm always open to suggestions for improvement
if anyone has better ideas.

--

  Stochastic Networks Group, Statistical Laboratory,
  16 Mill Lane, Cambridge, CB2 1SB, England    Tel.: +44 1223 337955
  "Collection of rent is subject to Compulsive Competitive Tendering" Cam. City

 
 
 

Log reporting: for "big" logs

Post by Luuk de Bo » Sat, 16 Nov 1996 04:00:00




>>I need to set up some new log reporting for a server which
>>handles about 4 million hits a month. Logs are currently kept
>>in sets of 2 days of activity.

>>and doing dns lookups would take forever to do on a routine
>>basis without some sort of effective cacheing.
>I'm having the same sort of problem.  I've started running the DNS
>lookups on the logs separately in the middle of the night.  What I've been
>looking at as a solution is to keep a file of hosts/addresses and keep
>an address around for about a week or so and then time it out.  
>I've tested it a bit and it seems to save quite a bit of time.  I now
>just have to figure out how I want to time the hostnames out.

This sounds interesting. It would be nice if you could mail it to me.
Maybe it's handy to use a database and run another script to delete
old entries so you have always a good list. That other script can be
set to 7 days.

Quote:>>any suggestions for "big" logs..I'm getting ready to look at mkstats
>>and something called analog i think..
>Analog seems to be the best program for running through huge numbers of
>hits.  I've abandoned everything else.

Did anybody looked at http-analyse 1.9e?? what I have read in a
performance test is that http-analyse is out performing analog and my
peronal meaning is that the statistics of http-analyse is more
beautifull than analog.
But I miss in all the two statistics the average Bytes/sec a day or a
hour. It's handy when you want to look what sort of data traffic the
site is generating and how much you have left.

Greetz...

Luuk

 
 
 

Log reporting: for "big" logs

Post by Andy Rabaglia » Sat, 16 Nov 1996 04:00:00


:I need to set up some new log reporting for a server which
:handles about 4 million hits a month. Logs are currently kept
:in sets of 2 days of activity.
:
:and doing dns lookups would take forever to do on a routine
:basis without some sort of effective cacheing.
:
:any suggestions for "big" logs..I'm getting ready to look at mkstats
:and something called analog i think..

We have about 250 Virtual hosts.

I use apache's custom log format to include '%v' - the virtual host
of the server - and the referer in a single large log file.

I do not have apache resolve the logs - I post-process them with
logresolve.

The key turned out to be to do the following :-

1. grep out each Virtual host to its own file
2. run logresolve on this (good locality compared to the single composite
   log). Great speedups by taking this step.
3. run analog on the result.

Cheers,       Andy!

#! /bin/sh
#
# process http logs
#
# daily script run as the www server
#

#         http://www.veryComputer.com/
# For:
#         Rocky Mountain Internet http://www.veryComputer.com/
#
# May safely be run many times, but not around the witching hour
# of root's rotation ..
#

GZIP=/usr/local/bin/gzip
BINDIR=/www/apache-ssl
LOGDIR=/var/log/apache-ssl-logs

#
# root turns the log over to this file - now static
#

LASTLOG=$LOGDIR/access_old

# Backups, gzipped
#

ARCHIVE=/var/log/old-apache-logs

#
# site root directories are assumed to be of the form
#
# $SITEROOT/www.wizzy.com/index.html
#

SITEROOT=/www

# Roll time back 12 hours to get the month yesterday ..
monthday=`TZ=GMT+19 date +%b-%d`
month=`TZ=GMT+19 date +%b`

onevirtualsite() {
    SITE=$1
    SITEDIR=$SITEROOT/$SITE/statistics
    [ ! -d $SITEDIR ] && return                 # outta here
    [ ! -f $SITEDIR/analog.conf ] && return     # outta here

    LOG=$SITEDIR/$month.html

    grep " $SITE " $LASTLOG | \
        $BINDIR/logresolve | \
        $GZIP --stdout > $SITEDIR/access_log.$monthday.gz

    $GZIP --decompress --stdout $SITEDIR/access_log.${month}-??.gz | \
        $BINDIR/analog +g$SITEDIR/analog.conf \
            +C"REFEXCLUDE http://$SITE/*" - > $LOG.new

    mv -f $LOG.new $LOG

Quote:}

# Backup, in case we*up

$GZIP --stdout $LASTLOG > $ARCHIVE/access_log.${monthday}.gz

# virtual IPs

cd $SITEROOT

for d www.* ; do
  onevirtualsite $d
done

exit 0

 
 
 

1. "Logging" or "Log structured" file systems with news spool.

(I've included comp.arch.storage and comp.databases in hopes of getting
info from people who know about large file systems with lots of small
files).

We're putting together a new news machine since our old one, which was
a large general server is going to be retired in a few months.  Instead
of burdening our newer servers with news, we want to set up a
workstation (Sparc 5 w/Solaris 2.4) dedicated to news.

I figure that anything I can do to speed up disk performance in the
news spool disk is worth a fair amount of effort.  News is pretty close
to a worst case scenario for disk performance due to the small files
(3.5KB average, 2KB typical) and the fact that news requires tons of
random seeking.  I'd like to plan this system so that it can easily
handle the news load for several years as well as support about 100
readers.  Right now my news spool contains about three-hundred-
thousand files and I expire news after 5 days.

I plan to move to the INN transport software with this new news server.

I've been researching things and I believe that using a log structured
file system (provided by Veritas Volume Manager) can probably buy me
some extra performance, particularly during news reception (lots of
small writes).  Is this correct?  I believe Sun's Online: Disksuite 3.0
provides a logging UFS.  Might this increase performance if I decide
not to use Veritas for some reason?

Also, we'll be using multiple 1GB disks.  I suspect that using multiple
disks with large stripe (interlace) sizes will also be a win since
individual articles end up on one disk or another but not spread over
each except for the relatively rare large article.  I'm hoping this
will reduce the individual seeking that has to be done on each disk.
Even if it doesn't buy me any performance (though I hope it will) I'm
hoping this will at least reduce wear on the disks by distributing the
workload.

I'd appreciate any comments or advice on this subject.

--Bill Davidson

2. A Path question...

3. logging - "secure" logs don't tell me who is logging in?

4. Microsoft Admits Deception/Firm paid for institute's ads backing its antitrust position

5. UFS logging VS Solstice DiskSuite's Trans metadevice "UFS logging"

6. Building a.out executables on FreeBSD 3.x

7. Logging stopped on RH 5.2 (no output to "messages" log)

8. RSH Proxy for UNIX - Where?

9. userdel : "user" is in use, but "user" isn't logged

10. "ufs logging" versus "vxfs"

11. "never logged in." and "hostname login:"

12. An addition to the "Failed CGI" and "Error Log for Apache" threads

13. GETSERVBYNAME()????????????????????"""""""""""""