> I'll take you up on that offer, I've got a P166MMX, 32Mb RAM and 50Mb swap,
> so it's a less powerful machine. I think it can do the job in under 2 hours,
> with no niceing of the script.
500Mhz, 256Mb RAM Windows NT box. :-)
Anyway, the script is at the end of this message.
I finally got sick of dealing with NT's idiocy. Here I had this 120Mb
log file, and I wanted to look at the last 20 lines of the file because
I suspected there was some garbage at the end; but hell, it's not like I
can do "cat logfile| head -20", because there's no head command, and
there's no piping support, for that matter. Sure I can install MKS
Toolkit, but that's the way it always is with NT...install product X to
get support for Y. (And usually people are paying $$$ to add core UNIX
functionality to NT in the form of MKS Toolkit, PCFS, and who knows what
else.) So here I go again loading it into Wordpad, which shoots CPU to
100% and locks up the box...3-finger salute. Such BS.
That same 120 Mb log file (after it had been parsed), was bulkcopied
into SQL Server on a decent NT Server (Pentium II, I believe, with Gigs
of disk and plenty of RAM). It amounted to about 750,000 rows of data in
a single table. Even after the table had been indexed, doing any sort of
simple query against that thing would *the NT box. (Bring on the
"you have SQL Server tuned wrong" flames...heh).
Anyway, I'm installing Solaris x86 on my box at work. I'm just not going
to deal with that NT mess anymore. And my machine at home will soon
follow, so I'll probably try to leech Microsoft for a refund on this
preinstalled fecal-matter-that-calls-itself-an-OS, Windows 95. The
Solaris install was flawless and apart from having to find & install
drivers for my weird video card, I'm up and running. So all that crapola
about NT being easier to install is without basis, based on my
experience.
Cheers,
Mark
PS: Note, on a 120 MB log file in the NCSA httpd log format, this script
took roughly
6 hours to execute...it would seem that an HP 48 calculator could do it
in less time,
so a Linux box will definitely smoke it. Note that the reverse DNS
lookups this script does are cached in an associative array, so you can
probably assume one reverse DNS lookup per 100 entries or so, on
average, based on the approximate clustering of the hits. If there's any
discrepancy on this, I can run a test sample and give you a precise
estimate of DNS lookup frequency and the time per DNS lookup on my
machine versus the same DNS lookup on a UNIX box on the same segment.
-----------------------------------------------------------------------
100% Pure Java Developer | http://www.veryComputer.com/~frenzy/
-----------------------------------------------------------------------
-----------------------------------------------------------------------
#!/usr/local/bin/perl
# Mark Lindner - 12/11/97
$input = "megalog";
$output = "cleanlog";
open(FIN, "<$input");
open(FOUT, ">$output");
while(<FIN>)
{
chop;
$_ =~ /^(.+) .+ (.*) \[(\d+)\/(.+)\/(\d+):(\d+):(\d+).*\] \"GET
(\S+)/;
$ip = $1;
$name = $2;
$day = $3;
$month = $4;
$year = $5;
$hr = $6;
$min = $7;
$url = $8;
# chop off query string
if(($pos = index($url, "?")) > -1)
{
$url = substr($url, 0, $pos);
}
$dns = &reverse_dns($ip);
print(FOUT "$dns\t$name\t$month $day $year $hr:$min\t$url\n");
}
sub reverse_dns
{
my $dns;
# if domain name for this ip address not yet known, look it up
$dns = $dnsname{$_[0]};
if(!$dns)
{
($a, $b, $c, $d) = split(/\./, $_[0]);
$address = pack('C4', $a, $b, $c, $d);
gethostbyaddr($address, 2);
if($dns eq "") { $dns = $_[0]; }
$dns =~ tr/A-Z/a-z/; # convert dns name to lowercase
$dnsname{$_[0]} = $dns;
}
return($dns);
}