detecting automated accesses in logfiles

detecting automated accesses in logfiles

Post by Thilo Salm » Fri, 06 Mar 1998 04:00:00



I'm in need of some kind of a rather sophisticated logfile analyzer, since
I'm trying to find automated access (e.g. by some machine running a daily cron
job) within a rather big logfile (ca. 1.000.000 Hits/month). Does anybody
know about one such tool. I'd hate to spend my time reinventing the wheel...

Ciao
  Thilo
--

 
 
 

detecting automated accesses in logfiles

Post by Mark Nottingh » Sat, 07 Mar 1998 04:00:00




> I'm in need of some kind of a rather sophisticated logfile analyzer, since
> I'm trying to find automated access (e.g. by some machine running a daily cron
> job) within a rather big logfile (ca. 1.000.000 Hits/month). Does anybody
> know about one such tool. I'd hate to spend my time reinventing the wheel...

Well, the two ways that come to mind to detect them are:

1) by UserAgent
2) by sessions consisting of all '-' referers (but see below)

How good does it have to be? Here's how I'd handle it (assuming you're
using Combined Logfile Format):

Run a script to sort out all of the different User Agents that you see.
Identify what you judge to be automated agents and write a regex for them.
Then, plug it into a logfile parser to weed them out. A good place to
start might be /scooter|spider|crawler/i.

If you want to be more fancy about it, you can track user sessions that
solely consist of '-' referers and more than one or two hits; these are
usually automated agents in my experience, as long as you weed out earlier
versions of Lynx and some other older browsers.

In any case, you might want to try using Follow 1.x as the basic engine;
it's reasonably easy to rip the guts out of it and get it to do this sort
of thing, as long as you have a decent understanding of Perl. Be warned-
with logfiles that big, it'll chew up some memory (I should really fix
that...)

http://www.pobox.com/~mnot/follow/

hope this helps,

--
Mark Nottingham
Melbourne, Australia


 
 
 

detecting automated accesses in logfiles

Post by Thilo Salm » Sat, 07 Mar 1998 04:00:00


: > I'm in need of some kind of a rather sophisticated logfile analyzer, since
: > I'm trying to find automated access (e.g. by some machine running a daily cron
: > job) within a rather big logfile (ca. 1.000.000 Hits/month). Does anybody
: > know about one such tool. I'd hate to spend my time reinventing the wheel...
:
: Well, the two ways that come to mind to detect them are:
:
: 1) by UserAgent
: 2) by sessions consisting of all '-' referers (but see below)
:
: How good does it have to be?

Sorry, I didn't state the background and why it's not enough to check for
Useragents and referers. Here it is:

I'm providing access to a database through a website for which I'm lucky to
have sponsors. Since the data is highly interesting for commercial users, I
want to detect anybody, who's querying the db in order to resell the data.
Anybody doing so would probably query the db once a day using an automated
mechanism. In order to hide themselves within the logfile, these people most
likely spoof the referer and useragent entries. Since I'm selling the
data myself, I'm hoping to be able to lock out anybody, whose 'stealing'
the data. Because all the information is public, I can't blame anybody for
copying it, but I still don't need to provide them with a comfortable way
of accessing it.

Thus, I'm hoping to find a tool which does some kind of sophisticated
analysis. I feel that I have to deal with fourier stuff here, but I'm not
a hundred percent sure. Any help on this is highly appreciated!

: In any case, you might want to try using Follow 1.x as the basic engine;
: it's reasonably easy to rip the guts out of it and get it to do this sort
: of thing, as long as you have a decent understanding of Perl. Be warned-
: with logfiles that big, it'll chew up some memory (I should really fix
: that...)
:
: http://www.pobox.com/~mnot/follow/

Look's pretty nice! In particular I like the session tracking
feature, even though I can't see how it would track down the access I'm
trying to find.

Thanks

Thilo
--

 
 
 

detecting automated accesses in logfiles

Post by Mark Nottingh » Sun, 08 Mar 1998 04:00:00




> I'm providing access to a database through a website for which I'm lucky to
> have sponsors. Since the data is highly interesting for commercial users, I
> want to detect anybody, who's querying the db in order to resell the data.
> Anybody doing so would probably query the db once a day using an automated
> mechanism. In order to hide themselves within the logfile, these people most
> likely spoof the referer and useragent entries. Since I'm selling the
> data myself, I'm hoping to be able to lock out anybody, whose 'stealing'
> the data. Because all the information is public, I can't blame anybody for
> copying it, but I still don't need to provide them with a comfortable way
> of accessing it.

Ahh... ok. Very interesting... What you need, then, is something to
analyse logs for
frequent, regular or semi-regular access by the same host. Probably best
to assign a score
for each host over time, and report those over a threshold. How do these
bounds look:

* same host can be defined as any host from a class c, etc... higher
weight for same host
* user agent is thrown away, although hits with a '-' UA  are weighted more...
* request always for same URL(s)? Should it include the query string, etc?
* time affects weight, biased towards access once a minute, hour, day,
2-day, week. longer periods would be undesireable.
* number of hits, of course, affects weight

* hosts which hit /robots.txt are given lower weight?
* hosts which wonder all over the site are given lower weight?

Of course, these parameters can be actively circumvented by a newer
generation of robots... *sigh* Still, might
be interesting.

Quote:> : http://www.pobox.com/~mnot/follow/

> Look's pretty nice! In particular I like the session tracking
> feature, even though I can't see how it would track down the access I'm
> trying to find.

Thanks. FYI, v2 is now out, linked from the above page.

--
Mark Nottingham
Melbourne, Australia

 
 
 

detecting automated accesses in logfiles

Post by Thilo Salm » Sun, 08 Mar 1998 04:00:00


: Ahh... ok. Very interesting... What you need, then, is something to
: analyse logs for
: frequent, regular or semi-regular access by the same host. Probably best
: to assign a score
: for each host over time, and report those over a threshold. How do these
: bounds look:
:
: * same host can be defined as any host from a class c, etc... higher
: weight for same host

Sounds good to me.

: * user agent is thrown away, although hits with a '-' UA  are weighted more...

Perhaps. IMHO that's not so important. Well, somebody might forget to spoof
this one, but he might not.

: * request always for same URL(s)? Should it include the query string, etc?

Definitely. I would include the querystring, since automated accesses most
likely will use identical query strings each time they access my db.
Unfortunately, they might use CGI's post-method.

: * time affects weight, biased towards access once a minute, hour, day,
: 2-day, week. longer periods would be undesireable.

Yeah, but how would you do that? Look at the following situation:

 Day      Human user      Machine
02/01       14:25          14:00
02/02       14:22          14:01
02/03       13:55          14:01
02/04       14:37          no access
02/05       14:02          14:00

A human user accesses the same pages everyday in order to keep up with the
latest informations. A machine doing so, might be down one day, but in general
would access using a very strict pattern. I'm afraid, there's a good amount of
math required in order to distinguish between the two.

: * number of hits, of course, affects weight

Not so sure about this one, since there might be user who intensively query
my db.

: * hosts which hit /robots.txt are given lower weight?
: * hosts which wonder all over the site are given lower weight?

Yep. These two might narrow down things a little.

Ciao
  Thilo
--

 
 
 

detecting automated accesses in logfiles

Post by John Loga » Mon, 09 Mar 1998 04:00:00


Thio-

My company's product SurfReport could detect frequent visitors quite
easily. In the configuration page, just ask SurfReport to analyze the
visitors who have generated at least "x" hits. Frequent visitors will most
likely have high hit counts.

SurfReport analyzes log files on the fly, and is the fastest at over 50
MB/minute.

You can download a free 30 day evaluation of SurfReport at
http://netrics.com

--
John Logan
NETRICS.COM

**To respond, please remove .nospam from email address.**



> I'm in need of some kind of a rather sophisticated logfile analyzer,
since
> I'm trying to find automated access (e.g. by some machine running a daily
cron
> job) within a rather big logfile (ca. 1.000.000 Hits/month). Does anybody
> know about one such tool. I'd hate to spend my time reinventing the
wheel...

> Ciao
>   Thilo
> --


 
 
 

detecting automated accesses in logfiles

Post by Eric Anderso » Sun, 22 Mar 1998 04:00:00



> My company's product SurfReport could detect frequent visitors quite
> easily. In the configuration page, just ask SurfReport to analyze the
> visitors who have generated at least "x" hits. Frequent visitors will most
> likely have high hit counts.
> SurfReport analyzes log files on the fly, and is the fastest at over 50
> MB/minute.
> You can download a free 30 day evaluation of SurfReport at
> http://www.veryComputer.com/

How does your product deal with proxies?  =)

--

-----------------------------------------------------------------------
 Eric Anderson  Online Network-Entertainment  *Iron Bodybuilding
  ICQ 3849549      <http://www.veryComputer.com/;   <http://www.*iron.com>

-----------------------------------------------------------------------
     "..and then my doctor said my nose wouldn't bleed so much    
      if I just kept my finger outta there!" -- Ralph Wiggum

 
 
 

1. Automating smbclient to copy yesterdays logfile

Hi all,

I want to automate smbclient to copy yesterdays log file from a webserver.
The logfiles names are in the format:
accessMMDDYYYY.log
where MM = month, DD = day and YYYY = year.
I guess I can use a date type function to build a string containing the date
for yesterdays log file, which I can pass to smbclient to retrieve?

Thanks in advance.

Reuben Pearse

2. NETGEAR WG511T PC Card @ Linux 2.4.20-31.9 (RedHat).

3. automate file access on server using SSH and passwordless access

4. S3 Trio64v+ problems

5. Can ISP detect when dial-ins are 'overloaded' ?

6. New Jersey Linuxers?

7. win95 to linux ppp

8. automating the export of a unix file into excel or access?

9. How to automate server access via telnet????

10. Script for automating web access count

11. automating telnet access with scripts(wu-ftp won't work now)

12. Automating Apache startup with a dedicated dialup access line....?