: Ahh... ok. Very interesting... What you need, then, is something to
: analyse logs for
: frequent, regular or semi-regular access by the same host. Probably best
: to assign a score
: for each host over time, and report those over a threshold. How do these
: bounds look:
: * same host can be defined as any host from a class c, etc... higher
: weight for same host
Sounds good to me.
: * user agent is thrown away, although hits with a '-' UA are weighted more...
Perhaps. IMHO that's not so important. Well, somebody might forget to spoof
this one, but he might not.
: * request always for same URL(s)? Should it include the query string, etc?
Definitely. I would include the querystring, since automated accesses most
likely will use identical query strings each time they access my db.
Unfortunately, they might use CGI's post-method.
: * time affects weight, biased towards access once a minute, hour, day,
: 2-day, week. longer periods would be undesireable.
Yeah, but how would you do that? Look at the following situation:
Day Human user Machine
02/01 14:25 14:00
02/02 14:22 14:01
02/03 13:55 14:01
02/04 14:37 no access
02/05 14:02 14:00
A human user accesses the same pages everyday in order to keep up with the
latest informations. A machine doing so, might be down one day, but in general
would access using a very strict pattern. I'm afraid, there's a good amount of
math required in order to distinguish between the two.
: * number of hits, of course, affects weight
Not so sure about this one, since there might be user who intensively query
: * hosts which hit /robots.txt are given lower weight?
: * hosts which wonder all over the site are given lower weight?
Yep. These two might narrow down things a little.