Using robots.txt to selectively allow robots but not users?

Using robots.txt to selectively allow robots but not users?

Post by Mark McWiggi » Fri, 16 Oct 1998 04:00:00



I have a client that wants to allow their site to be searched &
indexed by robots, but who want users visiting the site to fill out a
survey form before being allowed access.

Is there a standard way of doing this with Apache? I can think of some
approaches, but certainly there is a smooth way of doing this?

Thanks in advance for any suggestions.

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Dan Wil » Sat, 17 Oct 1998 04:00:00




> I have a client that wants to allow their site to be searched &
> indexed by robots, but who want users visiting the site to fill out a
> survey form before being allowed access.

> Is there a standard way of doing this with Apache? I can think of some
> approaches, but certainly there is a smooth way of doing this?

> Thanks in advance for any suggestions.

There is no absolute way to know the difference between a real user and a
robot. Users of either type can easily fake being the other.

I honestly can't think of any way to accomplish what you want that would
be 100% effective, except perhaps to give search engines a "back door" URL
they can use instead of having to go through the form. But for that to
work you'd have to tell all of the search engines where the back door is,
and get them not to publish it!


** Remove the REMOVE in my address address to reply reply  **

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Dan Wil » Sat, 17 Oct 1998 04:00:00


I just had another thought: Put a hidden link on your home page (the one
with the form) like this:

  <a href="real-home-page.html"></a>

That way the search engines will figure out how to get in, and average
users won't unless they look at the source. It's not 100%, but it's not
too bad either.


** Remove the REMOVE in my address address to reply reply  **

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Alan Coopersmi » Sun, 18 Oct 1998 04:00:00




>> I have a client that wants to allow their site to be searched &
>> indexed by robots, but who want users visiting the site to fill out a
>> survey form before being allowed access.

>I honestly can't think of any way to accomplish what you want that would
>be 100% effective, except perhaps to give search engines a "back door" URL
>they can use instead of having to go through the form. But for that to
>work you'd have to tell all of the search engines where the back door is,
>and get them not to publish it!

And anyone who found the site through a search engine would go in
through the back door, skipping the survey.

--
________________________________________________________________________

Univ. of California at Berkeley         http://soar.Berkeley.EDU/~alanc/

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Darren Co » Sun, 18 Oct 1998 04:00:00


Quote:>> I have a client that wants to allow their site to be searched &
>> indexed by robots, but who want users visiting the site to fill out a
>> survey form before being allowed access.

The BrowserMatch directive should let you set an environment variable
to say if they're a spider or not. You then make index.html an alias
for a cgi script that will pass back a different Redirect header. So
spiders end up at "realindex.html" and others end up at "survey.html".

It would be nice if we could define directives that say when someone
asks for index.html, show indexn4.html if Netscape 4, indexn3.html if
Netscape 3, etc. Then we could make top pages that suit the browser
features. I couldn't see anything in Apache that can do this without
resorting to cgi/modules however.

Quote:>There is no absolute way to know the difference between a real user and a
>robot. Users of either type can easily fake being the other.

Spiders are normally honest, and if not they're probably not a real
search engine, so who cares where they go. Almost all users have
browsers where the user agent can't be changed, and the rest are very
unlikely to pretend to be a spider.

Darren

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Dan Wil » Tue, 20 Oct 1998 04:00:00




> >> I have a client that wants to allow their site to be searched &
> >> indexed by robots, but who want users visiting the site to fill out a
> >> survey form before being allowed access.

> The BrowserMatch directive should let you set an environment variable
> to say if they're a spider or not. You then make index.html an alias
> for a cgi script that will pass back a different Redirect header. So
> spiders end up at "realindex.html" and others end up at "survey.html".

> It would be nice if we could define directives that say when someone
> asks for index.html, show indexn4.html if Netscape 4, indexn3.html if
> Netscape 3, etc. Then we could make top pages that suit the browser
> features. I couldn't see anything in Apache that can do this without
> resorting to cgi/modules however.

> >There is no absolute way to know the difference between a real user and a
> >robot. Users of either type can easily fake being the other.

> Spiders are normally honest, and if not they're probably not a real
> search engine, so who cares where they go. Almost all users have
> browsers where the user agent can't be changed, and the rest are very
> unlikely to pretend to be a spider.

The newer Spambots (programs that go looking for email addresses to send
spam to) use a fake agent ID to make themselves look like IE or Netscape.
Granted, this also points out the hole in my own idea of having a hidden
link to the "real" homepage.


** Remove the REMOVE in my address address to reply reply  **

 
 
 

Using robots.txt to selectively allow robots but not users?

Post by Darren Co » Wed, 21 Oct 1998 04:00:00


Quote:>> Spiders are normally honest, and if not they're probably not a real
>> search engine, so who cares where they go. Almost all users have
>> browsers where the user agent can't be changed, and the rest are very
>> unlikely to pretend to be a spider.
>The newer Spambots (programs that go looking for email addresses to send
>spam to) use a fake agent ID to make themselves look like IE or Netscape.
>Granted, this also points out the hole in my own idea of having a hidden
>link to the "real" homepage.

In this case that's great! It means the spambots will have to fill out
the survey before they are allowed into the site.

Quote:>> Spiders are normally honest, and if not they're probably not a real
>> search engine, so who cares where they go. Almost all users have
>> browsers where the user agent can't be changed, and the rest are very
>> unlikely to pretend to be a spider.
>The newer Spambots (programs that go looking for email addresses to send
>spam to) use a fake agent ID to make themselves look like IE or Netscape.
>Granted, this also points out the hole in my own idea of having a hidden
>link to the "real" homepage.

In this case that's great! It means the spambots will have to fill out
the survey before they are allowed into the site. How about under the
gender question you have:
  Male
  Female
  SpamBot
With spambot as the default option (I doubt spiders change the default
option do they?). Then anyone who chooses it would be redirected to
another site (one with a few million false email addresses would be
perfect :-).

But having said that, it's a good idea to have a "skip this survey"
link, otherwise you'll lose a lot of visitors on that first page.

Darren

 
 
 

1. robots.txt vs sitelist.txt

I have a robots.txt in our Server Root to take care of the robots.  But I
was wondering if someone could tell me what a sitelist.txt is?

homer-bbn.infoseek.com keeps looking for a sitelist.txt at my site.

Thanks

kiran

2. Linux -> DEC LAT -> TCP/IP

3. robots.txt vs sites.txt

4. Xfree86 3.33 installation error

5. robots.txt from web server

6. Diald Problems..

7. Robots.txt -- people are searching for this

8. SMP and custom problems

9. robots.txt

10. What is the file /robots.txt ?

11. Can someone pls clarify/explain robots.txt accesses