Automagic .txt versions of HTML pages (was Re: [Q] How to force a html page to a text version)

Automagic .txt versions of HTML pages (was Re: [Q] How to force a html page to a text version)

Post by Gerald Oskoboi » Wed, 19 Nov 1997 04:00:00




:

Quote:>|One thing you can do is:
>|1) view the page with Netscape or MSIE.
>|2) press CTRL A to select all text.
>|3) paste in your favorite text editor...

>And how do you do that in an automated fashion? You can't. The text
>version quickly becomes out of sync with the page itself.

If you're running Apache as your Web server, you can get automatic
text versions of HTML pages quite easily with a hack I thought up
a while ago.

Just put this:

    ErrorDocument 404 /cgi-bin/404error

in your httpd's conf files, then include something like this in the
"404error" CGI script:

#!/usr/local/bin/perl
#
# 404error: a cool 404 error handler
#
# Gerald Oskoboiny, 30 Jan 1997

$htdocs    = "/www/htdocs";
$logfile   = "/usr/log/404_error_log";
$html2txt  = "/usr/local/bin/lynx -cfg=/usr/local/lib/lynx.cfg -validate -dump";

$extension = $ENV{REDIRECT_URL}; $extension =~ s/.*\.//g;
$basename  = $ENV{REDIRECT_URL}; $basename  =~ s/\.[^\.]*$//g;
$basename  =~ s|^/||g;

#####
# Check if they were looking for a ".txt" file; if so, generate one for them.
if ( ( $extension eq "txt" ) && ( -f "$htdocs/${basename}.html" ) ) {
    print "Content-Type: text/plain\n\n";
    open( HTML2TXT, "$html2txt http://www.hwg.org/${basename}.html |" ) ||
      die "couldn't run $html2txt with http://www.hwg.org/${basename}.html! $!";
    while (<HTML2TXT>) {
        print;
    }
    close( HTML2TXT ) || die "couldn't close $html2txt! $!";
    exit;

Quote:}

#####

# do other stuff here...

et voila! Instant .txt versions of all your HTML pages.

For example:

    http://www.hwg.org/resources/html/validation.html  (HTML)
    http://www.hwg.org/resources/html/validation.txt   (plain text)

    http://www.hwg.org/index.html
    http://www.hwg.org/index.txt

This isn't especially efficient, but it gets decent results with extremely
little effort.

Better would be to make it an Apache module triggered by a .txt Handler
that caches the automatically-generated plain text versions somewhere
after they're generated.

Gerald
--
Gerald Oskoboiny