:
If you're running Apache as your Web server, you can get automaticQuote:>|One thing you can do is:
>|1) view the page with Netscape or MSIE.
>|2) press CTRL A to select all text.
>|3) paste in your favorite text editor...
>And how do you do that in an automated fashion? You can't. The text
>version quickly becomes out of sync with the page itself.
text versions of HTML pages quite easily with a hack I thought up
a while ago.
Just put this:
ErrorDocument 404 /cgi-bin/404error
in your httpd's conf files, then include something like this in the
"404error" CGI script:
#!/usr/local/bin/perl
#
# 404error: a cool 404 error handler
#
# Gerald Oskoboiny, 30 Jan 1997
$htdocs = "/www/htdocs";
$logfile = "/usr/log/404_error_log";
$html2txt = "/usr/local/bin/lynx -cfg=/usr/local/lib/lynx.cfg -validate -dump";
$extension = $ENV{REDIRECT_URL}; $extension =~ s/.*\.//g;
$basename = $ENV{REDIRECT_URL}; $basename =~ s/\.[^\.]*$//g;
$basename =~ s|^/||g;
#####
# Check if they were looking for a ".txt" file; if so, generate one for them.
if ( ( $extension eq "txt" ) && ( -f "$htdocs/${basename}.html" ) ) {
print "Content-Type: text/plain\n\n";
open( HTML2TXT, "$html2txt http://www.hwg.org/${basename}.html |" ) ||
die "couldn't run $html2txt with http://www.hwg.org/${basename}.html! $!";
while (<HTML2TXT>) {
print;
}
close( HTML2TXT ) || die "couldn't close $html2txt! $!";
exit;
#####Quote:}
# do other stuff here...
et voila! Instant .txt versions of all your HTML pages.
For example:
http://www.hwg.org/resources/html/validation.html (HTML)
http://www.hwg.org/resources/html/validation.txt (plain text)
http://www.hwg.org/index.html
http://www.hwg.org/index.txt
This isn't especially efficient, but it gets decent results with extremely
little effort.
Better would be to make it an Apache module triggered by a .txt Handler
that caches the automatically-generated plain text versions somewhere
after they're generated.
Gerald
--
Gerald Oskoboiny