how to text edit from html format page

how to text edit from html format page

Post by steverour » Thu, 31 Jul 2003 05:30:27



is there anyone who knows how to text-edit on html page using tools
such as sed, awk, or perl?
for example, in finance.yahoo.com, i find out financial statement,
then, i want to substract only numbers that i want to put into my
database.

based upon your expertise, is there any way to do such things by using
text-editing tools with regular expression? it would be great to have
respone on how to do that, but i would be stil good to learn just
ways.
thank you

 
 
 

how to text edit from html format page

Post by Adam Pric » Thu, 31 Jul 2003 13:36:15




Quote:> is there anyone who knows how to text-edit on html page using
> tools such as sed, awk, or perl?
> for example, in finance.yahoo.com, i find out financial statement,
> then, i want to substract only numbers that i want to put into my
> database.

> based upon your expertise, is there any way to do such things by
> using text-editing tools with regular expression? it would be
> great to have respone on how to do that, but i would be stil good
> to learn just ways.
> thank you

Easiest way is to use a 'text mode' browser such as links or lynx
to process the html into a plain text format and work from there.
HTH
Adam

 
 
 

how to text edit from html format page

Post by Chris F.A. Johnso » Thu, 31 Jul 2003 14:33:33



> is there anyone who knows how to text-edit on html page using tools
> such as sed, awk, or perl?
> for example, in finance.yahoo.com, i find out financial statement,
> then, i want to substract only numbers that i want to put into my
> database.

> based upon your expertise, is there any way to do such things by using
> text-editing tools with regular expression? it would be great to have
> respone on how to do that, but i would be stil good to learn just
> ways.

   There are many ways of dealing with HTML pages; without knowing
   more about what you want, it's hard to say how you should proceed.

   You can retrieve the page by

wget $URL

   or

lynx -source $URL > FILE.html".

   or, to get a text rendering of the page:

lynx -dump -nolist $URL > FILE.html".

   Then you can retrieve the lines you want with grep, pipe them to
   awk or sed or cut to extract the information you want.

--
    Chris F.A. Johnson                        http://cfaj.freeshell.org
    ===================================================================
    My code (if any) in this post is copyright 2003, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License

 
 
 

how to text edit from html format page

Post by steverour » Sat, 02 Aug 2003 11:47:24




> > is there anyone who knows how to text-edit on html page using tools
> > such as sed, awk, or perl?
> > for example, in finance.yahoo.com, i find out financial statement,
> > then, i want to substract only numbers that i want to put into my
> > database.

> > based upon your expertise, is there any way to do such things by using
> > text-editing tools with regular expression? it would be great to have
> > respone on how to do that, but i would be stil good to learn just
> > ways.

>    There are many ways of dealing with HTML pages; without knowing
>    more about what you want, it's hard to say how you should proceed.

>    You can retrieve the page by

> wget $URL

>    or

> lynx -source $URL > FILE.html".

>    or, to get a text rendering of the page:

> lynx -dump -nolist $URL > FILE.html".

>    Then you can retrieve the lines you want with grep, pipe them to
>    awk or sed or cut to extract the information you want.

thank you, chris
it helps me a lot
 
 
 

1. Output from cgi scripts displaying as text not html formatted text

I am running Solaris 2.5.1 with Apache 1.2.b8 and am having problems with
cgi (Perl) script output not being formatted in the browser.  The conf
files seem to be in order, and the script is definitely running and
producing output.  I can run the script manually and redirect the output to
a file and the browser display the html created without a problem, but when
the output is directed to the browser the html goes to the screen as raw
html text and it not formatted.  I have successfully written cgi scripts
that run under Netscape server that display without a problem.  I have
reviewed the headers repeatedly and don't see a problem.  I feel at this
point that it must be an Apache config thing.  Anyone encounter this one?  

2. Delaying response - slowing down robots

3. Any library to turn HTML page to a man page format?

4. Solaris 2.5.1 Threads Library

5. Serving text/x-server-parsed-html not text/html

6. Is instfix to be trusted ?

7. Man pages in html format

8. HD Image

9. HTML pages appear as text on Netscape, but not IE

10. Unix manual page in HTML or linked format?

11. something -> HTML pages, plain text

12. Needed script to format incoming mail into html WWW pages.

13. Formatting MAN pages to ASCII text