Extracting Links from HTML for Links MYSQL data base...

Extracting Links from HTML for Links MYSQL data base...

Post by Jay » Wed, 20 Apr 2005 23:39:51



Let me first say, I don't know anything about sed.

I only need a really small bit of help.

I am running Sed for dos.

I was collecting web links and installing them into my database when  
google stopped working with sed.

This is my working data base file before sed or google stopped working.

http://guitarchat.net/dmoz.sql.txt

This is my collection of links produced with Links Suite 4 software I  
purchased.

http://guitarchat.net/links-suite.html

This is my page of links

http://12string.net.

As you see, I got 1300 before google stopped working.

This is the sed code I was working with for several days.

http://guitarchat.net/sedscr_downloads_google2

I want to learn, however I really want to finish the last two categories  
really bad!

IF any one could provide the sed commands to extract that html file I  
would be really grateful.

*It was working right from google.html file for several days*

Any ideas?

This is for a community page and not for spam harvesting.

I don't mind making a paypal small donation if that is acceptable.

I hope this post is ok,  I know its a HELP! Post... Please forgive my  
ignorance with programming languages on my path of sed learning.

 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Chris F.A. Johnso » Fri, 22 Apr 2005 04:32:59



> Let me first say, I don't know anything about sed.

> I only need a really small bit of help.

> I am running Sed for dos.

    You may get some help here, but this is a Unix group.

Quote:> I was collecting web links and installing them into my database when  
> google stopped working with sed.

    What stopped working? Please post your code; tell us what you are
    trying to do, and what exactly didn't work.

    Provide examples.

--
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2005, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License

 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Jay » Thu, 21 Apr 2005 05:57:05


What I was trying to do was harvest guitar links from google for my data  
base.
The html changed and is changing every search so the sed commands must be  
re discovered every search.

This is the code that works for google html and works about 1/3 of the  
time.

http://guitarchat.net/sedscr_downloads_google2

This google search worked with my sed commands:
http://guitarchat.net/google.html

However "Heavy Metal Guitar"

http://guitarchat.net/google.html_Heavy_Metal-Guitar.html

Returns a zero size file  (File name changed for posting)

This is the sql file for my data base that I was getting from sed.

http://guitarchat.net/google.sql

However I purchased software that will rip the links into a standard html  
file every time so I wont need to decipher the html and sed commands.

http://guitarchat.net/links-suite.html

If I could get the sed commands to build my data base from the  
links-suite.html I will be set!

Thanks and sorry for disturbing this group.  the comp.lang.awk group  
suggested this as the correct group for sed discussions.

Thanks for the kind help and directions.

*I am reading the sed on-line manual*

On Wed, 20 Apr 2005 13:32:59 -0600, Chris F.A. Johnson  



>> Let me first say, I don't know anything about sed.

>> I only need a really small bit of help.

>> I am running Sed for dos.

>     You may get some help here, but this is a Unix group.

>> I was collecting web links and installing them into my database when
>> google stopped working with sed.

>     What stopped working? Please post your code; tell us what you are
>     trying to do, and what exactly didn't work.

>     Provide examples.

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Chris F.A. Johnso » Fri, 22 Apr 2005 06:27:21



> Thanks and sorry for disturbing this group.  the comp.lang.awk group  
> suggested this as the correct group for sed discussions.

   This is the correct group. But please post the code, not links to
   the code.

--
    Chris F.A. Johnson                  http://cfaj.freeshell.org/shell
    ===================================================================
    My code (if any) in this post is copyright 2005, Chris F.A. Johnson
    and may be copied under the terms of the GNU General Public License

 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Michael Tosc » Fri, 22 Apr 2005 07:26:48



> What I was trying to do was harvest guitar links from google for my
> data  base.
> The html changed and is changing every search so the sed commands must
> be  re discovered every search.

> This is the code that works for google html and works about 1/3 of the  
> time.

> http://guitarchat.net/sedscr_downloads_google2

> This google search worked with my sed commands:
> http://guitarchat.net/google.html

> However "Heavy Metal Guitar"

> http://guitarchat.net/google.html_Heavy_Metal-Guitar.html

> Returns a zero size file  (File name changed for posting)

> This is the sql file for my data base that I was getting from sed.

> http://guitarchat.net/google.sql

> However I purchased software that will rip the links into a standard
> html  file every time so I wont need to decipher the html and sed commands.

> http://guitarchat.net/links-suite.html

> If I could get the sed commands to build my data base from the  
> links-suite.html I will be set!

> Thanks and sorry for disturbing this group.  the comp.lang.awk group  
> suggested this as the correct group for sed discussions.

> Thanks for the kind help and directions.

> *I am reading the sed on-line manual*

Please dont top-post!

Your sed filter is highly dependent on the Google page code:
whenever they change it, your sed filter will fail.

This time they have added " delimiters for values, which must be
reflected in your sed filter as follows:
class=g  becomes  class="g"
href=\([^>]*\)  becomes href="\([^>]*\)"
font size=-1  becomes  font size="-1"

Maybe your additional program is more flexible to changes.
Its output can be postprocessed with

s/<b>//g
s/<\/b>//g
s/\&quot;/\"/g
s/\&amp;/\&/g
s/\&nbsp;/ /g
:b
s/<\/p>$//
tc
N
s/\n//
bb
:c
s/.*<a href="\([^>]*\)">\([^<]*\)<\/a>   \(.*\)$/INSERT INTO ...

ending like your previous sed filter.

--

 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Jay » Thu, 21 Apr 2005 09:55:04


Thanks very much.

I was working on that script for several days.

It's perfect.

One minor thing is it wont allow   '  apostrophes.

This is not a problem,  but if you know an easy fix?
Or is the sql data base rejecting it?

Either way!!!!

Thanks so much!

On Wed, 20 Apr 2005 16:26:48 -0600, Michael Tosch  



>> What I was trying to do was harvest guitar links from google for my  
>> data  base.
>> The html changed and is changing every search so the sed commands must  
>> be  re discovered every search.
>>   This is the code that works for google html and works about 1/3 of  
>> the  time.
>>  http://guitarchat.net/sedscr_downloads_google2
>>  This google search worked with my sed commands:
>> http://guitarchat.net/google.html
>>  However "Heavy Metal Guitar"
>>  http://guitarchat.net/google.html_Heavy_Metal-Guitar.html
>>  Returns a zero size file  (File name changed for posting)
>>   This is the sql file for my data base that I was getting from sed.
>>  http://guitarchat.net/google.sql
>>   However I purchased software that will rip the links into a standard  
>> html  file every time so I wont need to decipher the html and sed  
>> commands.
>>    http://guitarchat.net/links-suite.html
>>  If I could get the sed commands to build my data base from the  
>> links-suite.html I will be set!
>>   Thanks and sorry for disturbing this group.  the comp.lang.awk group  
>> suggested this as the correct group for sed discussions.
>>  Thanks for the kind help and directions.
>>  *I am reading the sed on-line manual*

> Please dont top-post!

> Your sed filter is highly dependent on the Google page code:
> whenever they change it, your sed filter will fail.

> This time they have added " delimiters for values, which must be
> reflected in your sed filter as follows:
> class=g  becomes  class="g"
> href=\([^>]*\)  becomes href="\([^>]*\)"
> font size=-1  becomes  font size="-1"

> Maybe your additional program is more flexible to changes.
> Its output can be postprocessed with

> s/<b>//g
> s/<\/b>//g
> s/\&quot;/\"/g
> s/\&amp;/\&/g
> s/\&nbsp;/ /g
> :b
> s/<\/p>$//
> tc
> N
> s/\n//
> bb
> :c
> s/.*<a href="\([^>]*\)">\([^<]*\)<\/a>   \(.*\)$/INSERT INTO ...

> ending like your previous sed filter.

--
Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
 
 
 

Extracting Links from HTML for Links MYSQL data base...

Post by Michael Tosc » Sat, 23 Apr 2005 02:51:49



> Thanks very much.

> I was working on that script for several days.

> It's perfect.

> One minor thing is it wont allow   '  apostrophes.

> This is not a problem,  but if you know an easy fix?
> Or is the sql data base rejecting it?

The ' is used as a value delimiter.

The sed filter can delete all ' by inserting a line

s/\'//g

You can also try to replace ' by \' and see if your program takes it
(it is a well-known escape):

s/\'/\\\'/g

--