how to get all linked pages ?

how to get all linked pages ?

Post by Bert Dougla » Wed, 08 Sep 1999 04:00:00



Hi All,

I just got my first linux mandrake box 3 days ago.
Linux email is not yet working.  So don't blast me, please.

I want to make some kind of fairly simple script that will get all the linked pages of a given URL.

I strongly suspect there is something already available to do this.  I just don't know the right terminology, so it is difficult to
find.

Thanks,
Bert Douglas

 
 
 

how to get all linked pages ?

Post by jkl.. » Wed, 08 Sep 1999 04:00:00


Have you looked at wget ? It sounds like the solution to your problem.

 
 
 

how to get all linked pages ?

Post by bernward halfkan » Thu, 09 Sep 1999 04:00:00



> Hi All,
> I just got my first linux mandrake box 3 days ago.
> Linux email is not yet working.  So don't blast me, please.
> I want to make some kind of fairly simple script that will get all the linked pages of a given URL.
> I strongly suspect there is something already available to do this.  I just don't know the right terminology, so it is difficult to
> find.

I use wwwoffle. Below a an extract from the wwwoffle-welcome-page:
regards, bernward

The WWWOFFLE programs simplify World Wide Web browsing from computers that
use intermittent (dial-up) connections to the internet.

Description

The wwwoffled program is a simple proxy server with special features for use
with dial-up internet links. This means that it is possible to browse web pages
and read them without having to remain connected.

While Online

    Caching of pages that are viewed for review later.
    Conditional fetching to only get pages that have changed.

While Offline

    The ability to follow links and mark other pages for download.
    Browser or command line interface to select pages for downloading.
    Optional info on bottom of pages showing cached date and allowing refresh.
    Works with pages containing forms.
    Works with pages that require basic username/password authentication.
    Can be configured to use dial-on-demand for pages that are not cached.

Automated Download  

    Downloading of specified pages non-interactively.
    Can automatically fetch inlined images in pages fetched this way.
    Can automatically fetch contents of all frames on pages fetched this way.
    Automatically follows links for pages that have been moved.
    Can monitor pages at regular intervals to fetch those that have changed.
    Makes backup copies of cached pages so server errors don't overwrite them.

Provides

    Caching of web pages (http), ftp sites and finger command.
    An introductory page with information and links to the built-in pages.
    Multiple indexes of pages stored in cache for easy selection.
    Interactive or command line control of online/offline status.
    User selectable purging of pages from cache based on URL matching.
    Interactive or command line option to fetch pages and links recursively.
    Interactive web page to allow editing of the configuration file.
    Built-in simple Web server for local pages.
    Automatic proxy configuration for Netscape.

General

    Can be used with one or more external proxies based on hostname.
    Automates proxy authentication for external proxies that require it.
    Configurable to still allow use on intranets while offline.
    Can be configured to block or not cache URLs based on file type or host.
    Can censor outgoing HTTP headers to maintain user privacy.
    All options controlled using a simple configuration file.
    Optional password control for management functions.
    User customisable error message and control pages.

Further WWWOFFLE Links

The WWWOFFLE FAQ is now provided with the program and there is also an online version at
http://www.gedanken.demon.co.uk/wwwoffle/version-2.3/FAQ.html

The WWWOFFLE homepage on the internet is available at http://www.gedanken.demon.co.uk/wwwoffle/index.html and
contains the latest information about the program in general.

The latest information about using this version of WWWOFFLE is on the WWWOFFLE Version 2.3 Users Page at
http://www.gedanken.demon.co.uk/wwwoffle/version-2.3/user.html and contains more information about using this
version of the program.


 
 
 

how to get all linked pages ?

Post by Mark Care » Fri, 10 Sep 1999 04:00:00



> Hi All,

> I just got my first linux mandrake box 3 days ago.
> Linux email is not yet working.  So don't blast me, please.

> I want to make some kind of fairly simple script that will get all the linked pages of a given URL.

> I strongly suspect there is something already available to do this.  I just don't know the right terminology, so it is difficult to
> find.

> Thanks,
> Bert Douglas

When getting into Linux I wrote a small program in Visual Basic to do this.
 
 
 

how to get all linked pages ?

Post by Duncan Simps » Sat, 11 Sep 1999 04:00:00



<stuff snipped>

Quote:>> I want to make some kind of fairly simple script that will get all the linked pages of a given URL.

<stuff snipped>5

Quote:>When getting into Linux I wrote a small program in Visual Basic to do this.

wget will do this with the required few million options, support for
the robot exclusion stuff, only getting stuff that has changed since
you last downloaded it and a vast amount else besides. There are
several perl scripts that do www mirrroing too (the better ones also
have not getting things you have already that satyed the same).

--
Duncan (-:
"software industry, the: unique industry where selling substandard goods is
legal and you can charge extra for fixing the problems."

 
 
 

1. How to make the initial page (index.html) link to a page in another machine?

I have APACHE in a Solaris 2.5.1. I like the index.html in /httpd/htdocs/
points directly to a html which is in the other web server.

Usually, in the html, you have to have

<A href="/../../another.html> some_text </A>

so user can click the some_text to link to the another.html.
Now, I want the another.html appears as the initial home page when user access
my Web Server, which means user doesnt have to click anything, just enter

  http://address_of_my_web_server/

Do you know whats the index.html should be like? Thanks in advance.

2. / listed twice in /proc/mounts

3. "100 Shateware Links Page" (Free Shareware Links)

4. ghelp! redhat 5.0 & Accelerated X

5. " 100 Computer Hardware Links Page " (Free Shareware Links)

6. Aggregating 4 Ethernet Ports to 1 IP Address

7. "100 Shateware Links Page" (Free Shareware Links)

8. PCI Video problem with : firmware, linload, milo, kernel or xserver ?

9. Q: Info about paging (count of page-in/page-out)?

10. KWord: page 1 in footer shows page 1 on all pages

11. Linking Domain Name to a page?

12. Error "External link to non page"?

13. How to find links to your page??