wintern: toward a shared desktop web cache facility

wintern: toward a shared desktop web cache facility

Post by Dan Connoll » Fri, 21 May 1999 04:00:00



I tried out Mandrake recently... great stuff! The linux
desktop is really coming along. I haven't found time
to try GNOME yet...

Mandrake (i.e. KDE) has pretty darn good web integration:
a file browser for /home/connolly acts almost the
same as a browser for http://www.w3.org/People/Connolly/
drag and drop, and so on. It even does the right thing
with putting network I/O in a background process so the
foreground never goes busy (well... never say never. There
are a few exceptions or bugs or something).

But each application seems to have its own web cache:
kpackage puts its stuff in ~/.kpagkage/cache or whatever
and so on.

I'd like to see the distribution builders (or the desktop
builders) work toward an integrated cache.

The basic idea starts with a little "gimme a URL and I'll
give you back a filename that you can use to access that data"
utility. Call it wintern. So I could do stuff like

        tar zxvf `wintern ftp://sunsite.unc.edu/honkin.tgz`

I know I can already do

        lynx -dump ftp://sunsite.unc.edu/honkin.tgz | tar zxvf -

but if I stop in the middle, I have to start the fetch all over.

wintern would be smart enough to cache stuff...

There should be a "public" cache in /var/webcache plus
a user-private cache in ~/.webcache (see
section "14.9 Cache-Control" of the HTTP 1.1 spec
http://www.w3.org/Protocols/History.html#Rev06
for what I mean by public/private).

If we can get the synchronization stuff right to be able
to use the Netscape cache, so much the better.
But I don't know how to do validation of the browser cache.
Hmmm... probably a better way to integrate with browsers
is using a local proxy server; wintern would be a client
of that proxy with extra knowledge about the cache:
wintern would use an extension header to say "in stead
of returning content over the TCP connection, just
store it in a file and give me a temporary redirect
to a file: URL where you stored the content."

Does anybody know if the mozilla folks are planning to
split the network/caching component out so that mozilla
can be built as a thin client with outboard caching
facilities? Some folks are kicking around the idea
of using libwww as the next-generation Mozilla client
library; that would probably do the trick.

Anyway...

wintern has to be smart about resources whose state changes.
It's gotta have a --max-age parameter so I can go

        x=`wintern --max-age 86400 http://www.w3.org/`

to say that if there's a copy of the page in the cache
that's no more than 1 day old, that's good enough for me.
Maybe 1 day should be the default (or 1 week? or maybe
size-of-cached-data * 1day/MB should be the default formula,
so that for a 1 MB file, you check for staleness after 1 day,
and for a 7MB file, you check after 1 week)

so if you're really picky about getting the current stuff,
you'd have to say

        x=`wintern --max-age 0 http://www.w3.org/`

wintern would also have a "cache priority" or "cache lifetime"
or "precious" flag that says "don't ever flush this from
the cache" so that you could get a bunch of stuff.

Hmm... in fact, since wintern can spontaneously reclaim cache
space, it needs a "lease" policy; something like: by default,
things live in the cache for at least 1 hour after winter
hands you the filename; if you want for a whole day, you have
to say

        wintern --lease=86400 ...

and if you want it indefinitely, you have to renew the lease
periodically, just like DHCP.

Another mode for wintern would be like find or xargs, where
it invokes a subprocess:

        wintern -exec "md5sum {}" http://www.w3.org/

where the subprocess gets a lock, rather than a lease, for
its lifetime. Further, if the subprocess succeeds, the priority
of the cache item is set very low, but if the sub process fails,
the priority is normal/high so that if you try again, you're
likely to get a cache hit.

Of course, the shell command line isn't the only place
where we want to be able to play the "trade a URL for
a filename" trick. Of course, there must be a C-callable
wintern work alike (and perl/tcl/python callable)
so that I can do
        fp = wfopen("http://www.w3.org/");

Hmm... it's likely that wintern could play tricks with
named pipes to allow "progressive display" sorts of things
in stead of blocking until all the content is local. But
in that case, you might as well just open an HTTP connection
to the local proxy. Hmm...

Anyway...

Some folks have talked about ftp and http file systems, but I think
putting this stuff in the kernel makes for unwieldy deployment.
I'd rather see it in user space so it can be used on non linux
platforms.

The userfs designs that I've seen also involve a user visible
convention for mapping ftp URLs into the local file system, so
that
        ftp://ftp.x.org/X11/foo.tgz
becomes
        /ftp/org/x/ftp/X11/foo.tgz

but a critical feature of wintern is that the local filename
is totally invisible and totally independent of the URL;
the local filenames are most likely something like
        /var/webcache/AQ/RY/0047.html
where AQRY0047 is some hash of the URL, split two directory
levels deep just because huge directories are slow.

--
Dan Connolly, W3C
http://www.w3.org/People/Connolly/

 
 
 

wintern: toward a shared desktop web cache facility

Post by Victor Wagn » Tue, 25 May 1999 04:00:00


: The basic idea starts with a little "gimme a URL and I'll
: give you back a filename that you can use to access that data"
: utility. Call it wintern. So I could do stuff like

:       tar zxvf `wintern ftp://sunsite.unc.edu/honkin.tgz`

: I know I can already do

:       lynx -dump ftp://sunsite.unc.edu/honkin.tgz | tar zxvf -

: but if I stop in the middle, I have to start the fetch all over.

Try wget -O - ftp://sunsite.unc.edu/honkin.tgz | tar zxvf -

it would do a bit better  - automatic retries and such. It also
handles timecheck (really checking timestamp of remote file is quite
cheap operation - ask remote www server for HEAD and it will return
you just header or ask remote ftp server for MTIME).
Or use If-Modified-Since feature of HTTP/1.1

Wget does all of these, and supports proxy, but it doesn't handle
any caches by itself. So, if you really want to write wintern utility,
I reccomend you to start with Perl LWP module. It is quite flexible
and powerful and desired functionality can be written with couple of
hundred lines.

Hope to see you code soon
                               Victor.
--
--------------------------------------------------------

I don't answer questions by private E-Mail from this address.

 
 
 

1. problems configuring shared memory facility and semaphores facility

I suppose I needs to add this to the FAQ.

Anyway, semaphores aren't loaded until needed.  Ipcs is broken in that
it gives info on whether a system is loaded or not, not whether it
is available.  If you use them, you'll find the semaphores are there.

Casper
--
Casper Dik - Network Security Engineer - Sun Microsystems
This article is posted from my guest account at the University

Opinions expressed here are mine (but you're welcome to share them with me)

2. Fill out a survey and get $50 free 2695

3. web caching and "Web Cache"

4. SCSI-2 Adaptec AHA-2840VL/2842VL

5. they play once, behave absolutely, then pull towards the car towards the hill

6. double clicking an image in kfm

7. Linux moving more toward the desktop...

8. virtual screens missing with xf8643

9. i was leaning toward getting Xandros Desktop,...but corel is toast!

10. share a directory towards NT

11. Web cache: how to force caching dynamic pages?

12. Further on the web stats generation facility.