> This message posted also to comp.unix.programmer
followup, that way discussion doesn't get needlessly duplicated. I've set
a followup on this message, but I don't read that group, sorry :)
Okay, so you are writing an HTTP client/HTML UA.Quote:> Somewhat OT, sorry. And also somewhat vague: I'm not quite sure of the
> right question. So far, RFC2616 has been too abstract for this
> question but maybe the answer's implied in there somewhere.
> On my Linux server, I need to do stuff that a browser normally does,
> to /be/ a browser in a limited way and for a time.
I'm not sure I'm following you; the socket library is at a lower level thanQuote:> From my code in, say, cgi-bin on my server I need to:
> Open a connection to some named web resource, send a GET HTTP header,
> receive the the web page or error report from that web resource,
> process the HTML page I've received, and then close the HTTP
> connection.
> Is there an analog of the socket library that'll let me do all that,
> but on an HTTP connection? If so, what is the analog? If the analog is
> a bunch of C code I have to write, what docs to I have to look at? Or
> does the socket library have a protocol choice == HTTP (would be
> great)?
HTTP. You open a TCP connection to port 80 on the server with the sockets,
and read/write at will. If I remember correctly, when using BSD sockets,
you pick the INET family and TCP. Any decent UNIX programming book should
include a bit about sockets.
I'm sure you are looking for code examples, the only one which springs to
mind is libwww, which was bundled with Amaya, and which I haven't looked
at. Presumably, being worked on at W3C, it's a decent implementation.
Bear in mind that most people prefer to work with higher-level languages
when dealing with higher-level protocols like HTTP. Whilst your code might
be ultra-efficient when written in C, the time spent implementing it might
be better used elsewhere.
Also, if you are using Linux, consider farming off the HTTP client bit to
wget or curl; at least for the prototype. If you are worried about the
overhead of starting a new process, I believe most of curl's features are
implemented as part of 'libcurl', which you can link into your executable.
I know perl has adequate HTML parsing routines, but in case your language of
choice does not, consider running the resulting HTML document through tidy
to convert it to xhtml, and just using an xml parser on the end product
(or, of course, the standard grep/sed/awk combinations if you aren't too
fussy about it being a conformant HTML UA).
--
Jim Dabell