lex+yacc parser for HTTP

lex+yacc parser for HTTP

Post by KJK::Hyperio » Fri, 27 Jun 2003 17:33:54



Hi all. I need to write a HTTP server for a school assignment, and
decided to structure it in the most trivial way possible, i.e. forking
on client connections and binding the child process's stdin and stdout
to the connection. The part that is giving me the most problems is the
request parser, though. I decided to use lex+yacc because I hate to
hand-write parsers, but HTTP requests seem designed in the most lex-
unfriendly way possible. Here's what I came up with (doubts and open
issues follow):

///// lexer
%{
#include <assert.h>
#include <stdlib.h>
#include <string.h>
#include "monkeyd.h"
#include "y.tab.h"

%}

%x BEFORE_URI
%x MAY_VERSION
%x BEFORE_VERSION
%x BEFORE_HEADER_NAME
%x BEFORE_HEADER_VALUE

WS    [\x20\t]+
CRLF  (\r\n|\n|\r)
TOKEN [^\x80-\xFF\x7F\x00-\x1F\x20\x22\r\n\t]+
DIGIT [0-9]

%%

{CRLF} { ; }

{TOKEN}/{WS} {
 BEGIN BEFORE_URI;

 yylval.token = malloc(sizeof(struct token));
 yylval.token->well_known = 0;
 yylval.token->text = strdup(yytext);

 return METHOD;

Quote:}

<BEFORE_URI>{TOKEN}/{WS} {
 BEGIN MAY_VERSION;

 yylval.literal = strdup(yytext);

 return URI;

Quote:}

<MAY_VERSION>{CRLF}  BEGIN BEFORE_HEADER_NAME;
<MAY_VERSION>"HTTP/" BEGIN BEFORE_VERSION;

<BEFORE_VERSION>{DIGIT}+\.{DIGIT}+/{CRLF} {
 char * endp;

 BEGIN BEFORE_HEADER_NAME;

 yylval.version = malloc(sizeof(struct version));

 yylval.version->major = strtol(yytext, &endp, 10);

 assert(endp[0] == '.');

 yylval.version->minor = strtol(endp, NULL, 10);

 return VERSION;

Quote:}

<BEFORE_HEADER_NAME>{TOKEN}/:{WS} {
 BEGIN BEFORE_HEADER_VALUE;

 yylval.token = malloc(sizeof(struct token));
 yylval.token->well_known = 0;
 yylval.token->text = strdup(yytext);

 return HEADER_NAME;

Quote:}

<BEFORE_HEADER_VALUE>.*/{CRLF} {
 BEGIN BEFORE_HEADER_NAME;
 return HEADER_VALUE;

Quote:}

%%

int yywrap(void) { return 1; }

///// parser
%{
#include <stddef.h>
#include <stdlib.h>
#include "monkeyd.h"

extern void yyerror(char *);

%}

%union
{
 struct token * token;
 char * literal;
 struct version * version;

Quote:};

%token <token>   METHOD
%token <literal> URI
%token <version> VERSION
%token <token>   HEADER_NAME
%token <literal> HEADER_VALUE

%%

request_meta:
   request                    { puts("request"); }
 ;

request:
   simple_request             { puts("simple_request"); }
 | full_request               { puts("full_request"); }
 ;

simple_request:
   METHOD URI                 { puts("METHOD URI"); }
 ;

full_request:
   METHOD URI VERSION headers { puts("METHOD URI VERSION headers"); }
 ;

headers:
   headers header             { puts("headers header"); }
 |                            { puts("<NULL>"); }
 ;

header:
   HEADER_NAME HEADER_VALUE   { puts("HEADER_NAME HEADER_VALUE"); }
 ;

%%

void yyerror(char *s) { fprintf(stderr, "%s\n", s); }

extern int yy_flex_debug;

int main()
{
 yy_flex_debug = 1;
 yyparse();
 return 0;

Quote:}

/////
The parser came out really simple, but this necessarily complicates the
lexer, as in HTTP lots of different tokens are matched by the same
pattern (BTW, yes, I'm aware that the pattern for the URI is wrong. I
will correct it later). The current lexer has a lot of problems:
 - it accepts anything. The tracing shows a default rule that matches
   anything I didn't specify. I want to override this default rule to
   return a 400 Bad Request to the client, but haven't found how to do
   so
 - the trailing context isn't removed from the input, and I need a way
   to remove it, or I'll have to double the number of states
 - lex complains about some of the trailing contexts I specified:

"monkeyd.l", line 76: warning, dangerous trailing context
"monkeyd.l", line 66: warning, dangerous trailing context
"monkeyd.l", line 76: warning, dangerous trailing context

   line 66 and 76 are:

66: <BEFORE_HEADER_NAME>{TOKEN}/:{WS} {
76: <BEFORE_HEADER_VALUE>.*/{CRLF} {

   well, I see nothing "dangerous" about them
 - I need a way to stop scanning once the end of the request header is
   reached. I think returning EOF as the token will do, but I'm not so
   sure
Well, what do you think? I have already tried writing yylex() by hand,
but, like I said, I hate to hand-write text scanners

 
 
 

lex+yacc parser for HTTP

Post by Marc Rochkin » Fri, 27 Jun 2003 22:42:28



> Hi all. I need to write a HTTP server for a school assignment, and

[snip]

Using yacc is OK, if you want to, but I wouldn't use lex for this job (or
for any others, but that's just me...). Write the lexer by hand. It will
easier than trying to get lex to do what you want.

And... don't try to do too much in the lexer. Move as much of the smarts
into the parser as you can.

--Marc

 
 
 

lex+yacc parser for HTTP

Post by bd » Sat, 28 Jun 2003 10:23:09



> Hi all. I need to write a HTTP server for a school assignment, and
> decided to structure it in the most trivial way possible, i.e. forking
> on client connections and binding the child process's stdin and stdout
> to the connection. The part that is giving me the most problems is the
> request parser, though. I decided to use lex+yacc because I hate to
> hand-write parsers, but HTTP requests seem designed in the most lex-
> unfriendly way possible. Here's what I came up with (doubts and open
> issues follow):
[snip]
> Well, what do you think? I have already tried writing yylex() by hand,
> but, like I said, I hate to hand-write text scanners

HTTP was designed to not need yacc or lex - you can probably parse it with
strtok(). For a minimal server, all you have to do is to split the query
by spaces (e.g., into 'GET', '/', and 'HTTP/1.0')

--
Freenet distribution not available
Interfere?  Of course we should interfere!  Always do what you're
best at, that's what I say.
                -- Doctor Who

 
 
 

lex+yacc parser for HTTP

Post by Haran Shivana » Sat, 28 Jun 2003 11:47:13


module "Marc Rochkind" core dumped with the following output:


>> Hi all. I need to write a HTTP server for a school assignment, and

> [snip]

> Using yacc is OK, if you want to, but I wouldn't use lex for this job (or
> for any others, but that's just me...). Write the lexer by hand. It will
> easier than trying to get lex to do what you want.

I always felt the same way as well, but I figured I was in the minority.

It's easy enough to roll out your own tokenizer, that it's not worth the massive
code\data generated by lex.

--
Flying is the art of throwing yourself at
the ground and missing.
        - The Hitchiker's Guide to the Galaxy

 
 
 

lex+yacc parser for HTTP

Post by KJK::Hyperio » Tue, 01 Jul 2003 07:34:59


Quote:> Using yacc is OK, if you want to, but I wouldn't use lex for this
> job (or for any others, but that's just me...). Write the lexer by
> hand. It will easier than trying to get lex to do what you want.

thanks. Meanwhile, I already figured that out by myself, and I'm
writing one as I speak. But I don't know how the lexer is supposed to
communicate with the parser. What should yylex() return to signal EOF,
lexical error, etc.?

Quote:> And... don't try to do too much in the lexer. Move as much of the
> smarts into the parser as you can.

well, if I detect that the request is so badly malformed that I can't
even tokenize it, I have to do something
 
 
 

lex+yacc parser for HTTP

Post by Marc Rochkin » Tue, 01 Jul 2003 13:21:27



>> Using yacc is OK, if you want to, but I wouldn't use lex for this
>> job (or for any others, but that's just me...). Write the lexer by
>> hand. It will easier than trying to get lex to do what you want.

> thanks. Meanwhile, I already figured that out by myself, and I'm writing
> one as I speak. But I don't know how the lexer is supposed to communicate
> with the parser. What should yylex() return to signal EOF, lexical error,
> etc.?

>> And... don't try to do too much in the lexer. Move as much of the
>> smarts into the parser as you can.

> well, if I detect that the request is so badly malformed that I can't
> even tokenize it, I have to do something

I haven't looked at the yacc documentation in years, but my recollection is
that the interface to the lexer was very clearly set out.

--Marc

 
 
 

lex+yacc parser for HTTP

Post by Clint Olse » Wed, 02 Jul 2003 01:57:13



> I always felt the same way as well, but I figured I was in the minority.

> It's easy enough to roll out your own tokenizer, that it's not worth the
> massive code\data generated by lex.

I agree about lex, but what might be better is to use a more efficient and
lightweight tool like re2c:

http://packages.debian.org/unstable/devel/re2c.html

-Clint

 
 
 

lex+yacc parser for HTTP

Post by Chuck Dillo » Wed, 02 Jul 2003 03:40:23




>> Hi all. I need to write a HTTP server for a school assignment, and

> [snip]

> Using yacc is OK, if you want to, but I wouldn't use lex for this job
> (or for any others, but that's just me...). Write the lexer by hand. It
> will easier than trying to get lex to do what you want.

> And... don't try to do too much in the lexer. Move as much of the smarts
> into the parser as you can.

> --Marc

Lest future archive miners read this thread and deduce that lex is in
all cases a bad choice, I have to disagree.  I agree that lex code is
ugly but you don't have to look at it.  I agree that it may be bloated
in comparison to custom code but that's a cost of generalization.

My experience is that lex is an easy to use and powerful tokenizer that
can be used to quickly build sopisticated and efficient tokenizers that
are easy to maintain as lex code.  Writing, testing and maintaining
(many more lines of) custom code for a non-trivial application is a
much bigger task than applying what you can assume to be a valid lexer
engine against your test case data.  Not to mention that the custom
code is likely to be much slower than a lex based solution since the
typical developer will tend to apply sscanf and/or the regexp functions.

It seems to me that the biggest problem for beginners with lex is
determining where the lexer logic should end and the parser logic
begin.  If you try to make the lexer logic too complicated you will run
into limitations and become frustrated.  That's not a lex problem
that's misapplication of a tool.

IMHO, for anyone doing non-trivial parsers it's well worth their time
to learn to use lex (or flex) or a similar tool.

-- ced

--
Chuck Dillon
Senior Software Engineer
NimbleGen Systems Inc.

 
 
 

lex+yacc parser for HTTP

Post by Haran Shivana » Thu, 03 Jul 2003 21:22:38


module "Chuck Dillon" core dumped with the following output:


>>> Hi all. I need to write a HTTP server for a school assignment, and

>> [snip]

>> Using yacc is OK, if you want to, but I wouldn't use lex for this job
>> (or for any others, but that's just me...). Write the lexer by hand. It
>> will easier than trying to get lex to do what you want.

>> And... don't try to do too much in the lexer. Move as much of the smarts
>> into the parser as you can.

>> --Marc

> Lest future archive miners read this thread and deduce that lex is in
> all cases a bad choice, I have to disagree.  I agree that lex code is
> ugly but you don't have to look at it.  I agree that it may be bloated
> in comparison to custom code but that's a cost of generalization.

> My experience is that lex is an easy to use and powerful tokenizer that
> can be used to quickly build sopisticated and efficient tokenizers that
> are easy to maintain as lex code.

True, but in some cases, you don't need a powerful tokenizer. They are
usually quite simple to build using a simple recursion\state-machine based
model without having to generate mountains of tables. Lex is powerfull but
it's power comes at a price. For complex lexical scanning it is ideal, for
simpler tasks you can probably build up your own one.
Quote:> Writing, testing and maintaining
> (many more lines of) custom code for a non-trivial application is a
> much bigger task than applying what you can assume to be a valid lexer
> engine against your test case data.  Not to mention that the custom
> code is likely to be much slower than a lex based solution

I could very well be wrong but I always thought (based on similiar
conversations elsewhere) that program-generated scanners (and parsers)
are more powerfull but usually *less* efficient than hand-written ones.

Quote:> since the
> typical developer will tend to apply sscanf and/or the regexp functions.

Scanners can be easily implemented by reading in one character at a time
and maintaining state information. Although, I suppose the lesser experienced
guys would immediately go for sscanf finding its flexibility irrestible.

Quote:

> IMHO, for anyone doing non-trivial parsers it's well worth their time
> to learn to use lex (or flex) or a similar tool.

I agree, it's important to learn both. But I'd say its also just as important
(if not more so) to know how to efficiently roll out your own scanning
functions. The techniques you learn in doing so may be applicable elsewhere.

--
I've found that you don't need to wear a neck-tie if you can hit
- Ted Williams

 
 
 

lex+yacc parser for HTTP

Post by Haran Shivana » Thu, 03 Jul 2003 21:22:40


module "KJK::Hyperion" core dumped with the following output:
Quote:>> Using yacc is OK, if you want to, but I wouldn't use lex for this
>> job (or for any others, but that's just me...). Write the lexer by
>> hand. It will easier than trying to get lex to do what you want.

> thanks. Meanwhile, I already figured that out by myself, and I'm
> writing one as I speak. But I don't know how the lexer is supposed to
> communicate with the parser. What should yylex() return to signal EOF,
> lexical error, etc.?

I'm not sure off the top of my head, but if you have access to a college
library, you should probably try to get a copy of the the Dragon Book.
"Compilers, Principles techniques and tools."
It explains this stuff well.
Quote:

> well, if I detect that the request is so badly malformed that I can't
> even tokenize it, I have to do something

Return an error token which immediately stops the scanning process.
This would be the simplest (though not necessarily most sophisticated)
solution.

--
Human beings, who are almost unique in having the ability to
learn from the experience of others, are also remarkable for
their apparent disinclination to do so.
        - Douglas Adams, Last Chance to See