Finding first 'header' of email/news

Finding first 'header' of email/news

Post by Thomas Michanek - Michag » Mon, 18 Jul 1994 22:53:16



        I want a way to find the _first_ header of an email/News
        article, but not using C or perl. Here's the scenario:

I have a text file with concatenated email and/or News articles. Each
article has a number of headers that probably vary from one article to
another, both in case of which headers are present and in what order
they appear. No assumptions are made on which email or News readers that
have written to the file. It could look something like this:

    [...]
  Path: ...
  Newsgroups: ...
  Subject: ...
    [more headers]
    [News article]
  From ...
  Received: ...
  Subject: ...
    [more headers]
    [email article]

Suppose _one_ of these headers is found in _every_ article, e.g. 'Subject:'.
This is the _only_ header that is supposed to exist in every article.
I want to find the "start" of each article, i.e. the first header that
separates the article from the previous one. There's probably an empty
line separating the articles also, but I don't think that's guaranteed.
Note that the 'Subject:' header itself may be the first header!

I want a solution that finds the line numbers of the first header of
each article containing a 'Subject:' header. Alternatively, it may find
simply the header lines themselves without the line number. As input,
you could have either the line numbers of the existing 'Subject:' headers,
or the complete file in which case you'll have to find the 'Subject:'
headers first.

The solution should use awk, sed, grep or other "standard" UNIX commands
and should be written in a csh-compatible syntax (no flames please!).
I'm _not_ looking for solutions written in C, perl or other languages.

Thanks in advance!
--
   _______   __  __ _     _                       ,------- Michagon -------.
  / _____|\ |  \/  (_)___| |__ ___ __ _ ___ _ __  |   (Thomas Michanek)    |
 / <|___  \|| .  . | / __) '. `-_ / _` / _ \ '. | |  Trumslagaregatan 118  |
|\_____|> / |_|`'|_|_\___)_||_(_,_\__, \___/_||_| |S-58346 Linkoping SWEDEN|
 \|______/  |_____________________|___/_________| |+46 13 273727(voice/fax)|

 
 
 

Finding first 'header' of email/news

Post by Eric Fisch » Fri, 22 Jul 1994 11:28:57



Quote:>    I want a way to find the _first_ header of an email/News
>    article, but not using C or perl. Here's the scenario:
>I want a solution that finds the line numbers of the first header of
>each article containing a 'Subject:' header.
>The solution should use awk, sed, grep or other "standard" UNIX commands
>and should be written in a csh-compatible syntax (no flames please!).
>I'm _not_ looking for solutions written in C, perl or other languages.

well, you're awfully picky in your requirements, but if ex(1) isn't
too arcane, you can do something like

echo '1i

.
g/^Subject: /?^$?+1#
q!' | ex $filename | awk '{print $1 - 1}'

which will insert a blank line at the start of the mail file, then
search for all the subject lines, at each instance going back to the
previous blank line (which is why you need one at the start), then
printing the next line and its number.  Ex then exits without saving
changes (q!), then awk gets each of the line numbers and subtracts
one from it.

Much better would have been just to save in mbox format, but I guess
it's too late for that.

Insert backslashes before the newlines if you're using csh, because
it'll yell at you otherwise.

eric
who just learned how to do this today...


 
 
 

Finding first 'header' of email/news

Post by Thomas Michanek - Michag » Fri, 22 Jul 1994 21:04:24



 > >      I want a way to find the _first_ header of an email/News
 > >      article, but not using C or perl. Here's the scenario:
 >
 > >I want a solution that finds the line numbers of the first header of
 > >each article containing a 'Subject:' header.
 >
 > well, you're awfully picky in your requirements, but if ex(1) isn't
 > too arcane, you can do something like
 >
 > echo '1i
 >
 > .
 > g/^Subject: /?^$?+1#
 > q!' | ex $filename | awk '{print $1 - 1}'

Thanks! I didn't think of using ex. It does work, but there are a few
problems: Both 'ex' and 'ed' creates a temporary copy of the file to process.
If the file is very large (several MB) it takes a lot of time just to create
the temporary file before the actual processing starts. Also, my 'ex' always
creates the temp. file in /var/tmp, which sometimes has too little space,
and not in /tmp, as the man page says. However, if I use 'ed' instead, I
can set the $TMPDIR variable, which 'ex' doesn't understand... :-(

The '#' command prints both line number and the line itself, which is OK,
but I can't find the command in the man pages! 'ed' doesn't understand it,
but it has '='. But, 'ed' can only handle line numbers <32768 and wraps
the line numbers from 1 again! I haven't been able to test if 'ex' behaves
in the same way.

Eric, if you know of any workarounds to the above, please let me know.
It is by far the best solution I've seen so far.

 > Much better would have been just to save in mbox format, but I guess
 > it's too late for that.

Yes, and I want to run it on any combined email/News file, regardless of
which email or News reader created the file. Picky, huh?
--
   _______   __  __ _     _                       ,------- Michagon -------.
  / _____|\ |  \/  (_)___| |__ ___ __ _ ___ _ __  |   (Thomas Michanek)    |
 / <|___  \|| .  . | / __) '. `-_ / _` / _ \ '. | |  Trumslagaregatan 118  |
|\_____|> / |_|`'|_|_\___)_||_(_,_\__, \___/_||_| |S-58346 Linkoping SWEDEN|
 \|______/  |_____________________|___/_________| |+46 13 273727(voice/fax)|

 
 
 

Finding first 'header' of email/news

Post by Roman Czyborr » Sun, 24 Jul 1994 17:53:22


Quote:> I have a text file with concatenated email and/or News articles.
> 'Subject:' is the _only_ header that is supposed to exist in every
> article.  I want to find the "start" of each article. The solution
> should use awk, sed, grep or other "standard" UNIX commands

Let AWK memorize the text in {T[NR]=$0} and when it finds the
/^Subject:/ back up for (l=NR;T[l]~/^[--z]+:|^[ \t]|^From /;--l),
delete the T[usedlines] off and on to recycle your memory.

Quote:> I'm _not_ looking for solutions written in C, perl or other languages.

Too bad, I only need to say formail -des
 
 
 

1. Slackware 1.1.2 smail doesn't put 'Recieved:' header in email

Has anyone else had this problem.  Slackware 1.1.2 smail
is supposed to insert a 'Recieved: .............." header
at the beginning of email passing through (and originating from)
the system.  Mine doesn't.

Any ideas?

--

Le Groupe Proteus        Uucp: uunet!proteus!randall
Montreal, Quebec CANADA   Voice: +1 514 630 7103,  FAX: +1 514 331 0053
186,000 miles/second.  Its not just a good idea, its the law!

2. Jumpstart Problem

3. Email Problems: Null's before the header of emails

4. Volume Control

5. No 'Lines:' field in news headers

6. Reading a key press without waiting for a carriage return?

7. setup one interface for 'up' and another one for 'down'

8. Mitsumi CD-ROM base address. How to change?

9. g++ can't find it's header fles

10. 'Find' 'Updatedb' or find database replacement for Solaris 2.X?

11. Request headers - 'Reload', <SHIFT>+'Reload'

12. About the 'TIME' header of the 'ps' command

13. What're '!'and'%' in email address