challenge - sed and/or awk

challenge - sed and/or awk

Post by LC's No-Spam Newsreading accoun » Fri, 27 Mar 1998 04:00:00



I have a publication list retrieved off the net (lynx -dump). This has a
multiline arrangement of the form :

  2. astro-ph/9803038 [[8]abs, [9]src, [10]ps, [11]other] :

          Title: Entropy-regularized Maximum-Likelihood cluster mass
          reconstruction
          Authors: [12]Stella Seitz (1,2), [13]Peter Schneider (2),
          [14]Matthias Bartelmann (2) ((1) Universitaetssternwarte
          Muenchen, (2) MPI fuer Astrophysik, Garching, Germany)
          Comments: 19 pages including 7 postscript figures; submitted to
          Astronomy and Astrophysics

I want to create a SINGLE LINE of each entry like that, with four tab
separated fields (the number, the authors without any references [n] or
affiliations (xxxx), the title and the rest), so that I can later merge and
sort publications lists obtained with different search criteria (authors) and
eliminate redundant ones, i.e. something like this

astro-ph/9803038 TAB Title: Entropy-regularized Maximum-Likelihood cluster
mass reconstruction TAB Authors: Stella Seitz, Peter Schneider, Matthias
Bartelmann TAB Comments: 19 pages including 7 postscript figures; submitted to
Astronomy and Astrophysics

This is a challenge for you, I have a solution.

I succeeded to do the first part (getting rid of unwanted stuff and joining
lines) via an awk script.
I desperately tried then to get rid of all [n] and (xxx) (note there are also
nested ((XX) xxxx) parenthesized fields, and multiple blanks using sed, but
did not succeed, so I wrote a little fortran program to do that.

----------------------------------------------------------------------

avoid unwanted spam. Any mail returning to this address will be rejected.
Users can disclose their e-mail address in the article if they wish so.