Tokenizer

Tokenizer

Post by Chris » Sun, 13 Feb 2000 04:00:00



Hi all....

Hopefully someone can help me here..
I am in the following situation
I need to break up a long sentence to smaller chunk ( eg as a word or string
closed by "    ")

eg  I have the follwoing sentence in a file call  "in.dat"

##   In the Microeconomic we study the overall or "aggregate performance" of
an "economy structure"        ##

and  I want to break it down to

In
the
Microeconomic
we
study
the
overall
or
"aggregate performance"
of
an
"economy structure"

the problem I am having is that I am not able to keep the string that is
enclosed by the  quote(")s as
a whole word..
for example , the output of shell script that I wrote to do the task..
In
the
Microeconomic
we
study
the
overall
or
"aggregate
performance"
of
an
"economy
structure"

in my script , I am unable to preserve "aggregate performance" and "economy
structure"
from being chopped..

can someone help ?

thanks in advance

Chris

 
 
 

Tokenizer

Post by Hugo van der Sande » Sun, 13 Feb 2000 04:00:00



> Hi all....

> Hopefully someone can help me here..
> I am in the following situation
> I need to break up a long sentence to smaller chunk ( eg as a word or string
> closed by "    ")

I'd do this with perl:

echo 'my "somewhat quoted" string' | perl -nle 'print $& while
/".*?"|\S+/g'

Hugo

 
 
 

Tokenizer

Post by Paul Dlu » Sun, 13 Feb 2000 04:00:00


Perl:


Python:
import string

test = "this is a really weird test string"

for i in string.split(test, " "):
        print i

C:
char test[] = "this is a really weird test string";
char *word;

word = strtok(test, " ");
while ( (word = strtok(NULL, " ")) != NULL) {
    printf("%s\n", word);

Quote:}

(yes I know strsep is new and better but I did this quickly and strtok I
remember how to use :) )

Hope this helps



Quote:> Hi all....

> Hopefully someone can help me here..
> I am in the following situation
> I need to break up a long sentence to smaller chunk ( eg as a word or
string
> closed by "    ")

> eg  I have the follwoing sentence in a file call  "in.dat"

> ##   In the Microeconomic we study the overall or "aggregate
performance" of
> an "economy structure"        ##

> and  I want to break it down to

> In
> the
> Microeconomic
> we
> study
> the
> overall
> or
> "aggregate performance"
> of
> an
> "economy structure"

> the problem I am having is that I am not able to keep the string that
is
> enclosed by the  quote(")s as
> a whole word..
> for example , the output of shell script that I wrote to do the task..
> In
> the
> Microeconomic
> we
> study
> the
> overall
> or
> "aggregate
> performance"
> of
> an
> "economy
> structure"

> in my script , I am unable to preserve "aggregate performance" and
"economy
> structure"
> from being chopped..

> can someone help ?

> thanks in advance

> Chris

Sent via Deja.com http://www.deja.com/
Before you buy.
 
 
 

Tokenizer

Post by Eddie Cor » Tue, 15 Feb 2000 04:00:00



>Hi all....
>Hopefully someone can help me here..
>I am in the following situation
>I need to break up a long sentence to smaller chunk ( eg as a word or string
>closed by "    ")
>eg  I have the follwoing sentence in a file call  "in.dat"
>##   In the Microeconomic we study the overall or "aggregate performance" of
>an "economy structure"        ##
>and  I want to break it down to
>In
>the
>Microeconomic
>we
>study
>the
>overall
>or
>"aggregate performance"
>of
>an
>"economy structure"
>the problem I am having is that I am not able to keep the string that is
>enclosed by the  quote(")s as
>a whole word..
>for example , the output of shell script that I wrote to do the task..
>In
>the
>Microeconomic
>we
>study
>the
>overall
>or
>"aggregate
>performance"
>of
>an
>"economy
>structure"
>in my script , I am unable to preserve "aggregate performance" and "economy
>structure"
>from being chopped..
>can someone help ?
>thanks in advance
>Chris

Here's a solution using lex that is a) small an neat and b) copes with quoted strings that wrap.

%%
\"[^"]*\"       {unf();printf("%s\n",yytext);}
[^ ]*      {printf("%s\n",yytext);}
.          {}
%%
unf() {
  char *p = yytext;
  while (*p != '\0') {
    if (*p == '\n') *p = ' ';
    p++;
  }

Quote:}

use lex <file>
then cc lex.yy.c -l l   (read man lex if that doesn't work)

(oops, I meant to say wraps onto the next line but somehow I've gotten into vi and I can't fix it eek)
:

 
 
 

Tokenizer

Post by brian hile » Wed, 16 Feb 2000 04:00:00



Quote:> ##   In the Microeconomic we study the overall or "aggregate performance" of
> an "economy structure"        ##

Perl? Tcl? lex?!

Wow! Using a blowtorch to lit a stove! Plain ol' ksh does this jus'
fine an' dandy, I reckon.

eval "set -- $(< in.dat)"  # one caveat: no "##"s in in.dat
for a                           # test...
do      print -r -- "$a"
done

-Brian

 
 
 

Tokenizer

Post by Richard Evitt » Wed, 16 Feb 2000 04:00:00


Chris,
Not quite the answers you were looking for to solve question 2, are
they?
There doesn't seem to be a very clean solution to this double quote
problem using the BOURNE SHELL.

BTW, how did you eliminate those extra spaces when you exchanged NICE
with ""?

regards,
Rich


> Hi all....

> Hopefully someone can help me here..
> I am in the following situation
> I need to break up a long sentence to smaller chunk ( eg as a word or string
> closed by "    ")

> eg  I have the follwoing sentence in a file call  "in.dat"

> ##   In the Microeconomic we study the overall or "aggregate performance" of
> an "economy structure"        ##

> and  I want to break it down to

> In
> the
> Microeconomic
> we
> study
> the
> overall
> or
> "aggregate performance"
> of
> an
> "economy structure"

> the problem I am having is that I am not able to keep the string that is
> enclosed by the  quote(")s as
> a whole word..
> for example , the output of shell script that I wrote to do the task..
> In
> the
> Microeconomic
> we
> study
> the
> overall
> or
> "aggregate
> performance"
> of
> an
> "economy
> structure"

> in my script , I am unable to preserve "aggregate performance" and "economy
> structure"
> from being chopped..

> can someone help ?

> thanks in advance

> Chris

 
 
 

Tokenizer

Post by Andrew Gabri » Wed, 16 Feb 2000 04:00:00




Quote:> Chris,
> Not quite the answers you were looking for to solve question 2, are
> they?
> There doesn't seem to be a very clean solution to this double quote
> problem using the BOURNE SHELL.

Well, this is pretty much how the shell itself tokenises, so why not
use the shell's own tokenizer...

-----------------------------8<------------------------------
#!/bin/sh
sed -e 's/[][#?\*]/ /g' | while read x
do
  eval set -- $x
  while [ $# -gt 0 ]
  do
    echo $1
    shift
  done
done
-----------------------------8<------------------------------

The sed is just to strip out characters which the shell tokenizer
would treat specially (comments, wild carding) - I might not have
caught everything though.

--
Andrew Gabriel
Consultant Software Engineer

 
 
 

1. tokenizer.cc in kdevelop 2.1.5 gives build error

Hello,

I've searched all over for a fix to this error, but can't find anything
relating to kdevelop 2.1.5.  Here is the build error isolated:

tokenizer.cc: In member function `virtual int yyFlexLexer::yylex()':
tokenizer.cc:1523: `yywrap' undeclared (first use this function)
tokenizer.cc:1523: (Each undeclared identifier is reported only once for
each
   function it appears in.)
make[3]: *** [tokenizer.o] Error 1

And here is the rest of the output for the classparser directory:

Making all in classparser
make[3]: Entering directory
`/downloads/src/kde/konstruct/apps/kdevelop/work/kde
velop-2.1.5_for_KDE_3.1/kdevelop/classparser'
source='tokenizer.cc' object='tokenizer.o' libtool=no \
depfile='.deps/tokenizer.Po' tmpdepfile='.deps/tokenizer.TPo' \
depmode=gcc3 /bin/sh ../../admin/depcomp \
g++ -DHAVE_CONFIG_H -I. -I. -I../.. -I/root/kde3.1.2/include -I/usr/X11R6/in
clud
e   -DQT_THREAD_SUPPORT  -D_REENTRANT  -Wnon-virtual-dtor -Wno-long-long -Wu
ndef
 -Wall -pedantic -W -Wpointer-arith -Wmissing-prototypes -Wwrite-strings -an
si -
D_XOPEN_SOURCE=500 -D_BSD_SOURCE -Wcast-align -Wconversion -O2 -fno-exceptio
ns -
fno-check-new -ftemplate-depth-99  -c -o tokenizer.o `test -f 'tokenizer.cc'
||
echo './'`tokenizer.cc
tokenizer.cc: In member function `virtual int yyFlexLexer::yylex()':
tokenizer.cc:1523: `yywrap' undeclared (first use this function)
tokenizer.cc:1523: (Each undeclared identifier is reported only once for
each
   function it appears in.)
make[3]: *** [tokenizer.o] Error 1
make[3]: Leaving directory
`/downloads/src/kde/konstruct/apps/kdevelop/work/kdev
elop-2.1.5_for_KDE_3.1/kdevelop/classparser'
make[2]: *** [all-recursive] Error 1
make[2]: Leaving directory
`/downloads/src/kde/konstruct/apps/kdevelop/work/kdev
elop-2.1.5_for_KDE_3.1/kdevelop'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory
`/downloads/src/kde/konstruct/apps/kdevelop/work/kdev
elop-2.1.5_for_KDE_3.1'
make: *** [all] Error 2

/downloads/src/kde/konstruct/apps/kdevelop/work/kdevelop-2.1.5_for
_KDE_3.1]#

Happens with both gcc-3.2.2 and gcc-2.95.3.  Does anyone have any
suggestions?

Thanks,
zoltan

2. top under ultrix -- found

3. no-break-space handling in shell tokenizer

4. Virtual domain with CERN www server on Solaris