Fast memory search routine wanted

Fast memory search routine wanted

Post by Daniel Barro » Tue, 17 Jul 2001 03:18:27



I want to do a fast case insensitive memory search for a string.  At the
moment I am using:

for(j = 0; j < l; j++) { // l = length of file
        file[j] = tolower(file[j]);  // make the document lowercase

Quote:}                                // char by char

followed by a:

res = std::search(file, fileend, p, pend);

where fileend is file+l and p is a pointer to a string and pend is a pointer
to the end of that string.

I need to run the search about 20 times on memory blocks of about 200kbytes.

On my P166 it takes a few seconds.  Is there a faster way?  Also is there a
faster way of making it lower case?

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Paul Pluzhniko » Tue, 17 Jul 2001 11:40:49




Quote:> I want to do a fast case insensitive memory search for a string.  At the
> moment I am using:
> res = std::search(file, fileend, p, pend);

I would expect strstr() to be faster. Also, since
it is part of glibc, you could make a  case-insensitive
version, and do away with the tolower() loop.

 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Tue, 17 Jul 2001 19:33:11






> > I want to do a fast case insensitive memory search for a string.  At the
> > moment I am using:
> > res = std::search(file, fileend, p, pend);

> I would expect strstr() to be faster. Also, since
> it is part of glibc, you could make a  case-insensitive
> version, and do away with the tolower() loop.

Unfortunately the block of memory /may/ contains zeros which AFAIK would
cause strstr to not search the whole block.  But I suppose I could use the
source to make my own version.

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Florian Bachman » Wed, 18 Jul 2001 08:45:12




Quote:> On my P166 it takes a few seconds.  Is there a faster way?  Also is
> there a faster way of making it lower case?

If you know you are only looking at ASCII character codes, you could
replace the tolower() call with something like

file[j] = (file[j] >= 'A') && (file[j] <= 'Z') ?
                'a' + file[j] - 'A' :
                file[j]

or

if ((file[j] >= 'A') && (file[j] <= 'Z')) file[j] = 'a' + file[j] - 'A';

But you would probably gain more by integrating the tolower() loop and the
search loop - which of course would require you to write a searching
algorithm on your own.

--

 
 
 

Fast memory search routine wanted

Post by Bohdan Vlasyu » Wed, 18 Jul 2001 16:40:09



> I want to do a fast case insensitive memory search for a string.  At
> the moment I am using:
> for(j = 0; j < l; j++) { // l = length of file
>         file[j] = tolower(file[j]);  // make the document lowercase
> }                                // char by char
> where fileend is file+l and p is a pointer to a string and pend is a
> pointer to the end of that string.

That's apparently wrong. I'd recommend you to take a look on any CS
book, there'd be nice examples there. TAOP is very nice to read too.

as an example -- that's my _old_ :-) app written for DOS for searching
the dictionary. It took 1.5 seconds on ancient laptop (24M/Cyrix 180)
to search for "zymotic" in ~6mb file (after few launches file was
cached). There were asm() pieces in original code, however, with GCC
there's no longer any need for it -- -O2 reduces time quite nice...

[Pattern must be shorter than number of bits in "unsigned long", or
try "unsigned long long"]

[On my C400 it runs fine]

# ./a.out Mueller7GPL.koi zYmOtiC

                ------- translating -------
at offset: 5714140      

zymotic  _a. 1> <*zymotic*> 2> <*zymotic*>; zymotic diseases <*zymotic*> <*diseases*>
---------------
100.00%                
Elapsed time: 2600
all's OK

[That's something quite hard to read, but it shows the point]
[Yes, I was depressed when wrote this ;-)]
[almost no error checking, yet it should work]

#include <sys/types.h>
#include <ctype.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <math.h>
#include <string.h>
#include <time.h>

#define ERROR(x) {perror("(fatal) "x);return -1;}
const unsigned bufl=(1<<15)-5000;
unsigned char ch[bufl*2];
#define nodebug
char pattern[32];
//=#ifdef debug//"aabac""int"#else"\n"  "ail" " "//zymotic"#endif
char pl;
unsigned long r,pr=~1l;
unsigned long sss[0x100];

void binp(unsigned long v) {
        for(int i=0;i<pl+3;i++){
                printf("%c",(!(v&1))|0x30);
                v>>=1;
        };
        printf("\n");

Quote:};

int main(int argc,char*argv[]) {
        if(argc!=3) {
                printf("Not enough parameters, exiting\n");
                return -1;
        };
        int inf=open(argv[1],O_RDONLY);
        strcpy(&pattern[1],argv[2]);
        pattern[0]='\n';
        pl=strlen(pattern);
        pattern[pl++]=' ';
        pattern[pl++]=' ';
        if(inf==-1)
                ERROR("opening input file");
        unsigned long l=0,s=0,fl=lseek(inf,0,SEEK_END);
        if(fl==-1)
                ERROR("counting file's length");
        lseek(inf,0,SEEK_SET);
        unsigned int i,j;
        for(i=0;i<0x100;i++) {
                sss[i]=0l;
                for(j=0;j<pl;j++) {
                        int foo=pattern[j]!=i;                  
                        if(islower(i))
                                foo&=(pattern[j]!=toupper(i));
                        if(isupper(i))
                                foo&=(pattern[j]!=tolower(i));
                        sss[i]|=foo<<(j+1);
                }
        };
        clock_t start, end;
        start = clock();
        int print=0,ob=0;
        printf("\n\t\t------- translating -------\n");
        while((l=(unsigned int)read(inf,ch,bufl*2))>0) {
                for(i=0;i<l;i++) {
                        if(print) {
                                if(ch[i]=='[')
                                        ob=1;
                                if(!ob)
                                        printf("%c",ch[i]);
                                if(ch[i]==']')
                                        ob=0;
                                if(ch[i]=='\n') {
                                        print=0;
                                        printf("---------------\n");
                                }
                        };
//                      r=((pr<<1)|1)&sss[ch[i]];
                        r=(pr<<1)|sss[ch[i]];
#ifdef debug
                        printf("\t\t-------\n");
                        printf("r: ");binp(r);
                        printf("p: ");binp(pr);
                        printf("s: ");binp(sss[ch[i]]);
#endif
                        pr=r;
//                      if(r&(1<<pl)) {
                        if(!(r&(1<<pl))) {
                                printf("at offset: %lu\n",s+i);
                                print=1;
                                for(int j=1;j<=pl-1;j++)
                                        printf("%c",ch[i-pl+j]);
                        };

                };
                s+=l;
                printf("%5.2f%%\t\t\t\r",s*100.0/fl);
        };
        end = clock();
        printf("\nElapsed time: %i\n", (end - start) / CLK_TCK);
        if(l==0) {
                printf("all's OK\n");
                return 0;
        } else
                ERROR("reading file");

Quote:};

 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Thu, 19 Jul 2001 07:11:51




[snip]

Quote:> That's apparently wrong. I'd recommend you to take a look on any CS
> book, there'd be nice examples there. TAOP is very nice to read too

CS?
[snip]

Quote:> [That's something quite hard to read, but it shows the point]

I shall try to parse it later and see if I can work out the ideas.

Thanks for the reply.

[snip]

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Thu, 19 Jul 2001 07:03:14




[snip]

Quote:

> if ((file[j] >= 'A') && (file[j] <= 'Z')) file[j] = 'a' + file[j] - 'A';

> But you would probably gain more by integrating the tolower() loop and the
> search loop - which of course would require you to write a searching
> algorithm on your own.

Thanks for the reply.  I suspect I'll use it :)

The searching of the memory searches several times for several strings so I
only need to tolower() it once.

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Bohdan Vlasyu » Thu, 19 Jul 2001 15:35:25



>> That's apparently wrong. I'd recommend you to take a look on any CS
>> book, there'd be nice examples there. TAOP is very nice to read too
> CS?

Computer Science. Something that [here, at least] kids learn at
school.
 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Sat, 21 Jul 2001 11:46:48





> >> That's apparently wrong. I'd recommend you to take a look on any CS
> >> book, there'd be nice examples there. TAOP is very nice to read too
> > CS?
> Computer Science. Something that [here, at least] kids learn at
> school.

We call it IT - Information Technology.

My main reason for asking for an algorithm is that I just expected that my
own would be slow.  However it appears not.  I quickly knocked this up:

char* MyClass::search(char* file, char* fileend, char* phrase, char* phraseend) {
    int f, p;  // counters
    int pl = phraseend - phrase;  // phrase length
    int fl = (int)(fileend - file) - pl;  // file length
    bool same;  // flag
    for(f = 0; f < fl; f++) {  // check whole file byte by byte
        if (file[f] == phrase[0]) {  // partial match?
            same = true;
            for (p = 1; p < pl; p++) {  // check the whole phrase
                if (file[f+p] != phrase[p]) {
                    same = false;
                    break;  // better luck next time
                }
            }
            if (same == true) {
                return (file + f);  // return the location where it was found
            }
        }
    }
    return fileend;  // did not find it so return the end of file location

Quote:}

On running it with no compiler optimisation, I found it to be 2-3 times
faster than STL::search().  I am both pleased and surprised.  I am sure there
are ways of speeding it up such as converting the phrase into int blocks of
4 bytes and comparing 32-bit rather than 8-bit.  I am sure it could be
sped up or improved in other ways.  But it's so much faster already :)

I hope this is of use to people in the future who use deja google.

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Bohdan Vlasyu » Sat, 21 Jul 2001 16:16:48



>>>> That's apparently wrong. I'd recommend you to take a look on any
>>>> CS book, there'd be nice examples there.
>>> CS?
>> Computer Science. Something that [here, at least] kids learn at
>> school.
> We call it IT - Information Technology.

Umm.. Whateever.

Quote:> My main reason for asking for an algorithm is that I just expected
> that my own would be slow.  However it appears not.  I quickly
> knocked this up:

As usual, the answer is "YMMV"... I'm afraid it would work better on
random data, however, if data looks like "AAAAAAAAAA...AAAAAAAAAAAA"
and you need to find "AAAAAAAAAB", your idea would fail.

As for my opinios -- these functions in libc are USUALLY much betteer
that your own, because they're optimised for your kind of processor,
etc.. So every time you use it, you shouldn't worry about what
computer you're running on..

Quote:>     for(f = 0; f < fl; f++) {  // check whole file byte by byte
>         if (file[f] == phrase[0]) {  // partial match?
>             same = true;
>             for (p = 1; p < pl; p++) {  // check the whole phrase
>                 if (file[f+p] != phrase[p]) {
>                     same = false;
>                     break;  // better luck next time
>                 }
>             }
>             if (same == true) {
>                 return (file + f);  // return the location where it was found
>             }
>         }
>     }
>     return fileend;  // did not find it so return the end of file location
> }
> On running it with no compiler optimisation, I found it to be 2-3
> times faster than STL::search().  I am both pleased and surprised.
> I am sure there are ways of speeding it up such as converting the
> phrase into int blocks of 4 bytes and comparing 32-bit rather than
> 8-bit.  I am sure it could be sped up or improved in other ways.
> But it's so much faster already :)

The trick I'd recommend you is to append your desired sequence at the
end of input, so that it'd be matched anyway. Thus, your
for(f=0;f<fl;f++) loop would turn into for(f=0;1;f++) loop, which is
obviously more efficient. That's perhaps most common trick in such
cases..

Also, that is not best possible solution.. Actually, it's possible to
make it to be loop to look like

while(foo)
        foo = bar[foo, file[i++]];

and that's it! only one comparision!

however, you'd need to prerape bar[] array before..

 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Sat, 21 Jul 2001 21:29:43





> >>>> That's apparently wrong. I'd recommend you to take a look on any
> >>>> CS book, there'd be nice examples there.
> >>> CS?
> >> Computer Science. Something that [here, at least] kids learn at
> >> school.
> > We call it IT - Information Technology.
> Umm.. Whateever.

I was just explaining why I did not know was CS was!  I tried looking in
several of my linux programming books, but none discussed this.  I suspect I
need a general efficient algorithm book instead.

Quote:

> > My main reason for asking for an algorithm is that I just expected
> > that my own would be slow.  However it appears not.  I quickly
> > knocked this up:
> As usual, the answer is "YMMV"... I'm afraid it would work better on
> random data, however, if data looks like "AAAAAAAAAA...AAAAAAAAAAAA"
> and you need to find "AAAAAAAAAB", your idea would fail.

Yes, you're right.  I had not thought of that.  However, luckily, my data is
essentially random - it's english text.

[snip]

Quote:> The trick I'd recommend you is to append your desired sequence at the
> end of input, so that it'd be matched anyway. Thus, your
> for(f=0;f<fl;f++) loop would turn into for(f=0;1;f++) loop, which is
> obviously more efficient. That's perhaps most common trick in such
> cases..

Nice trick.  Thanks.

Quote:

> Also, that is not best possible solution.. Actually, it's possible to
> make it to be loop to look like

> while(foo)
>    foo = bar[foo, file[i++]];

> and that's it! only one comparision!

> however, you'd need to prerape bar[] array before..

Prerape ehy?  Typos can be amusing.

This is even better, I'll give it a go.  Thanks for such useful help.

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Daniel Barro » Sat, 21 Jul 2001 22:12:22






[snip]
> > The trick I'd recommend you is to append your desired sequence at the
> > end of input, so that it'd be matched anyway. Thus, your
> > for(f=0;f<fl;f++) loop would turn into for(f=0;1;f++) loop, which is
> > obviously more efficient. That's perhaps most common trick in such
> > cases..

What if I was doing a search for "ABCABC"?  And the end of the file happened
to look like this "NNNNNABC".  If the search string was appended to the end
it would look like "NNNNNABCABCABC" which would provide a match that did
not equal the end of the file and thus a false result.

It would not work in my example of a STL::search() replacement.  It's not a
problem as I can just change my usage of search() so that if the return
address is greater than the end address - phrase length, rather than just
checking if the returned address equals the file end address.  Or just
modify the code with the test.

Just thought you may be interested.

[snip]

--
Daniel Barron - use [at jadeb.com] for personal replys.
(Visit http://dansguardian.org/  - True web content filtering for all)

 
 
 

Fast memory search routine wanted

Post by Bohdan Vlasyu » Sat, 21 Jul 2001 22:37:24



>> Also, that is not best possible solution.. Actually, it's possible
>> to make it to be loop to look like
>> while(foo)
>>        foo = bar[foo, file[i++]];
>> and that's it! only one comparision!
>> however, you'd need to prerape bar[] array before..
> Prerape ehy?  Typos can be amusing.

:-\. Was this supposed to be `why' ??

I'll explain again.

It's simplified implementation of "state machine" (I don't know
correct terms, the best I can do is to translate ones used in our
literature), aka "automate".

"foo" is s.c. "state". In simplest case foo is number of next symbol
in pattern that would match next symbol from file, and 0 if pattern
matched reached, however, that's unimportant details... The main point
is that every input character changes "state" according to current
state -- that reflects in `foo = bar[foo, file[i]]'. "bar" is "state
table", i.e. an array that represents how state should be changed. In
general case, you can match any RE [regexp] in that way, may be with
some additions "if"'s.

So, most important and hard thing is to generate "bar". after it,
search is simple..

Please, read some book about it, we're getting offtopic.

I believe that some of Borland Pascal for DOS string functions used
this idea, but I never bothered to disassemble actual code.

P.s.: It is possible, and, once upon a time, I've implemented it in
pascal, but I was unexperienced that time, and the program was quite
awkward. I think it'd be quite easy to implement it if you'll have
some nice book...

 
 
 

Fast memory search routine wanted

Post by Bohdan Vlasyu » Sat, 21 Jul 2001 22:40:07



>>> The trick I'd recommend you is to append your desired sequence at
>>> the end of input
> What if I was doing a search for "ABCABC"?  And the end of the file
> happened to look like this "NNNNNABC".  it would provide a match
> that did not equal the end of the file and thus a false result.

I thought that's obvious -- match is matched if it's contained INSIDE
given text, i.e. *last* character of match belongs to text.
 
 
 

Fast memory search routine wanted

Post by Eric P. McC » Sun, 22 Jul 2001 00:13:07



> I was just explaining why I did not know was CS was!  I tried looking in
> several of my linux programming books, but none discussed this.  I suspect I
> need a general efficient algorithm book instead.

The standard by which all other computer books is judged is Donald
Knuth's _The Art of Computer Programming_.  It was written some time
ago and never completed, but probably 90% of everything in it still
applies.

You need to be willing to wade through the math - it's pretty
straightforward but there's a lot of it.

Quote:> > > My main reason for asking for an algorithm is that I just expected
> > > that my own would be slow.  However it appears not.  I quickly
> > > knocked this up:
> > As usual, the answer is "YMMV"... I'm afraid it would work better on
> > random data, however, if data looks like "AAAAAAAAAA...AAAAAAAAAAAA"
> > and you need to find "AAAAAAAAAB", your idea would fail.
> Yes, you're right.  I had not thought of that.  However, luckily, my data is
> essentially random - it's english text.

That's not random, nor is it even close.  I missed the original
question, but the non-randomness of the English language is the reason
why hashing on the first letter or several letters of a word is a bad
idea.  Some letters are far more likely to occur in words than others;
some are also far more likely to occur at the beginning of a word.

The above may not be relevant, but still it's wise not to think of
English as an evenly-distributed, random assortment of letters.

--

  "Knowing that a lot of people across the world with Geocities sites
absolutely despise me is about the only thing that can add a positive
spin to this situation."  - Something Awful, 1/11/2001

 
 
 

1. WANTED: a fast substring matching C routine

I am trying to write a program doing substring matching heavily; however,
it seems that my unix C (SunOS) strstr() doesn't have a good performance.
Any better idea?  Any faster code available?

Thank you.
--
Huang, Chih-Hsien
Rm 101, Computer Center, National Chiao Tung University, Hsinchu, Taiwan

2. Xdaliclock

3. Fast (real-time) memory allocation wanted.

4. Netatalk with ANA6944A (4xTP)

5. memory lacks / tools for memory debugging searched

6. Solaris 2.4 x86 openwin installation problem

7. Fast opmized BLAS(Level 1) routine is available

8. Redhat 8.0 KDE logging out funny - HELP!

9. FAST vector multiply routine

10. Ported fast Cray libm routines now available !

11. Name Comparison/Search Routine

12. Fast BLAS routine(Level 3)

13. searching routines for operand classification