hashing function for strings

hashing function for strings

Post by Mark Galas » Sat, 17 Feb 1990 06:12:10



I would like to know which is a good hash function for strings.
I was going to compute the index into the hash table with
        index = hashval(s) % HASH_LEN;
where s is a string, and HASH_LEN is a prime number and is also
the size of the hash table.

I have seen two schemes:

1. hashval(s) is the sum of the ascii values of each char in the string.

2. hashval(s) is the sum of the ascii values*256^i where i is the
   position of the char in the string.

Which of these is better, or is there any other?  Knuth seems to not
go into detail on hashval() for strings.

Thanks,
--
    {These opinions are mine, and should be everybody else's :-)}


 
 
 

hashing function for strings

Post by Aaron Wo » Sat, 17 Feb 1990 21:34:37


Here is a hash function that works well.  Aaron

/*
 * hht_ascii_hht_string - hash a string
 * intputs:
 * key - an asciz string
 * table - a hht table
 * outputs:
 *  hht - value >=0 and < hht_buck_size, an index into a hht table
 * whatch_out_for:
 *  table size must be prime
 *  unsigned long probibly needs to be 32 bits for this to work
 * algorithim stolen from:
 *  Compilers Principles, Techniques, And tools by Aho, Sethi,
 *  and Ullman  pg 436
 */
PUBLIC_PROC long hht_ascii_hht_string(table,key)
hht_table *table;
register unsigned char *key;
{
  register unsigned long hhtval=0;
  register unsigned long carry;
  for(;(*key)!=0;key++) {       /*scan the string*/
    hhtval=hhtval<<4;             /*make room for next character*/
    hhtval+=(*key);             /*add it in*/
    carry=hhtval&0xf0000000;        /*look at some high order bits*/
    if(carry!=0) {              /*about to overflow?*/
      hhtval^=(carry>>24);        /*wrap bits back into low order bits*/
      hhtval^=carry;            /*turn them off in hhtval*/
    }
  }
  hhtval%=table->hht_buck_size;      /*map it into range*/
  return(hhtval);

Quote:}


 
 
 

hashing function for strings

Post by Andrew P. Berm » Sat, 17 Feb 1990 11:48:21



>I would like to know which is a good hash function for strings.
>I was going to compute the index into the hash table with
>    index = hashval(s) % HASH_LEN;
>where s is a string, and HASH_LEN is a prime number and is also
>the size of the hash table.
>I have seen two schemes:
>1. hashval(s) is the sum of the ascii values of each char in the string.

>2. hashval(s) is the sum of the ascii values*256^i where i is the
>   position of the char in the string.
>Which of these is better, or is there any other?  Knuth seems to not
>go into detail on hashval() for strings.

It seems to me that (2) is superior in the sense that
hashval(X) = hashval(Y) implies X = Y, where as this is not
true in (1).
For example, in (1) hashval("abc") = hashval("acb") = hashval("bca")=...
Also , hashval("ae")=hashval("bd") = hashval("cc")=...  you get the point.
This becomes important if your strings are not randomly distributed,
which, if they bear any resemblance to the English language, they
won't be: ("eat","ate","tea","eta" are all words and map to the same value,
while "zxq","xzq",...  aren't).
So I would use (2).  Of course, you may have problems if your strings
are very long, i.e., a 9 byte string would overflow an integer.
And even in (2) you don't have a fully random distribution over the
domain (unless you use the full ascii character set in your strings).

Andrew P. Berman

>Thanks,
>--
>    {These opinions are mine, and should be everybody else's :-)}



 
 
 

hashing function for strings

Post by Chris Ho » Sun, 18 Feb 1990 03:35:35




Quote:

>I would like to know which is a good hash function for strings.

        [deletions]

I've found that if you want the hash table to be a useful
number (e.g. a power of two) you can take the component
characters of the string, multiply each by a different
prime number, and sum and mod the results; it gave quite an even
distribution (though I didn't have a tremendously large table).
_______________________________________________________________________________

_______________________________________________________________________________
   "All the world's a cage, and all the men and women merely orangutans..."

 
 
 

hashing function for strings

Post by Thomas Trusco » Sun, 18 Feb 1990 00:48:29


Quote:> I would like to know which is a good hash function for strings.

Here is my favorite string hasher:

/*
 * Handy string->hash index function.
 * Not wonderful, but adequate for chained hash table.
 */
hash(s, size)
register char *s;       /* note: 'unsigned char' might be faster */
unsigned int size;
{
    register int n, c;

    if (n = *s++)
        while (c = *s++)
            n = n*2 + c;
    /* n *= 2; for "linear open addressing" */
    return(((unsigned int)n) % size);

Quote:}

This routine is a compromise between just summing up the characters
(change the above "2" to "1")
which hashes too many things to the same value,
and constructing an exact numeric equivalent of the string
(change the above "2" to "256")
which on most systems overflows after 4 characters.

The main virtue of this function is its speed.
It can be made even faster by expanding it inline,
and by making "size" a constant.
Making size a power of 2 is even faster, but it weakens the hash.

My favorite hashing scheme is to hash into one of "size"
linked lists which is then searched linearly.
If you are instead going to hash into a table with "size"
entries and search linearly for a free slot,
I recommend uncommenting the "n *= 2;" line.
That will keep strings such as "tty01", "tty02", ...
from hashing to adjacent entries, which can result in long
linear searches if something else hashes there.

The merits of this function can be debated, but I have found
that instrumenting with the following routine
is more useful than visionary posturing.

ztabinfo(table, tabsize, name)
struct hashent *table[];
unsigned int tabsize;
char *name;
{
    register struct hashent *t;
    register int i, n;
    register int nent,  /* total # table entries */
                nprobe, /* total # probes needed to find all entries */
                np_exp, /* expected # probes needed (randomized table) */
                np_opt; /* optimal # probes needed (ideal distribution) */

    nprobe = nent = 0;
    for (i = 0; i < tabsize; i++) {
        n = 0;
        for (t = table[i]; t != nullp(struct hashent); t = t->x_next)
           n++;
        nent += n;
        nprobe += n*(n+1)/2;
    }

    /*
     * Expected # of probes (for successful search of all entries).
     * Cf. Knuth vol. 3 section 6.4 C(n) (equation 19).
     */
    np_exp = nent + (nent*(nent-1)+2*tabsize-1)/(2*tabsize);

    /*
     * Optimal # of probes (for successful search of all entries).
     * The optimal table has (nent-(nent%tabsize)) positions
     * containing (int)(nent/tabsize) entries, with the remaining
     * nent%tabsize positions containing 1+(int)(nent/tabsize) entries.
     */
    n = nent/tabsize;   /* optimal entier(entries/table position) */
    np_opt = n*(n+1)/2 + (nent%tabsize)*(n+1);  /* optimal # probes */

    printf("%s: entries %3d  probes %3d (expected %3d  optimal %3d).\n",
        name, nent, nprobe, np_exp, np_opt);

Quote:}

 
 
 

hashing function for strings

Post by Scott David Danie » Sun, 18 Feb 1990 11:52:07



>Here is a hash function that works well.  Aaron

...explanation of function interface...removing nice format for compactness...

Quote:>PUBLIC_PROC long hht_ascii_hht_string( table, key )
>    hht_table *table; register unsigned char *key; {
>  register unsigned long hhtval=0, carry;
>  for(;(*key)!=0;key++) {
>    hhtval=hhtval<<4; hhtval+=(*key); carry=hhtval&0xf0000000;
>    if(carry!=0) { hhtval^=(carry>>24); hhtval^=carry; }
>  }
>  hhtval%=table->hht_buck_size; return(hhtval);
>}

I was involved in a small study on proper potency for such hashings (actually
to decide a technical argument,) and found basically:
If you are hashing arbitrary text, (1) it is important to get the greatest
information spread early across the word, and (2) the rotate size should be
be relatively prime to the word size (to reduce reordering).  In all cases,
if you have typical data, you should certainly test your hash functions; you
may find that your subset of source is significantly biased.  Our best was to
use SHIFT_SIZE in the following as 5 for "basically alphabetic monocase data",
the hardware word size was thirty-two bits, and (since % produced inconsistent
results on our two target machines for full-sized words), we used WORD_SIZE=31.
I believe SHIFT_SIZE and WORD_SIZE should be relatively prime (and WORD_SIZE
can be below your native word size in the code below).  The relatively prime
requirement is caused by the fact that you want superposition of two characters
to occur as late (far into a word) as possible.  6 would be a better shift size
for multi-case hashing, but in that case I think you should avoid WORD_SIZE 32.

(Code made to mimic above interface)
#define SHIFT_SIZE 5
#define WORD_SIZE 31

#define KEEP_BLOCK_SIZE (WORD_SIZE-SHIFT_SIZE)
#define KEEP_MASK       ((1 << KEEP_BLOCK_SIZE) - 1)

PUBLIC_PROC long hht_ascii_hht_string( table, key )
  hht_table *table;
  register unsigned char *key;
{
  register long hhtval;                 /* could be unsigned if you prefer */

  for( hhtval = 0; (*key) != 0; key++ ) {               /*scan the string*/
    hhtval = ((hhtval & KEEP_MASK) << SHIFT_SIZE)       /*rotate former hash*/
           ^ (hhtval >> KEEP_BLOCK_SIZE)
           ^ *keep;                                     /*include new char*/
    /* at this point, it MAY help to do something like:
     * if( (1<<WORD_SIZE) & ++hhtval ) --hhtval;
     * -->(WORD_SIZE=sizeof(word)-1) makes this = if( 0 > ++hhtval ) --hhtval;
     * to keep permutations distinct.  For the identifiers we tested, it
     * did not help (the test just watches for overflow, so modulus worked.)
     */
    }
  return hhtval % table->hht_buck_size;         /*map it into range*/

Quote:}

What we found was that one, two, three, and four-letter elements were common,
so that our performance was increased as long as we kept the INFORMATION bits
separated as long as possible.  (Note: Ascii is a 7-bit code, the top two bits
will determine case for alpha-primarily code (like language identifiers), hence
5 bit shifts.  It does not hurt to overlap the "mostly constant" bits with the
data here, since that gives you the longest runs before you overlap information
bits.  A shift of 4 overlaps information bits in a two-character identifier.

-Scott Daniels

 
 
 

hashing function for strings

Post by Metafont Consultant Accou » Mon, 19 Feb 1990 10:17:02



Quote:>the hardware word size was thirty-two bits, and (since % produced inconsistent
>results on our two target machines for full-sized words),

Yep.  It is utterly amazing how many modulus inplementations exist
where (-1) mod m isn't m-1.  This makes cellular automaton programs
add half a dozen extra instructions per loop to treat a rectanglar
array as a toroid.  I don't know who originated this bogosity, but it
would be nice if hardware and compiler folks would extirpate it once
and for all from all machines and all languages, so programs using mod
could port without stumbling over this implementation error.
--

Again, my opinions, not the account furnishers'.
 
 
 

hashing function for strings

Post by Felix L » Tue, 20 Feb 1990 15:57:21



Quote:>Yep.  It is utterly amazing how many modulus inplementations exist
>where (-1) mod m isn't m-1.

There are three things that most people would like:
        quotient = -5 div 3 = -(5 div 3) = -1
        remainder = -5 mod 3 = (-5 + 6) mod 3 = 1
        3*quotient + remainder = -5

Unfortunately, they're mutually contradictory.  You must choose one of
them to violate.  Someone will be unhappy with your decision.
--

 
 
 

hashing function for strings

Post by Metafont Consultant Accou » Tue, 20 Feb 1990 20:59:58




>>Yep.  It is utterly amazing how many modulus inplementations exist
>>where (-1) mod m isn't m-1.

>There are three things that most people would like:
>    quotient = -5 div 3 = -(5 div 3) = -1
>    remainder = -5 mod 3 = (-5 + 6) mod 3 = 1
>    3*quotient + remainder = -5

>Unfortunately, they're mutually contradictory.  You must choose one of
>them to violate.  Someone will be unhappy with your decision.

Absolutely!  One may either choose to truncate integer divisions'
quotients towards minus infinity, a very unnatural feeling operation,
to dissociate the modulus function from the remainder of integer
division, or to violate the grade school concept of reconstructing the
dividend simply from the divisor, the quotient, and the remainder.

I see the idea of a modulus function that doesn't change if the axis
is shifted left or right by the mod base, rather than one that suffers
an ugly reflection at the origin, as inherently beautiful, in the math
sense.  It is also inherently more useful, in the graphics programming
sense.

Among your three items, the one that does not arouse any particular
religious fervor at thoughts of its loss is the second; I see no
particular reason to identify the mod function with the remainder of
integer division; they are very comfortable as independent concepts,
and, I think, were identified with one another on computers by the
historical accident that the remainder fell out simply from the most
common implementation of shift and subtract logic for division, which
leaves the remainder in one half of the dividend double register while
the quotient is built in the other half, and the by arithmetical
coincidence that they are identical when all the operands are
positive.
--

Again, my opinions, not the account furnishers'.

 
 
 

hashing function for strings

Post by Paul D. Crowl » Wed, 21 Feb 1990 03:30:48



>There are three things that most people would like:
>    quotient = -5 div 3 = -(5 div 3) = -1
>    remainder = -5 mod 3 = (-5 + 6) mod 3 = 1
>    3*quotient + remainder = -5

>Unfortunately, they're mutually contradictory.  You must choose one of
>them to violate.  Someone will be unhappy with your decision.

I'd _hate_ a computer that obeyed the first of these. To me, -5 div 3 is
-2. -5 div 3 = int(-5/3) = int(-1.666) = -2

Then again, I think month and day numbers should start at 0.

--
 \/ o\ "I say we grease him right away. No offense." - Aliens.

 
 
 

hashing function for strings

Post by Peter Desnoye » Wed, 21 Feb 1990 10:15:09



Quote:Consultant Account) writes:
> I see the idea of a modulus function that doesn't change if the axis
> is shifted left or right by the mod base, rather than one that suffers
> an ugly reflection at the origin, as inherently beautiful, in the math
> sense.  It is also inherently more useful, in the graphics programming
> sense.

It is more than just pretty. The integers mod M has M elements, usually
represented as 0..M-1.  A modulo operator that results in {integers mod M}
having 2M-1 elements is either incorrectly named or incorrectly implemented.

                                             Peter Desnoyers
                                             Apple ATG

 
 
 

hashing function for strings

Post by Metafont Consultant Accou » Wed, 21 Feb 1990 17:13:50



Desnoyers) writes:

>Consultant Account) writes:
>> I see the idea of a modulus function that doesn't change if the axis
>> is shifted left or right by the mod base, rather than one that suffers
>> an ugly reflection at the origin, as inherently beautiful, in the math
>> sense.  It is also inherently more useful, in the graphics programming
>> sense.

>It is more than just pretty. The integers mod M has M elements, usually
>represented as 0..M-1.  A modulo operator that results in {integers mod M}
>having 2M-1 elements is either incorrectly named or incorrectly implemented.

[I knew I couldn't get away without drawing a picture; my words have
never been that clear...]

Peter,

It isn't the case, for the implementions I have seen that I considered
flawed, that the x mod 4 mapping looks like:

                                  3           3
                               2           2
                            1           1    
 0           0           0           0           0
         -1          -1
      -2          -2
   -3          -3

-8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8

but instead it looks like:

    3           3                 3           3
       2           2           2           2
          1           1     1           1    
 0           0           0           0           0

-8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8

when I need it to look like:

          3           3           3           3
       2           2           2           2
    1           1           1           1    
 0           0           0           0           0

-8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8

which latter is invarient under translations of the axis by multiples
of 4, the modulus base, a very desirable property for mapping a
rectangle onto a torus.
--

Again, my opinions, not the account furnishers'.

 
 
 

hashing function for strings

Post by Charles Linds » Wed, 21 Feb 1990 20:38:05




>>There are three things that most people would like:
>>        quotient = -5 div 3 = -(5 div 3) = -1
>>        remainder = -5 mod 3 = (-5 + 6) mod 3 = 1
>>        3*quotient + remainder = -5
>Absolutely!  One may either choose to truncate integer divisions'
>quotients towards minus infinity, a very unnatural feeling operation,

On the contrary, one must start by realising that it is the usual definition
of integer division in programming languages that is WRONG, and is the real
root of the problem.


Quote:>I'd _hate_ a computer that obeyed the first of these. To me, -5 div 3 is
>-2. -5 div 3 = int(-5/3) = int(-1.666) = -2

So I am not alone in my view.

Here is a nice example taken from an article by Lambert Meertens in ALGOL
Bulletin #39. I have transcribed it into ADA,

TYPE intarray IS ARRAY(integer RANGE <>) OF integer;

PROCEDURE mid(a: intarray) RETURN integer IS
                -- the "middle" value of the array a
        BEGIN RETURN a( (a'first + a'last)/2 ) END;

PROCEDURE at(a: intarray, newbound: integer) RETURN intarray IS
                -- a copy of a, s.t. a'first = newbound
        DECLARE
                newa: ARRAY(newbound .. a'last-a'first+newbound) OF integer := a;
        BEGIN RETURN newa; END;

somearray: intarray(1..4) := (1,2,3,4);

        -- so mid(somearray) gives '2'

FOR i IN REVERSE -4..+4
LOOP    put(mid(at(somearray, i))); END LOOP;
        -- prints "2 2 2 2 2 2 3 3 3"

How did we get into this mess? Once upon a time (c. 1950) some machines
(mostly in Europe) did integer division "right" and some (mostly in America)
did it "wrong". The difference was essentially between machines built by
mathematicians and machines built by engineers (note that, if you carefully
follow a textbook on Algebra, you will see how mathematicians treat it). Not
surprisingly, the "wrong" answer got built into FORTRAN. The last chance to
get it right was in ALGOL 60, when the Europeans were only interested in REAL
division. The Americans wanted a separate operator for INTEGER division, so
the Europeans asked the Americans to provide a definition for it.

I tried in 1968 to get it right in ALGOL 68, but all I achieved was to get the
MOD correct (and not the same as remainder).

I tried to persuade ADA to get it right, but all I achieved was to get a REM
operator in addition to MOD.

I tried to get the PASCAL standardizers to get it right, but although they
admitted the merit of my case, they got cold feet because, in the meantime,
all hardware was now being built the way languages were now defining it.

So here is my plea to our descendants who may someday colonize a planet in
Alpha Centauri. When you get there, and set up new standards to be applied in
that corner of the galaxy, please remember that the people on Terra got
integer division WRONG, and do not let the mistake propagate any further.

 
 
 

1. hash table and hash function

I've here one hash function and I'm trying to determine what table
size it works best on within a certain range. I copy about 25,000
words into tables of different sizes using the same hashing function
and I compare how long it takes to hash all of the words, the average
number of comparisons it does for each table size, and also the
standard deviation of that.
I notice that at times, the amount of time it takes to hash is greater
when the average number of comparisons performed are fewer.

The average number of comparison is computed by summing the total
number of comparisons necessary to find each element in the table,
divided by the total number of elements in the table.

The standard deviation is computed by finding the total overall
elements of the square of the number of comparisons necessary to find
that element, dividing by the number of elements, subtracting the
square of the average number of comparisons, and taking the square
root of that difference.

here is one such example:

Table size = 1043
number of elements = 25143
average number of compares = 13.077675819396973
standard deviation = 7.80722988434817
time to hash 25143 words = 118 milliseconds

Table size = 1049
number of elements = 25143
average number of compares = 13.036272048950195
standard deviation = 7.831248331181126
time to hash 25143 words = 126 milliseconds

which one of these table sizes is better for this hash function?

thanks for the help.

2. Hawk 1.0 / Fastrak 2.2.1

3. String hashing and polynomial evaluation

4. SMS_License_Manager - How to disable License Metering?

5. lower bounds on k-universal hash functions

6. Datagrid ... Combo instead of textBox

7. HELP ! wrt HASH FUNCTIONS

8. Installing a Creative Ensoniq PCI Card - Ultra AXi

9. Cryptographic hash functions

10. Multidimensional Hash Function

11. Hash function for a Set of integers

12. Q: non-arithmetic perfect hash functions

13. Independent hash functions