Standardized dictionary response summary

Standardized dictionary response summary

Post by S.. » Tue, 27 Jul 1993 23:24:14

Following is a summary of responses received regarding a search for a
standardized dictionary of American words:

 1) A standardized dictionary should include misspellings, abbreviations, slang,
    proper names, geographic names, acronyms, common foreign words, as well as
    all forms of the word endings.

 2) 16 bits (64K) might be sufficient for everyday vocabulary, 20 bits (1M)
    might be sufficient for the comprehensive dictionary described above,
    24 bits (16M) might be necessary to contain capitalization and common
    punctuation, and 32 bits (4G) would be required for international and  
    personalized dictionaries, forms, etc.

    Obviously, there exists a tradeoff between the inherent compression ratio,
    which varies indirectly, and the cryptographic functionality, which varies
    directly with the size of the dictionary.  The thought is to favor the
    comprehensive dictionary in order to address the issues of portability and
    extensibility.  Subsets could then be derived for specific applications.

 3) The definition of the storage format for the dictionary itself is left to
    the implementation.  Suggestions have been made to encode the dictionary,
    store the words in a relational database, and other techniques aimed at
    performance or storage efficiency enhancements.  The body of the dictionary
    should probably be released in straight binary collated flat files of a
    record length associated with the word length.  The index for the word will
    then be the record index of the file permitting a somewhat efficient
    distribution mechanism.

 4) A standard character set which includes foreign characters and the
    diacritical marks common to many foreign languages should be identified.  An
    extension to ASCII, such as a combination of Microsoft Multilingual/Latin I
    (code page 850) and Slavic/Latin II (code page 852), but without the line
    and border drawing characters, is indicated.

 5) Dictionaries need to be updated relatively slowly.  Once or twice a year is
    the maximum frequency which would still allow the changes to propagate to
    the majority of users.

 6) A standardized index should be defined.  This index should include
    provisions for specifying run-length encoded character strings, multiple
    dictionaries, personalized extensions for both words and entire documents,
    version control information, common punctuation, and usual word endings.

 7) A repository, such as an anonymous ftp site, needs to be established.  FAQ
    lists, contributions, and a current release of the dictionaries would be
    provided from a well publicized location.

The summary indicates a lot of work will need to take place for this effort to
succeed.  Critiscisms and individual contributions will be required to ensure a
quality product results.  Thanks to those that have replied so far and to those
that might wish to contribute in the future.    

Take care.

         - Sig

Sigurd P. Crossland
Manager Advanced Technology Lab           Telephone: (703) 818-4202
GTE                                       Facsimile: (703) 818-4834

Chantilly, VA   22021                     Home: (703) 818-8942