Following is a summary of responses received regarding a search for a
standardized dictionary of American words:
1) A standardized dictionary should include misspellings, abbreviations, slang,
proper names, geographic names, acronyms, common foreign words, as well as
all forms of the word endings.
2) 16 bits (64K) might be sufficient for everyday vocabulary, 20 bits (1M)
might be sufficient for the comprehensive dictionary described above,
24 bits (16M) might be necessary to contain capitalization and common
punctuation, and 32 bits (4G) would be required for international and
personalized dictionaries, forms, etc.
Obviously, there exists a tradeoff between the inherent compression ratio,
which varies indirectly, and the cryptographic functionality, which varies
directly with the size of the dictionary. The thought is to favor the
comprehensive dictionary in order to address the issues of portability and
extensibility. Subsets could then be derived for specific applications.
3) The definition of the storage format for the dictionary itself is left to
the implementation. Suggestions have been made to encode the dictionary,
store the words in a relational database, and other techniques aimed at
performance or storage efficiency enhancements. The body of the dictionary
should probably be released in straight binary collated flat files of a
record length associated with the word length. The index for the word will
then be the record index of the file permitting a somewhat efficient
4) A standard character set which includes foreign characters and the
diacritical marks common to many foreign languages should be identified. An
extension to ASCII, such as a combination of Microsoft Multilingual/Latin I
(code page 850) and Slavic/Latin II (code page 852), but without the line
and border drawing characters, is indicated.
5) Dictionaries need to be updated relatively slowly. Once or twice a year is
the maximum frequency which would still allow the changes to propagate to
the majority of users.
6) A standardized index should be defined. This index should include
provisions for specifying run-length encoded character strings, multiple
dictionaries, personalized extensions for both words and entire documents,
version control information, common punctuation, and usual word endings.
7) A repository, such as an anonymous ftp site, needs to be established. FAQ
lists, contributions, and a current release of the dictionaries would be
provided from a well publicized location.
The summary indicates a lot of work will need to take place for this effort to
succeed. Critiscisms and individual contributions will be required to ensure a
quality product results. Thanks to those that have replied so far and to those
that might wish to contribute in the future.
Sigurd P. Crossland
Manager Advanced Technology Lab Telephone: (703) 818-4202
GTE Facsimile: (703) 818-4834
Chantilly, VA 22021 Home: (703) 818-8942