American language standardized dictionary for text compression

American language standardized dictionary for text compression

Post by Sigurd Crosslan » Wed, 30 Mar 1994 15:22:53



As an aid to those involved in natural language parsing, dictionary compression,
or textual encryption, I have been collecting and compiling a lengthy list of
words.  It is expected that a comprehensive standardized dictionary will
eventually result.  This dictionary should contain most common American words,
abbreviations, hyphenations, and even incorrect spellings.

The word lists are compiled from a number of sources: commercial news services,
UseNet news postings, existing dictionaries, name lists, company lists, UNIX man
pages, project Gutenberg's E-texts, project Wordnet, received mailings, etc.
The texts are parsed and the words sorted by length into files for storage
efficiency and by ASCII collating sequence within files for retrieval
performance.  By definition, 'words' must begin and end with an alphabetic
character and may contain one of the characters "/-&+'248" embedded within the
string.  The words are supposed to be normalized to lower case except where an
unusual capitalization occurs.  There is a bug in the parser at the moment which
allows for both upper and lower cased variations of some words.

An anonymous ftp server has been built on wocket.vantage.gte.com which contains
the following files in the pub/standard_dictionary directory:

                    words         bytes

-r--r--r--                         4497 Feb 24 11:00 README
-r--r--r--                      8552448 Jan 28 12:00 dic-0194.tar
-r--r--r--                      4058075 Jan 28 12:02 dic-0194.tar.Z
-r--r--r--                      8880128 Feb 24 10:39 dic-0294.tar
-r--r--r--                      4220442 Feb 24 10:41 dic-0294.tar.Z
-r--r--r--                      3285891 Feb 28 12:45 dic-0294.tar.gz
-r--r--r--                     10403840 Mar 28 10:43 dic-0394.tar
-r--r--r--                      4950681 Mar 28 10:45 dic-0394.tar.Z
-r--r--r--                      3846113 Mar 28 11:18 dic-0394.tar.gz
-r--r--r--                      3818781 Mar 28 11:05 dic-0394.zip
-r--r--r--                      1269760 Aug 16  1993 dic-0893.tar
-r--r--r--                       523393 Aug 16  1993 dic-0893.tar.Z
-r--r--r--                       421239 Aug 16  1993 dic-0893.zip
-r--r--r--                      3186688 Sep 17  1993 dic-0993.tar
-r--r--r--                      1503561 Sep 17  1993 dic-0993.tar.Z
-r--r--r--                      7479296 Oct 26 17:29 dic-1093.tar
-r--r--r--                      3516519 Oct 26 17:32 dic-1093.tar.Z
-r--r--r--                      8273920 Dec 17 11:58 dic-1293.tar
-r--r--r--                      3918385 Dec 17 11:59 dic-1293.tar.Z

-r--r--r--           1067          4268 Mar 28 10:40 length02.txt
-r--r--r--          22790        113950 Mar 28 10:40 length03.txt
-r--r--r--          59156        354934 Mar 28 10:40 length04.txt
-r--r--r--          96155        673082 Mar 28 10:40 length05.txt
-r--r--r--         130085       1040743 Mar 28 10:40 length06.txt
-r--r--r--         141446       1273007 Mar 28 10:41 length07.txt
-r--r--r--         152579       1525780 Mar 28 10:41 length08.txt
-r--r--r--         110207       1212268 Mar 28 10:41 length09.txt
-r--r--r--          87648       1051762 Mar 28 10:41 length10.txt
-r--r--r--          65937        857170 Mar 28 10:41 length11.txt
-r--r--r--          47946        671243 Mar 28 10:41 length12.txt
-r--r--r--          32891        493352 Mar 28 10:41 length13.txt
-r--r--r--          21969        351504 Mar 28 10:41 length14.txt
-r--r--r--          14385        244545 Mar 28 10:41 length15.txt
-r--r--r--           9126        164268 Mar 28 10:41 length16.txt
-r--r--r--           5853        111207 Mar 28 10:41 length17.txt
-r--r--r--           3721         74420 Mar 28 10:41 length18.txt
-r--r--r--           2435         51135 Mar 28 10:41 length19.txt
-r--r--r--           1545         33990 Mar 28 10:41 length20.txt
-r--r--r--           1027         23621 Mar 28 10:41 length21.txt
-r--r--r--            690         16560 Mar 28 10:41 length22.txt
-r--r--r--            455         11375 Mar 28 10:41 length23.txt
-r--r--r--            292          7592 Mar 28 10:41 length24.txt
-r--r--r--            193          5211 Mar 28 10:41 length25.txt
-r--r--r--            121          3388 Mar 28 10:41 length26.txt
-r--r--r--             83          2407 Mar 28 10:41 length27.txt
-r--r--r--              1            30 Mar 28 10:41 length28.txt
-r--r--r--              0             0 Mar 28 10:41 length29.txt
-r--r--r--              0             0 Mar 28 10:41 length30.txt
-r--r--r--              0             0 Mar 28 10:41 length31.txt
-r--r--r--              1            34 Mar 28 10:41 length32.txt

                  1009804 Total

-r--r--r--                        11521 Aug 13  1993 tarread.com

The most recent compilation being dic-0394.tar is composed of the 31 text files
and may be restored on an MS-DOS computer using the tarread.com utility program.

Any words for inclusion in future dictionaries should be submitted to my E-Mail
address directly or placed in the /pub/incoming directory.  Please compare your
dictionaries with standard Unix 'words' and submit only the differences.  Many
thanks to those that have submitted the 140,000 words during the last month.

Take care.

         - Sig

Sigurd P. Crossland
Advanced Technology Lab                   Telephone: (703) 818-8504
GTE                                       Facsimile: (703) 802-3110

Chantilly, VA   22021                     Home: (703) 818-8942