American language standardized text compression dictionary

American language standardized text compression dictionary

Post by S.. » Thu, 26 May 1994 13:28:06



As an aid to those involved in natural language parsing, dictionary
compression, or textual encryption, I have been collecting and
compiling a lengthy list of words.  It is expected that a
comprehensive standardized dictionary will eventually result.  This
dictionary should contain most common American words, abbreviations,
hyphenations, and even incorrect spellings.

The word lists are compiled from a number of sources: commercial
news services, UseNet news postings, existing dictionaries, name
lists, company lists, UNIX man pages, project Gutenberg's E-texts,
project Wordnet, received mailings, etc.  The texts are parsed and
the words sorted by length into files for storage efficiency and by
ASCII collating sequence within files for retrieval performance.  By
definition, 'words' must begin and end with an alphabetic character
and may contain one of the characters "/-&+'248" embedded within the
string.

An anonymous ftp server has been built on wocket.vantage.gte.com
(URL: ftp://wocket.vantage.gte.com/pub/standard_dictionary - IP
address: 131.131.98.182) which contains the following files in the
pub/standard_dictionary directory:

                    words         bytes

-r--r--r--                         5966 May 24 10:57 README
-r--r--r--                      8552448 Jan 28 12:00 dic-0194.tar
-r--r--r--                      8880128 Feb 24 10:39 dic-0294.tar
-r--r--r--                     10403840 Mar 28 10:43 dic-0394.tar
-r--r--r--                     10936320 Apr 27 09:22 dic-0494.tar
-r--r--r--                     16080896 May 24 10:34 dic-0594.tar
-r--r--r--                      7640419 May 24 10:42 dic-0594.tar.Z
-r--r--r--                      5917885 May 24 11:21 dic-0594.tar.gz
-r--r--r--                      5834014 May 24 11:54 dic-0594.zip
-r--r--r--                      1269760 Aug 16  1993 dic-0893.tar
-r--r--r--                      3186688 Sep 17  1993 dic-0993.tar
-r--r--r--                      7479296 Oct 26  1993 dic-1093.tar
-r--r--r--                      8273920 Dec 17 11:58 dic-1293.tar
-r--r--r--                         4779 Mar 29 12:52 index.txt

-r--r--r--           1120          4480 May 24 10:31 length02.txt
-r--r--r--          26188        130942 May 24 10:31 length03.txt
-r--r--r--          83578        501466 May 24 10:31 length04.txt
-r--r--r--         136368        954573 May 24 10:31 length05.txt
-r--r--r--         187427       1499479 May 24 10:31 length06.txt
-r--r--r--         207570       1868123 May 24 10:31 length07.txt
-r--r--r--         218916       2189150 May 24 10:31 length08.txt
-r--r--r--         170416       1874567 May 24 10:31 length09.txt
-r--r--r--         138538       1662946 May 24 10:31 length10.txt
-r--r--r--         105643       1373348 May 24 10:32 length11.txt
-r--r--r--          77521       1085293 May 24 10:32 length12.txt
-r--r--r--          54211        813152 May 24 10:32 length13.txt
-r--r--r--          36743        587888 May 24 10:32 length14.txt
-r--r--r--          25069        426173 May 24 10:32 length15.txt
-r--r--r--          16773        301914 May 24 10:32 length16.txt
-r--r--r--          11312        214928 May 24 10:32 length17.txt
-r--r--r--           7854        157080 May 24 10:32 length18.txt
-r--r--r--           5511        115731 May 24 10:32 length19.txt
-r--r--r--           3776         83072 May 24 10:32 length20.txt
-r--r--r--           2667         61341 May 24 10:32 length21.txt
-r--r--r--           1811         43464 May 24 10:32 length22.txt
-r--r--r--           1364         34100 May 24 10:32 length23.txt
-r--r--r--            951         24726 May 24 10:32 length24.txt
-r--r--r--            668         18036 May 24 10:32 length25.txt
-r--r--r--            513         14364 May 24 10:32 length26.txt
-r--r--r--            367         10643 May 24 10:32 length27.txt
-r--r--r--              1            30 May 24 10:32 length28.txt
-r--r--r--              0             0 May 24 10:32 length29.txt
-r--r--r--              0             0 May 24 10:32 length30.txt
-r--r--r--              0             0 May 24 10:32 length31.txt
-r--r--r--              1            34 May 24 10:32 length32.txt

    Total         1522877

-r--r--r--                        11521 Aug 13  1993 tarread.com

The most recent compilation is available as dic-0594.tar,
dic-0594.tar.Z, dic-0594.tar.gz, and dic-0594.zip.  The archive files
are composed of the 31 text files and may be restored on an MS-DOS
computer using the tarread.com or pkunzip.exe utility programs.

Any words for inclusion in future dictionaries should be submitted to
my E-Mail address directly or placed in the /pub/incoming directory.
Please compare your dictionaries with standard Unix 'words' and
submit only the differences.  Many thanks to those that have
submitted the 460,000 words during the last month.

Take care.

         - Sig

Sigurd P. Crossland
Advanced Technology Lab        |  Telephone: (703) 818-8504
GTE                            |  Facsimile: (703) 802-3110

Chantilly, VA 22021            |  Home: (703) 818-8942