Free: /usr/dict/words Roget's 1911 Thesaurus is available by anonymous FTP from the Consortium for Lexical Research clr.nmsu.edu:/CLR/lexica/roget-1911 [128.123.1.12] It is also available from src.doc.ic.ac.uk:/literary/collections/project_gutenberg/roget11.txt.Z An old Webster's dictionary is in /text/dict/{DICT.Z,DICT.INDEX.Z}. Project Gutenberg also has Roget's 1911 Thesaurus. The Project Gutenberg archive is at mrcnext.cso.uiuc.edu:/pub/etext/. The Project Gutenberg archive collects public domain electronic books. For more information, write to Michael S. Hart, Professor of Electronic Text, Executive Director of Project Gutenberg Etext, Illinois Benedictine College, 5700 College Road, Lisle, IL 60532 or send email to hart@vmd.cso.uiuc.edu. For people without FTP, Austin Code Works sells floppy disks containing Roget's 1911 Thesaurus for $40.00. This money helps support the production of other useful texts, such as the 1913 Webster's dictionary. The Online Book Initiative maintains a text repository on ftp.std.com (a public access UNIX system, 617-739-WRLD). See the README file on obi.std.com:/obi/. For more information, send email to obi@world.std.com, write to Software Tool & Die, 1330 Beacon Street, Brookline, MA 02146, or call 617-739-0202. The CHILDES project at Carnegie Mellon University has a lot of data of children speaking to adults, as well as the adult written and adult spoken corpora from the CORNELL project. Contact Brian MacWhinney <brian@andrew.cmu.edu> for more information. The Association for Computational Linguistics (ACL) has a Data Collection Initiative. For more information, contact Donald Walker at Bellcore, walker@flash.bellcore.com. Two lists of common female first names (4967 names) and male first names (2924 names) are available for anonymous ftp from ftp.cs.cmu.edu:/user/ai/areas/nlp/corpora/names/ Read the file README first. Send mail to mkant@cs.cmu.edu for more information. A list of 110,000 English words (one per line, in ASCII) is available in the PD1:<MSDOS.LINGUISTICS> directory on SIMTEL20 as the files WORDS1.ZIP, WORDS2.ZIP, WORDS3.ZIP, and WORDS4.ZIP. Although the list is in MS-DOS files, it can easily be used on other machines (but first you'll have to unzip the files on a DOS machine). The list includes inflected forms of the words, such as plural nouns and the -s, -ed, and -ing forms of verbs; thus the number of lexical stems in the list is considerably smaller than the total number of word forms. These files are available via FTP from WSMR-SIMTEL20.ARMY.MIL [192.88.110.20]. SIMTEL20 files are mirrored on wuarchive.wustl.edu. The Collins English Dictionary encoded as a Prolog fact base is available from the Oxford Text Archive by anonymous ftp from ota.ox.ac.uk:/pub/ota/dicts/1192/ [129.67.1.165] The Oxford Text Archive includes many other texts, dictionaries, thesauri, word lists, and so on, most of which are available for scholarly use and research only. See the files ota.ox.ac.uk:/pub/ota/textarchive.form ota.ox.ac.uk:/pub/ota/textarchive.info ota.ox.ac.uk:/pub/ota/textarchive.list ota.ox.ac.uk:/pub/ota/textarchive.sgml for more information, or write to archive@ox.ac.uk, Oxford Text Archive, Oxford University Computing Services, 13 Banbury Road, Oxford OX2 6NN, UK, call 44-865-273238 or fax 44-865-273275. Chuck Wooters <wooters@icsi.berkeley.edu> has extracted the most likely pronunciation for each of about 6100 words in the hand-labeled TIMIT database, and made them available by anonymous ftp from ftp.icsi.berkeley.edu:/pub/speech/TIMIT.mostlikely.Z. A list of homophones from general American English is available by anonymous ftp from svr-ftp.eng.cam.ac.uk:/comp.speech/data/ as the file homophones-1.01.txt. To receive the list by email, send mail to Evan.Antworth@sil.org. The list was compiled by Tony Robinson. Sigurd P. Crossland <sig@seuss.vantage.gte.com> has been compiling a dictionary of English words, including most common American words, abbreviations, hyphenations, and even incorrect spellings. The most recent version is available by anonymous ftp from wocket.vantage.gte.com:/pub/standard_dictionary/dic-0394.tar.gz The tar file includes 31 text files, one for each word-length from 2 to 32. The compressed tar file takes up just over 4mb of space, and includes approximately 870,000 words. WordNet is an English lexical reference system based on current psycholinguistic theories of human lexical memory. It organizes nouns, verbs and adjectives into synonym sets corresponding to lexical concepts. The sets are linked by a variety of relations. Besides being of scientific interest, it makes a handy thesaurus. WordNet is available by anonymous ftp from clarity.princeton.edu:/pub/ If you retrieve a copy of wordnet by ftp, please send mail to wordnet@princeton.edu. Commercial: Illumind publishes the Moby Thesaurus (25,000 roots/1.2 million synonyms), Moby Words (560,000 entries), Moby Hyphenator (155,000 entries), and the Moby Part-of-Speech (214,000 entries), Moby Pronunciator (167,000 entries with IPA encoding, syllabification, and primary, secondary, and tertiary stress marks) and Moby Language (100,000 word word lists in five major world languages) lexical databases. All databases are supplied in pure ASCII, royalty-free, in both Macintosh and MS-DOS disk formats (also in .Z file formats). Both commercial (to resell derived structures as part of commercial applications) and educational/research licenses are available. Samples of each of the lexical databases are available by anonymous ftp from netcom.com:/pub/grady/Moby_Sampler.tar.Z [192.100.81.100]. For more information, write to Illumind, Attn: Grady Ward, 3449 Martha Court, Arcata, CA 95521, call/fax 707-826-7715, or send email to grady@netcom.com. The Oxford Text Archive has hundreds of online texts in a wide variety of languages, including a few dictionaries (the OED, Collins, etc.). The Lancaster-Oslo-Bergen (LOB), Brown, and London-Lund corpii are also available from them. For more information, write to Oxford Electronic Publishing, Oxford University Press, 200 Madison Avenue, New York, NY 10016, call 212-889-0206, or send mail to archive@vax.oxford.ac.uk. (Their contact information in England is Oxford Text Archive, Oxford University Computing Service, 13 Banbury Road, Oxford OX2 6NN, UK, +44 (865) 273238.) Mailing Lists: CORPORA is a mailing list for Text Corpora. It welcomes information and questions about text corpora such as availability, aspects of compiling and using corpora, software, tagging, parsing, and bibliography. To be added to the list, send a message to corpora-request@x400.hd.uib.no. Contributions should be sent to corpora@x400.hd.uib.no. Linguistic Data Consortium: The Linguistic Data Consortium was established to broaden the collection and distribution of speech and natural language data bases for the purposes of research and technology development in automatic speech recognition, natural language processing, and other areas where large amounts of linguistic data are needed. Information about the LDC is available by anonymous ftp from ftp.cis.upenn.edu:/pub/ldc [130.91.6.8]. Documents available in this directory include a paper on the background, rationale and goals of the LDC, a brief list of available data bases, and some tables summarizing these corpora. For further information, contact Elizabeth Hodas, <ehodas@walnut.ling.upenn.edu>, Mark Liberman <myl@unagi.cis.upenn.edu>, or Jack Godfrey <jgodfrey@unagi.cis.upenn.edu>.Go Back Up