[6-3] Where can I get a machine readable dictionary, thesaurus, and other text corpora?
Free:
/usr/dict/words
Roget's 1911 Thesaurus is available by anonymous FTP from the
Consortium for Lexical Research
clr.nmsu.edu:/CLR/lexica/roget-1911 [128.123.1.12]
It is also available from
src.doc.ic.ac.uk:/literary/collections/project_gutenberg/roget11.txt.Z
An old Webster's dictionary is in /text/dict/{DICT.Z,DICT.INDEX.Z}.
Project Gutenberg also has Roget's 1911 Thesaurus. The Project
Gutenberg archive is at mrcnext.cso.uiuc.edu:/pub/etext/. The
Project Gutenberg archive collects public domain electronic books. For more
information, write to Michael S. Hart, Professor of Electronic Text,
Executive Director of Project Gutenberg Etext, Illinois Benedictine
College, 5700 College Road, Lisle, IL 60532 or send email to
hart@vmd.cso.uiuc.edu.
For people without FTP, Austin Code Works sells floppy disks
containing Roget's 1911 Thesaurus for $40.00. This money helps support
the production of other useful texts, such as the 1913 Webster's dictionary.
The Online Book Initiative maintains a text repository on
ftp.std.com (a public access UNIX system, 617-739-WRLD). See the
README file on obi.std.com:/obi/. For more information, send email to
obi@world.std.com, write to Software Tool & Die, 1330 Beacon Street,
Brookline, MA 02146, or call 617-739-0202.
The CHILDES project at Carnegie Mellon University has a lot of data of
children speaking to adults, as well as the adult written and adult
spoken corpora from the CORNELL project. Contact Brian MacWhinney
<brian@andrew.cmu.edu> for more information.
The Association for Computational Linguistics (ACL) has a Data
Collection Initiative. For more information, contact Donald Walker at
Bellcore, walker@flash.bellcore.com.
Two lists of common female first names (4967 names) and male first
names (2924 names) are available for anonymous ftp from
ftp.cs.cmu.edu:/user/ai/areas/nlp/corpora/names/
Read the file README first. Send mail to mkant@cs.cmu.edu for more
information.
A list of 110,000 English words (one per line, in ASCII) is
available in the PD1:<MSDOS.LINGUISTICS> directory on SIMTEL20 as the
files WORDS1.ZIP, WORDS2.ZIP, WORDS3.ZIP, and WORDS4.ZIP. Although the
list is in MS-DOS files, it can easily be used on other machines (but
first you'll have to unzip the files on a DOS machine). The list
includes inflected forms of the words, such as plural nouns and the
-s, -ed, and -ing forms of verbs; thus the number of lexical stems in
the list is considerably smaller than the total number of word forms.
These files are available via FTP from WSMR-SIMTEL20.ARMY.MIL
[192.88.110.20]. SIMTEL20 files are mirrored on wuarchive.wustl.edu.
The Collins English Dictionary encoded as a Prolog fact base is
available from the Oxford Text Archive by anonymous ftp from
ota.ox.ac.uk:/pub/ota/dicts/1192/ [129.67.1.165]
The Oxford Text Archive includes many other texts, dictionaries,
thesauri, word lists, and so on, most of which are available for
scholarly use and research only. See the files
ota.ox.ac.uk:/pub/ota/textarchive.form
ota.ox.ac.uk:/pub/ota/textarchive.info
ota.ox.ac.uk:/pub/ota/textarchive.list
ota.ox.ac.uk:/pub/ota/textarchive.sgml
for more information, or write to archive@ox.ac.uk, Oxford Text Archive,
Oxford University Computing Services, 13 Banbury Road, Oxford OX2
6NN, UK, call 44-865-273238 or fax 44-865-273275.
Chuck Wooters <wooters@icsi.berkeley.edu> has extracted the most
likely pronunciation for each of about 6100 words in the hand-labeled
TIMIT database, and made them available by anonymous ftp from
ftp.icsi.berkeley.edu:/pub/speech/TIMIT.mostlikely.Z.
A list of homophones from general American English is available by
anonymous ftp from svr-ftp.eng.cam.ac.uk:/comp.speech/data/ as the file
homophones-1.01.txt. To receive the list by email, send mail to
Evan.Antworth@sil.org. The list was compiled by Tony Robinson.
Sigurd P. Crossland <sig@seuss.vantage.gte.com> has been compiling
a dictionary of English words, including most common American words,
abbreviations, hyphenations, and even incorrect spellings. The most
recent version is available by anonymous ftp from
wocket.vantage.gte.com:/pub/standard_dictionary/dic-0394.tar.gz
The tar file includes 31 text files, one for each word-length from 2
to 32. The compressed tar file takes up just over 4mb of space, and
includes approximately 870,000 words.
WordNet is an English lexical reference system based on current
psycholinguistic theories of human lexical memory. It organizes nouns,
verbs and adjectives into synonym sets corresponding to lexical
concepts. The sets are linked by a variety of relations. Besides being
of scientific interest,
it makes a handy thesaurus. WordNet is available by anonymous ftp from
clarity.princeton.edu:/pub/
If you retrieve a copy of wordnet by ftp, please send mail to
wordnet@princeton.edu.
Commercial:
Illumind publishes the Moby Thesaurus (25,000 roots/1.2 million
synonyms), Moby Words (560,000 entries), Moby Hyphenator (155,000
entries), and the Moby Part-of-Speech (214,000 entries), Moby
Pronunciator (167,000 entries with IPA encoding, syllabification, and
primary, secondary, and tertiary stress marks) and Moby Language
(100,000 word word lists in five major world languages) lexical
databases. All databases are supplied in pure ASCII, royalty-free, in
both Macintosh and MS-DOS disk formats (also in .Z file formats). Both
commercial (to resell derived structures as part of commercial
applications) and educational/research licenses are available. Samples
of each of the lexical databases are available by anonymous ftp from
netcom.com:/pub/grady/Moby_Sampler.tar.Z [192.100.81.100]. For more
information, write to Illumind, Attn: Grady Ward, 3449 Martha Court,
Arcata, CA 95521, call/fax 707-826-7715, or send email to
grady@netcom.com.
The Oxford Text Archive has hundreds of online texts in a wide variety
of languages, including a few dictionaries (the OED, Collins, etc.).
The Lancaster-Oslo-Bergen (LOB), Brown, and London-Lund corpii are also
available from them. For more information, write to Oxford Electronic
Publishing, Oxford University Press, 200 Madison Avenue, New York, NY
10016, call 212-889-0206, or send mail to archive@vax.oxford.ac.uk.
(Their contact information in England is Oxford Text Archive, Oxford
University Computing Service, 13 Banbury Road, Oxford OX2 6NN, UK, +44
(865) 273238.)
Mailing Lists:
CORPORA is a mailing list for Text Corpora. It welcomes information
and questions about text corpora such as availability, aspects of
compiling and using corpora, software, tagging, parsing, and
bibliography. To be added to the list, send a message to
corpora-request@x400.hd.uib.no. Contributions should be sent to
corpora@x400.hd.uib.no.
Linguistic Data Consortium:
The Linguistic Data Consortium was established to broaden the collection
and distribution of speech and natural language data bases for the
purposes of research and technology development in automatic speech
recognition, natural language processing, and other areas where large
amounts of linguistic data are needed. Information about the LDC is
available by anonymous ftp from ftp.cis.upenn.edu:/pub/ldc [130.91.6.8].
Documents available in this directory include a paper on the background,
rationale and goals of the LDC, a brief list of available data bases,
and some tables summarizing these corpora. For further information,
contact Elizabeth Hodas, <ehodas@walnut.ling.upenn.edu>, Mark Liberman
<myl@unagi.cis.upenn.edu>, or Jack Godfrey <jgodfrey@unagi.cis.upenn.edu>.
Go Back Up
Go To Previous
Go To Next