Published
Repositories
Language-dedicated repositories are a great source of data for language research. Below I've included a listing of some of the more commonly used repositories.
| Resource | Description |
|---|---|
| BYU corpora | A repository of corpora that includes billions of words of data. |
| COW (COrpora from the Web) | A collection of linguistically processed gigatoken web corpora |
| LRE Map | Repository of language resources collected during the submission process for the Language Resource and Evaluation Conference (LREC). |
| Leipzig Corpora Collection | Corpora in different languages using the same format and comparable sources. |
| Linguistic Data Consortium | Repository of language corpora |
| NLTK language data | Repository of corpora and language datasets included with the Python package NLTK. |
| OPUS - an open source parallel corpus | Repository of translated texts from the web. |
| TalkBank | Repository of language collections dealing with conversation, acquisition, multilingualism, and clinical contexts. |
| The Language Archive | Various corpora and language datasets |
| The Oxford Text Archive (OTA) | A collection of thousands of texts in more than 25 different languages. |
Corpora and datasets
Below I've included a listing of corpora and datasets that are available for language research. This list is not exhaustive, but includes a few of the more common corpora and datasets used in language research.
| Resource | Description |
|---|---|
| Atari Email Archive | A collection of messages sent at Atari from 1983 to 1992. |
| CHILDES Treebank | A corpus derived from several corpora from the American English section of CHILDES with the goal to annotate child-directed speech utterance transcriptions with phrase structure tree information. |
| Cornell Movie-Dialogs Corpus | A corpus containing a large metadata-rich collection of fictional conversations extracted from raw movie scripts. |
| Corpus Argentino | Corpus of Argentine Spanish |
| Corpus of Spanish in Southern Arizona | Spanish varieties spoken in Arizona. |
| Europarl Parallel Corpus | A parallel corpus extracted from the proceedings of the European Parliament Proceedings between 1996-2011. |
| Google Ngram Viewer | Google web corpus |
| International Corpus of English (ICE) | The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide. |
| OpenSubtitles2011 | A collection of documents from http://www.opensubtitles.org/. |
| Russian National Corpus | A corpus of modern Russian language incorporating over 300 million words. |
| The Big Bad NLP Database - Quantum Stat | NLP datasets |
| The Switchboard Dialog Act Corpus | A corpus of 1155 5-minute conversations in American English, comprising 205,000 utterances and 1.4 million words, from the Switchboard corpus of telephone conversations. |
| Welcome to LANGSNAP - LANGSNAP | The aim of this repository is to promote research on the learning of French and Spanish as L2, by making parallel learner corpora for each language freely available to the research community. |
| Westbury Lab Web Site: Usenet Corpus Download | This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47,860 English language, non-binary-file news groups (see list of newsgroups included with the corpus for details) |
Aggregated listings
The list of data available for language research is constantly growing. I've document very few of the wide variety of resources. In Table 3 I've included attempts by others to provide a summary of the corpus data and language resources available.
| Resource | Description |
|---|---|
| CLARIN Reference corpora | The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation. |
| Learner corpora around the world | A listing of learner corpora around the world |
| Machine Learning Datasets | Papers With Code | A free and open resource with Machine Learning papers, code, and evaluation tables. |
| Stanford NLP corpora | Listing of corpora and language resources aimed at the NLP community. |
| Where can you find language data on the web? | Listing of various corpora and language datasets. |
| Wordbank | An open database of children's vocabulary development. |
Custom-built
Application programming interfaces (APIs)
There are many APIs available for accessing language corpora and datasets. Below I've included a few of the R packages that provide access to these resources.
| Resource | Description |
|---|---|
| Accessing the Wordbank Database • wordbankr | Connecting to Wordbank, an open repository for developmental vocabulary data. |
| aRxiv | R package interface to query arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics. |
| crminer | R package interface focusing on getting the user full text via the Crossref search API. |
| dvn | R package interface to access to the Dataverse Network APIs. |
| fulltext | R package interface to query open access journals, such as PLOS. |
| gutenbergr | R package interface to download and process public domain works from the Project Gutenberg collection. |
| internetarchive | R package interface to query the Internet Archive. |
| newsflash | R package interface to query the Internet Archive and GDELT Television Explorer |
| oai | R package interface to query any OAI-PMH repository, including Zenodo. |
| rfigshare | R package interface to query the data sharing platform FigShare. |
| rtweet | R client for interacting with Twitter's APIs |
Other language resources
Data for language research is not limited to (primary) text sources. Other sources may include processed data from previous research; word lists, linguistic features, etc.. Alone or in combination with text sources this data can be a rich and viable source of data for a research project.
| Resource | Description |
|---|---|
| English Lexicon Project | Access to a large set of lexical characteristics, along with behavioral data from visual lexical decision and naming studies. |
| Grambank | Grambank, the result of a collaboration involving 100+ linguists, examines a range of grammatical phenomena, “from word order to verbal tense, nominal plurals, and many other well-studied comparative linguistic variables.” The project’s dataset, available to download and explore online, spans 195 such features across 2,400+ languages and dialects. For instance, here’s the page for feature GB030, which asks, “Is there a gender distinction in independent 3rd person pronouns?” [h/t Robin Sloan] |
| PHOIBLE | PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3183 segment types found in 2186 distinct languages. |
| The Collective Noun Catalog | Daniel E. Meyers (Miami University) has put together a large list of collective nouns |
| The Corpus of Linguistic Acceptability (CoLA) | A corpus that consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by their original authors. |
| The Moby lexicon project | Language wordlists and resources from the Moby project. |
| lingtypology | R package interface to connect with the Glottolog database and provides additional functionality for linguistic mapping. |