Identifying data and data sources

Published

Repositories

Language-dedicated repositories are a great source of data for language research. Below I've included a listing of some of the more commonly used repositories.

Table 1: Data repositories
Resource	Description
BYU corpora	A repository of corpora that includes billions of words of data.
COW (COrpora from the Web)	A collection of linguistically processed gigatoken web corpora
LRE Map	Repository of language resources collected during the submission process for the Language Resource and Evaluation Conference (LREC).
Leipzig Corpora Collection	Corpora in different languages using the same format and comparable sources.
Linguistic Data Consortium	Repository of language corpora
NLTK language data	Repository of corpora and language datasets included with the Python package NLTK.
OPUS - an open source parallel corpus	Repository of translated texts from the web.
TalkBank	Repository of language collections dealing with conversation, acquisition, multilingualism, and clinical contexts.
The Language Archive	Various corpora and language datasets
The Oxford Text Archive (OTA)	A collection of thousands of texts in more than 25 different languages.

Corpora and datasets

Below I've included a listing of corpora and datasets that are available for language research. This list is not exhaustive, but includes a few of the more common corpora and datasets used in language research.

Table 2: Corpora and language datasets
Resource	Description
Atari Email Archive	A collection of messages sent at Atari from 1983 to 1992.
CHILDES Treebank	A corpus derived from several corpora from the American English section of CHILDES with the goal to annotate child-directed speech utterance transcriptions with phrase structure tree information.
Cornell Movie-Dialogs Corpus	A corpus containing a large metadata-rich collection of fictional conversations extracted from raw movie scripts.
Corpus Argentino	Corpus of Argentine Spanish
Corpus of Spanish in Southern Arizona	Spanish varieties spoken in Arizona.
Europarl Parallel Corpus	A parallel corpus extracted from the proceedings of the European Parliament Proceedings between 1996-2011.
Google Ngram Viewer	Google web corpus
International Corpus of English (ICE)	The International Corpus of English (ICE) began in 1990 with the primary aim of collecting material for comparative studies of English worldwide.
OpenSubtitles2011	A collection of documents from http://www.opensubtitles.org/.
Russian National Corpus	A corpus of modern Russian language incorporating over 300 million words.
The Big Bad NLP Database - Quantum Stat	NLP datasets
The Switchboard Dialog Act Corpus	A corpus of 1155 5-minute conversations in American English, comprising 205,000 utterances and 1.4 million words, from the Switchboard corpus of telephone conversations.
Welcome to LANGSNAP - LANGSNAP	The aim of this repository is to promote research on the learning of French and Spanish as L2, by making parallel learner corpora for each language freely available to the research community.
Westbury Lab Web Site: Usenet Corpus Download	This corpus is a collection of public USENET postings. This corpus was collected between Oct 2005 and Jan 2011, and covers 47,860 English language, non-binary-file news groups (see list of newsgroups included with the corpus for details)

Aggregated listings

The list of data available for language research is constantly growing. I've document very few of the wide variety of resources. In Table 3 I've included attempts by others to provide a summary of the corpus data and language resources available.

Table 3: Aggregated listings of language corpora and datasets
Resource	Description
CLARIN Reference corpora	The CLARIN infrastructure offers access to 30 reference corpora for 21 languages. Most of the corpora are available through easy-to-use concordancers such as KonText and NoSketch Engine; the reference corpora are also well annotated, typically displaying rich morphosyntactic annotation.
Learner corpora around the world	A listing of learner corpora around the world
Machine Learning Datasets \| Papers With Code	A free and open resource with Machine Learning papers, code, and evaluation tables.
Stanford NLP corpora	Listing of corpora and language resources aimed at the NLP community.
Where can you find language data on the web?	Listing of various corpora and language datasets.
Wordbank	An open database of children's vocabulary development.

Custom-built

Application programming interfaces (APIs)

There are many APIs available for accessing language corpora and datasets. Below I've included a few of the R packages that provide access to these resources.

Table 4: R Package APIs to language corpora and datasets.)
Resource	Description
Accessing the Wordbank Database • wordbankr	Connecting to Wordbank, an open repository for developmental vocabulary data.
aRxiv	R package interface to query arXiv, a repository of electronic preprints for computer science, mathematics, physics, quantitative biology, quantitative finance, and statistics.
crminer	R package interface focusing on getting the user full text via the Crossref search API.
dvn	R package interface to access to the Dataverse Network APIs.
fulltext	R package interface to query open access journals, such as PLOS.
gutenbergr	R package interface to download and process public domain works from the Project Gutenberg collection.
internetarchive	R package interface to query the Internet Archive.
newsflash	R package interface to query the Internet Archive and GDELT Television Explorer
oai	R package interface to query any OAI-PMH repository, including Zenodo.
rfigshare	R package interface to query the data sharing platform FigShare.
rtweet	R client for interacting with Twitter's APIs

Other language resources

Data for language research is not limited to (primary) text sources. Other sources may include processed data from previous research; word lists, linguistic features, etc.. Alone or in combination with text sources this data can be a rich and viable source of data for a research project.

Table 5: Other language resources
Resource	Description
English Lexicon Project	Access to a large set of lexical characteristics, along with behavioral data from visual lexical decision and naming studies.
Grambank	Grambank, the result of a collaboration involving 100+ linguists, examines a range of grammatical phenomena, “from word order to verbal tense, nominal plurals, and many other well-studied comparative linguistic variables.” The project’s dataset, available to download and explore online, spans 195 such features across 2,400+ languages and dialects. For instance, here’s the page for feature GB030, which asks, “Is there a gender distinction in independent 3rd person pronouns?” [h/t Robin Sloan]
PHOIBLE	PHOIBLE is a repository of cross-linguistic phonological inventory data, which have been extracted from source documents and tertiary databases and compiled into a single searchable convenience sample. Release 2.0 from 2019 includes 3020 inventories that contain 3183 segment types found in 2186 distinct languages.
The Collective Noun Catalog	Daniel E. Meyers (Miami University) has put together a large list of collective nouns
The Corpus of Linguistic Acceptability (CoLA)	A corpus that consists of 10657 sentences from 23 linguistics publications, expertly annotated for acceptability (grammaticality) by their original authors.
The Moby lexicon project	Language wordlists and resources from the Moby project.
lingtypology	R package interface to connect with the Glottolog database and provides additional functionality for linguistic mapping.