06. Identifying data and data sources

guides

This guide will outline some key data and dataset resources that will be usefull for your own text analysis projects. This guide is not exhaustive, but it will provide you with a good starting point for you to start to explore different types of data and data sources.

Outcomes

Recognize the difference between various sources of data and datasets.
Identify data and/or datasets that are relevant to your research project.
Locate and access data and/datasets used in the textbook and resources.

Introduction

Finding data can be a challenging task, especially when you are looking for data that is relevant to your research project. Ideally you begin your data search with a research question in hand. This will allow you to vet sources as you encounter them. In other cases, however, you may begin to peruse available resources and brainstorm potential research questions that a given resource may support.

In either case, it is helpful to have a place to start. Below I’ve included various sources and data/datasets that will help you kickstart your exploration. There is a vast world of data available on the web and this list is by no means exhaustive.

Sources

Data sharing platforms

There are many data sharing platforms that include various types of research materials, often including datasets.

Platform	Description	URL
Dataverse	Dataverse is an open-source web application to share, preserve, cite, explore, and analyze research data.	https://dataverse.org/
Figshare	Figshare is a repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner.	https://figshare.com/
Zenodo	Zenodo is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN.	https://zenodo.org/
Dryad	Dryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable.	https://datadryad.org/
Open Science Framework	The Open Science Framework (OSF) is a free, open-source web application built to help researchers manage their workflows.	https://osf.io/

Language data repositories

Platform	Description	URL
Linguistic Data Consortium	The Linguistic Data Consortium is an open consortium of universities, companies, and government research laboratories.	https://www.ldc.upenn.edu/
Open Language Archives Community	The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources.	http://www.language-archives.org/
The Language Archive	The Language Archive is a digital repository for language resources.	https://tla.mpi.nl/
The Language Bank	The Language Bank is a digital repository for language resources.	https://www.sprakbanken.se/
TalkBank	TalkBank is a system for sharing and studying conversational interactions.	https://talkbank.org/
Oxford Text Archive	The Oxford Text Archive develops, collects, catalogues, and preserves electronic literary and linguistic resources.	https://ota.bodleian.ox.ac.uk/

Developed corpora

Platform	Description	URL
British National Corpus	The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken language from a wide range of sources.	https://www.english-corpora.org/bnc/
American National Corpus	The American National Corpus (ANC) is a text corpus of American English.	https://anc.org/
Corpus of Contemporary American English	The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English.	https://www.english-corpora.org/coca/

Referenced datasets

Textbook

Dataset	Location(s)	Description	URL
`masc`	Ch. 2 and 8	The `masc` dataset is drawn from the Manually Annotated Sub-Corpus (MASC) of the American National Corpus.	https://anc.org/data/masc/
`belc`	Ch. 3	The `belc` dataset is acquired from the TalkBank repository. It is a dataset that contains the results of a study on the use of English as a second language. On the written portion is used.	https://talkbank.org/
`cedel2`	Ch. 5 and 9	A corpus of Spanish as a second language. This dataset appears in chapter 5.	http://cedel2.learnercorpora.com/
`swda`	Ch. 5 and 10	The Switchboard Dialog Act Corpus (SWDA) is a corpus of telephone conversations.	https://catalog.ldc.upenn.edu/docs/LDC97S62/
`cabnc`	Ch. 5 and 6	The spoken portion of the British National Corpus. It is available through Talkbank.	https://ca.talkbank.org/access/CABNC.html
`europarl`	Ch. 6 and 7	The Europarl Parallel Corpus is a parallel corpus of the European Parliament proceedings.	https://www.statmt.org/europarl/
`enntt`	Ch. 6 and 7	The Europarl Corpus of Native and Non-Native and Translated Texts (ENNTT) is a parallel corpus of the European Parliament proceedings.	https://github.com/senisioi/enntt-release
`dative`	Ch. 10	The `dative` from the {languageR} package is a dataset that contains the results of a study on the use of dative constructions in English.	https://cran.r-project.org/web/packages/languageR/languageR.pdf