06. Identifying data and data sources
This guide will outline some key data and dataset resources that will be usefull for your own text analysis projects. This guide is not exhaustive, but it will provide you with a good starting point for you to start to explore different types of data and data sources.
Introduction
Finding data can be a challenging task, especially when you are looking for data that is relevant to your research project. Ideally you begin your data search with a research question in hand. This will allow you to vet sources as you encounter them. In other cases, however, you may begin to peruse available resources and brainstorm potential research questions that a given resource may support.
In either case, it is helpful to have a place to start. Below I’ve included various sources and data/datasets that will help you kickstart your exploration. There is a vast world of data available on the web and this list is by no means exhaustive.
Sources
Data sharing platforms
There are many data sharing platforms that include various types of research materials, often including datasets.
Platform | Description | URL |
---|---|---|
Dataverse | Dataverse is an open-source web application to share, preserve, cite, explore, and analyze research data. | https://dataverse.org/ |
Figshare | Figshare is a repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. | https://figshare.com/ |
Zenodo | Zenodo is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. | https://zenodo.org/ |
Dryad | Dryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. | https://datadryad.org/ |
Open Science Framework | The Open Science Framework (OSF) is a free, open-source web application built to help researchers manage their workflows. | https://osf.io/ |
Language data repositories
Platform | Description | URL |
---|---|---|
Linguistic Data Consortium | The Linguistic Data Consortium is an open consortium of universities, companies, and government research laboratories. | https://www.ldc.upenn.edu/ |
Open Language Archives Community | The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. | http://www.language-archives.org/ |
The Language Archive | The Language Archive is a digital repository for language resources. | https://tla.mpi.nl/ |
The Language Bank | The Language Bank is a digital repository for language resources. | https://www.sprakbanken.se/ |
TalkBank | TalkBank is a system for sharing and studying conversational interactions. | https://talkbank.org/ |
Oxford Text Archive | The Oxford Text Archive develops, collects, catalogues, and preserves electronic literary and linguistic resources. | https://ota.bodleian.ox.ac.uk/ |
Developed corpora
Platform | Description | URL |
---|---|---|
British National Corpus | The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken language from a wide range of sources. | https://www.english-corpora.org/bnc/ |
American National Corpus | The American National Corpus (ANC) is a text corpus of American English. | https://anc.org/ |
Corpus of Contemporary American English | The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. | https://www.english-corpora.org/coca/ |
Referenced datasets
Textbook
Dataset | Location(s) | Description | URL |
---|---|---|---|
masc |
Ch. 2 and 8 | The masc dataset is drawn from the Manually Annotated Sub-Corpus (MASC) of the American National Corpus. |
https://anc.org/data/masc/ |
belc |
Ch. 3 | The belc dataset is acquired from the TalkBank repository. It is a dataset that contains the results of a study on the use of English as a second language. On the written portion is used. |
https://talkbank.org/ |
cedel2 |
Ch. 5 and 9 | A corpus of Spanish as a second language. This dataset appears in chapter 5. | http://cedel2.learnercorpora.com/ |
swda |
Ch. 5 and 10 | The Switchboard Dialog Act Corpus (SWDA) is a corpus of telephone conversations. | https://catalog.ldc.upenn.edu/docs/LDC97S62/ |
cabnc |
Ch. 5 and 6 | The spoken portion of the British National Corpus. It is available through Talkbank. | https://ca.talkbank.org/access/CABNC.html |
europarl |
Ch. 6 and 7 | The Europarl Parallel Corpus is a parallel corpus of the European Parliament proceedings. | https://www.statmt.org/europarl/ |
enntt |
Ch. 6 and 7 | The Europarl Corpus of Native and Non-Native and Translated Texts (ENNTT) is a parallel corpus of the European Parliament proceedings. | https://github.com/senisioi/enntt-release |
dative |
Ch. 10 | The dative from the {languageR} package is a dataset that contains the results of a study on the use of dative constructions in English. |
https://cran.r-project.org/web/packages/languageR/languageR.pdf |