06. Identifying data and data sources
guides
This guide will outline some key data and dataset resources that will be usefull for your own text analysis projects. This guide is not exhaustive, but it will provide you with a good starting point for you to start to explore different types of data and data sources.
Introduction
Data sharing platforms
There are many data sharing platforms that include various types of research materials, often including datasets.
Platform | Description | URL |
---|---|---|
Dataverse | Dataverse is an open-source web application to share, preserve, cite, explore, and analyze research data. | https://dataverse.org/ |
Figshare | Figshare is a repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. | https://figshare.com/ |
Zenodo | Zenodo is a general-purpose open-access repository developed under the European OpenAIRE program and operated by CERN. | https://zenodo.org/ |
Dryad | Dryad is a curated general-purpose repository that makes the data underlying scientific publications discoverable, freely reusable, and citable. | https://datadryad.org/ |
Open Science Framework | The Open Science Framework (OSF) is a free, open-source web application built to help researchers manage their workflows. | https://osf.io/ |
Language data repositories
Platform | Description | URL |
---|---|---|
Linguistic Data Consortium | The Linguistic Data Consortium is an open consortium of universities, companies, and government research laboratories. | https://www.ldc.upenn.edu/ |
Open Language Archives Community | The Open Language Archives Community (OLAC) is an international partnership of institutions and individuals who are creating a worldwide virtual library of language resources. | http://www.language-archives.org/ |
The Language Archive | The Language Archive is a digital repository for language resources. | https://tla.mpi.nl/ |
The Language Bank | The Language Bank is a digital repository for language resources. | https://www.sprakbanken.se/ |
TalkBank | TalkBank is a system for sharing and studying conversational interactions. | https://talkbank.org/ |
Oxford Text Archive | The Oxford Text Archive develops, collects, catalogues, and preserves electronic literary and linguistic resources. | https://ota.bodleian.ox.ac.uk/ |
Developed corpora
Platform | Description | URL |
---|---|---|
British National Corpus | The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken language from a wide range of sources. | https://www.english-corpora.org/bnc/ |
American National Corpus | The American National Corpus (ANC) is a text corpus of American English. | https://anc.org/ |
Corpus of Contemporary American English | The Corpus of Contemporary American English (COCA) is the largest freely-available corpus of English, and the only large and balanced corpus of American English. | https://www.english-corpora.org/coca/ |
Data sources in the textbook and resources
Textbook
Dataset | Location(s) | Description | URL |
---|---|---|---|
masc |
Ch. 2 and 8 | The masc dataset is drawn from the Manually Annotated Sub-Corpus (MASC) of the American National Corpus. |
https://anc.org/data/masc/ |
belc |
Ch. 3 | The belc dataset is acquired from the TalkBank repository. It is a dataset that contains the results of a study on the use of English as a second language. On the written portion is used. |
https://talkbank.org/ |
cedel2 |
Ch. 5 and 9 | A corpus of Spanish as a second language. This dataset appears in chapter 5. | http://cedel2.learnercorpora.com/ |
swda |
Ch. 5 | The Switchboard Dialog Act Corpus (SWDA) is a corpus of telephone conversations. | https://catalog.ldc.upenn.edu/docs/LDC97S62/ |
cabnc |
Ch. 5 and 6 | The spoken portion of the British National Corpus. It is available through Talkbank. | https://ca.talkbank.org/access/CABNC.html |
europarl |
Ch. 6 and 7 | The Europarl Parallel Corpus is a parallel corpus of the European Parliament proceedings. | https://www.statmt.org/europarl/ |
enntt |
Ch. 6 and 7 | The Europarl Corpus of Native and Non-Native and Translated Texts (ENNTT) is a parallel corpus of the European Parliament proceedings. | https://github.com/senisioi/enntt-release |
dative |
Ch. 10 | The dative from the {languageR} package is a dataset that contains the results of a study on the use of dative constructions in English. |
https://cran.r-project.org/web/packages/languageR/languageR.pdf |