Corpora of Computer-Mediated Communication

This webpage provides a select list of corpora of computer-mediated communication which supplement the following article:

What is Computer-Mediated Communication (CMC)?

Computer-Mediated Communication (CMC) is the research field that explores the social, communicative and linguistic impact of communication technologies, which have continually evolved in connection with the use of computer networks. The main focus of CMC research is on Internet-based technologies and their genres: e-mail, mailinglists, discussion groups (forums and bulletin boards), Internet Relay Chat (IRC) and webchats, Instant Messaging (ICQ, AIM & Co.), MUDs, Voice-over-IP applications (Skype etc.), Web-based videoconferencing, weblogs and hypertext (incl. wikis).

Types and examples of CMC corpora

We differentiate the following types of CMC corpora:

project-related corpora
were compiled as an empirical basis for research questions in a particular project;
corpora for general use
do not directly pertain to a particular project, but were established rather as a data pool for the investigation of diverse potential research questions;
corpora of raw data
have been left in the condition in which they were originally acquired from the Internet;
In annotated corpora,
the data have been subjected to annotation processes (e.g. an SGML/XML-based annotation of data segments that may be relevant for purposes of analysis).

1) Examples: Project-related corpora of raw data

The CoSy:50 Corpus
(Simeon Yates)
50 submissions from 152 computer conferences (see Yates, Simeon J. 1996, Oral and written linguistic aspects of computer conferencing , in: Herring, Susan C. (ed), Computer-Mediated Communication. Linguistic, Social and Cross-Cultural Perspectives. Amsterdam/Philadelphia, pp. 29-46)
The Swiss German webchat corpora
(Beat Siebenhaar)
(see e.g. Siebenhaar, Beat, Die dialektale Verankerung regionaler Chats in der deutschsprachigen Schweiz, in: Eggers, Eckhard/Stellmacher, Dieter/Schmidt, Jürgen Erich (eds): Tagungsband IGDD-Kongress Marburg. Stuttgart)
The contrastive German-Swedish IRC-Corpus
(Christiane Pankow)
(see e.g. Pankow, Christiane 2003,Zur Darstellung nonverbalen Verhaltens in deutschen und schwedischen IRC-Chats. Eine Korpusuntersuchung, in: Linguistik online 15)

2) Examples: Corpora of raw data for general use

The Netscan Usenet Database
The Enron Email Dataset (> 0.5 million business-related e-mail messages)
The SpamAssassin Public Corpus (approx. 6,000 e-mail messages from the Apache SpamAssassin Project)
Korpus deutschsprachiger Newsgroups (see Feldweg, Helmut/Kibiger, Ralf/Thielen, Christine 1995, Zum Sprachgebrauch in deutschen Newsgruppen , in: Schmitz, Ulrich (ed), Neue Medien (Osnabrücker Beiträge zur Sprachtheorie 50), 143-154)
WWE-2006 weblog dataset this corpus of weblog posts was temporarily available to the participants of the 3rd Annual Workshop on the Weblogging Ecosystem (see

3) Examples: Project-related annotated corpora

E-Mail corpus from the COSMA project 160 e-mail messages with appointment arrangements (see Declerck, Thierry/Klein, Judith 1997, Ein Email-Korpus zur Entwicklung und Evaluierung der Analysekomponente eines Terminvereinbarungssystems)
Website corpus from the Hypnotic project (see Rehm, Georg, 2001, Hypertextsorten: Definition, Struktur, Klassifikation)

4) Examples: Annotated corpora for general use

The Düsseldorf CMC Corpus Corpus resource at Düsseldorf University (Dieter Stein); no online access
The Dortmund Chat Corpus (> 500 annotated chatlogs and a retrieval tool available online)

List compiled by Michael Beißwenger and Angelika Storrer, Technical University of Dortmund.

Last revised: 2008-02-09