The Beldeko corpus as a resource to investigate cohesion in German learner language: A preliminary analysis of corpus homogeneity

Date:

Abstract

We present a new learner corpus for the investigation of German as a foreign language (L2), Beldeko (Belgisches Deutschkorpus). Beldeko contains 301 summaries produced by writers with Dutch as first language (L1) (70,774 tokens). The corpus was created with the aim to investigate novice academic writing in German L2, more specifically, the characteristics of academic writing of writers with Dutch L1, which is a novelty in learner corpora. In this presentation, we will focus on the representativeness of Beldeko as a German learner corpus, especially with regard to selected cohesive devices, which are central to the advanced communicative competence of language learners.

To investigate the representativeness of the Beldeko corpus as a potential resource to investigate cohesion in German learner language, first its level of homogeneity needs to be established. The texts of the corpus were written by 115 students majoring in German (CEF level of B2-C1). They produced summaries of two popular-scientific texts about language variation in contemporary German under test conditions. With regard to corpus homogeneity, two hypotheses were investigated: (1) Based on the writers’ common linguistic background and similar overall proficiency level, we hypothesized that the texts show a similar pattern regarding the distributions and the frequencies of layers containing general linguistic information, e.g. part-of-speech (POS)-tags. (2) Since the use of cohesive devices is strongly related to individual writing style and vocabulary size, we hypothesized to find a higher heterogeneity in the corpus regarding these elements. Especially the first condition needs to be met to guarantee that the corpus is balanced enough as a resource for the analysis of the use of cohesive devices.

For the statistical analysis of the corpus, the data were pre-processed and several linguistic annotation layers were added automatically (e.g. POS-tags). Furthermore, we used an online tool for automated text analysis (CTAP: Weiss & Meurers, 2019) to investigate the distribution of cohesive devices, such as connectives. Subsequently, the descriptive statistical analysis was performed via R. The findings reveal a rather homogenous picture of the corpus on the overall grammatical level: the texts show similar frequencies regarding POS-tags. In contrast, the texts show a heterogeneous distribution of connectives. In conclusion, the results confirm that the corpus is suitable for the analysis of German learner language, more specifically, to investigate the use of cohesive devices by advanced learners of German.

References

Strobl, Carola (2020). Beldeko Summary Corpus v1.0.0, Eurac Research CLARIN Centre, http://hdl.handle.net/20.500.12124/15.

Weiss, Z., & Meurers, D. (2019). Broad linguistic modeling is beneficial for German L2 proficiency assessment. In Widening the Scope of Learner Corpus Research, Selected Papers from the Fourth Learner Corpus Research Conference (pp. 419-435). Presses Universitaires de Louvain.