The Beldeko corpus: A new resource for investigating L2 German texts written by L1 Dutch students

Date: December 13, 2022

Abstract

In this poster, we present a new learner corpus for investigating German as a foreign language (L2): Beldeko (Belgisches Deutschkorpus). The corpus was created to investigate academic writing in L2 German by advanced learners with Dutch as first language (L1). It contains summaries produced by L1 Dutch writers. Although there are several learner L2 German corpora available, most of them are heterogeneous with regard to the learners’ L1s. This means that L1-specific characteristics of L2 German have not received due attention. One exception is the ALeSKo learner corpus (Zinsmeister & Breckle 2012), which consist of two subcorpora: L2 German essays written by L1 Chinese writers and comparable L1 German essays. The largest and best-known German learner corpus to date is the Falko corpus (Reznicek et al. 2012), which was compiled at the Humboldt−Universität zu Berlin. Other corpora with various L1s are the Kommentiertes Lernendenkorpus akademisches Schreiben (KOLAS; Knorr & Andresen 2017), which contains 854 academic texts produced by 233 students in the context of writing consultation given by peer tutors, and the MERLIN corpus (Abel et al. 2014), which contains 2,286 texts produced by learners of Italian, German and Czech taken from written exams of CEFR testing institutions. To date, no L2 German corpus produced by L1 Dutch students is available. The corpus being presented aims to close this gap.

The 301 summaries included in the Beldeko corpus (70,774 tokens) were written by 115 students with L1 Dutch. The texts were collected at Ghent University (in 2013 and 2014) and University College of Ghent (in 2013) as pretests, immediate posttests and delayed posttests in an intervention study on collaborative writing. 82 students produced three summaries each and 33 students produced two summaries each. The tasks at hand were to write summaries of two popular scientific texts (newspaper articles, interviews or websites) about a topic related to language variation in contemporary German (Kiezdeutsch, Mundartdebatte in der Schweiz, Viadrinisch, Varianten-Wörterbuch des Deutschen).

For a research project aimed at investigating cohesive strategies deployed by L2 German writers with L1 Dutch, the corpus was pre-processed and several linguistic annotation layers were added automatically: PoS tags, morphological information, lexemes and universal dependencies. Moreover, a target hypothesis was added manually. In the course of the project, the corpus will be annotated with information about cohesive devices targeting several of the categories described by Halliday and Hasan (1976), starting with co-reference, conjunction and lexical cohesion. These categories are especially interesting in the context of Dutch–German influences, as studies into Dutch–German translations have found that co-reference (in German) shifts to lexical cohesion (in Dutch) (Van de Velde 2011), which Dendooven (2018) explains as the result of language-specific grammatical restrictions on the one hand (e.g., der Stuhl, auf dem er sitzt vs de stoel waarop hij zit) and of language-specific preferences on the other (e.g., man vs je).

An automatic pre-annotation of these categories has been performed with the help of CorZu (Tuggener, 2016: coreference), DimLex (Scheffler & Stede 2016; Stede 2002: connectives) and GermaNet (Hamp & Feldweg 1997; Henrich & Hinrichs 2010: synonyms, hyponyms und hypernyms). Based on the automated pre-annotation, manual annotations will be conducted, using the annotation platform Inception (Klie et al. 2018) and guidelines based on PTDB3 scheme (Webber et al. 2019: connectives), the co-reference guidelines developed by Reznicek et al. (2012) and lexical cohesive devices as presented in Tanskanen (2006). These guidelines will be put to the test and possibly revised after a pilot phase, depending on inter-annotator agreement. The poster will introduce the corpus to the research community and show preliminary results of the analysis of cohesion retrieved from the automatic annotation. This includes an analysis of the homogeneity of the corpus to investigate learner-specific use of cohesive devices.

References

Abel, A., Wisniewski, K., Nicolas, L., Boyd, A., Hana, J., Meurers, D. (2014). A trilingual learner corpus illustrating European Reference Levels. Ricognizioni – Rivista di Lingue, Letterature e Culture Moderne, 2(1), 111–126.

Dendooven, F. (2018). Die Übersetzung von Koreferenzmitteln: Eine Studie auf Basis eines deutsch–niederländischen Übersetzungskorpus von Museumstexten [Unpublished master’s thesis]. Ghent University.

Hamp, B., & Feldweg, H. (1997). GermaNet: A lexical-semantic net for German. In Proceedings of the ACL workshop Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Applications (pp. 9–15). Association for Computational Linguistics.

Henrich, V., & Hinrichs, E. (2010). GernEdiT: The GermaNet Editing Tool. In Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010) (pp. 2228–2235). ELRA.

Knorr, D., & Andresen, M. (2017). Commented Learner Corpus Academic Writing (KoLaS). Archived in Hamburger Zentrum für Sprachkorpora. Version 2.0. http://hdl.handle.net/11022/0000-0001-B732-8.

Klie, J. C., Bugert, M., Boullosa, B., de Castilho, R. E., & Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrationsm (pp. 5–9). Association for Computational Linguistics.

Reznicek, M., Lüdeling, A., & Schwantuschke, F. (2012). Das Falko-Handbuch: Korpusaufbau und Annotationen: Version 2.01. Humboldt-Universität zu Berlin. Institut für deutsche Sprache und Linguistik - Korpuslinguistik. Retrieved from https://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko/Falko-Handbuch_Korpusaufbau%20und%20Annotationen_v2.01

Scheffler, T., & Stede, M. (2016). Adding semantic relations to a large-coverage connective lexicon of German. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 16) (pp. 1008–1013). ELRA.

Stede, M. (2002). DiMLex: A lexical approach to discourse markers. In A. Lenci & V. Di Tomaso (Eds.), Exploring the lexicon: Theory and computation (pp. 1–15). Edizioni dell’Orso.

Tanskanen, S. K. (2006). Collaborating towards coherence: Lexical cohesion in English discourse. John Benjamins. https://doi.org/10.1075/pbns.146

Tuggener, D. (2016). Incremental coreference resolution for German [Doctoral dissertation]. University of Zürich.

Van de Velde, M. (2011). Explizierung und Implizierung im Übersetzungspaar Deutsch Niederländisch: Eine quantitative Untersuchung. In P. A. Schmitt, S. Herold, & A. Weilandt (Eds.), Translationsforschung (pp. 865–884). Peter Lang.

Webber, B., Prasad, R., Lee, A., & Joshi, A. (2019). The penn discourse treebank 3.0 annotation manual. University of Pennsylvania.

Zinsmeister, H., & Breckle, M. (2012). The ALeSKo learner corpus. Multilingual Corpora and Multilingual Corpus Analysis, 14, 71–96. https://doi.org/10.1075/hsm.14.06zin

Share on

Twitter Facebook LinkedIn

Helena Wedig

Share on