The German Summary corpus (GerSumCo): A new resource for contrastive research into L2 German of advanced writers
Date:
Abstract The German Summary corpus (GerSumCo) is a new corpus for contrastive research into German as a second (L2) vs. first language (L1). GerSumCo was created to investigate cohesion in academic L2 German writing produced by advanced learners. There are several corpora for the contrastive investigation of German learner language available, targeting diverse acquisition levels, text types and L1 backgrounds (e.g., KOLAS: Knorr & Andersen, 2017; Falko: Lüdeling et al., 2008). However, whereas summary writing is an interesting genre for the analysis of cohesion (as seen in Walter, 2007), the only existing corpus of summaries to date is the Falko summary subcorpus (Lüdeling et al., 2008). Preliminary analyses of the Falko summary L2 subcorpus revealed a high degree of patchwriting, i.e., students copy-pasting larger chunks of text from the original text. Since this creates a bias in the data, we decided to compile a new summary corpus. The specificity of our corpus is twofold: First, students created summaries from two different source texts, i.e., they needed to create their own coherent flow, which diminishes the problem of patchwriting. Second, all summaries were written based on the same source texts and under comparable conditions: All students had to write a summary of two popular scientific texts about a topic related to language variation in contemporary German (e.g., Kiezdeutsch, Mundartdebatte in der Schweiz).
To date, GerSumCo consists of 89 summaries which were written by 42 L2 German students with diverse L1s and 47 L1 German students, with the corpus still growing. The texts were collected at several German Universities during the academic year of 2022-23. For a research project aimed at investigating cohesive strategies deployed by L1 and L2 German writers, the corpus was pre-processed and general linguistic information was added automatically (e.g. part-of-speech). The first analysis of the corpus focuses on connectives as a well-researched cohesive device in learner language. Manual corrections of an automatic pre-annotation via DimLex (Scheffler & Stede 2016; Stede 2002) were conducted by three trained annotators using guidelines based on the PDTB-3 scheme (Webber et al. 2019). The poster will introduce the corpus to the research community and first results of the contrastive analysis of connectives.
References
Knorr, D., & Andresen, M. (2017). Commented Learner Corpus Academic Writing (KoLaS). Hamburger Zentrum für Sprachkorpora. http://hdl.handle.net/11022/0000-0001-B732-8.
Lüdeling, A., Doolittle, S., Hirschmann, H., Schmidt, K. & Walter, M. (2008). Das Lernerkorpus Falko. Deutsch als Fremdsprache, 2, 67–73. https://doi.org/10.37307/j.2198-2430.2008.02.02
Scheffler, T., & Stede, M. (2016). Adding semantic relations to a large-coverage connective lexicon of German. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 16) (pp. 1008–1013). ELRA.
Stede, M. (2002). DiMLex: A lexical approach to discourse markers. In A. Lenci & V. Di Tomaso (Eds.), Exploring the lexicon: Theory and computation (pp. 1–15). Edizioni dell’Orso.
Walter, M. (2006): Hier wird die Wahl schwer, aber entscheidend. Konnektorenkontraste im Deutschen. Österreichisches Jahrbuch Deutsch als Fremdsprache 2006.
Webber, B., Prasad, R., Lee, A., & Joshi, A. (2019). The Penn Discourse Treebank 3.0 Annotation Manual. University of Pennsylvania.
