Swiss-AL: Platform for Language Data in Applied Sciences
On Challenges in the Field of Language Open Research Data
DOI:
https://doi.org/10.52825/cordi.v1i.249Keywords:
Language Data, Corpus Linguistics, InterdisciplinarityAbstract
Open Science is transforming the way researchers collect, process, analyze, and store empirical research data, particularly in the social sciences and humanities, where language data is crucial. This transformation process especially concerns developers and providers of large language corpora and manifests itself in at least three challenges when providing these corpora as Open Research Data (ORD). Challenges concern heterogeneous practices that researchers apply when working with language data, research data lifecycle, and legal and ethical aspect. In this paper, we present Swiss-AL, a language data platform developed in Switzerland that is being transformed into an Open Research Data Resource for Applied Sciences within the Swiss Open Science Strategy. The paper gives an overview over the data contained in Swiss-AL and the infrastructure that is used to process and analyze the data. Furthermore, it presents approaches to the three abovementioned challenges to language ORD.Downloads
References
P. Dreesen and P. Stücheli-Herlach, "Diskurslinguistik in Anwendung. Ein transdisziplinä-res Forschungsdesign für korpuszentrierte Analysen zu öffentlicher Kommunikation", Zeit-schrift für Diskursforschung, vol. 7, no. 2, pp. 123–162, 2019, doi: https.doi.org/ 10.3262/ZFD1902123
J. Krasselt, P. Dreesen, M. Fluor, C. Mahlow, K. Rothenhäusler, and M. Runte, "Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics", in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4138--4144. https://aclanthology.org/2020.lrec-1.510/ [26.04.2023]
M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections", in 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 (SIGIR 2008), Singapore, Singapore, 2008.
D. Ferrucci and A. Lally, "UIMA: an architectural approach to unstructured information pro-cessing in the corporate research environment", Natural Language Engineering, vol. 10, no. 3–4, pp. 327–348, 2004, doi: https.doi.org/10.1017/S1351324904003523.
H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision Trees", in Proceedings of the international conference on new methods in language processing, Manchester, United Kingdom, 1994, pp. 44–49. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1139
J. R. Finkel, T. Grenager, and C. Manning, "Incorporating Non-local Information into Infor-mation Extraction Systems by Gibbs Sampling", in Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, 2005, pp. 363–370.
C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit", in Proceedings of 52nd Annual Meet-ing of the Association for Computational Linguistics: System Demonstrations, Baltimore, Mary-land USA, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010
J. Krasselt, M. Fluor, K. Rothenhäusler, and P. Dreesen, "A workbench for corpus linguis-tic discourse analysis", in 3rd conference on language, data and knowledge (LDK 2021), D. Gromann, G. Sérasset, T. Declerck, J. P. McCrae, J. Gracia, J. Bosque-Gil, F. Bobillo, and B. Heinisch, Eds., in Open access series in informatics (OASIcs), vol. 93. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021, p. 26:1-26:9. doi: https.doi.org/10.4230/OASIcs.LDK.2021.26.
Downloads
Published
How to Cite
Conference Proceedings Volume
Section
License
Copyright (c) 2023 Julia Krasselt, Philipp Dreesen, Peter Stücheli-Herlach
This work is licensed under a Creative Commons Attribution 4.0 International License.
Accepted 2023-06-30
Published 2023-09-07