Swiss-AL: Platform for Language Data in Applied Sciences: On Challenges in the Field of Language Open Research Data

Julia Krasselt; Philipp Dreesen; Peter Stücheli-Herlach; Dolores Lemmenmeier; Sooyeon Cho; Klaus Rothenhäusler; Matthias Fluor

doi:10.52825/cordi.v1i.249

Authors

Julia Krasselt ZHAW Zurich University of Applied Sciences https://orcid.org/0000-0003-1060-2657
Philipp Dreesen ZHAW Zurich University of Applied Sciences https://orcid.org/0000-0001-5291-2798
Peter Stücheli-Herlach ZHAW Zurich University of Applied Sciences
Dolores Lemmenmeier ZHAW Zurich University of Applied Sciences https://orcid.org/0000-0003-0541-6956
Sooyeon Cho ZHAW Zurich University of Applied Sciences https://orcid.org/0009-0005-4172-7008
Klaus Rothenhäusler ZHAW Zurich University of Applied Sciences https://orcid.org/0000-0003-4744-3362
Matthias Fluor ZHAW Zurich University of Applied Sciences

DOI:

https://doi.org/10.52825/cordi.v1i.249

Keywords:

Language Data, Corpus Linguistics, Interdisciplinarity

Abstract

Open Science is transforming the way researchers collect, process, analyze, and store empirical research data, particularly in the social sciences and humanities, where language data is crucial. This transformation process especially concerns developers and providers of large language corpora and manifests itself in at least three challenges when providing these corpora as Open Research Data (ORD). Challenges concern heterogeneous practices that researchers apply when working with language data, research data lifecycle, and legal and ethical aspect. In this paper, we present Swiss-AL, a language data platform developed in Switzerland that is being transformed into an Open Research Data Resource for Applied Sciences within the Swiss Open Science Strategy. The paper gives an overview over the data contained in Swiss-AL and the infrastructure that is used to process and analyze the data. Furthermore, it presents approaches to the three abovementioned challenges to language ORD.

Downloads

Download data is not yet available.

References

P. Dreesen and P. Stücheli-Herlach, "Diskurslinguistik in Anwendung. Ein transdisziplinä-res Forschungsdesign für korpuszentrierte Analysen zu öffentlicher Kommunikation", Zeit-schrift für Diskursforschung, vol. 7, no. 2, pp. 123–162, 2019, doi: https.doi.org/ 10.3262/ZFD1902123

J. Krasselt, P. Dreesen, M. Fluor, C. Mahlow, K. Rothenhäusler, and M. Runte, "Swiss-AL: A Multilingual Swiss Web Corpus for Applied Linguistics", in Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 2020, pp. 4138--4144. https://aclanthology.org/2020.lrec-1.510/ [26.04.2023]

M. Theobald, J. Siddharth, and A. Paepcke, "SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections", in 31st annual international ACM SIGIR conference on Research and development in information retrieval 2008 (SIGIR 2008), Singapore, Singapore, 2008.

D. Ferrucci and A. Lally, "UIMA: an architectural approach to unstructured information pro-cessing in the corporate research environment", Natural Language Engineering, vol. 10, no. 3–4, pp. 327–348, 2004, doi: https.doi.org/10.1017/S1351324904003523.

H. Schmid, "Probabilistic Part-of-Speech Tagging Using Decision Trees", in Proceedings of the international conference on new methods in language processing, Manchester, United Kingdom, 1994, pp. 44–49. [Online]. Available: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.28.1139

J. R. Finkel, T. Grenager, and C. Manning, "Incorporating Non-local Information into Infor-mation Extraction Systems by Gibbs Sampling", in Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), Ann Arbor, Michigan, 2005, pp. 363–370.

C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky, "The Stanford CoreNLP natural language processing toolkit", in Proceedings of 52nd Annual Meet-ing of the Association for Computational Linguistics: System Demonstrations, Baltimore, Mary-land USA, 2014, pp. 55–60. [Online]. Available: http://www.aclweb.org/anthology/P/P14/P14-5010

J. Krasselt, M. Fluor, K. Rothenhäusler, and P. Dreesen, "A workbench for corpus linguis-tic discourse analysis", in 3rd conference on language, data and knowledge (LDK 2021), D. Gromann, G. Sérasset, T. Declerck, J. P. McCrae, J. Gracia, J. Bosque-Gil, F. Bobillo, and B. Heinisch, Eds., in Open access series in informatics (OASIcs), vol. 93. Dagstuhl, Germany: Schloss Dagstuhl – Leibniz-Zentrum für Informatik, 2021, p. 26:1-26:9. doi: https.doi.org/10.4230/OASIcs.LDK.2021.26.

Swiss-AL: Platform for Language Data in Applied Sciences

On Challenges in the Field of Language Open Research Data

Authors

DOI:

Keywords:

Abstract

Downloads

References

Downloads

Published

How to Cite

Conference Proceedings Volume

Section

License