National basic, applied and postdoctoral projects

Resources, tools and methods for the analysis of non-standard Slovene

COORDINATOR: dr. D. Fišer
DURATION: 01.07.14 - 30.06.17
 

Web page (Slovene): http://nl.ijs.si/janes/

The past decade has witnessed rapid growth of user-generated content, such as blogs, forums and social media. Such content offers an important source of information to diverse fields, such as social sciences, economics and computer science, both for research and business. But when dealing with user-generated content it is necessary to come to grips with the language of computer-mediated communication which is, due to social and technical characteristics, often very different from standard Slovene. This language is characterized by colloquialisms and borrowings, dialect-specific phonetic orthography and syntax, specific abbreviations and fast uptake of new vocabulary, etc. Standard Slovene is well researched and supported with linguistic resources and tools. But there are no representative corpora for studies of non-standard language, no tools for its analysis and processing, and characteristics of non-standard language are hardly ever included in language descriptions, textbooks or school curricula. The proposed project aims to overcome this gap by developing an infrastructure and methodology for the analysis of user-generated content in (mostly non-standard) Slovene. The proposed project uses a combination of state-of-the-art methods from corpus and computational linguistics to enable a comprehensive study into a segment of the Slovene language, which is changing rapidly, gaining increasing importance in all our activities but has been, so far, ignored for various reasons. In the scope of the project we will compile a large and representative corpus containing Slovene tweets, blogs, Internet forums and comments on news articles and on Wikipedia entries. These text types cover a large portion of publicly available user-generated text. The corpus will be linguistically annotated with standardized spelling, lemma, part-of-speech, syntactic structure and names and will be freely available via a powerful concordancer to make it useful for theoretical and applied linguistic research. The corpus we will be used for a series of linguistic analyses, in particular a comparison of non-standard Slovene with standard written and spoken Slovene, a study of offensive language and three corpus-driven studies; of collocations, terms and semantic shifts in non-standard language. The project will also produce two datasets, a manually annotated corpus and a lexicon, which will be used to develop methods for automatic processing on non-standard texts. Based on the lexicon, a web dictionary of non-standard Slovene will also be produced, useful for teachers, students, linguists, lexicographers and the general public. At the end of the project, we will make the developed resources openly available for download under the Creative Commons license, to make them available for R&D in computational linguistics and other automatic data processing fields. The developed tools will be incorporated into a workflow construction and execution environment, and a prototype platform for the continuous construction of a monitor corpus will be developed. We plan to include the resources, workflows and platforms into the Slovene research infrastructure CLARIN after the end of the project, in order to ensure longevity of its results and to maintain and further develop them. The project also aims to disseminate its results through two workshops and the publication of a book. The developed resources, tools and methods will enable transfer of knowledge to all fields that deal with user-generated content. This will increase e-inclusion of Slovene speakers, who are often chained to foreign-language applications, so that Slovene can function and develop in the digital age. As the methodology for corpus construction and development of the tools will be language independent, it will also be useful for related languages that still lack them, giving the results an important multilingual dimension as well.