1.

The ssj500k training corpus for Slovene language processing

The ssj500k training corpus is the largest and most widely used open-source collection of training data for Slovene language processing, which has been manually annotated with respect to segmentation, tokenisation, lemmatisation, JOS morphosyntax and dependency syntax, Universal Dependencies, semantic role labelling, named entities and verbal multi-word expressions. Annotation layers are based on linguistic guidelines that take into account the specificities of Slovene while allowing for cross-lingual connectivity. In the paper, we present the newest version of the corpus as well as the Q-CAT software that we developed for faster corpus annotation and easier analysis of richly annotated data. The corpus thus became easier to upgrade and widely available for empirically based grammatical analyses.

F.23 Development of new system-wide, normative and programme solutions, and methods

COBISS.SI-ID: 30599683

2.

Special issue and a public debate on the new grammar of modern Slovene

Under the umbrella of the J6-8256 project, we have produced a special issue of the Slovenščina 2.0 journal: “Grammar in Linguistic Description”. The issue comprises seven papers that contribute to the field of contemporary Slovene studies, also by introducing new approaches to grammatical data analyses. In addition, there is a transcript of an expert panel we organised as part of a project dissemination event in 2018. The discussion brought together representatives of various relevant Slovenian institutions and highlighted what kind of grammatical description of modern Slovene the community needs given the development of the field and the society.

C.03 Guest-associated editor

COBISS.SI-ID: 298688512

3.

Corpus extraction tool LIST 1.0

With LIST, the user can extract frequency information at the level of characters, word parts, word forms, lemmas, and word strings (n-grams) from any (suitably formatted) text corpus. The tool supports a variety of corpus formats, regardless of the language and the chosen set of linguistic or other annotations. By developing a tool that is openly accessible and user-friendly, we have significantly improved the accessibility of empirical data on modern Slovene for the research and development community, thus facilitating corpus-based grammatical description, comparative corpus analyses and many other results.

F.06 Development of a new product

COBISS.SI-ID: 1538193091

4.

Special award for e-publishing: Gigafida 2.0 corpus

Gigafida, currently available in version 2.0, is a reference corpus of written Slovene. It comprises texts that have been selected and automatically processed with the aim of creating a corpus that represents a sample of modern standard Slovene and can be used for research in linguistics and other branches of the humanities, for compiling modern dictionaries, grammars, and learning materials, as well as for developing language technologies for Slovene. At the 36th Slovenian Book Fair (2020), the Gigafida corpus received a special award in the field of e-publishing which is given - as part of the Book of the Year award - for a project with the most imaginative, fresh and specific solutions within digital platforms related to books: https: //www.knjiznisejem.si/index.php/sl/nagrade.

E.01 National awards

COBISS.SI-ID: 18023939

5.

Valency lexicon extracted from the Gigafida 2.1 corpus

The valency lexicon was extracted from the Gigafida 2.1 corpus and contains valency patterns for 14,595 Slovenian verbs. It is the largest and the first openly accessible lexicon of its type, compiled by an interdisciplinary combination of linguistic and machine approaches. The valency patterns are linguistically defined and formalised using the JOS syntactic dependency system and the semantic role labelling system for Slovene. For all patterns in which individual verbs appear, as well as for each semantic role in the pattern, we have provided statistical data on the representation in the ssj500k and Gigafida 2.1 corpora. Each pattern also contains at least one example of authentic language use from both corpora. The open-access database provides a foundation for corpus studies at the levels of syntax and semantics, for grammatical description and for the development of language technologies for modern Slovene.

F.15 Development of a new information system/databases

COBISS.SI-ID: 62222339

J6-8256 — Final report

1.

The ssj500k training corpus for Slovene language processing

2.

Special issue and a public debate on the new grammar of modern Slovene

3.

Corpus extraction tool LIST 1.0

4.

Special award for e-publishing: Gigafida 2.0 corpus

5.

Valency lexicon extracted from the Gigafida 2.1 corpus