Nacionalni portal odprte znanosti

Annotated corpus of Slovenian language-related news articles MetaLangNEWS-Sl

Raziskovalni podatki

Oznake: news corpus;news discourse

A comprehensive corpus of news articles on the topic of language, published in major Slovenian daily newspapers and news portals in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about language’), linguistic ideolog ...

Leto: 2020 Vir: CLARIN.si

Annotated corpus of Slovenian language-related news comments MetaLangNEWS-COMMENTS-Sl

Ksenija Bogetić, Vuk Batanović

Raziskovalni podatki

Oznake: news comments;computer-mediated communication

A comprehensive corpus of user comments on online news articles on the topic of language from major Slovenian daily newspapers and news portals, published in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about lang ...

Leto: 2020 Vir: CLARIN.si

Annotated corpus of Croatian language-related news articles MetaLangNEWS-Hr

Ksenija Bogetić, Vuk Batanović

Raziskovalni podatki

Oznake: news corpus;news discourse

A comprehensive corpus of news articles on the topic of language, published in major Croatian daily newspapers and news portals in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about language’), linguistic ideologi ...

Leto: 2020 Vir: CLARIN.si

Annotated corpus of Croatian language-related news comments MetaLangNEWS-COMMENTS-Hr

Vuk Batanović, Ksenija Bogetić

Raziskovalni podatki

Oznake: news comments;computer-mediated communication

A comprehensive corpus of user comments on online news articles on the topic of language from major Croatian daily newspapers and news portals, published in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about langu ...

Leto: 2020 Vir: CLARIN.si

Annotated corpus of Serbian language-related news articles MetaLangNEWS-Sr

Ksenija Bogetić, Vuk Batanović

Raziskovalni podatki

Oznake: news corpus;news discourse

A comprehensive corpus of news articles on the topic of language, published in major Serbian daily newspapers and news portals in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about language’), linguistic ideologie ...

Leto: 2020 Vir: CLARIN.si

Annotated corpus of Serbian language-related news comments MetaLangNEWS-COMMENTS-Sr

Ksenija Bogetić, Vuk Batanović

Raziskovalni podatki

Oznake: news comments;computer-mediated communication

A comprehensive corpus of user comments on online news articles on the topic of language from major Serbian daily newspapers and news portals, published in the five-year period of January 1, 2015 - January 1, 2020. The corpus is designed to facilitate research on metalanguage (‘language about langua ...

Leto: 2020 Vir: CLARIN.si

Training corpus SETimes.SR 1.0

Vuk Batanović, Nikola Ljubešić, Tanja Samardžić, Tomaž Erjavec

Raziskovalni podatki

Oznake: part-of-speech tagging;dependency treebank;parsing;named entities;tokenisation;manual annotation;TEI

The SETimes.SR training corpus contains 86 726 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation, syntactic dependencies, and named entities. The annotations (and other aspects) of the corpus are documented in the teiHeader and ...

Leto: 2018 Vir: CLARIN.si

Training corpus hr500k 1.0

Nikola Ljubešić, Željko Agić, Filip Klubička, Vuk Batanović, Tomaž Erjavec

Raziskovalni podatki

Oznake: part-of-speech tagging;dependency treebank;parsing;named entities;tokenisation;manual annotation;TEI;semantic role labelling

The hr500k training corpus contains about 500,000 tokens manually annotated on the levels of tokenisation, sentence segmentation, morphosyntactic tagging, lemmatisation and named entities. About half of the corpus is also manually annotated with syntactic dependencies. Furthermore, about a fifth of ...

Leto: 2018 Vir: CLARIN.si

Croatian Twitter training corpus ReLDI-NormTagNER-hr 2.1

Nikola Ljubešić, Tomaž Erjavec, Vuk Batanović, Maja Miličević, Tanja Samardžić

Raziskovalni podatki

Oznake: computer-mediated communication;tokenisation;word normalisation;part-of-speech tagging;lemmatisation;named entities;manual annotation;TEI

ReLDI-NormTagNER-hr 2.1 is a manually annotated corpus of Croatian tweets. It is meant as a gold-standard training and testing dataset for tokenisation, sentence segmentation, word normalisation, morphosyntactic tagging, lemmatisation and named entity recognition of non-standard Croatian. Each tweet ...

Leto: 2019 Vir: CLARIN.si

Nacionalni portal odprte znanosti

Dostop do znanja slovenskih raziskovalnih organizacij