fbpx
Wikipedia

Croatian National Corpus

Croatian National Corpus (Croatian: Hrvatski nacionalni korpus, HNK) is the biggest and the most important corpus of Croatian. Its compilation started in 1998 at the Institute of Linguistics[1] of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of . The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of Croatian started to appear even earlier.[2] The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The initial composition was divided in two constituents:

  1. 30-million corpus of contemporary Croatian (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded.
  2. Croatian Electronic Text Archive (HETA) where the complete text were included, particularly serial publications (volumes, series, editions etc.) which would imbalance the 30m if they were inserted there.

Since 2004, with the adoption of the concept of the 3rd generation corpus, the two-constituent structure has been abandoned in favor of several subcorpora and larger size. Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus. Since 2004 HNK also migrated to a new server platform, namely Manatee/Bonito server-client architecture. For searching the HNK (today still with free test access) a free client program Bonito[3] is needed. The author of this corpus manager is Pavel Rychlý[4] from the Natural Language Processing Laboratory[5] of the Faculty of Informatics,[6] Masaryk University in Brno, Czech Republic. Its interface features complex and more elaborated queries over corpus, different types of statistical results, total or partial word lists according to different query criteria (with their frequencies), frequency distribution of types, automatic collocation detection etc.

The last version of this corpus (version 3)[7] has 216.8 million tokens. The online search is available via web-interface search Bonito 2 which is a part of NoSketch Engine,[8] limited version of the software Sketch Engine.

References edit

  1. ^ Institute of Linguistics
  2. ^ Tadić 1990, 1996 2006-02-10 at the Wayback Machine, 1998 2006-02-10 at the Wayback Machine
  3. ^ Bonito
  4. ^ Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing. Masaryk University: 65–70.
  5. ^ Natural Language Processing Laboratory 2005-10-28 at the Wayback Machine
  6. ^ Faculty of Informatics
  7. ^ Tadić, Marko (2009). "New version of the Croatian National Corpus". After Half a Century of Slavonic Natural Language Processing. Masaryk University: 199–205.
  8. ^ NoSketch Engine

External links edit

croatian, national, corpus, croatian, hrvatski, nacionalni, korpus, biggest, most, important, corpus, croatian, compilation, started, 1998, institute, linguistics, faculty, humanities, social, sciences, university, zagreb, following, ideas, marko, tadić, theor. Croatian National Corpus Croatian Hrvatski nacionalni korpus HNK is the biggest and the most important corpus of Croatian Its compilation started in 1998 at the Institute of Linguistics 1 of the Faculty of Humanities and Social Sciences University of Zagreb following the ideas of Marko Tadic The theoretical foundations and the expression of the need for a general purpose representative and multi million corpus of Croatian started to appear even earlier 2 The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields topics genres and styles from literary and scientific texts to text books newspaper user groups and chat rooms The initial composition was divided in two constituents 30 million corpus of contemporary Croatian 30m where samples from texts from 1990 on were included The criteria for inclusion of text samples were written by native speakers different fields genres and topics Translated text or poetry were excluded Croatian Electronic Text Archive HETA where the complete text were included particularly serial publications volumes series editions etc which would imbalance the 30m if they were inserted there Since 2004 with the adoption of the concept of the 3rd generation corpus the two constituent structure has been abandoned in favor of several subcorpora and larger size Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus Since 2004 HNK also migrated to a new server platform namely Manatee Bonito server client architecture For searching the HNK today still with free test access a free client program Bonito 3 is needed The author of this corpus manager is Pavel Rychly 4 from the Natural Language Processing Laboratory 5 of the Faculty of Informatics 6 Masaryk University in Brno Czech Republic Its interface features complex and more elaborated queries over corpus different types of statistical results total or partial word lists according to different query criteria with their frequencies frequency distribution of types automatic collocation detection etc The last version of this corpus version 3 7 has 216 8 million tokens The online search is available via web interface search Bonito 2 which is a part of NoSketch Engine 8 limited version of the software Sketch Engine References edit Institute of Linguistics Tadic 1990 1996 Archived 2006 02 10 at the Wayback Machine 1998 Archived 2006 02 10 at the Wayback Machine Bonito Rychly Pavel 2007 Manatee bonito a modular corpus manager PDF 1st Workshop on Recent Advances in Slavonic Natural Language Processing Masaryk University 65 70 Natural Language Processing Laboratory Archived 2005 10 28 at the Wayback Machine Faculty of Informatics Tadic Marko 2009 New version of the Croatian National Corpus After Half a Century of Slavonic Natural Language Processing Masaryk University 199 205 NoSketch EngineExternal links editFree online search Croatian National Corpus website in Croatian Hrvatska jezicna riznica another online Croatian corpus by the Institute of Croatian Language and Linguistics Retrieved from https en wikipedia org w index php title Croatian National Corpus amp oldid 1031867331, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.