fbpx
Wikipedia

TenTen Corpus Family

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.[1]

In the creation of the TenTen corpora, data crawled from the World Wide Web are processed with natural language processing tools developed by the Natural Language Processing Centre at the Faculty of Informatics at Masaryk University (Brno, Czech Republic) and by the Lexical Computing company (developer of the Sketch Engine).

Corpus linguistics edit

In corpus linguistics, a text corpus is a large and structured collection of texts that are electronically stored and processed. It is used to do hypothesis testing about languages, validating linguistic rules or the frequency distribution of words (n-grams) within languages.

Electronically processed corpora provide fast search. Text processing procedures such as tokenization, part-of-speech tagging and word-sense disambiguation enrich corpus texts with detailed linguistic information. This enables to narrow the search to a particular parts of speech, word sequences or a specific part of the corpus.

First text corpora were created in the 1960s, such as the 1-million-word Brown Corpus of American English. Over time, many further corpora were produced (such as the British National Corpus and the LOB Corpus) and work had begun also on corpora of larger sizes and covering other languages than English. This development was linked with the emergence of corpus creation tools that help achieve larger size, wider coverage, cleaner data etc.

Production of TenTen corpora edit

The procedure by which TenTen corpora are produced is based on the creators' earlier research in preparing web corpora and the subsequent processing thereof.[2][3][4]

At the beginning, a huge amount of text data is downloaded from the World Wide Web by the dedicated SpiderLing web crawler.[5] In a later stage, these texts undergo cleaning, which consists of removing any non-textual material such as navigation links, headers and footers from the HTML source code of web pages with the jusText tool,[6] so that only full solid sentences are preserved. Eventually, the ONION tool[6] is applied to remove duplicate text portions from the corpus, which naturally occur on the World Wide Web due to practices such as quoting, citing, copying etc.[1]

TenTen corpora data structure edit

TenTen corpora follow a specific metadata structure that is common to all of them. Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus. Some TenTen corpora can feature additional specific attributes.

Document attributes edit

  • top-level domain – domain at the highest level of the hierarchical Domain Name System (e.g. "com")
  • website – identification string defining a realm of administrative autonomy within the Internet (e.g. "wikipedia.org")
  • web domain – collection of related web pages (e.g. "la.wikipedia.org")
  • crawl date – date when the document was downloaded from the Web
  • url – the Uniform Resource Locator referring to the document's source
  • wordcount – number of words in the document
  • length – classification of the document into a range by its length measured in thousands of words

Paragraph attributes edit

  • heading – a numeric attribute distinguishing headers and similar titles from ordinary body text (1 if the paragraph is a heading, 0 otherwise)

Available TenTen corpora edit

The following corpora can be accessed through the Sketch Engine as of October 2018:[7]

  1. arTenTen (Arabic web corpus)[8]
  2. beTenTen (Belarusian web corpus)[9]
  3. bgTenTen (Bulgarian web corpus)[10]
  4. caTenTen (Catalan web corpus)
  5. csTenTen (Czech web corpus)[11]
  6. daTenTen (Danish web corpus)
  7. deTenTen (German web corpus)
  8. elTenTen (Greek web corpus)
  9. enTenTen (English web corpus)[12]
  10. esTenTen (Spanish web corpus with European/American Spanish subcorpora)[13]
  11. etTenTen (Estonian web corpus)[14]
  12. fiTenTen (Finnish web corpus)
  13. frTenTen (French web corpus)
  14. heTenTen (Hebrew web corpus)
  15. hiTenTen (Hindi web corpus)
  16. huTenTen (Hungarian web corpus)
  17. itTenTen (Italian web corpus)
  18. jaTenTen (Japanese web corpus)
  19. kmTenTen (Khmer web corpus)
  20. koTenTen (Korean web corpus)
  21. loTenTen (Lao & Isan web corpus)
  22. ltTenTen (Lithuanian web corpus)
  23. lvTenTen (Latvian web corpus)
  24. mkTenTen (Macedonian web corpus)
  25. nlTenTen (Dutch web corpus)
  26. noTenTen (Norwegian web corpus)
  27. plTenTen (Polish web corpus)
  28. ptTenTen (Portuguese web corpus)
  29. roTenTen (Romanian web corpus)
  30. ruTenTen (Russian web corpus)
  31. skTenTen (Slovak web corpus)
  32. slTenTen (Slovenian web corpus)
  33. svTenTen (Swedish web corpus)
  34. thTenTen (Thai web corpus)
  35. tlTenTen (Tagalog web corpus)
  36. trTenTen (Turkish web corpus)[15]
  37. ukTenTen (Ukrainian web corpus)
  38. zhTenTen (Chinese Simplified characters web corpus)

See also edit

References edit

  1. ^ a b Jakubíček, Miloš; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). The Tenten Corpus Family (PDF). 7th International Corpus Linguistics Conference CL. Lancaster, UK: Lancaster University. pp. 125–127. Retrieved 13 June 2017.
  2. ^ Baroni, Marco; Kilgarriff, Adam; Kovář, Vojtěch; Rychlý, Pavel; Suchomel, Vít (July 2013). Large linguistically-processed web corpora for multiple languages (PDF). 11th Conference of the European Chapter of the Association for Computational Linguistics: Posters & Demonstrations. Association for Computational Linguistics. Trento, Italy: Lancaster University. pp. 87–90. Retrieved 13 June 2017.
  3. ^ Kilgarriff, Adam; Reddy, Siva; Pomikálek, Jan; Avinesh, PVS (May 2010). A Corpus Factory for Many Languages. 7th Language Resources and Evaluation Conference. Valletta, Malta: ELRA. Retrieved 13 June 2017.
  4. ^ Sharoff, Serge (2006). "Creating general-purpose corpora using automated search engine queries" (PDF). In Baroni, Marco; Bernardini, Silvia (eds.). Wacky! Working papers on the Web as Corpus. Bologna, Italy: GEDIT. pp. 63–98. ISBN 978-88-6027-004-7.
  5. ^ Suchomel, Vít; Pomikálek, Jan (17 April 2012). "Efficient web crawling for large text corpora" (PDF). Proceedings of the seventh Web as Corpus Workshop (WAC7). 7th Web as Corpus Workshop. Lyon, France: Association for Computational Linguistics (ACL) on Web as Corpus. pp. 39–43. Retrieved 13 June 2017.
  6. ^ a b Pomikálek, Jan (2011). Removing boilerplate and duplicate content from web corpora (PhD). Faculty of Informatics, Masaryk University. Retrieved 17 April 2017.
  7. ^ "TenTen Corpus Family". www.sketchengine.eu. Sketch Engine. Retrieved 23 October 2018.
  8. ^ Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). arTen-Ten: a new, vast corpus for Arabic. Proceedings of WACL.
  9. ^ "A new Belarusian corpus (beTenTen)". Sketch Engine. Lexical Computing. 2018-02-26. Retrieved 2018-04-06.
  10. ^ Kilgarriff, A., Jakubíček, M., Pomikalek, J., Sardinha, T. B., & Whitelock, P. (2014). PtTenTen: a corpus for Portuguese lexicography. Working with Portuguese Corpora, 111-30.
  11. ^ Suchomel, Vít (December 7–9, 2012). "Recent Czech Web Corpora". In Horák, A.; Rychlý, P. (eds.). Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2012. Tribun EU. pp. 77–83.
  12. ^ Kilgarriff, Adam (2012). "Getting to Know Your Corpus". Text, Speech and Dialogue. Lecture Notes in Computer Science. Vol. 7499. pp. 3–15. CiteSeerX 10.1.1.452.8074. doi:10.1007/978-3-642-32790-2_1. ISBN 978-3-642-32789-6.
  13. ^ Kilgarriff, A., & Renau, I. (2013). esTenTen, a vast web corpus of Peninsular and American Spanish. Procedia - Social and Behavioral Sciences, 95, 12-19.
  14. ^ SRDANOVIĆ, I. (2016). A Research Project on Language Resources for Learners of Japanese. Inter Faculty, 6.
  15. ^ Baisa, Vít; Suchomel, Vít (2015). "Turkic Language Support in Sketch Engine". Proceedings of the international conference "Turkic Languages processing: TurkLang 2015". Kazan: Academy of Sciences of the Republic of Tatarstan Press. pp. 214–223. ISBN 978-5-9690-0262-3 – via IS MU.

External links edit

  • TenTen Corpus Family (at the Sketch Engine website)

tenten, corpus, family, also, called, tenten, corpora, comparable, text, corpora, collections, texts, that, have, been, crawled, from, world, wide, processed, match, same, standards, these, corpora, made, available, through, sketch, engine, corpus, manager, th. The TenTen Corpus Family also called TenTen corpora is a set of comparable web text corpora i e collections of texts that have been crawled from the World Wide Web and processed to match the same standards These corpora are made available through the Sketch Engine corpus manager There are TenTen corpora for more than 35 languages Their target size is 10 billion 1010 words per language which gave rise to the corpus family s name 1 In the creation of the TenTen corpora data crawled from the World Wide Web are processed with natural language processing tools developed by the Natural Language Processing Centre at the Faculty of Informatics at Masaryk University Brno Czech Republic and by the Lexical Computing company developer of the Sketch Engine Contents 1 Corpus linguistics 2 Production of TenTen corpora 3 TenTen corpora data structure 3 1 Document attributes 3 2 Paragraph attributes 4 Available TenTen corpora 5 See also 6 References 7 External linksCorpus linguistics editIn corpus linguistics a text corpus is a large and structured collection of texts that are electronically stored and processed It is used to do hypothesis testing about languages validating linguistic rules or the frequency distribution of words n grams within languages Electronically processed corpora provide fast search Text processing procedures such as tokenization part of speech tagging and word sense disambiguation enrich corpus texts with detailed linguistic information This enables to narrow the search to a particular parts of speech word sequences or a specific part of the corpus First text corpora were created in the 1960s such as the 1 million word Brown Corpus of American English Over time many further corpora were produced such as the British National Corpus and the LOB Corpus and work had begun also on corpora of larger sizes and covering other languages than English This development was linked with the emergence of corpus creation tools that help achieve larger size wider coverage cleaner data etc Production of TenTen corpora editThe procedure by which TenTen corpora are produced is based on the creators earlier research in preparing web corpora and the subsequent processing thereof 2 3 4 At the beginning a huge amount of text data is downloaded from the World Wide Web by the dedicated SpiderLing web crawler 5 In a later stage these texts undergo cleaning which consists of removing any non textual material such as navigation links headers and footers from the HTML source code of web pages with the jusText tool 6 so that only full solid sentences are preserved Eventually the ONION tool 6 is applied to remove duplicate text portions from the corpus which naturally occur on the World Wide Web due to practices such as quoting citing copying etc 1 TenTen corpora data structure editTenTen corpora follow a specific metadata structure that is common to all of them Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus Some TenTen corpora can feature additional specific attributes Document attributes edit top level domain domain at the highest level of the hierarchical Domain Name System e g com website identification string defining a realm of administrative autonomy within the Internet e g wikipedia org web domain collection of related web pages e g la wikipedia org crawl date date when the document was downloaded from the Web url the Uniform Resource Locator referring to the document s source wordcount number of words in the document length classification of the document into a range by its length measured in thousands of wordsParagraph attributes edit heading a numeric attribute distinguishing headers and similar titles from ordinary body text 1 if the paragraph is a heading 0 otherwise Available TenTen corpora editThe following corpora can be accessed through the Sketch Engine as of October 2018 7 arTenTen Arabic web corpus 8 beTenTen Belarusian web corpus 9 bgTenTen Bulgarian web corpus 10 caTenTen Catalan web corpus csTenTen Czech web corpus 11 daTenTen Danish web corpus deTenTen German web corpus elTenTen Greek web corpus enTenTen English web corpus 12 esTenTen Spanish web corpus with European American Spanish subcorpora 13 etTenTen Estonian web corpus 14 fiTenTen Finnish web corpus frTenTen French web corpus heTenTen Hebrew web corpus hiTenTen Hindi web corpus huTenTen Hungarian web corpus itTenTen Italian web corpus jaTenTen Japanese web corpus kmTenTen Khmer web corpus koTenTen Korean web corpus loTenTen Lao amp Isan web corpus ltTenTen Lithuanian web corpus lvTenTen Latvian web corpus mkTenTen Macedonian web corpus nlTenTen Dutch web corpus noTenTen Norwegian web corpus plTenTen Polish web corpus ptTenTen Portuguese web corpus roTenTen Romanian web corpus ruTenTen Russian web corpus skTenTen Slovak web corpus slTenTen Slovenian web corpus svTenTen Swedish web corpus thTenTen Thai web corpus tlTenTen Tagalog web corpus trTenTen Turkish web corpus 15 ukTenTen Ukrainian web corpus zhTenTen Chinese Simplified characters web corpus See also editText corpus Sketch Engine Web crawler spider Data deduplicationReferences edit a b Jakubicek Milos Kilgarriff Adam Kovar Vojtech Rychly Pavel Suchomel Vit July 2013 The Tenten Corpus Family PDF 7th International Corpus Linguistics Conference CL Lancaster UK Lancaster University pp 125 127 Retrieved 13 June 2017 Baroni Marco Kilgarriff Adam Kovar Vojtech Rychly Pavel Suchomel Vit July 2013 Large linguistically processed web corpora for multiple languages PDF 11th Conference of the European Chapter of the Association for Computational Linguistics Posters amp Demonstrations Association for Computational Linguistics Trento Italy Lancaster University pp 87 90 Retrieved 13 June 2017 Kilgarriff Adam Reddy Siva Pomikalek Jan Avinesh PVS May 2010 A Corpus Factory for Many Languages 7th Language Resources and Evaluation Conference Valletta Malta ELRA Retrieved 13 June 2017 Sharoff Serge 2006 Creating general purpose corpora using automated search engine queries PDF In Baroni Marco Bernardini Silvia eds Wacky Working papers on the Web as Corpus Bologna Italy GEDIT pp 63 98 ISBN 978 88 6027 004 7 Suchomel Vit Pomikalek Jan 17 April 2012 Efficient web crawling for large text corpora PDF Proceedings of the seventh Web as Corpus Workshop WAC7 7th Web as Corpus Workshop Lyon France Association for Computational Linguistics ACL on Web as Corpus pp 39 43 Retrieved 13 June 2017 a b Pomikalek Jan 2011 Removing boilerplate and duplicate content from web corpora PhD Faculty of Informatics Masaryk University Retrieved 17 April 2017 TenTen Corpus Family www sketchengine eu Sketch Engine Retrieved 23 October 2018 Belinkov Y Habash N Kilgarriff A Ordan N Roth R amp Suchomel V 2013 arTen Ten a new vast corpus for Arabic Proceedings of WACL A new Belarusian corpus beTenTen Sketch Engine Lexical Computing 2018 02 26 Retrieved 2018 04 06 Kilgarriff A Jakubicek M Pomikalek J Sardinha T B amp Whitelock P 2014 PtTenTen a corpus for Portuguese lexicography Working with Portuguese Corpora 111 30 Suchomel Vit December 7 9 2012 Recent Czech Web Corpora In Horak A Rychly P eds Proceedings of Recent Advances in Slavonic Natural Language Processing RASLAN 2012 Tribun EU pp 77 83 Kilgarriff Adam 2012 Getting to Know Your Corpus Text Speech and Dialogue Lecture Notes in Computer Science Vol 7499 pp 3 15 CiteSeerX 10 1 1 452 8074 doi 10 1007 978 3 642 32790 2 1 ISBN 978 3 642 32789 6 Kilgarriff A amp Renau I 2013 esTenTen a vast web corpus of Peninsular and American Spanish Procedia Social and Behavioral Sciences 95 12 19 SRDANOVIC I 2016 A Research Project on Language Resources for Learners of Japanese Inter Faculty 6 Baisa Vit Suchomel Vit 2015 Turkic Language Support in Sketch Engine Proceedings of the international conference Turkic Languages processing TurkLang 2015 Kazan Academy of Sciences of the Republic of Tatarstan Press pp 214 223 ISBN 978 5 9690 0262 3 via IS MU External links editTenTen Corpus Family at the Sketch Engine website Retrieved from https en wikipedia org w index php title TenTen Corpus Family amp oldid 1018803597, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.