fbpx
Wikipedia

Hamshahri Corpus

The Hamshahri Corpus (Persian: پیکره همشهری) is a sizable Persian corpus based on the Iranian newspaper Hamshahri, one of the first online Persian-language newspapers in Iran. It was initially collected and compiled by Ehsan Darrudi at DBRG Group[1] of University of Tehran. Later, a team headed by Abolfazl AleAhmad[2] built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks.

Hamshahri Corpus logo

This corpus was created by crawling the online news articles from the Hamshahri's website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments.

Version 1.0 edit

The collection contains more than 160,000 articles covering the following subject categories: politics, city news, economics, reports, editorials, literature, sciences, society, foreign news, sports, etc. The size of the documents varies from short news (under 1 KB) to rather long articles (e.g. 140 KB) with the average size of 1.8 KB.

The corpus is available in several formats for download:[2]

  • Tagged Text: 560 MB
  • In SQL Server 2000 Tables: 712 MB

Version 2.0 edit

The second release of the Hamshahri Corpus was launched on 20 October 2008. It offers several new features and improvements:

  • More News: 323,616 Text Stories in 3206 XML files (one file for each day)
  • Increased Time Span: from 22 June 1996 to 13 May 2007
  • Bigger in Size: 1.42 GB uncompressed
  • Standard Container: Unicode XML
  • Included Images: images have been extracted from the news and preserved (available in an additional package), which makes it suitable for Image Retrieval tasks.
  • Categorized News: the news stories have been categorized semi-automatically (appropriate for text categorization and classification tasks).

The corpus is available for download in XML format.

See also edit

References edit

  1. ^ DBRG News 2017-05-15 at the Wayback Machine Database Research Group
  2. ^ a b Hamshahri 2017-05-14 at the Wayback Machine Database Research Group

External links edit

  • Hamshahri Corpus Homepage 2017-05-14 at the Wayback Machine
  • irBlogs Collection Homepage


hamshahri, corpus, persian, پیکره, همشهری, sizable, persian, corpus, based, iranian, newspaper, hamshahri, first, online, persian, language, newspapers, iran, initially, collected, compiled, ehsan, darrudi, dbrg, group, university, tehran, later, team, headed,. The Hamshahri Corpus Persian پیکره همشهری is a sizable Persian corpus based on the Iranian newspaper Hamshahri one of the first online Persian language newspapers in Iran It was initially collected and compiled by Ehsan Darrudi at DBRG Group 1 of University of Tehran Later a team headed by Abolfazl AleAhmad 2 built on this corpus and created the first Persian text collection suitable for information retrieval evaluation tasks Hamshahri Corpus logo This corpus was created by crawling the online news articles from the Hamshahri s website and processing the HTML pages to create a standard text corpus for modern information retrieval experiments Contents 1 Version 1 0 2 Version 2 0 3 See also 4 References 5 External linksVersion 1 0 editThe collection contains more than 160 000 articles covering the following subject categories politics city news economics reports editorials literature sciences society foreign news sports etc The size of the documents varies from short news under 1 KB to rather long articles e g 140 KB with the average size of 1 8 KB The corpus is available in several formats for download 2 Tagged Text 560 MB In SQL Server 2000 Tables 712 MBVersion 2 0 editThe second release of the Hamshahri Corpus was launched on 20 October 2008 It offers several new features and improvements More News 323 616 Text Stories in 3206 XML files one file for each day Increased Time Span from 22 June 1996 to 13 May 2007 Bigger in Size 1 42 GB uncompressed Standard Container Unicode XML Included Images images have been extracted from the news and preserved available in an additional package which makes it suitable for Image Retrieval tasks Categorized News the news stories have been categorized semi automatically appropriate for text categorization and classification tasks The corpus is available for download in XML format See also editBijankhan Corpus Persian Today Corpus Tehran Monolingual Corpus Text corpus Information retrievalReferences edit DBRG News Archived 2017 05 15 at the Wayback Machine Database Research Group a b Hamshahri Archived 2017 05 14 at the Wayback Machine Database Research GroupExternal links editHamshahri Corpus Homepage Archived 2017 05 14 at the Wayback Machine irBlogs Collection Homepage nbsp This article about a digital library is a stub You can help Wikipedia by expanding it vte nbsp This Indo European languages related article is a stub You can help Wikipedia by expanding it vte Retrieved from https en wikipedia org w index php title Hamshahri Corpus amp oldid 1163125150, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.