fbpx
Wikipedia

American National Corpus

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

The ANC is available from the Linguistic Data Consortium. A fifteen million word subset of the corpus, called the Open American National Corpus (OANC), is freely available with no restrictions on its use from the ANC Website.

The corpus and its annotations are provided according to the specifications of ISO/TC 37 SC4's Linguistic Annotation Framework. By using a freely provided transduction tool (ANC2Go), the corpus and user-chosen annotations are provided in multiple formats, including CoNLL IOB format, the XML format conformant to the XML Corpus Encoding Standard (XCES) (usable with the British National Corpus's XAIRA search engine), a UIMA-compliant format, and formats suitable for input to a wide variety of concordance software. Plugins to import the annotations into General Architecture for Text Engineering (GATE) are also available.

The ANC differs from other corpora of English because it is richly annotated, including different part of speech annotations (Penn tags, CLAWS5 and CLAWS7 tags), shallow parse annotations, and annotations for several types of named entities. Additional annotations are added to all or parts of the corpus as they become available, often by contributions from other projects. Unlike on-line searchable corpora, which due to copyright restrictions allow access only to individual sentences, the entire ANC is available to enable research involving, for example, development of statistical language models and full-text linguistic annotation.

ANC annotations are automatically produced and unvalidated. A 500,000 word subset called the Manually Annotated Sub-Corpus (MASC) is annotated for approximately 20 different kinds of linguistic annotations, all of which have been hand-validated or manually produced. These include syntactic annotation, WordNet sense annotation, FrameNet semantic frame annotations, among others. Like the OANC, MASC is freely available for any use, and can be downloaded from the ANC site or from the Linguistic Data Consortium. It is also distributed in part-of-speech tagged form with the Natural Language Toolkit.

The ANC and its sub-corpora differ from similar corpora primarily in the range of linguistic annotations provided and the inclusion of modern genres that do not appear in resources like the British National Corpus. Also, because the initial target use of the corpora was the development of statistical language models, the full data and all annotations are available, thus differing from the Corpus of Contemporary American English (COCA) which is available only selectively through a web browser.

Continued growth of the OANC and MASC relies on contributions of data and annotations from the computational linguistics and corpus linguistics communities.

See also edit

References edit

  • Ide, N. (2008). The American National Corpus: Then, Now, and Tomorrow. In Michael Haugh, Kate Burridge, Jean Mulder and Pam Peters (eds.), Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus: Mustering Languages, Cascadilla Proceedings Project, Sommerville, MA.
  • Ide, N., Suderman, K. (2004). The American National Corpus First Release. Proceedings of the Fourth Language Resources and Evaluation Conference (LREC), Lisbon, 1681-84.
  • Ide, N., Baker, C., Fellbaum, C., Passonneau, R. (2010). The Manually Annotated Sub-Corpus: A Community Resource For and By the People Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden.

External links edit

  • ANC Website
  • MASC website
  • ANC2Go

american, national, corpus, text, corpus, american, english, containing, million, words, written, spoken, data, produced, since, 1990, currently, includes, range, genres, including, emerging, genres, such, email, tweets, data, that, included, earlier, corpora,. The American National Corpus ANC is a text corpus of American English containing 22 million words of written and spoken data produced since 1990 Currently the ANC includes a range of genres including emerging genres such as email tweets and web data that are not included in earlier corpora such as the British National Corpus It is annotated for part of speech and lemma shallow parse and named entities The ANC is available from the Linguistic Data Consortium A fifteen million word subset of the corpus called the Open American National Corpus OANC is freely available with no restrictions on its use from the ANC Website The corpus and its annotations are provided according to the specifications of ISO TC 37 SC4 s Linguistic Annotation Framework By using a freely provided transduction tool ANC2Go the corpus and user chosen annotations are provided in multiple formats including CoNLL IOB format the XML format conformant to the XML Corpus Encoding Standard XCES usable with the British National Corpus s XAIRA search engine a UIMA compliant format and formats suitable for input to a wide variety of concordance software Plugins to import the annotations into General Architecture for Text Engineering GATE are also available The ANC differs from other corpora of English because it is richly annotated including different part of speech annotations Penn tags CLAWS5 and CLAWS7 tags shallow parse annotations and annotations for several types of named entities Additional annotations are added to all or parts of the corpus as they become available often by contributions from other projects Unlike on line searchable corpora which due to copyright restrictions allow access only to individual sentences the entire ANC is available to enable research involving for example development of statistical language models and full text linguistic annotation ANC annotations are automatically produced and unvalidated A 500 000 word subset called the Manually Annotated Sub Corpus MASC is annotated for approximately 20 different kinds of linguistic annotations all of which have been hand validated or manually produced These include Penn Treebank syntactic annotation WordNet sense annotation FrameNet semantic frame annotations among others Like the OANC MASC is freely available for any use and can be downloaded from the ANC site or from the Linguistic Data Consortium It is also distributed in part of speech tagged form with the Natural Language Toolkit The ANC and its sub corpora differ from similar corpora primarily in the range of linguistic annotations provided and the inclusion of modern genres that do not appear in resources like the British National Corpus Also because the initial target use of the corpora was the development of statistical language models the full data and all annotations are available thus differing from the Corpus of Contemporary American English COCA which is available only selectively through a web browser Continued growth of the OANC and MASC relies on contributions of data and annotations from the computational linguistics and corpus linguistics communities See also editBritish National Corpus Oxford English Corpus Corpus of Contemporary American English COCA References editIde N 2008 The American National Corpus Then Now and Tomorrow In Michael Haugh Kate Burridge Jean Mulder and Pam Peters eds Selected Proceedings of the 2008 HCSNet Workshop on Designing the Australian National Corpus Mustering Languages Cascadilla Proceedings Project Sommerville MA Ide N Suderman K 2004 The American National Corpus First Release Proceedings of the Fourth Language Resources and Evaluation Conference LREC Lisbon 1681 84 Ide N Baker C Fellbaum C Passonneau R 2010 The Manually Annotated Sub Corpus A Community Resource For and By the People Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics Uppsala Sweden External links editANC Website MASC website ANC2Go Retrieved from https en wikipedia org w index php title American National Corpus amp oldid 1137221516, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.