fbpx
Wikipedia

Word list

A word list (or lexicon) is a list of a language's lexicon (generally sorted by frequency of occurrence either by levels or as a ranked list) within some given text corpus, serving the purpose of vocabulary acquisition. A lexicon sorted by frequency "provides a rational basis for making sure that learners get the best return for their vocabulary learning effort" (Nation 1997), but is mainly intended for course writers, not directly for learners. Frequency lists are also made for lexicographical purposes, serving as a sort of checklist to ensure that common words are not left out. Some major pitfalls are the corpus content, the corpus register, and the definition of "word". While word counting is a thousand years old, with still gigantic analysis done by hand in the mid-20th century, natural language electronic processing of large corpora such as movie subtitles (SUBTLEX megastudy) has accelerated the research field.

In computational linguistics, a frequency list is a sorted list of words (word types) together with their frequency, where frequency here usually means the number of occurrences in a given corpus, from which the rank can be derived as the position in the list.

Type Occurrences Rank
the 3,789,654 1st
he 2,098,762 2nd
[...]
king 57,897 1,356th
boy 56,975 1,357th
[...]
stringyfy 5 34,589th
[...]
transducionalify 1 123,567th

Methodology edit

Factors edit

Nation (Nation 1997) noted the incredible help provided by computing capabilities, making corpus analysis much easier. He cited several key issues which influence the construction of frequency lists:

  • corpus representativeness
  • word frequency and range
  • treatment of word families
  • treatment of idioms and fixed expressions
  • range of information
  • various other criteria

Corpora edit

Traditional written corpus edit

 
Frequency of personal pronouns in Serbo-Croatian

Most of currently available studies are based on written text corpus, more easily available and easy to process.

SUBTLEX movement edit

However, New et al. 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches. Brysbaert & New 2009 made a long critical evaluation of this traditional textual analysis approach, and support a move toward speech analysis and analysis of film subtitles available online. This has recently been followed by a handful of follow-up studies,[1] providing valuable frequency count analysis for various languages. Indeed, the SUBTLEX movement completed in five years full studies for French (New et al. 2007), American English (Brysbaert & New 2009; Brysbaert, New & Keuleers 2012), Dutch (Keuleers & New 2010), Chinese (Cai & Brysbaert 2010), Spanish (Cuetos et al. 2011), Greek (Dimitropoulou et al. 2010), Vietnamese (Pham, Bolger & Baayen 2011), Brazil Portuguese (Tang 2012) and Portugal Portuguese (Soares et al. 2015), Albanian (Avdyli & Cuetos 2013), Polish (Mandera et al. 2014) and Catalan (2019[2]). SUBTLEX-IT (2015) provides raw data only.[1]

Lexical unit edit

In any case, the basic "word" unit should be defined. For Latin scripts, words are usually one or several characters separated either by spaces or punctuation. But exceptions can arise, such as English "can't", French "aujourd'hui", or idioms. It may also be preferable to group words of a word family under the representation of its base word. Thus, possible, impossible, possibility are words of the same word family, represented by the base word *possib*. For statistical purpose, all these words are summed up under the base word form *possib*, allowing the ranking of a concept and form occurrence. Moreover, other languages may present specific difficulties. Such is the case of Chinese, which does not use spaces between words, and where a specified chain of several characters can be interpreted as either a phrase of unique-character words, or as a multi-character word.

Statistics edit

It seems that Zipf's law holds for frequency lists drawn from longer texts of any natural language. Frequency lists are a useful tool when building an electronic dictionary, which is a prerequisite for a wide range of applications in computational linguistics.

German linguists define the Häufigkeitsklasse (frequency class)   of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item. The most common item belongs to frequency class 0 (zero) and any item that is approximately half as frequent belongs in class 1. In the example list above, the misspelled word outragious has a ratio of 76/3789654 and belongs in class 16.

 

where   is the floor function.

Frequency lists, together with semantic networks, are used to identify the least common, specialized terms to be replaced by their hypernyms in a process of semantic compression.

Pedagogy edit

Those lists are not intended to be given directly to students, but rather to serve as a guideline for teachers and textbook authors (Nation 1997). Paul Nation's modern language teaching summary encourages first to "move from high frequency vocabulary and special purposes [thematic] vocabulary to low frequency vocabulary, then to teach learners strategies to sustain autonomous vocabulary expansion" (Nation 2006).

Effects of words frequency edit

Word frequency is known to have various effects (Brysbaert et al. 2011; Rudell 1993). Memorization is positively affected by higher word frequency, likely because the learner is subject to more exposures (Laufer 1997). Lexical access is positively influenced by high word frequency, a phenomenon called word frequency effect (Segui et al.). The effect of word frequency is related to the effect of age-of-acquisition, the age at which the word was learned.

Languages edit

Below is a review of available resources.

English edit

Word counting is an ancient field,[3] with known discussion back to Hellenistic time. In 1944, Edward Thorndike, Irvin Lorge and colleagues[4] hand-counted 18,000,000 running words to provide the first large-scale English language frequency list, before modern computers made such projects far easier (Nation 1997). 20th century's works all suffer from their age. In particular, words relating to technology, such as "blog," which, in 2014, was #7665 in frequency[5] in the Corpus of Contemporary American English,[6] was first attested to in 1999,[7][8][9] and does not appear in any of these three lists.

The Teachers Word Book of 30,000 words (Thorndike and Lorge, 1944)

The Teacher Word Book contains 30,000 lemmas or ~13,000 word families (Goulden, Nation and Read, 1990). A corpus of 18 million written words was hand analysed. The size of its source corpus increased its usefulness, but its age, and language changes, have reduced its applicability (Nation 1997).

The General Service List (West, 1953)

The General Service List contains 2,000 headwords divided into two sets of 1,000 words. A corpus of 5 million written words was analyzed in the 1940s. The rate of occurrence (%) for different meanings, and parts of speech, of the headword are provided. Various criteria, other than frequence and range, were carefully applied to the corpus. Thus, despite its age, some errors, and its corpus being entirely written text, it is still an excellent database of word frequency, frequency of meanings, and reduction of noise (Nation 1997). This list was updated in 2013 by Dr. Charles Browne, Dr. Brent Culligan and Joseph Phillips as the New General Service List.

The American Heritage Word Frequency Book (Carroll, Davies and Richman, 1971)

A corpus of 5 million running words, from written texts used in United States schools (various grades, various subject areas). Its value is in its focus on school teaching materials, and its tagging of words by the frequency of each word, in each of the school grade, and in each of the subject areas (Nation 1997).

The Brown (Francis and Kucera, 1982) LOB and related corpora

These now contain 1 million words from a written corpus representing different dialects of English. These sources are used to produce frequency lists (Nation 1997).

French edit

Traditional datasets

A review has been made by New & Pallier. An attempt was made in the 1950s–60s with the Français fondamental. It includes the F.F.1 list with 1,500 high-frequency words, completed by a later F.F.2 list with 1,700 mid-frequency words, and the most used syntax rules.[10] It is claimed that 70 grammatical words constitute 50% of the communicatives sentence,[11][12] while 3,680 words make about 95~98% of coverage.[13] A list of 3,000 frequent words is available.[14]

The French Ministry of the Education also provide a ranked list of the 1,500 most frequent word families, provided by the lexicologue Étienne Brunet.[15] Jean Baudot made a study on the model of the American Brown study, entitled "Fréquences d'utilisation des mots en français écrit contemporain".[16]

More recently, the project Lexique3 provides 142,000 French words, with orthography, phonetic, syllabation, part of speech, gender, number of occurrence in the source corpus, frequency rank, associated lexemes, etc., available under an open license CC-by-sa-4.0.[17]

Subtlex

This Lexique3 is a continuous study from which originate the Subtlex movement cited above. New et al. 2007 made a completely new counting based on online film subtitles.

Spanish edit

There have been several studies of Spanish word frequency (Cuetos et al. 2011).[18]

Chinese edit

Chinese corpora have long been studied from the perspective of frequency lists. The historical way to learn Chinese vocabulary is based on characters frequency (Allanic 2003). American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can't Read Chinese (DeFrancis 1966). As a frequency toolkit, Da (Da 1998) and the Taiwanese Ministry of Education (TME 1997) provided large databases with frequency ranks for characters and words. The HSK list of 8,848 high and medium frequency words in the People's Republic of China, and the Republic of China (Taiwan)'s TOP list of about 8,600 common traditional Chinese words are two other lists displaying common Chinese words and characters. Following the SUBTLEX movement, Cai & Brysbaert 2010 recently made a rich study of Chinese word and character frequencies.

Other edit

Most frequently used words in different languages based on Wikipedia or combined corpora.[19]

See also edit

Notes edit

  1. ^ a b "Crr » Subtitle Word Frequencies".
  2. ^ Boada, Roger; Guasch, Marc; Haro, Juan; Demestre, Josep; Ferré, Pilar (1 February 2020). "SUBTLEX-CAT: Subtitle word frequencies and contextual diversity for Catalan". Behavior Research Methods. 52 (1): 360–375. doi:10.3758/s13428-019-01233-1. ISSN 1554-3528. PMID 30895456. S2CID 84843788.
  3. ^ Bontrager, Terry (1 April 1991). "The Development of Word Frequency Lists Prior to the 1944 Thorndike‐Lorge List". Reading Psychology. 12 (2): 91–116. doi:10.1080/0270271910120201. ISSN 0270-2711.
  4. ^ "APA PsycNet". psycnet.apa.org. Retrieved 2023-05-15.
  5. ^ "Words and phrases: Frequency, genres, collocates, concordances, synonyms, and WordNet".
  6. ^ "Corpus of Contemporary American English (COCA)".
  7. ^ "It's the links, stupid". The Economist. 20 April 2006. Retrieved 2008-06-05.
  8. ^ Merholz, Peter (1999). . Internet Archive. Archived from the original on 1999-10-13. Retrieved 2008-06-05.
  9. ^ Kottke, Jason (26 August 2003). "kottke.org". Retrieved 2008-06-05.
  10. ^ . Archived from the original on 2010-07-04.
  11. ^ Ouzoulias, André (2004), Comprendre et aider les enfants en difficulté scolaire: Le Vocabulaire fondamental, 70 mots essentiels (PDF), Retz - Citing V.A.C Henmon (dead link, no Internet Archive copy, August 10, 2023)
  12. ^ Liste des "70 mots essentiels" recensés par V.A.C. Henmon
  13. ^ "Generalities".
  14. ^ "PDF 3000 French words".
  15. ^ "Maitrise de la langue à l'école: Vocabulaire". Ministère de l'éducation nationale.
  16. ^ Baudot, J. (1992), Fréquences d'utilisation des mots en français écrit contemporain, Presses de L'Université, ISBN 978-2-7606-1563-2
  17. ^ "Lexique".
  18. ^ "Spanish word frequency lists". Vocabularywiki.pbworks.com.
  19. ^ Most frequently used words in different languages, ezglot

References edit

Theoretical concepts edit

  • Nation, P. (1997), "Vocabulary size, text coverage, and word lists", in Schmitt; McCarthy (eds.), Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 6–19, ISBN 978-0-521-58551-4
  • Laufer, B. (1997), "What's in a word that makes it hard or easy? Some intralexical factors that affect the learning of words.", Vocabulary: Description, Acquisition and Pedagogy, Cambridge: Cambridge University Press, pp. 140–155, ISBN 9780521585514
  • Nation, P. (2006), "Language Education - Vocabulary", Encyclopedia of Language & Linguistics, Oxford: 494–499, doi:10.1016/B0-08-044854-2/00678-7, ISBN 9780080448541.
  • Brysbaert, Marc; Buchmeier, Matthias; Conrad, Markus; Jacobs, Arthur M.; Bölte, Jens; Böhl, Andrea (2011). "The word frequency effect: a review of recent developments and implications for the choice of frequency estimates in German". Experimental Psychology. 58 (5): 412–424. doi:10.1027/1618-3169/a000123. PMID 21768069. database
  • Rudell, A.P. (1993), "Frequency of word usage and perceived word difficulty : Ratings of Kucera and Francis words", Most, vol. 25, pp. 455–463
  • Segui, J.; Mehler, Jacques; Frauenfelder, Uli; Morton, John (1982), "The word frequency effect and lexical access", Neuropsychologia, 20 (6): 615–627, doi:10.1016/0028-3932(82)90061-6, PMID 7162585, S2CID 39694258
  • Meier, Helmut (1967), Deutsche Sprachstatistik, Hildesheim: Olms (frequency list of German words)
  • DeFrancis, John (1966), Why Johnny can't read Chinese
  • Allanic, Bernard (2003), The corpus of characters and their pedagogical aspect in ancient and contemporary China (fr: Les corpus de caractères et leur dimension pédagogique dans la Chine ancienne et contemporaine) (These de doctorat), Paris: INALCO

Written texts-based databases edit

  • Da, Jun (1998), Jun Da: Chinese text computing, retrieved 2010-08-21.
  • Taiwan Ministry of Education (1997), 八十六年常用語詞調查報告書, retrieved 2010-08-21.
  • New, Boris; Pallier, Christophe, Manuel de Lexique 3 (in French) (3.01 ed.).
  • Gimenes, Manuel; New, Boris (2016), "Worldlex: Twitter and blog word frequencies for 66 languages", Behavior Research Methods, 48 (3): 963–972, doi:10.3758/s13428-015-0621-0, ISSN 1554-3528, PMID 26170053.

SUBTLEX movement edit

  • New, B.; Brysbaert, M.; Veronis, J.; Pallier, C. (2007). (PDF). Applied Psycholinguistics. 28 (4): 661. doi:10.1017/s014271640707035x. hdl:1854/LU-599589. S2CID 145366468. Archived from the original (PDF) on 2016-10-24.
  • Brysbaert, Marc; New, Boris (2009), "Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English" (PDF), Behavior Research Methods, 41 (4): 977–990, doi:10.3758/brm.41.4.977, PMID 19897807, S2CID 4792474
  • Keuleers, E, M, B.; New, B. (2010), "SUBTLEX--NL: A new measure for Dutch word frequency based on film subtitles", Behavior Research Methods, 42 (3): 643–650, doi:10.3758/brm.42.3.643, PMID 20805586{{citation}}: CS1 maint: multiple names: authors list (link)
  • Cai, Q.; Brysbaert, M. (2010), "SUBTLEX-CH: Chinese Word and Character Frequencies Based on Film Subtitles", PLOS ONE, 5 (6): 8, Bibcode:2010PLoSO...510729C, doi:10.1371/journal.pone.0010729, PMC 2880003, PMID 20532192
  • Cuetos, F.; Glez-nosti, Maria; Barbón, Analía; Brysbaert, Marc (2011), "SUBTLEX-ESP : Spanish word frequencies based on film subtitles" (PDF), Psicológica, 32: 133–143
  • Dimitropoulou, M.; Duñabeitia, Jon Andoni; Avilés, Alberto; Corral, José; Carreiras, Manuel (2010), "SUBTLEX-GR: Subtitle-Based Word Frequencies as the Best Estimate of Reading Behavior: The Case of Greek", Frontiers in Psychology, 1 (December): 12, doi:10.3389/fpsyg.2010.00218, PMC 3153823, PMID 21833273
  • Pham, H.; Bolger, P.; Baayen, R.H. (2011), "SUBTLEX-VIE : A Measure for Vietnamese Word and Character Frequencies on Film Subtitles", ACOL
  • Brysbaert, M.; New, Boris; Keuleers, E. (2012), "SUBTLEX-US : Adding Part of Speech Information to the SUBTLEXus Word Frequencies" (PDF), Behavior Research Methods: 1–22 (databases)
  • Mandera, P.; Keuleers, E.; Wodniecka, Z.; Brysbaert, M. (2014). "Subtlex-pl: subtitle-based word frequency estimates for Polish" (PDF). Behav Res Methods. 47 (2): 471–483. doi:10.3758/s13428-014-0489-4. PMID 24942246. S2CID 2334688.
  • Tang, K. (2012), "A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research", UCL Work Pap Linguist (24): 208–214
  • Avdyli, Rrezarta; Cuetos, Fernando (June 2013), "SUBTLEX- AL: Albanian word frequencies based on film subtitles", ILIRIA International Review, 3 (1): 285–292, doi:10.21113/iir.v3i1.112, ISSN 2365-8592
  • Soares, Ana Paula; Machado, João; Costa, Ana; Iriarte, Álvaro; Simões, Alberto; de Almeida, José João; Comesaña, Montserrat; Perea, Manuel (April 2015), "On the advantages of word frequency and contextual diversity measures extracted from subtitles: The case of Portuguese", The Quarterly Journal of Experimental Psychology, 68 (4): 680–696, doi:10.1080/17470218.2014.964271, PMID 25263599, S2CID 5376519

word, list, word, lists, used, word, games, scrabble, acceptable, words, this, article, unclear, citation, style, references, used, made, clearer, with, different, consistent, style, citation, footnoting, march, 2021, learn, when, remove, this, template, messa. For word lists used in word games see Scrabble Acceptable words This article has an unclear citation style The references used may be made clearer with a different or consistent style of citation and footnoting March 2021 Learn how and when to remove this template message A word list or lexicon is a list of a language s lexicon generally sorted by frequency of occurrence either by levels or as a ranked list within some given text corpus serving the purpose of vocabulary acquisition A lexicon sorted by frequency provides a rational basis for making sure that learners get the best return for their vocabulary learning effort Nation 1997 but is mainly intended for course writers not directly for learners Frequency lists are also made for lexicographical purposes serving as a sort of checklist to ensure that common words are not left out Some major pitfalls are the corpus content the corpus register and the definition of word While word counting is a thousand years old with still gigantic analysis done by hand in the mid 20th century natural language electronic processing of large corpora such as movie subtitles SUBTLEX megastudy has accelerated the research field In computational linguistics a frequency list is a sorted list of words word types together with their frequency where frequency here usually means the number of occurrences in a given corpus from which the rank can be derived as the position in the list Type Occurrences Rankthe 3 789 654 1sthe 2 098 762 2nd king 57 897 1 356thboy 56 975 1 357th stringyfy 5 34 589th transducionalify 1 123 567thContents 1 Methodology 1 1 Factors 1 2 Corpora 1 2 1 Traditional written corpus 1 2 2 SUBTLEX movement 1 3 Lexical unit 1 4 Statistics 1 5 Pedagogy 2 Effects of words frequency 3 Languages 3 1 English 3 2 French 3 3 Spanish 3 4 Chinese 3 5 Other 4 See also 5 Notes 6 References 6 1 Theoretical concepts 6 2 Written texts based databases 6 3 SUBTLEX movementMethodology editFactors edit Nation Nation 1997 noted the incredible help provided by computing capabilities making corpus analysis much easier He cited several key issues which influence the construction of frequency lists corpus representativeness word frequency and range treatment of word families treatment of idioms and fixed expressions range of information various other criteriaCorpora edit Traditional written corpus edit nbsp Frequency of personal pronouns in Serbo CroatianMost of currently available studies are based on written text corpus more easily available and easy to process SUBTLEX movement edit However New et al 2007 proposed to tap into the large number of subtitles available online to analyse large numbers of speeches Brysbaert amp New 2009 made a long critical evaluation of this traditional textual analysis approach and support a move toward speech analysis and analysis of film subtitles available online This has recently been followed by a handful of follow up studies 1 providing valuable frequency count analysis for various languages Indeed the SUBTLEX movement completed in five years full studies for French New et al 2007 American English Brysbaert amp New 2009 Brysbaert New amp Keuleers 2012 Dutch Keuleers amp New 2010 Chinese Cai amp Brysbaert 2010 Spanish Cuetos et al 2011 Greek Dimitropoulou et al 2010 Vietnamese Pham Bolger amp Baayen 2011 Brazil Portuguese Tang 2012 and Portugal Portuguese Soares et al 2015 Albanian Avdyli amp Cuetos 2013 Polish Mandera et al 2014 and Catalan 2019 2 SUBTLEX IT 2015 provides raw data only 1 Lexical unit edit In any case the basic word unit should be defined For Latin scripts words are usually one or several characters separated either by spaces or punctuation But exceptions can arise such as English can t French aujourd hui or idioms It may also be preferable to group words of a word family under the representation of its base word Thus possible impossible possibility are words of the same word family represented by the base word possib For statistical purpose all these words are summed up under the base word form possib allowing the ranking of a concept and form occurrence Moreover other languages may present specific difficulties Such is the case of Chinese which does not use spaces between words and where a specified chain of several characters can be interpreted as either a phrase of unique character words or as a multi character word Statistics edit It seems that Zipf s law holds for frequency lists drawn from longer texts of any natural language Frequency lists are a useful tool when building an electronic dictionary which is a prerequisite for a wide range of applications in computational linguistics German linguists define the Haufigkeitsklasse frequency class N displaystyle N nbsp of an item in the list using the base 2 logarithm of the ratio between its frequency and the frequency of the most frequent item The most common item belongs to frequency class 0 zero and any item that is approximately half as frequent belongs in class 1 In the example list above the misspelled word outragious has a ratio of 76 3789654 and belongs in class 16 N 0 5 log 2 Frequency of this item Frequency of most common item displaystyle N left lfloor 0 5 log 2 left frac text Frequency of this item text Frequency of most common item right right rfloor nbsp where displaystyle lfloor ldots rfloor nbsp is the floor function Frequency lists together with semantic networks are used to identify the least common specialized terms to be replaced by their hypernyms in a process of semantic compression Pedagogy edit Those lists are not intended to be given directly to students but rather to serve as a guideline for teachers and textbook authors Nation 1997 Paul Nation s modern language teaching summary encourages first to move from high frequency vocabulary and special purposes thematic vocabulary to low frequency vocabulary then to teach learners strategies to sustain autonomous vocabulary expansion Nation 2006 Effects of words frequency editWord frequency is known to have various effects Brysbaert et al 2011 Rudell 1993 Memorization is positively affected by higher word frequency likely because the learner is subject to more exposures Laufer 1997 Lexical access is positively influenced by high word frequency a phenomenon called word frequency effect Segui et al The effect of word frequency is related to the effect of age of acquisition the age at which the word was learned Languages editBelow is a review of available resources English edit Further information Most common words in English Word counting is an ancient field 3 with known discussion back to Hellenistic time In 1944 Edward Thorndike Irvin Lorge and colleagues 4 hand counted 18 000 000 running words to provide the first large scale English language frequency list before modern computers made such projects far easier Nation 1997 20th century s works all suffer from their age In particular words relating to technology such as blog which in 2014 was 7665 in frequency 5 in the Corpus of Contemporary American English 6 was first attested to in 1999 7 8 9 and does not appear in any of these three lists The Teachers Word Book of 30 000 words Thorndike and Lorge 1944 The Teacher Word Book contains 30 000 lemmas or 13 000 word families Goulden Nation and Read 1990 A corpus of 18 million written words was hand analysed The size of its source corpus increased its usefulness but its age and language changes have reduced its applicability Nation 1997 The General Service List West 1953 The General Service List contains 2 000 headwords divided into two sets of 1 000 words A corpus of 5 million written words was analyzed in the 1940s The rate of occurrence for different meanings and parts of speech of the headword are provided Various criteria other than frequence and range were carefully applied to the corpus Thus despite its age some errors and its corpus being entirely written text it is still an excellent database of word frequency frequency of meanings and reduction of noise Nation 1997 This list was updated in 2013 by Dr Charles Browne Dr Brent Culligan and Joseph Phillips as the New General Service List The American Heritage Word Frequency Book Carroll Davies and Richman 1971 A corpus of 5 million running words from written texts used in United States schools various grades various subject areas Its value is in its focus on school teaching materials and its tagging of words by the frequency of each word in each of the school grade and in each of the subject areas Nation 1997 The Brown Francis and Kucera 1982 LOB and related corporaThese now contain 1 million words from a written corpus representing different dialects of English These sources are used to produce frequency lists Nation 1997 French edit Traditional datasetsA review has been made by New amp Pallier An attempt was made in the 1950s 60s with the Francais fondamental It includes the F F 1 list with 1 500 high frequency words completed by a later F F 2 list with 1 700 mid frequency words and the most used syntax rules 10 It is claimed that 70 grammatical words constitute 50 of the communicatives sentence 11 12 while 3 680 words make about 95 98 of coverage 13 A list of 3 000 frequent words is available 14 The French Ministry of the Education also provide a ranked list of the 1 500 most frequent word families provided by the lexicologue Etienne Brunet 15 Jean Baudot made a study on the model of the American Brown study entitled Frequences d utilisation des mots en francais ecrit contemporain 16 More recently the project Lexique3 provides 142 000 French words with orthography phonetic syllabation part of speech gender number of occurrence in the source corpus frequency rank associated lexemes etc available under an open license CC by sa 4 0 17 SubtlexThis Lexique3 is a continuous study from which originate the Subtlex movement cited above New et al 2007 made a completely new counting based on online film subtitles Spanish edit Main article Most common words in Spanish There have been several studies of Spanish word frequency Cuetos et al 2011 18 Chinese edit Chinese corpora have long been studied from the perspective of frequency lists The historical way to learn Chinese vocabulary is based on characters frequency Allanic 2003 American sinologist John DeFrancis mentioned its importance for Chinese as a foreign language learning and teaching in Why Johnny Can t Read Chinese DeFrancis 1966 As a frequency toolkit Da Da 1998 and the Taiwanese Ministry of Education TME 1997 provided large databases with frequency ranks for characters and words The HSK list of 8 848 high and medium frequency words in the People s Republic of China and the Republic of China Taiwan s TOP list of about 8 600 common traditional Chinese words are two other lists displaying common Chinese words and characters Following the SUBTLEX movement Cai amp Brysbaert 2010 recently made a rich study of Chinese word and character frequencies Other edit Most frequently used words in different languages based on Wikipedia or combined corpora 19 See also editLetter frequency Most common words in English Long tail Google Ngram Viewer shows changes in word phrase frequency and relative frequency over timeNotes edit a b Crr Subtitle Word Frequencies Boada Roger Guasch Marc Haro Juan Demestre Josep Ferre Pilar 1 February 2020 SUBTLEX CAT Subtitle word frequencies and contextual diversity for Catalan Behavior Research Methods 52 1 360 375 doi 10 3758 s13428 019 01233 1 ISSN 1554 3528 PMID 30895456 S2CID 84843788 Bontrager Terry 1 April 1991 The Development of Word Frequency Lists Prior to the 1944 Thorndike Lorge List Reading Psychology 12 2 91 116 doi 10 1080 0270271910120201 ISSN 0270 2711 APA PsycNet psycnet apa org Retrieved 2023 05 15 Words and phrases Frequency genres collocates concordances synonyms and WordNet Corpus of Contemporary American English COCA It s the links stupid The Economist 20 April 2006 Retrieved 2008 06 05 Merholz Peter 1999 Peterme com Internet Archive Archived from the original on 1999 10 13 Retrieved 2008 06 05 Kottke Jason 26 August 2003 kottke org Retrieved 2008 06 05 Le francais fondamental Archived from the original on 2010 07 04 Ouzoulias Andre 2004 Comprendre et aider les enfants en difficulte scolaire Le Vocabulaire fondamental 70 mots essentiels PDF Retz Citing V A C Henmon dead link no Internet Archive copy August 10 2023 Liste des 70 mots essentiels recenses par V A C Henmon Generalities PDF 3000 French words Maitrise de la langue a l ecole Vocabulaire Ministere de l education nationale Baudot J 1992 Frequences d utilisation des mots en francais ecrit contemporain Presses de L Universite ISBN 978 2 7606 1563 2 Lexique Spanish word frequency lists Vocabularywiki pbworks com Most frequently used words in different languages ezglotReferences edit nbsp Look up Wiktionary Frequency lists in Wiktionary the free dictionary Theoretical concepts edit Nation P 1997 Vocabulary size text coverage and word lists in Schmitt McCarthy eds Vocabulary Description Acquisition and Pedagogy Cambridge Cambridge University Press pp 6 19 ISBN 978 0 521 58551 4 Laufer B 1997 What s in a word that makes it hard or easy Some intralexical factors that affect the learning of words Vocabulary Description Acquisition and Pedagogy Cambridge Cambridge University Press pp 140 155 ISBN 9780521585514 Nation P 2006 Language Education Vocabulary Encyclopedia of Language amp Linguistics Oxford 494 499 doi 10 1016 B0 08 044854 2 00678 7 ISBN 9780080448541 Brysbaert Marc Buchmeier Matthias Conrad Markus Jacobs Arthur M Bolte Jens Bohl Andrea 2011 The word frequency effect a review of recent developments and implications for the choice of frequency estimates in German Experimental Psychology 58 5 412 424 doi 10 1027 1618 3169 a000123 PMID 21768069 database Rudell A P 1993 Frequency of word usage and perceived word difficulty Ratings of Kucera and Francis words Most vol 25 pp 455 463 Segui J Mehler Jacques Frauenfelder Uli Morton John 1982 The word frequency effect and lexical access Neuropsychologia 20 6 615 627 doi 10 1016 0028 3932 82 90061 6 PMID 7162585 S2CID 39694258 Meier Helmut 1967 Deutsche Sprachstatistik Hildesheim Olms frequency list of German words DeFrancis John 1966 Why Johnny can t read Chinese Allanic Bernard 2003 The corpus of characters and their pedagogical aspect in ancient and contemporary China fr Les corpus de caracteres et leur dimension pedagogique dans la Chine ancienne et contemporaine These de doctorat Paris INALCOWritten texts based databases edit Da Jun 1998 Jun Da Chinese text computing retrieved 2010 08 21 Taiwan Ministry of Education 1997 八十六年常用語詞調查報告書 retrieved 2010 08 21 New Boris Pallier Christophe Manuel de Lexique 3 in French 3 01 ed Gimenes Manuel New Boris 2016 Worldlex Twitter and blog word frequencies for 66 languages Behavior Research Methods 48 3 963 972 doi 10 3758 s13428 015 0621 0 ISSN 1554 3528 PMID 26170053 SUBTLEX movement edit New B Brysbaert M Veronis J Pallier C 2007 SUBTLEX FR The use of film subtitles to estimate word frequencies PDF Applied Psycholinguistics 28 4 661 doi 10 1017 s014271640707035x hdl 1854 LU 599589 S2CID 145366468 Archived from the original PDF on 2016 10 24 Brysbaert Marc New Boris 2009 Moving beyond Kucera and Francis a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English PDF Behavior Research Methods 41 4 977 990 doi 10 3758 brm 41 4 977 PMID 19897807 S2CID 4792474 Keuleers E M B New B 2010 SUBTLEX NL A new measure for Dutch word frequency based on film subtitles Behavior Research Methods 42 3 643 650 doi 10 3758 brm 42 3 643 PMID 20805586 a href Template Citation html title Template Citation citation a CS1 maint multiple names authors list link Cai Q Brysbaert M 2010 SUBTLEX CH Chinese Word and Character Frequencies Based on Film Subtitles PLOS ONE 5 6 8 Bibcode 2010PLoSO 510729C doi 10 1371 journal pone 0010729 PMC 2880003 PMID 20532192 Cuetos F Glez nosti Maria Barbon Analia Brysbaert Marc 2011 SUBTLEX ESP Spanish word frequencies based on film subtitles PDF Psicologica 32 133 143 Dimitropoulou M Dunabeitia Jon Andoni Aviles Alberto Corral Jose Carreiras Manuel 2010 SUBTLEX GR Subtitle Based Word Frequencies as the Best Estimate of Reading Behavior The Case of Greek Frontiers in Psychology 1 December 12 doi 10 3389 fpsyg 2010 00218 PMC 3153823 PMID 21833273 Pham H Bolger P Baayen R H 2011 SUBTLEX VIE A Measure for Vietnamese Word and Character Frequencies on Film Subtitles ACOL Brysbaert M New Boris Keuleers E 2012 SUBTLEX US Adding Part of Speech Information to the SUBTLEXus Word Frequencies PDF Behavior Research Methods 1 22 databases Mandera P Keuleers E Wodniecka Z Brysbaert M 2014 Subtlex pl subtitle based word frequency estimates for Polish PDF Behav Res Methods 47 2 471 483 doi 10 3758 s13428 014 0489 4 PMID 24942246 S2CID 2334688 Tang K 2012 A 61 million word corpus of Brazilian Portuguese film subtitles as a resource for linguistic research UCL Work Pap Linguist 24 208 214 Avdyli Rrezarta Cuetos Fernando June 2013 SUBTLEX AL Albanian word frequencies based on film subtitles ILIRIA International Review 3 1 285 292 doi 10 21113 iir v3i1 112 ISSN 2365 8592 Soares Ana Paula Machado Joao Costa Ana Iriarte Alvaro Simoes Alberto de Almeida Jose Joao Comesana Montserrat Perea Manuel April 2015 On the advantages of word frequency and contextual diversity measures extracted from subtitles The case of Portuguese The Quarterly Journal of Experimental Psychology 68 4 680 696 doi 10 1080 17470218 2014 964271 PMID 25263599 S2CID 5376519 nbsp This article includes a language related list of lists If an internal link incorrectly led you here you may wish to change the link to point directly to the intended article Retrieved from https en wikipedia org w index php title Word list amp oldid 1183810791, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.