fbpx
Wikipedia

Automated Similarity Judgment Program

The Automated Similarity Judgment Program (ASJP) is a collaborative project applying computational approaches to comparative linguistics using a database of word lists. The database is open access and consists of 40-item basic-vocabulary lists for well over half of the world's languages.[1] It is continuously being expanded. In addition to isolates and languages of demonstrated genealogical groups, the database includes pidgins, creoles, mixed languages, and constructed languages. Words of the database are transcribed into a simplified standard orthography (ASJPcode).[2] The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from glottochronology,[3] to determine the homeland (Urheimat) of a proto-language,[4] to investigate sound symbolism,[5] to evaluate different phylogenetic methods,[6] and several other purposes.

Automated Similarity Judgment Program
ProducerMax Planck Institute for the Science of Human History (Germany)
LanguagesEnglish
Access
CostFree
Coverage
DisciplinesQuantitative comparative linguistics
Links
Websiteasjp.clld.org

ASJP is not widely accepted among historical linguists as an adequate method to establish or evaluate relationships between language families.[7]

It is part of the Cross-Linguistic Linked Data project hosted by the Max Planck Institute for the Science of Human History.[8]

History edit

Original goals edit

ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages, with the ultimate goal of classifying languages computationally, based on the lexical similarities observed. In the first ASJP paper[2] two semantically identical words from compared languages were judged similar if they showed at least two identical sound segments. Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar. This method was applied to 100-item word lists for 250 languages from language families including Austroasiatic, Indo-European, Mayan, and Muskogean.

ASJP Consortium edit

The ASJP Consortium, founded around 2008,[when?] came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and/or extending aid to the project in other ways. The main driving force behind the founding of the consortium was Cecil H. Brown. Søren Wichmann is daily curator of the project. A third central member of the consortium is Eric W. Holman, who has created most of the software used in the project.

Shorter word lists edit

While word lists used were originally based on the 100-item Swadesh list, it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list.[9] So subsequently word lists gathered contain only 40 items (or less, when attestations for some are lacking).

Levenshtein distance edit

In papers published since 2008, ASJP has employed a similarity judgment program based on Levenshtein distance (LD). This approach was found to produce better classificatory results measured against expert opinion than the method used initially. LD is defined as the minimum number of successive changes necessary to convert one word into another, where each change is the insertion, deletion, or substitution of a symbol. Within the Levenshtein approach, differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words. This produces normalized LD (LDN). An LDN divided (LDND) between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings. This second normalization is intended to correct for chance similarity.[10]

Word list edit

The ASJP uses the following 40-word list.[11] It is similar to the Swadesh–Yakhontov list, but has some differences.

Body parts
  • eye
  • ear
  • nose
  • tongue
  • tooth
  • hand
  • knee
  • blood
  • bone
  • breast (woman’s)
  • liver
  • skin
Animals and plants
  • louse
  • dog
  • fish (noun)
  • horn (animal part)
  • tree
  • leaf
People
  • person
  • name (noun)
Nature
  • sun
  • star
  • water
  • fire
  • stone
  • path
  • mountain
  • night (dark time)
Verbs and adjectives
  • drink (verb)
  • die
  • see
  • hear
  • come
  • new
  • full
Numerals and pronouns
  • one
  • two
  • I
  • you
  • we

ASJPcode edit

ASJP version from 2016[citation needed] uses the following symbols to encode phonemes: p b f v m w 8 t d s z c n r l S Z C j T 5 y k g x N q X h 7 L 4 G ! i e E 3 a u o

They represent 7 vowels and 34 consonants, all found on the standard QWERTY keyboard.

Sounds represented by ASJPcode [2]
ASJPcode Description IPA
i high front vowel, rounded and unrounded i, ɪ, y, ʏ
e mid front vowel, rounded and unrounded e, ø
E low front vowel, rounded and unrounded a, æ, ɛ, ɶ, œ, e
3 high and mid central vowel, rounded and unrounded ɨ, ɘ, ə, ɜ, ʉ, ɵ, ɞ
a low central vowel, unrounded ɐ
u high back vowel, rounded and unrounded ɯ, u, ɑ
o mid and low back vowel, rounded and unrounded ɤ, ʌ, ɑ, o, ɔ, ɒ
p voiceless bilabial stop and fricative p, ɸ
b voiced bilabial stop and fricative b, β
m bilabial nasal m
f voiceless labiodental fricative f
v voiced labiodental fricative v
8 voiceless and voiced dental fricative θ, ð
4 dental nasal
t voiceless alveolar stop t
d voiced alveolar stop d
s voiceless alveolar fricative s
z voiced alveolar fricative z
c voiceless and voiced alveolar affricate t͡s, d͡z
n voiceless and voiced alveolar nasal n
S voiceless postalveolar fricative ʃ
Z voiced postalveolar fricative ʒ
C voiceless palato-alveolar affricate t͡ʃ
j voiced palato-alveolar affricate d͡ʒ
T voiceless and voiced palatal stop c, ɟ
5 palatal nasal ɲ
k voiceless velar stop k
g voiced velar stop ɡ
x voiceless and voiced velar fricative x, ɣ
N velar nasal ŋ
q voiceless uvular stop q
G voiced uvular stop ɢ
X voiceless and voiced uvular fricative, voiceless and voiced pharyngeal fricative χ, ʁ, ħ, ʕ
7 voiceless glottal stop ʔ
h voiceless and voiced glottal fricative h, ɦ
l voiced alveolar lateral approximate l
L all other laterals ʟ, ɭ, ʎ
w voiced bilabial-velar approximant w
y palatal approximant j
r voiced apico-alveolar trill and all varieties of “r-sounds” r, ʀ, etc.
! all varieties of “click-sounds” ǃ, ǀ, ǁ, ǂ

A ~ mark follows two consonants so that they are considered to be in the same position. Thus, kʷat becomes kw~at. Syllables like kat, wat, kaw and kwi are considered lexically similar to kw~at.

Similarly, a $ mark follows three consonants so that they are considered to be in the same position. ndy$im is considered similar to nim, dam and yim.

" marks the preceding consonant as glottalized.

See also edit

References edit

  1. ^ Wichmann, Søren, André Müller, Annkathrin Wett, Viveka Velupillai, Julia Bischoffberger, Cecil H. Brown, Eric W. Holman, Sebastian Sauppe, Zarina Molochieva, Pamela Brown, Harald Hammarström, Oleg Belyaev, Johann-Mattis List, Dik Bakker, Dmitry Egorov, Matthias Urban, Robert Mailhammer, Agustina Carrizo, Matthew S. Dryer, Evgenia Korovina, David Beck, Helen Geyer, Patience Epps, Anthony Grant, and Pilar Valenzuela. 2013. The ASJP Database (version 16). http://asjp.clld.org/
  2. ^ a b c Brown, Cecil H., Eric W. Holman, Søren Wichmann, and Viveka Velupillai. 2008. Automated classification of the world's languages: A description of the method and preliminary results. STUF – Language Typology and Universals 61.4: 285-308.
  3. ^ Holman, Eric W., Cecil H. Brown, Søren Wichmann, André Müller, Viveka Velupillai, Harald Hammarström, Sebastian Sauppe, Hagen Jung, Dik Bakker, Pamela Brown, Oleg Belyaev, Matthias Urban, Robert Mailhammer, Johann-Mattis List, and Dmitry Egorov. 2011. Automated dating of the world’s language families based on lexical similarity. Current Anthropology 52.6: 841-875.
  4. ^ Wichmann, Søren, André Müller, and Viveka Velupillai. 2010. Homelands of the world’s language families: A quantitative approach. Diachronica 27.2: 247-276.
  5. ^ Wichmann, Søren, Holman, Eric W., and Cecil H. Brown. 2010. Sound symbolism in basic vocabulary. Entropy 12.4: 844-858.
  6. ^ Pompei, Simone, Vittorio Loreto, and Francesca Tria. 2011. On the accuracy of language trees. PLoS ONE 6: e20109.
  7. ^ Cf. comments by Adelaar, Blust and Campbell in Holman, Eric W., et al. (2011) "Automated Dating of the World’s Language Families Based on Lexical Similarity." Current Anthropology, vol. 52, no. 6, pp. 841–875.
  8. ^ "Cross-Linguistic Linked Data". Retrieved February 22, 2020.
  9. ^ Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller, and Dik Bakker. 2008. Explorations in automated language classification. Folia Linguistica 42.2: 331-354.
  10. ^ Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 (doi:10.1016/j.physa.2010.05.011).
  11. ^ http://asjp.clld.org/static/Guidelines.pdf[bare URL PDF]

Sources edit

  • Søren Wichmann, Jeff Good (eds). 2014. Quantifying Language Dynamics: On the Cutting edge of Areal and Phylogenetic Linguistics, p. 203. Leiden: Brill.
  • Brown, Cecil H., et al. 2008. Automated Classification of the World's Languages: A Description of the Method and Preliminary Results. Language Typology and Universals 61(4). November 2008. doi:10.1524/stuf.2008.0026
  • Wichmann, Søren, Eric W. Holman, and Cecil H. Brown (eds.). 2018. The ASJP Database (version 18).

External links edit

  • ASJP Database official home page

automated, similarity, judgment, program, asjp, collaborative, project, applying, computational, approaches, comparative, linguistics, using, database, word, lists, database, open, access, consists, item, basic, vocabulary, lists, well, over, half, world, lang. The Automated Similarity Judgment Program ASJP is a collaborative project applying computational approaches to comparative linguistics using a database of word lists The database is open access and consists of 40 item basic vocabulary lists for well over half of the world s languages 1 It is continuously being expanded In addition to isolates and languages of demonstrated genealogical groups the database includes pidgins creoles mixed languages and constructed languages Words of the database are transcribed into a simplified standard orthography ASJPcode 2 The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from glottochronology 3 to determine the homeland Urheimat of a proto language 4 to investigate sound symbolism 5 to evaluate different phylogenetic methods 6 and several other purposes Automated Similarity Judgment ProgramProducerMax Planck Institute for the Science of Human History Germany LanguagesEnglishAccessCostFreeCoverageDisciplinesQuantitative comparative linguisticsLinksWebsiteasjp wbr clld wbr orgASJP is not widely accepted among historical linguists as an adequate method to establish or evaluate relationships between language families 7 It is part of the Cross Linguistic Linked Data project hosted by the Max Planck Institute for the Science of Human History 8 Contents 1 History 1 1 Original goals 1 2 ASJP Consortium 1 3 Shorter word lists 1 4 Levenshtein distance 2 Word list 3 ASJPcode 4 See also 5 References 6 Sources 7 External linksHistory editOriginal goals edit ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages with the ultimate goal of classifying languages computationally based on the lexical similarities observed In the first ASJP paper 2 two semantically identical words from compared languages were judged similar if they showed at least two identical sound segments Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar This method was applied to 100 item word lists for 250 languages from language families including Austroasiatic Indo European Mayan and Muskogean ASJP Consortium edit The ASJP Consortium founded around 2008 when came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and or extending aid to the project in other ways The main driving force behind the founding of the consortium was Cecil H Brown Soren Wichmann is daily curator of the project A third central member of the consortium is Eric W Holman who has created most of the software used in the project Shorter word lists edit While word lists used were originally based on the 100 item Swadesh list it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list 9 So subsequently word lists gathered contain only 40 items or less when attestations for some are lacking Levenshtein distance edit In papers published since 2008 ASJP has employed a similarity judgment program based on Levenshtein distance LD This approach was found to produce better classificatory results measured against expert opinion than the method used initially LD is defined as the minimum number of successive changes necessary to convert one word into another where each change is the insertion deletion or substitution of a symbol Within the Levenshtein approach differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words This produces normalized LD LDN An LDN divided LDND between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings This second normalization is intended to correct for chance similarity 10 Word list editThe ASJP uses the following 40 word list 11 It is similar to the Swadesh Yakhontov list but has some differences Body partseye ear nose tongue tooth hand knee blood bone breast woman s liver skinAnimals and plantslouse dog fish noun horn animal part tree leafPeopleperson name noun Naturesun star water fire stone path mountain night dark time Verbs and adjectivesdrink verb die see hear come new fullNumerals and pronounsone two I you weASJPcode editASJP version from 2016 citation needed uses the following symbols to encode phonemes p b f v m w 8 t d s z c n r l S Z C j T 5 y k g x N q X h 7 L 4 G i e E 3 a u oThey represent 7 vowels and 34 consonants all found on the standard QWERTY keyboard Sounds represented by ASJPcode 2 ASJPcode Description IPAi high front vowel rounded and unrounded i ɪ y ʏe mid front vowel rounded and unrounded e oE low front vowel rounded and unrounded a ae ɛ ɶ œ e3 high and mid central vowel rounded and unrounded ɨ ɘ e ɜ ʉ ɵ ɞa low central vowel unrounded ɐu high back vowel rounded and unrounded ɯ u ɑo mid and low back vowel rounded and unrounded ɤ ʌ ɑ o ɔ ɒp voiceless bilabial stop and fricative p ɸb voiced bilabial stop and fricative b bm bilabial nasal mf voiceless labiodental fricative fv voiced labiodental fricative v8 voiceless and voiced dental fricative 8 d4 dental nasal n t voiceless alveolar stop td voiced alveolar stop ds voiceless alveolar fricative sz voiced alveolar fricative zc voiceless and voiced alveolar affricate t s d zn voiceless and voiced alveolar nasal nS voiceless postalveolar fricative ʃZ voiced postalveolar fricative ʒC voiceless palato alveolar affricate t ʃj voiced palato alveolar affricate d ʒT voiceless and voiced palatal stop c ɟ5 palatal nasal ɲk voiceless velar stop kg voiced velar stop ɡx voiceless and voiced velar fricative x ɣN velar nasal ŋq voiceless uvular stop qG voiced uvular stop ɢX voiceless and voiced uvular fricative voiceless and voiced pharyngeal fricative x ʁ ħ ʕ7 voiceless glottal stop ʔh voiceless and voiced glottal fricative h ɦl voiced alveolar lateral approximate lL all other laterals ʟ ɭ ʎw voiced bilabial velar approximant wy palatal approximant jr voiced apico alveolar trill and all varieties of r sounds r ʀ etc all varieties of click sounds ǃ ǀ ǁ ǂA mark follows two consonants so that they are considered to be in the same position Thus kʷat becomes kw at Syllables like kat wat kaw and kwi are considered lexically similar to kw at Similarly a mark follows three consonants so that they are considered to be in the same position ndy im is considered similar to nim dam and yim marks the preceding consonant as glottalized See also editLexicostatistics Historical linguisticsReferences edit Wichmann Soren Andre Muller Annkathrin Wett Viveka Velupillai Julia Bischoffberger Cecil H Brown Eric W Holman Sebastian Sauppe Zarina Molochieva Pamela Brown Harald Hammarstrom Oleg Belyaev Johann Mattis List Dik Bakker Dmitry Egorov Matthias Urban Robert Mailhammer Agustina Carrizo Matthew S Dryer Evgenia Korovina David Beck Helen Geyer Patience Epps Anthony Grant and Pilar Valenzuela 2013 The ASJP Database version 16 http asjp clld org a b c Brown Cecil H Eric W Holman Soren Wichmann and Viveka Velupillai 2008 Automated classification of the world s languages A description of the method and preliminary results STUF Language Typology and Universals 61 4 285 308 Holman Eric W Cecil H Brown Soren Wichmann Andre Muller Viveka Velupillai Harald Hammarstrom Sebastian Sauppe Hagen Jung Dik Bakker Pamela Brown Oleg Belyaev Matthias Urban Robert Mailhammer Johann Mattis List and Dmitry Egorov 2011 Automated dating of the world s language families based on lexical similarity Current Anthropology 52 6 841 875 Wichmann Soren Andre Muller and Viveka Velupillai 2010 Homelands of the world s language families A quantitative approach Diachronica 27 2 247 276 Wichmann Soren Holman Eric W and Cecil H Brown 2010 Sound symbolism in basic vocabulary Entropy 12 4 844 858 Pompei Simone Vittorio Loreto and Francesca Tria 2011 On the accuracy of language trees PLoS ONE 6 e20109 Cf comments by Adelaar Blust and Campbell in Holman Eric W et al 2011 Automated Dating of the World s Language Families Based on Lexical Similarity Current Anthropology vol 52 no 6 pp 841 875 Cross Linguistic Linked Data Retrieved February 22 2020 Holman Eric W Soren Wichmann Cecil H Brown Viveka Velupillai Andre Muller and Dik Bakker 2008 Explorations in automated language classification Folia Linguistica 42 2 331 354 Wichmann Soren Eric W Holman Dik Bakker and Cecil H Brown 2010 Evaluating linguistic distance measures Physica A 389 3632 3639 doi 10 1016 j physa 2010 05 011 http asjp clld org static Guidelines pdf bare URL PDF Sources editSoren Wichmann Jeff Good eds 2014 Quantifying Language Dynamics On the Cutting edge of Areal and Phylogenetic Linguistics p 203 Leiden Brill Brown Cecil H et al 2008 Automated Classification of the World s Languages A Description of the Method and Preliminary Results Language Typology and Universals 61 4 November 2008 doi 10 1524 stuf 2008 0026 Wichmann Soren Eric W Holman and Cecil H Brown eds 2018 The ASJP Database version 18 External links editASJP Database official home page Retrieved from https en wikipedia org w index php title Automated Similarity Judgment Program amp oldid 1171734031, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.