fbpx
Wikipedia

Pointwise mutual information

In statistics, probability theory and information theory, pointwise mutual information (PMI),[1] or point mutual information, is a measure of association. It compares the probability of two events occurring together to what this probability would be if the events were independent.[2]

PMI (especially in its positive pointwise mutual information variant) has been described as "one of the most important concepts in NLP", where it "draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co-occur in [a] corpus than we would have a priori expected them to appear by chance."[2]

The concept was introduced in 1961 by Robert Fano under the name of "mutual information", but today that term is instead used for a related measure of dependence between random variables:[2] The mutual information (MI) of two discrete random variables refers to the average PMI of all possible events.

Definition edit

The PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions, assuming independence. Mathematically:[2]

 

(with the latter two expressions being equal to the first by Bayes' theorem). The mutual information (MI) of the random variables X and Y is the expected value of the PMI (over all possible outcomes).

The measure is symmetric ( ). It can take positive or negative values, but is zero if X and Y are independent. Note that even though PMI may be negative or positive, its expected outcome over all joint events (MI) is non-negative. PMI maximizes when X and Y are perfectly associated (i.e.   or  ), yielding the following bounds:

 

Finally,   will increase if   is fixed but   decreases.

Here is an example to illustrate:

x y p(xy)
0 0 0.1
0 1 0.7
1 0 0.15
1 1 0.05

Using this table we can marginalize to get the following additional table for the individual distributions:

p(x) p(y)
0 0.8 0.25
1 0.2 0.75

With this example, we can compute four values for  . Using base-2 logarithms:

pmi(x=0;y=0) = −1
pmi(x=0;y=1) = 0.222392
pmi(x=1;y=0) = 1.584963
pmi(x=1;y=1) = -1.584963

(For reference, the mutual information   would then be 0.2141709.)

Similarities to mutual information edit

Pointwise Mutual Information has many of the same relationships as the mutual information. In particular,

 

Where   is the self-information, or  .

Variants edit

Several variations of PMI have been proposed, in particular to address what has been described as its "two main limitations":[3]

  1. PMI can take both positive and negative values and has no fixed bounds, which makes it harder to interpret.[3]
  2. PMI has "a well-known tendency to give higher scores to low-frequency events", but in applications such as measuring word similarity, it is preferable to have "a higher score for pairs of words whose relatedness is supported by more evidence."[3]

Positive PMI edit

The positive pointwise mutual information (PPMI) measure is defined by setting negative values of PMI to zero:[2]

 

This definition is motivated by the observation that "negative PMI values (which imply things are co-occurring less often than we would expect by chance) tend to be unreliable unless our corpora are enormous" and also by a concern that "it's not clear whether it's even possible to evaluate such scores of 'unrelatedness' with human judgment".[2] It also avoid having to deal with   values for events that never occur together ( ), by setting PPMI for these to 0.[2]

Normalized pointwise mutual information (npmi) edit

Pointwise mutual information can be normalized between [-1,+1] resulting in -1 (in the limit) for never occurring together, 0 for independence, and +1 for complete co-occurrence.[4]

 

Where   is the joint self-information  .

PMIk family edit

The PMIk measure (for k=2, 3 etc.), which was introduced by Béatrice Daille around 1994, and as of 2011 was described as being "among the most widely used variants", is defined as[5][3]

 

In particular,  . The additional factors of   inside the logarithm are intended to correct the bias of PMI towards low-frequency events, by boosting the scores of frequent pairs.[3] A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia. Taking x to be the word "football", its most strongly associated words y according to the PMI measure (i.e. those maximizing  ) were domain-specific ("midfielder", "cornerbacks", "goalkeepers") whereas the terms ranked most highly by PMI3 were much more general ("league", "clubs", "england").[3]

Chain-rule edit

Like mutual information,[6] point mutual information follows the chain rule, that is,

 

This is proven through application of Bayes' theorem:

 

Applications edit

PMI could be used in various disciplines e.g. in information theory, linguistics or chemistry (in profiling and analysis of chemical compounds).[7] In computational linguistics, PMI has been used for finding collocations and associations between words. For instance, countings of occurrences and co-occurrences of words in a text corpus can be used to approximate the probabilities   and   respectively. The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia (dump of October 2015)[citation needed] filtering by 1,000 or more co-occurrences. The frequency of each count can be obtained by dividing its value by 50,000,952. (Note: natural log is used to calculate the PMI values in this example, instead of log base 2)

word 1 word 2 count word 1 count word 2 count of co-occurrences PMI
puerto rico 1938 1311 1159 10.0349081703
hong kong 2438 2694 2205 9.72831972408
los angeles 3501 2808 2791 9.56067615065
carbon dioxide 4265 1353 1032 9.09852946116
prize laureate 5131 1676 1210 8.85870710982
san francisco 5237 2477 1779 8.83305176711
nobel prize 4098 5131 2498 8.68948811416
ice hockey 5607 3002 1933 8.6555759741
star trek 8264 1594 1489 8.63974676575
car driver 5578 2749 1384 8.41470768304
it the 283891 3293296 3347 -1.72037278119
are of 234458 1761436 1019 -2.09254205335
this the 199882 3293296 1211 -2.38612756961
is of 565679 1761436 1562 -2.54614706831
and of 1375396 1761436 2949 -2.79911817902
a and 984442 1375396 1457 -2.92239510038
in and 1187652 1375396 1537 -3.05660070757
to and 1025659 1375396 1286 -3.08825363041
to in 1025659 1187652 1066 -3.12911348956
of and 1761436 1375396 1190 -3.70663100173

Good collocation pairs have high PMI because the probability of co-occurrence is only slightly lower than the probabilities of occurrence of each word. Conversely, a pair of words whose probabilities of occurrence are considerably higher than their probability of co-occurrence gets a small PMI score.

References edit

  1. ^ Kenneth Ward Church and Patrick Hanks (March 1990). "Word association norms, mutual information, and lexicography". Comput. Linguist. 16 (1): 22–29.
  2. ^ a b c d e f g Dan Jurafsky and James H. Martin: Speech and Language Processing (3rd ed. draft), December 29, 2021, chapter 6
  3. ^ a b c d e f Francois Role, Moahmed Nadif. Handling the Impact of Low frequency Events on Co-occurrence-based Measures of Word Similarity:A Case Study of Pointwise Mutual Information. Proceedings of KDIR 2011 : KDIR- International Conference on Knowledge Discovery and Information Retrieval, Paris, October 26-29 2011
  4. ^ Bouma, Gerlof (2009). "Normalized (Pointwise) Mutual Information in Collocation Extraction" (PDF). Proceedings of the Biennial GSCL Conference.
  5. ^ B. Daille. Approche mixte pour l'extraction automatique de terminologie : statistiques lexicales et filtres linguistiques. Thèse de Doctorat en Informatique Fondamentale. Université Paris 7. 1994. p.139
  6. ^ Paul L. Williams. INFORMATION DYNAMICS: ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS.
  7. ^ Čmelo, I.; Voršilák, M.; Svozil, D. (2021-01-10). "Profiling and analysis of chemical compounds using pointwise mutual information". Journal of Cheminformatics. 13 (1): 3. doi:10.1186/s13321-020-00483-y. ISSN 1758-2946. PMC 7798221. PMID 33423694.
  • Fano, R M (1961). "chapter 2". Transmission of Information: A Statistical Theory of Communications. MIT Press, Cambridge, MA. ISBN 978-0262561693.

External links edit

  • Demo at Rensselaer MSR Server (PMI values normalized to be between 0 and 1)

pointwise, mutual, information, statistics, probability, theory, information, theory, pointwise, mutual, information, point, mutual, information, measure, association, compares, probability, events, occurring, together, what, this, probability, would, events, . In statistics probability theory and information theory pointwise mutual information PMI 1 or point mutual information is a measure of association It compares the probability of two events occurring together to what this probability would be if the events were independent 2 PMI especially in its positive pointwise mutual information variant has been described as one of the most important concepts in NLP where it draws on the intuition that the best way to weigh the association between two words is to ask how much more the two words co occur in a corpus than we would have a priori expected them to appear by chance 2 The concept was introduced in 1961 by Robert Fano under the name of mutual information but today that term is instead used for a related measure of dependence between random variables 2 The mutual information MI of two discrete random variables refers to the average PMI of all possible events Contents 1 Definition 2 Similarities to mutual information 3 Variants 3 1 Positive PMI 3 2 Normalized pointwise mutual information npmi 3 3 PMIk family 4 Chain rule 5 Applications 6 References 7 External linksDefinition editThe PMI of a pair of outcomes x and y belonging to discrete random variables X and Y quantifies the discrepancy between the probability of their coincidence given their joint distribution and their individual distributions assuming independence Mathematically 2 pmi x y log2 p x y p x p y log2 p x y p x log2 p y x p y displaystyle operatorname pmi x y equiv log 2 frac p x y p x p y log 2 frac p x y p x log 2 frac p y x p y nbsp with the latter two expressions being equal to the first by Bayes theorem The mutual information MI of the random variables X and Y is the expected value of the PMI over all possible outcomes The measure is symmetric pmi x y pmi y x displaystyle operatorname pmi x y operatorname pmi y x nbsp It can take positive or negative values but is zero if X and Y are independent Note that even though PMI may be negative or positive its expected outcome over all joint events MI is non negative PMI maximizes when X and Y are perfectly associated i e p x y displaystyle p x y nbsp or p y x 1 displaystyle p y x 1 nbsp yielding the following bounds pmi x y min log p x log p y displaystyle infty leq operatorname pmi x y leq min left log p x log p y right nbsp Finally pmi x y displaystyle operatorname pmi x y nbsp will increase if p x y displaystyle p x y nbsp is fixed but p x displaystyle p x nbsp decreases Here is an example to illustrate x y p x y 0 0 0 10 1 0 71 0 0 151 1 0 05Using this table we can marginalize to get the following additional table for the individual distributions p x p y 0 0 8 0 251 0 2 0 75With this example we can compute four values for pmi x y displaystyle operatorname pmi x y nbsp Using base 2 logarithms pmi x 0 y 0 1pmi x 0 y 1 0 222392pmi x 1 y 0 1 584963pmi x 1 y 1 1 584963 For reference the mutual information I X Y displaystyle operatorname I X Y nbsp would then be 0 2141709 Similarities to mutual information editPointwise Mutual Information has many of the same relationships as the mutual information In particular pmi x y h x h y h x y h x h x y h y h y x displaystyle begin aligned operatorname pmi x y amp amp h x h y h x y amp amp h x h x mid y amp amp h y h y mid x end aligned nbsp Where h x displaystyle h x nbsp is the self information or log2 p x displaystyle log 2 p x nbsp Variants editSeveral variations of PMI have been proposed in particular to address what has been described as its two main limitations 3 PMI can take both positive and negative values and has no fixed bounds which makes it harder to interpret 3 PMI has a well known tendency to give higher scores to low frequency events but in applications such as measuring word similarity it is preferable to have a higher score for pairs of words whose relatedness is supported by more evidence 3 Positive PMI edit The positive pointwise mutual information PPMI measure is defined by setting negative values of PMI to zero 2 ppmi x y max log2 p x y p x p y 0 displaystyle operatorname ppmi x y equiv max left log 2 frac p x y p x p y 0 right nbsp This definition is motivated by the observation that negative PMI values which imply things are co occurring less often than we would expect by chance tend to be unreliable unless our corpora are enormous and also by a concern that it s not clear whether it s even possible to evaluate such scores of unrelatedness with human judgment 2 It also avoid having to deal with displaystyle infty nbsp values for events that never occur together p x y 0 displaystyle p x y 0 nbsp by setting PPMI for these to 0 2 Normalized pointwise mutual information npmi edit Pointwise mutual information can be normalized between 1 1 resulting in 1 in the limit for never occurring together 0 for independence and 1 for complete co occurrence 4 npmi x y pmi x y h x y displaystyle operatorname npmi x y frac operatorname pmi x y h x y nbsp Where h x y displaystyle h x y nbsp is the joint self information log2 p x y displaystyle log 2 p x y nbsp PMIk family edit The PMIk measure for k 2 3 etc which was introduced by Beatrice Daille around 1994 and as of 2011 was described as being among the most widely used variants is defined as 5 3 pmik x y log2 p x y kp x p y pmi x y k 1 log2 p x y displaystyle operatorname pmi k x y equiv log 2 frac p x y k p x p y operatorname pmi x y k 1 log 2 p x y nbsp In particular pmi1 x y pmi x y displaystyle pmi 1 x y pmi x y nbsp The additional factors of p x y displaystyle p x y nbsp inside the logarithm are intended to correct the bias of PMI towards low frequency events by boosting the scores of frequent pairs 3 A 2011 case study demonstrated the success of PMI3 in correcting this bias on a corpus drawn from English Wikipedia Taking x to be the word football its most strongly associated words y according to the PMI measure i e those maximizing pmi x y displaystyle pmi x y nbsp were domain specific midfielder cornerbacks goalkeepers whereas the terms ranked most highly by PMI3 were much more general league clubs england 3 Chain rule editLike mutual information 6 point mutual information follows the chain rule that is pmi x yz pmi x y pmi x z y displaystyle operatorname pmi x yz operatorname pmi x y operatorname pmi x z y nbsp This is proven through application of Bayes theorem pmi x y pmi x z y log p x y p x p y log p x z y p x y p z y log p x y p x p y p x z y p x y p z y log p x y p y p x z y p x p y p x y p z y log p x yz p x p yz pmi x yz displaystyle begin aligned operatorname pmi x y operatorname pmi x z y amp log frac p x y p x p y log frac p x z y p x y p z y amp log left frac p x y p x p y frac p x z y p x y p z y right amp log frac p x y p y p x z y p x p y p x y p z y amp log frac p x yz p x p yz amp operatorname pmi x yz end aligned nbsp Applications editPMI could be used in various disciplines e g in information theory linguistics or chemistry in profiling and analysis of chemical compounds 7 In computational linguistics PMI has been used for finding collocations and associations between words For instance countings of occurrences and co occurrences of words in a text corpus can be used to approximate the probabilities p x displaystyle p x nbsp and p x y displaystyle p x y nbsp respectively The following table shows counts of pairs of words getting the most and the least PMI scores in the first 50 millions of words in Wikipedia dump of October 2015 citation needed filtering by 1 000 or more co occurrences The frequency of each count can be obtained by dividing its value by 50 000 952 Note natural log is used to calculate the PMI values in this example instead of log base 2 word 1 word 2 count word 1 count word 2 count of co occurrences PMIpuerto rico 1938 1311 1159 10 0349081703hong kong 2438 2694 2205 9 72831972408los angeles 3501 2808 2791 9 56067615065carbon dioxide 4265 1353 1032 9 09852946116prize laureate 5131 1676 1210 8 85870710982san francisco 5237 2477 1779 8 83305176711nobel prize 4098 5131 2498 8 68948811416ice hockey 5607 3002 1933 8 6555759741star trek 8264 1594 1489 8 63974676575car driver 5578 2749 1384 8 41470768304it the 283891 3293296 3347 1 72037278119are of 234458 1761436 1019 2 09254205335this the 199882 3293296 1211 2 38612756961is of 565679 1761436 1562 2 54614706831and of 1375396 1761436 2949 2 79911817902a and 984442 1375396 1457 2 92239510038in and 1187652 1375396 1537 3 05660070757to and 1025659 1375396 1286 3 08825363041to in 1025659 1187652 1066 3 12911348956of and 1761436 1375396 1190 3 70663100173Good collocation pairs have high PMI because the probability of co occurrence is only slightly lower than the probabilities of occurrence of each word Conversely a pair of words whose probabilities of occurrence are considerably higher than their probability of co occurrence gets a small PMI score References edit Kenneth Ward Church and Patrick Hanks March 1990 Word association norms mutual information and lexicography Comput Linguist 16 1 22 29 a b c d e f g Dan Jurafsky and James H Martin Speech and Language Processing 3rd ed draft December 29 2021 chapter 6 a b c d e f Francois Role Moahmed Nadif Handling the Impact of Low frequency Events on Co occurrence based Measures of Word Similarity A Case Study of Pointwise Mutual Information Proceedings of KDIR 2011 KDIR International Conference on Knowledge Discovery and Information Retrieval Paris October 26 29 2011 Bouma Gerlof 2009 Normalized Pointwise Mutual Information in Collocation Extraction PDF Proceedings of the Biennial GSCL Conference B Daille Approche mixte pour l extraction automatique de terminologie statistiques lexicales et filtres linguistiques These de Doctorat en Informatique Fondamentale Universite Paris 7 1994 p 139 Paul L Williams INFORMATION DYNAMICS ITS THEORY AND APPLICATION TO EMBODIED COGNITIVE SYSTEMS Cmelo I Vorsilak M Svozil D 2021 01 10 Profiling and analysis of chemical compounds using pointwise mutual information Journal of Cheminformatics 13 1 3 doi 10 1186 s13321 020 00483 y ISSN 1758 2946 PMC 7798221 PMID 33423694 Fano R M 1961 chapter 2 Transmission of Information A Statistical Theory of Communications MIT Press Cambridge MA ISBN 978 0262561693 External links editDemo at Rensselaer MSR Server PMI values normalized to be between 0 and 1 Retrieved from https en wikipedia org w index php title Pointwise mutual information amp oldid 1191792153, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.