fbpx
Wikipedia

SimRank

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model. SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects. Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor,[1] inserting additional terms that are neglected by SimRank[2] or using PageRank-based alternatives.[3]

Introduction edit

Many applications require a measure of "similarity" between objects. One obvious example is the "find-similar-document" query, on traditional text corpora or the World-Wide Web. More generally, a similarity measure can be used to cluster objects, such as for collaborative filtering in a recommender system, in which “similar” users and items are grouped based on the users’ preferences.

Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain. In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences. SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest. On the Web, for example, two pages are related if there are hyperlinks between them. A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference information. In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item. Such domains are naturally modeled as graphs, with nodes representing objects and edges representing relationships.

The intuition behind the SimRank algorithm is that, in many domains, similar objects are referenced by similar objects. More precisely, objects   and   are considered to be similar if they are pointed from objects   and  , respectively, and   and   are themselves similar. The base case is that objects are maximally similar to themselves .[4]

It is important to note that SimRank is a general algorithm that determines only the similarity of structural context. SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships. Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure. For example, for Web pages SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora. For recommendation systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.), as well as similarities between users (e.g., same gender, same spending level). Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.

Basic SimRank equation edit

For a node   in a directed graph, we denote by   and   the set of in-neighbors and out-neighbors of  , respectively. Individual in-neighbors are denoted as  , for  , and individual out-neighbors are denoted as  , for  .

Let us denote the similarity between objects   and   by  . Following the earlier motivation, a recursive equation is written for  . If   then   is defined to be  . Otherwise,

 

where   is a constant between   and  . A slight technicality here is that either   or   may not have any in-neighbors. Since there is no way to infer any similarity between   and   in this case, similarity is set to  , so the summation in the above equation is defined to be   when   or  .

Matrix representation of SimRank edit

Given an arbitrary constant   between   and  , let   be the similarity matrix whose entry   denotes the similarity score  , and   be the column normalized adjacency matrix whose entry   if there is an edge from   to  , and 0 otherwise. Then, in matrix notations, SimRank can be formulated as

 

where   is an identity matrix.

Computing SimRank edit

A solution to the SimRank equations for a graph   can be reached by iteration to a fixed-point. Let   be the number of nodes in  . For each iteration  , we can keep   entries  , where   gives the score between   and   on iteration  . We successively compute   based on  . We start with   where each   is a lower bound on the actual SimRank score  :

 

To compute   from  , we use the basic SimRank equation to get:

 

for  , and   for  . That is, on each iteration  , we update the similarity of   using the similarity scores of the neighbours of   from the previous iteration   according to the basic SimRank equation. The values   are nondecreasing as   increases. It was shown in [4] that the values converge to limits satisfying the basic SimRank equation, the SimRank scores  , i.e., for all  ,  .

The original SimRank proposal suggested choosing the decay factor   and a fixed number   of iterations to perform. However, the recent research [5] showed that the given values for   and   generally imply relatively low accuracy of iteratively computed SimRank scores. For guaranteeing more accurate computation results, the latter paper suggests either using a smaller decay factor (in particular,  ) or taking more iterations.

CoSimRank edit

CoSimRank is a variant of SimRank with the advantage of also having a local formulation, i.e. CoSimRank can be computed for a single node pair.[6] Let   be the similarity matrix whose entry   denotes the similarity score  , and   be the column normalized adjacency matrix. Then, in matrix notations, CoSimRank can be formulated as:

 

where   is an identity matrix. To compute the similarity score of only a single node pair, let  , with   being a vector of the standard basis, i.e., the  -th entry is 1 and all other entries are 0. Then, CoSimRank can be computed in two steps:

  1.  
  2.  

Step one can be seen a simplified version of Personalized PageRank. Step two sums up the vector similarity of each iteration. Both, matrix and local representation, compute the same similarity score. CoSimRank can also be used to compute the similarity of sets of nodes, by modifying  .

Further research on SimRank edit

  • Fogaras and Racz [7] suggested speeding up SimRank computation through probabilistic computation using the Monte Carlo method.
  • Antonellis et al.[8] extended SimRank equations to take into consideration (i) evidence factor for incident nodes and (ii) link weights.
  • Yu et al.[9] further improved SimRank computation via a fine-grained memoization method to share small common parts among different partial sums.
  • Chen and Giles discussed the limitations and proper use cases of SimRank.[3]

Partial Sums Memoization edit

Lizorkin et al.[5] proposed three optimization techniques for speeding up the computation of SimRank:

  1. Essential nodes selection may eliminate the computation of a fraction of node pairs with a-priori zero scores.
  2. Partial sums memoization can effectively reduce repeated calculations of the similarity among different node pairs by caching part of similarity summations for later reuse.
  3. A threshold setting on the similarity enables a further reduction in the number of node pairs to be computed.

In particular, the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from   to  , where   is the number of iterations,   is average degree of a graph, and   is the number of nodes in a graph. The central idea of partial sums memoization consists of two steps:

First, the partial sums over   are memoized as

 

and then   is iteratively computed from   as

 

Consequently, the results of  ,  , can be reused later when we compute the similarities   for a given vertex   as the first argument.

See also edit

Citations edit

  1. ^ I. Antonellis, H. Garcia-Molina and C.-C. Chang. Simrank++: Query Rewriting through Link Analysis of the Click Graph. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 408--421. [1]
  2. ^ W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks. In VLDB '13: Proceedings of the 39th International Conference on Very Large Data Bases, pages 13--24. [2]
  3. ^ a b H. Chen, and C. L. Giles. "ASCOS++: An Asymmetric Similarity Measure for Weighted Networks to Address the Problem of SimRank." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.2 2015.[3]
  4. ^ a b G. Jeh and J. Widom. SimRank: A Measure of Structural-Context Similarity. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538-543. ACM Press, 2002. (PDF). Archived from the original (PDF) on 2008-05-12. Retrieved 2008-10-02.{{cite web}}: CS1 maint: archived copy as title (link)
  5. ^ a b D. Lizorkin, P. Velikhov, M. Grinev and D. Turdakov. Accuracy Estimate and Optimization Techniques for SimRank Computation. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 422--433. (PDF). Archived from the original (PDF) on 2009-04-07. Retrieved 2008-10-25.{{cite web}}: CS1 maint: archived copy as title (link)
  6. ^ S. Rothe and H. Schütze. CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure. In ACL '14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1392-1402 . [4]
  7. ^ D. Fogaras and B. Racz. Scaling link-based similarity search. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 641--650, New York, NY, USA, 2005. ACM.
  8. ^ Antonellis, Ioannis, Hector Garcia Molina, and Chi Chao Chang. "Simrank++: query rewriting through link analysis of the click graph." Proceedings of the VLDB Endowment 1.1 (2008): 408-421. arXiv:0712.0499
  9. ^ W. Yu, X. Lin, W. Zhang. Towards Efficient SimRank Computation on Large Networks. In ICDE '13: Proceedings of the 29th IEEE International Conference on Data Engineering, pages 601--612. (PDF). Archived from the original (PDF) on 2014-05-12. Retrieved 2014-05-09.{{cite web}}: CS1 maint: archived copy as title (link)

Sources edit

  • Cai, Y.; Cong, G.; Jia, X.; Liu, H.; He, J.; Lu, J.; Du, X. (2009-12-01). "Efficient Algorithm for Computing Link-Based Similarity in Real World Networks". 2009 Ninth IEEE International Conference on Data Mining. pp. 734–739. doi:10.1109/ICDM.2009.136. ISBN 978-1-4244-5242-2. S2CID 9799597.

simrank, general, similarity, measure, based, simple, intuitive, graph, theoretic, model, applicable, domain, with, object, object, relationships, that, measures, similarity, structural, context, which, objects, occur, based, their, relationships, with, other,. SimRank is a general similarity measure based on a simple and intuitive graph theoretic model SimRank is applicable in any domain with object to object relationships that measures similarity of the structural context in which objects occur based on their relationships with other objects Effectively SimRank is a measure that says two objects are considered to be similar if they are referenced by similar objects Although SimRank is widely adopted it may output unreasonable similarity scores which are influenced by different factors and can be solved in several ways such as introducing an evidence weight factor 1 inserting additional terms that are neglected by SimRank 2 or using PageRank based alternatives 3 Contents 1 Introduction 2 Basic SimRank equation 3 Matrix representation of SimRank 4 Computing SimRank 5 CoSimRank 6 Further research on SimRank 6 1 Partial Sums Memoization 7 See also 8 Citations 9 SourcesIntroduction editMany applications require a measure of similarity between objects One obvious example is the find similar document query on traditional text corpora or the World Wide Web More generally a similarity measure can be used to cluster objects such as for collaborative filtering in a recommender system in which similar users and items are grouped based on the users preferences Various aspects of objects can be used to determine similarity usually depending on the domain and the appropriate definition of similarity for that domain In a document corpus matching text may be used and for collaborative filtering similar users may be identified by common preferences SimRank is a general approach that exploits the object to object relationships found in many domains of interest On the Web for example two pages are related if there are hyperlinks between them A similar approach can be applied to scientific papers and their citations or to any other document corpus with cross reference information In the case of recommender systems a user s preference for an item constitutes a relationship between the user and the item Such domains are naturally modeled as graphs with nodes representing objects and edges representing relationships The intuition behind the SimRank algorithm is that in many domains similar objects are referenced by similar objects More precisely objects a displaystyle a nbsp and b displaystyle b nbsp are considered to be similar if they are pointed from objects c displaystyle c nbsp and d displaystyle d nbsp respectively and c displaystyle c nbsp and d displaystyle d nbsp are themselves similar The base case is that objects are maximally similar to themselves 4 It is important to note that SimRank is a general algorithm that determines only the similarity of structural context SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships Obviously similarity of other domain specific aspects are important as well these can and should be combined with relational structural context similarity for an overall similarity measure For example for Web pages SimRank can be combined with traditional textual similarity the same idea applies to scientific papers or other document corpora For recommendation systems there may be built in known similarities between items e g both computers both clothing etc as well as similarities between users e g same gender same spending level Again these similarities can be combined with the similarity scores that are computed based on preference patterns in order to produce an overall similarity measure Basic SimRank equation editFor a node v displaystyle v nbsp in a directed graph we denote by I v displaystyle I v nbsp and O v displaystyle O v nbsp the set of in neighbors and out neighbors of v displaystyle v nbsp respectively Individual in neighbors are denoted as Ii v displaystyle I i v nbsp for 1 i I v displaystyle 1 leq i leq left I v right nbsp and individual out neighbors are denoted as Oi v displaystyle O i v nbsp for 1 i O v displaystyle 1 leq i leq left O v right nbsp Let us denote the similarity between objects a displaystyle a nbsp and b displaystyle b nbsp by s a b 0 1 displaystyle s a b in 0 1 nbsp Following the earlier motivation a recursive equation is written for s a b displaystyle s a b nbsp If a b displaystyle a b nbsp then s a b displaystyle s a b nbsp is defined to be 1 displaystyle 1 nbsp Otherwise s a b C I a I b i 1 I a j 1 I b s Ii a Ij b displaystyle s a b frac C left I a right left I b right sum i 1 left I a right sum j 1 left I b right s I i a I j b nbsp where C displaystyle C nbsp is a constant between 0 displaystyle 0 nbsp and 1 displaystyle 1 nbsp A slight technicality here is that either a displaystyle a nbsp or b displaystyle b nbsp may not have any in neighbors Since there is no way to infer any similarity between a displaystyle a nbsp and b displaystyle b nbsp in this case similarity is set to s a b 0 displaystyle s a b 0 nbsp so the summation in the above equation is defined to be 0 displaystyle 0 nbsp when I a displaystyle I a emptyset nbsp or I b displaystyle I b emptyset nbsp Matrix representation of SimRank editGiven an arbitrary constant C displaystyle C nbsp between 0 displaystyle 0 nbsp and 1 displaystyle 1 nbsp let S displaystyle mathbf S nbsp be the similarity matrix whose entry S a b displaystyle mathbf S a b nbsp denotes the similarity score s a b displaystyle s a b nbsp and A displaystyle mathbf A nbsp be the column normalized adjacency matrix whose entry A a b 1 I b displaystyle mathbf A a b tfrac 1 mathcal I b nbsp if there is an edge from a displaystyle a nbsp to b displaystyle b nbsp and 0 otherwise Then in matrix notations SimRank can be formulated as S max C AT S A I displaystyle mathbf S max C cdot mathbf A T cdot mathbf S cdot mathbf A mathbf I nbsp where I displaystyle mathbf I nbsp is an identity matrix Computing SimRank editA solution to the SimRank equations for a graph G displaystyle G nbsp can be reached by iteration to a fixed point Let n displaystyle n nbsp be the number of nodes in G displaystyle G nbsp For each iteration k displaystyle k nbsp we can keep n2 displaystyle n 2 nbsp entries sk displaystyle s k nbsp where sk a b displaystyle s k a b nbsp gives the score between a displaystyle a nbsp and b displaystyle b nbsp on iteration k displaystyle k nbsp We successively compute sk 1 displaystyle s k 1 nbsp based on sk displaystyle s k nbsp We start with s0 displaystyle s 0 nbsp where each s0 a b displaystyle s 0 a b nbsp is a lower bound on the actual SimRank score s a b displaystyle s a b nbsp s0 a b 1 if a b 0 if a b displaystyle s 0 a b begin cases 1 mbox mbox mbox if a b mbox 0 mbox mbox mbox if a neq b mbox end cases nbsp To compute sk 1 a b displaystyle s k 1 a b nbsp from sk displaystyle s k nbsp we use the basic SimRank equation to get sk 1 a b C I a I b i 1 I a j 1 I b sk Ii a Ij b displaystyle s k 1 a b frac C left I a right left I b right sum i 1 left I a right sum j 1 left I b right s k I i a I j b nbsp for a b displaystyle a neq b nbsp and sk 1 a b 1 displaystyle s k 1 a b 1 nbsp for a b displaystyle a b nbsp That is on each iteration k 1 displaystyle k 1 nbsp we update the similarity of a b displaystyle a b nbsp using the similarity scores of the neighbours of a b displaystyle a b nbsp from the previous iteration k displaystyle k nbsp according to the basic SimRank equation The values sk displaystyle s k nbsp are nondecreasing as k displaystyle k nbsp increases It was shown in 4 that the values converge to limits satisfying the basic SimRank equation the SimRank scores s displaystyle s nbsp i e for all a b V displaystyle a b in V nbsp limk sk a b s a b displaystyle lim k to infty s k a b s a b nbsp The original SimRank proposal suggested choosing the decay factor C 0 8 displaystyle C 0 8 nbsp and a fixed number K 5 displaystyle K 5 nbsp of iterations to perform However the recent research 5 showed that the given values for C displaystyle C nbsp and K displaystyle K nbsp generally imply relatively low accuracy of iteratively computed SimRank scores For guaranteeing more accurate computation results the latter paper suggests either using a smaller decay factor in particular C 0 6 displaystyle C 0 6 nbsp or taking more iterations CoSimRank editCoSimRank is a variant of SimRank with the advantage of also having a local formulation i e CoSimRank can be computed for a single node pair 6 Let S displaystyle mathbf S nbsp be the similarity matrix whose entry S a b displaystyle mathbf S a b nbsp denotes the similarity score s a b displaystyle s a b nbsp and A displaystyle mathbf A nbsp be the column normalized adjacency matrix Then in matrix notations CoSimRank can be formulated as S C AT S A I displaystyle mathbf S C cdot mathbf A T cdot mathbf S cdot mathbf A mathbf I nbsp where I displaystyle mathbf I nbsp is an identity matrix To compute the similarity score of only a single node pair let p 0 i ei displaystyle p 0 i e i nbsp with ei displaystyle e i nbsp being a vector of the standard basis i e the i displaystyle i nbsp th entry is 1 and all other entries are 0 Then CoSimRank can be computed in two steps p k Ap k 1 displaystyle p k Ap k 1 nbsp s i j k 0 Ck p k i p k j displaystyle s i j sum k 0 infty C k langle p k i p k j rangle nbsp Step one can be seen a simplified version of Personalized PageRank Step two sums up the vector similarity of each iteration Both matrix and local representation compute the same similarity score CoSimRank can also be used to compute the similarity of sets of nodes by modifying p 0 i displaystyle p 0 i nbsp Further research on SimRank editFogaras and Racz 7 suggested speeding up SimRank computation through probabilistic computation using the Monte Carlo method Antonellis et al 8 extended SimRank equations to take into consideration i evidence factor for incident nodes and ii link weights Yu et al 9 further improved SimRank computation via a fine grained memoization method to share small common parts among different partial sums Chen and Giles discussed the limitations and proper use cases of SimRank 3 Partial Sums Memoization edit Lizorkin et al 5 proposed three optimization techniques for speeding up the computation of SimRank Essential nodes selection may eliminate the computation of a fraction of node pairs with a priori zero scores Partial sums memoization can effectively reduce repeated calculations of the similarity among different node pairs by caching part of similarity summations for later reuse A threshold setting on the similarity enables a further reduction in the number of node pairs to be computed In particular the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from O Kd2n2 displaystyle mathcal O Kd 2 n 2 nbsp to O Kdn2 displaystyle mathcal O Kdn 2 nbsp where K displaystyle K nbsp is the number of iterations d displaystyle d nbsp is average degree of a graph and n displaystyle n nbsp is the number of nodes in a graph The central idea of partial sums memoization consists of two steps First the partial sums over I a displaystyle I a nbsp are memoized as PartialI a sk j i I a sk i j j I b displaystyle text Partial I a s k j sum i in I a s k i j qquad forall j in I b nbsp and then sk 1 a b displaystyle s k 1 a b nbsp is iteratively computed from PartialI a sk j displaystyle text Partial I a s k j nbsp as sk 1 a b C I a I b j I b PartialI a sk j displaystyle s k 1 a b frac C I a I b sum j in I b text Partial I a s k j nbsp Consequently the results of PartialI a sk j displaystyle text Partial I a s k j nbsp j I b displaystyle forall j in I b nbsp can be reused later when we compute the similarities sk 1 a displaystyle s k 1 a nbsp for a given vertex a displaystyle a nbsp as the first argument See also editPageRankCitations edit I Antonellis H Garcia Molina and C C Chang Simrank Query Rewriting through Link Analysis of the Click Graph In VLDB 08 Proceedings of the 34th International Conference on Very Large Data Bases pages 408 421 1 W Yu X Lin W Zhang L Chang and J Pei More is Simpler Effectively and Efficiently Assessing Node Pair Similarities Based on Hyperlinks In VLDB 13 Proceedings of the 39th International Conference on Very Large Data Bases pages 13 24 2 a b H Chen and C L Giles ASCOS An Asymmetric Similarity Measure for Weighted Networks to Address the Problem of SimRank ACM Transactions on Knowledge Discovery from Data TKDD 10 2 2015 3 a b G Jeh and J Widom SimRank A Measure of Structural Context Similarity In KDD 02 Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining pages 538 543 ACM Press 2002 Archived copy PDF Archived from the original PDF on 2008 05 12 Retrieved 2008 10 02 a href Template Cite web html title Template Cite web cite web a CS1 maint archived copy as title link a b D Lizorkin P Velikhov M Grinev and D Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation In VLDB 08 Proceedings of the 34th International Conference on Very Large Data Bases pages 422 433 Archived copy PDF Archived from the original PDF on 2009 04 07 Retrieved 2008 10 25 a href Template Cite web html title Template Cite web cite web a CS1 maint archived copy as title link S Rothe and H Schutze CoSimRank A Flexible amp Efficient Graph Theoretic Similarity Measure In ACL 14 Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics Volume 1 Long Papers pages 1392 1402 4 D Fogaras and B Racz Scaling link based similarity search In WWW 05 Proceedings of the 14th international conference on World Wide Web pages 641 650 New York NY USA 2005 ACM 5 Antonellis Ioannis Hector Garcia Molina and Chi Chao Chang Simrank query rewriting through link analysis of the click graph Proceedings of the VLDB Endowment 1 1 2008 408 421 arXiv 0712 0499 W Yu X Lin W Zhang Towards Efficient SimRank Computation on Large Networks In ICDE 13 Proceedings of the 29th IEEE International Conference on Data Engineering pages 601 612 Archived copy PDF Archived from the original PDF on 2014 05 12 Retrieved 2014 05 09 a href Template Cite web html title Template Cite web cite web a CS1 maint archived copy as title link Sources editCai Y Cong G Jia X Liu H He J Lu J Du X 2009 12 01 Efficient Algorithm for Computing Link Based Similarity in Real World Networks 2009 Ninth IEEE International Conference on Data Mining pp 734 739 doi 10 1109 ICDM 2009 136 ISBN 978 1 4244 5242 2 S2CID 9799597 Retrieved from https en wikipedia org w index php title SimRank amp oldid 1167068177, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.