fbpx
Wikipedia

Universal hashing

In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees a low number of collisions in expectation, even if the data is chosen by an adversary. Many universal families are known (for hashing integers, vectors, strings), and their evaluation is often very efficient. Universal hashing has numerous uses in computer science, for example in implementations of hash tables, randomized algorithms, and cryptography.

Introduction edit

Assume we want to map keys from some universe   into   bins (labelled  ). The algorithm will have to handle some data set   of   keys, which is not known in advance. Usually, the goal of hashing is to obtain a low number of collisions (keys from   that land in the same bin). A deterministic hash function cannot offer any guarantee in an adversarial setting if  , since the adversary may choose   to be precisely the preimage of a bin. This means that all data keys land in the same bin, making hashing useless. Furthermore, a deterministic hash function does not allow for rehashing: sometimes the input data turns out to be bad for the hash function (e.g. there are too many collisions), so one would like to change the hash function.

The solution to these problems is to pick a function randomly from a family of hash functions. A family of functions   is called a universal family if,  .

In other words, any two different keys of the universe collide with probability at most   when the hash function   is drawn uniformly at random from  . This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key.

Sometimes, the definition is relaxed by a constant factor, only requiring collision probability   rather than  . This concept was introduced by Carter and Wegman[1] in 1977, and has found numerous applications in computer science (see, for example[2]).

If we have an upper bound of   on the collision probability, we say that we have  -almost universality. So for example, a universal family has  -almost universality.

Many, but not all, universal families have the following stronger uniform difference property:

 , when   is drawn randomly from the family  , the difference   is uniformly distributed in  .

Note that the definition of universality is only concerned with whether  , which counts collisions. The uniform difference property is stronger.

(Similarly, a universal family can be XOR universal if  , the value   is uniformly distributed in   where   is the bitwise exclusive or operation. This is only possible if   is a power of two.)

An even stronger condition is pairwise independence: we have this property when   we have the probability that   will hash to any pair of hash values   is as if they were perfectly random:  . Pairwise independence is sometimes called strong universality.

Another property is uniformity. We say that a family is uniform if all hash values are equally likely:   for any hash value  . Universality does not imply uniformity. However, strong universality does imply uniformity.

Given a family with the uniform distance property, one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in   to the hash functions. (Similarly, if   is a power of two, we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant.) Since a shift by a constant is sometimes irrelevant in applications (e.g. hash tables), a careful distinction between the uniform distance property and pairwise independent is sometimes not made.[3]

For some applications (such as hash tables), it is important for the least significant bits of the hash values to be also universal. When a family is strongly universal, this is guaranteed: if   is a strongly universal family with  , then the family made of the functions   for all   is also strongly universal for  . Unfortunately, the same is not true of (merely) universal families. For example, the family made of the identity function   is clearly universal, but the family made of the function   fails to be universal.

UMAC and Poly1305-AES and several other message authentication code algorithms are based on universal hashing.[4][5] In such applications, the software chooses a new hash function for every message, based on a unique nonce for that message.

Several hash table implementations are based on universal hashing. In such applications, typically the software chooses a new hash function only after it notices that "too many" keys have collided; until then, the same hash function continues to be used over and over. (Some collision resolution schemes, such as dynamic perfect hashing, pick a new hash function every time there is a collision. Other collision resolution schemes, such as cuckoo hashing and 2-choice hashing, allow a number of collisions before picking a new hash function). A survey of fastest known universal and strongly universal hash functions for integers, vectors, and strings is found in.[6]

Mathematical guarantees edit

For any fixed set   of   keys, using a universal family guarantees the following properties.

  1. For any fixed   in  , the expected number of keys in the bin   is  . When implementing hash tables by chaining, this number is proportional to the expected running time of an operation involving the key   (for example a query, insertion or deletion).
  2. The expected number of pairs of keys   in   with   that collide ( ) is bounded above by  , which is of order  . When the number of bins,   is chosen linear in   (i.e., is determined by a function in  ), the expected number of collisions is  . When hashing into   bins, there are no collisions at all with probability at least a half.
  3. The expected number of keys in bins with at least   keys in them is bounded above by  .[7] Thus, if the capacity of each bin is capped to three times the average size ( ), the total number of keys in overflowing bins is at most  . This only holds with a hash family whose collision probability is bounded above by  . If a weaker definition is used, bounding it by  , this result is no longer true.[7]

As the above guarantees hold for any fixed set  , they hold if the data set is chosen by an adversary. However, the adversary has to make this choice before (or independent of) the algorithm's random choice of a hash function. If the adversary can observe the random choice of the algorithm, randomness serves no purpose, and the situation is the same as deterministic hashing.

The second and third guarantee are typically used in conjunction with rehashing. For instance, a randomized algorithm may be prepared to handle some   number of collisions. If it observes too many collisions, it chooses another random   from the family and repeats. Universality guarantees that the number of repetitions is a geometric random variable.

Constructions edit

Since any computer data can be represented as one or more machine words, one generally needs hash functions for three types of domains: machine words ("integers"); fixed-length vectors of machine words; and variable-length vectors ("strings").

Hashing integers edit

This section refers to the case of hashing integers that fit in machines words; thus, operations like multiplication, addition, division, etc. are cheap machine-level instructions. Let the universe to be hashed be  .

The original proposal of Carter and Wegman[1] was to pick a prime   and define

 

where   are randomly chosen integers modulo   with  . (This is a single iteration of a linear congruential generator.)

To see that   is a universal family, note that   only holds when

 

for some integer   between   and  . Since  , if   their difference   is nonzero and has an inverse modulo  . Solving for   yields

 .

There are   possible choices for   (since   is excluded) and, varying   in the allowed range,   possible non-zero values for the right hand side. Thus the collision probability is

 .

Another way to see   is a universal family is via the notion of statistical distance. Write the difference   as

 .

Since   is nonzero and   is uniformly distributed in  , it follows that   modulo   is also uniformly distributed in  . The distribution of   is thus almost uniform, up to a difference in probability of   between the samples. As a result, the statistical distance to a uniform family is  , which becomes negligible when  .

The family of simpler hash functions

 

is only approximately universal:   for all  .[1] Moreover, this analysis is nearly tight; Carter and Wegman [1] show that   whenever  .

Avoiding modular arithmetic edit

The state of the art for hashing integers is the multiply-shift scheme described by Dietzfelbinger et al. in 1997.[8] By avoiding modular arithmetic, this method is much easier to implement and also runs significantly faster in practice (usually by at least a factor of four[9]). The scheme assumes the number of bins is a power of two,  . Let   be the number of bits in a machine word. Then the hash functions are parametrised over odd positive integers   (that fit in a word of   bits). To evaluate  , multiply   by   modulo   and then keep the high order   bits as the hash code. In mathematical notation, this is

 

This scheme does not satisfy the uniform difference property and is only  -almost-universal; for any  ,  .

To understand the behavior of the hash function, notice that, if   and   have the same highest-order 'M' bits, then   has either all 1's or all 0's as its highest order M bits (depending on whether   or   is larger). Assume that the least significant set bit of   appears on position  . Since   is a random odd integer and odd integers have inverses in the ring  , it follows that   will be uniformly distributed among  -bit integers with the least significant set bit on position  . The probability that these bits are all 0's or all 1's is therefore at most  . On the other hand, if  , then higher-order M bits of   contain both 0's and 1's, so it is certain that  . Finally, if   then bit   of   is 1 and   if and only if bits   are also 1, which happens with probability  .

This analysis is tight, as can be shown with the example   and  . To obtain a truly 'universal' hash function, one can use the multiply-add-shift scheme that picks higher-order bits

 

where   is a random positive integer with   and   is a random non-negative integer with  . This requires doing arithmetic on  -bit unsigned integers. This version of multiply-shift is due to Dietzfelbinger, and was later analyzed more precisely by Woelfel.[10]

Hashing vectors edit

This section is concerned with hashing a fixed-length vector of machine words. Interpret the input as a vector   of   machine words (integers of   bits each). If   is a universal family with the uniform difference property, the following family (dating back to Carter and Wegman[1]) also has the uniform difference property (and hence is universal):

 , where each   is chosen independently at random.

If   is a power of two, one may replace summation by exclusive or.[11]

In practice, if double-precision arithmetic is available, this is instantiated with the multiply-shift hash family of hash functions.[12] Initialize the hash function with a vector   of random odd integers on   bits each. Then if the number of bins is   for  :

 .

It is possible to halve the number of multiplications, which roughly translates to a two-fold speed-up in practice.[11] Initialize the hash function with a vector   of random odd integers on   bits each. The following hash family is universal:[13]

 .

If double-precision operations are not available, one can interpret the input as a vector of half-words ( -bit integers). The algorithm will then use   multiplications, where   was the number of half-words in the vector. Thus, the algorithm runs at a "rate" of one multiplication per word of input.

The same scheme can also be used for hashing integers, by interpreting their bits as vectors of bytes. In this variant, the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication-based universal hashing schemes.[14]

Strong universality at high speed is also possible.[15] Initialize the hash function with a vector   of random integers on   bits. Compute

 .

The result is strongly universal on   bits. Experimentally, it was found to run at 0.2 CPU cycle per byte on recent Intel processors for  .

Hashing strings edit

This refers to hashing a variable-sized vector of machine words. If the length of the string can be bounded by a small number, it is best to use the vector solution from above (conceptually padding the vector with zeros up to the upper bound). The space required is the maximal length of the string, but the time to evaluate   is just the length of  . As long as zeroes are forbidden in the string, the zero-padding can be ignored when evaluating the hash function without affecting universality.[11] Note that if zeroes are allowed in the string, then it might be best to append a fictitious non-zero (e.g., 1) character to all strings prior to padding: this will ensure that universality is not affected.[15]

Now assume we want to hash  , where a good bound on   is not known a priori. A universal family proposed by [12] treats the string   as the coefficients of a polynomial modulo a large prime. If  , let   be a prime and define:

 , where   is uniformly random and   is chosen randomly from a universal family mapping integer domain  .

Using properties of modular arithmetic, above can be computed without producing large numbers for large strings as follows:[16]

uint hash(String x, int a, int p) uint h = INITIAL_VALUE for (uint i=0 ; i < x.length ; ++i) h = ((h*a) + x[i]) mod p return h 

This Rabin-Karp rolling hash is based on a linear congruential generator.[17] Above algorithm is also known as Multiplicative hash function.[18] In practice, the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod (Max-Int-Value + 1) in many programming languages. Below table shows values chosen to initialize h and a for some of the popular implementations.

Implementation INITIAL_VALUE a
Bernstein's hash function djb2[19] 5381 33
STLPort 4.6.2 0 5
Kernighan and Ritchie's hash function[20] 0 31
java.lang.String.hashCode()[21] 0 31

Consider two strings   and let   be length of the longer one; for the analysis, the shorter string is conceptually padded with zeros up to length  . A collision before applying   implies that   is a root of the polynomial with coefficients  . This polynomial has at most   roots modulo  , so the collision probability is at most  . The probability of collision through the random   brings the total collision probability to  . Thus, if the prime   is sufficiently large compared to the length of strings hashed, the family is very close to universal (in statistical distance).

Other universal families of hash functions used to hash unknown-length strings to fixed-length hash values include the Rabin fingerprint and the Buzhash.

Avoiding modular arithmetic edit

To mitigate the computational penalty of modular arithmetic, three tricks are used in practice:[11]

  1. One chooses the prime   to be close to a power of two, such as a Mersenne prime. This allows arithmetic modulo   to be implemented without division (using faster operations like addition and shifts). For instance, on modern architectures one can work with  , while  's are 32-bit values.
  2. One can apply vector hashing to blocks. For instance, one applies vector hashing to each 16-word block of the string, and applies string hashing to the   results. Since the slower string hashing is applied on a substantially smaller vector, this will essentially be as fast as vector hashing.
  3. One chooses a power-of-two as the divisor, allowing arithmetic modulo   to be implemented without division (using faster operations of bit masking). The NH hash-function family takes this approach.

See also edit

References edit

  1. ^ a b c d e Carter, Larry; Wegman, Mark N. (1979). "Universal Classes of Hash Functions". Journal of Computer and System Sciences. 18 (2): 143–154. doi:10.1016/0022-0000(79)90044-8. Conference version in STOC'77.
  2. ^ Miltersen, Peter Bro. (PDF). Archived from the original (PDF) on 24 May 2011. Retrieved 24 June 2009.
  3. ^ Motwani, Rajeev; Raghavan, Prabhakar (1995). Randomized Algorithms. Cambridge University Press. p. 221. ISBN 0-521-47465-5.
  4. ^ David Wagner, ed. "Advances in Cryptology - CRYPTO 2008". p. 145.
  5. ^ Jean-Philippe Aumasson, Willi Meier, Raphael Phan, Luca Henzen. "The Hash Function BLAKE". 2014. p. 10.
  6. ^ Thorup, Mikkel (2015). "High Speed Hashing for Integers and Strings". arXiv:1504.06804 [cs.DS].
  7. ^ a b Baran, Ilya; Demaine, Erik D.; Pătraşcu, Mihai (2008). "Subquadratic Algorithms for 3SUM" (PDF). Algorithmica. 50 (4): 584–596. doi:10.1007/s00453-007-9036-3. S2CID 9855995.
  8. ^ Dietzfelbinger, Martin; Hagerup, Torben; Katajainen, Jyrki; Penttonen, Martti (1997). "A Reliable Randomized Algorithm for the Closest-Pair Problem" (Postscript). Journal of Algorithms. 25 (1): 19–51. doi:10.1006/jagm.1997.0873. Retrieved 10 February 2011.
  9. ^ Thorup, Mikkel (18 December 2009). "Text-book algorithms at SODA".
  10. ^ Woelfel, Philipp (1999). Efficient Strongly Universal and Optimally Universal Hashing. Mathematical Foundations of Computer Science 1999. LNCS. Vol. 1672. pp. 262–272. doi:10.1007/3-540-48340-3_24.
  11. ^ a b c d Thorup, Mikkel (2009). String hashing for linear probing. Proc. 20th ACM-SIAM Symposium on Discrete Algorithms (SODA). pp. 655–664. CiteSeerX 10.1.1.215.4253. doi:10.1137/1.9781611973068.72. ISBN 978-0-89871-680-1., section 5.3
  12. ^ a b Dietzfelbinger, Martin; Gil, Joseph; Matias, Yossi; Pippenger, Nicholas (1992). Polynomial Hash Functions Are Reliable (Extended Abstract). Proc. 19th International Colloquium on Automata, Languages and Programming (ICALP). pp. 235–246.
  13. ^ Black, J.; Halevi, S.; Krawczyk, H.; Krovetz, T. (1999). UMAC: Fast and Secure Message Authentication (PDF). Advances in Cryptology (CRYPTO '99)., Equation 1
  14. ^ Pătraşcu, Mihai; Thorup, Mikkel (2011). The power of simple tabulation hashing. Proceedings of the 43rd annual ACM Symposium on Theory of Computing (STOC '11). pp. 1–10. arXiv:1011.5200. doi:10.1145/1993636.1993638. ISBN 9781450306911.
  15. ^ a b Kaser, Owen; Lemire, Daniel (2013). "Strongly universal string hashing is fast". Computer Journal. 57 (11). Oxford University Press: 1624–1638. arXiv:1202.4961. doi:10.1093/comjnl/bxt070.
  16. ^ "Hebrew University Course Slides" (PDF).
  17. ^ Robert Uzgalis. "Library Hash Functions". 1996.
  18. ^ Kankowsk, Peter. "Hash functions: An empirical comparison".
  19. ^ Yigit, Ozan. "String hash functions".
  20. ^ Kernighan; Ritchie (1988). "6". The C Programming Language (2nd ed.). Prentice Hall. pp. 118. ISBN 0-13-110362-8.{{cite book}}: CS1 maint: multiple names: authors list (link)
  21. ^ "String (Java Platform SE 6)". docs.oracle.com. Retrieved 2015-06-10.

Further reading edit

External links edit

  • Open Data Structures - Section 5.1.1 - Multiplicative Hashing, Pat Morin

universal, hashing, mathematics, computing, universal, hashing, randomized, algorithm, data, structure, refers, selecting, hash, function, random, from, family, hash, functions, with, certain, mathematical, property, definition, below, this, guarantees, number. In mathematics and computing universal hashing in a randomized algorithm or data structure refers to selecting a hash function at random from a family of hash functions with a certain mathematical property see definition below This guarantees a low number of collisions in expectation even if the data is chosen by an adversary Many universal families are known for hashing integers vectors strings and their evaluation is often very efficient Universal hashing has numerous uses in computer science for example in implementations of hash tables randomized algorithms and cryptography Contents 1 Introduction 2 Mathematical guarantees 3 Constructions 3 1 Hashing integers 3 1 1 Avoiding modular arithmetic 3 2 Hashing vectors 3 3 Hashing strings 3 3 1 Avoiding modular arithmetic 4 See also 5 References 6 Further reading 7 External linksIntroduction editSee also Hash function Assume we want to map keys from some universe U displaystyle U nbsp into m displaystyle m nbsp bins labelled m 0 m 1 displaystyle m 0 dots m 1 nbsp The algorithm will have to handle some data set S U displaystyle S subseteq U nbsp of S n displaystyle S n nbsp keys which is not known in advance Usually the goal of hashing is to obtain a low number of collisions keys from S displaystyle S nbsp that land in the same bin A deterministic hash function cannot offer any guarantee in an adversarial setting if U gt m n displaystyle U gt m cdot n nbsp since the adversary may choose S displaystyle S nbsp to be precisely the preimage of a bin This means that all data keys land in the same bin making hashing useless Furthermore a deterministic hash function does not allow for rehashing sometimes the input data turns out to be bad for the hash function e g there are too many collisions so one would like to change the hash function The solution to these problems is to pick a function randomly from a family of hash functions A family of functions H h U m displaystyle H h U to m nbsp is called a universal family if x y U x y h H h x h y H m displaystyle forall x y in U x neq y h in H h x h y leq frac H m nbsp In other words any two different keys of the universe collide with probability at most 1 m displaystyle 1 m nbsp when the hash function h displaystyle h nbsp is drawn uniformly at random from H displaystyle H nbsp This is exactly the probability of collision we would expect if the hash function assigned truly random hash codes to every key Sometimes the definition is relaxed by a constant factor only requiring collision probability O 1 m displaystyle O 1 m nbsp rather than 1 m displaystyle leq 1 m nbsp This concept was introduced by Carter and Wegman 1 in 1977 and has found numerous applications in computer science see for example 2 If we have an upper bound of ϵ lt 1 displaystyle epsilon lt 1 nbsp on the collision probability we say that we have ϵ displaystyle epsilon nbsp almost universality So for example a universal family has 1 m displaystyle 1 m nbsp almost universality Many but not all universal families have the following stronger uniform difference property x y U x y displaystyle forall x y in U x neq y nbsp when h displaystyle h nbsp is drawn randomly from the family H displaystyle H nbsp the difference h x h y mod m displaystyle h x h y bmod m nbsp is uniformly distributed in m displaystyle m nbsp Note that the definition of universality is only concerned with whether h x h y 0 displaystyle h x h y 0 nbsp which counts collisions The uniform difference property is stronger Similarly a universal family can be XOR universal if x y U x y displaystyle forall x y in U x neq y nbsp the value h x h y mod m displaystyle h x oplus h y bmod m nbsp is uniformly distributed in m displaystyle m nbsp where displaystyle oplus nbsp is the bitwise exclusive or operation This is only possible if m displaystyle m nbsp is a power of two An even stronger condition is pairwise independence we have this property when x y U x y displaystyle forall x y in U x neq y nbsp we have the probability that x y displaystyle x y nbsp will hash to any pair of hash values z 1 z 2 displaystyle z 1 z 2 nbsp is as if they were perfectly random P h x z 1 h y z 2 1 m 2 displaystyle P h x z 1 land h y z 2 1 m 2 nbsp Pairwise independence is sometimes called strong universality Another property is uniformity We say that a family is uniform if all hash values are equally likely P h x z 1 m displaystyle P h x z 1 m nbsp for any hash value z displaystyle z nbsp Universality does not imply uniformity However strong universality does imply uniformity Given a family with the uniform distance property one can produce a pairwise independent or strongly universal hash family by adding a uniformly distributed random constant with values in m displaystyle m nbsp to the hash functions Similarly if m displaystyle m nbsp is a power of two we can achieve pairwise independence from an XOR universal hash family by doing an exclusive or with a uniformly distributed random constant Since a shift by a constant is sometimes irrelevant in applications e g hash tables a careful distinction between the uniform distance property and pairwise independent is sometimes not made 3 For some applications such as hash tables it is important for the least significant bits of the hash values to be also universal When a family is strongly universal this is guaranteed if H displaystyle H nbsp is a strongly universal family with m 2 L displaystyle m 2 L nbsp then the family made of the functions h mod 2 L displaystyle h bmod 2 L nbsp for all h H displaystyle h in H nbsp is also strongly universal for L L displaystyle L leq L nbsp Unfortunately the same is not true of merely universal families For example the family made of the identity function h x x displaystyle h x x nbsp is clearly universal but the family made of the function h x x mod 2 L displaystyle h x x bmod 2 L nbsp fails to be universal UMAC and Poly1305 AES and several other message authentication code algorithms are based on universal hashing 4 5 In such applications the software chooses a new hash function for every message based on a unique nonce for that message Several hash table implementations are based on universal hashing In such applications typically the software chooses a new hash function only after it notices that too many keys have collided until then the same hash function continues to be used over and over Some collision resolution schemes such as dynamic perfect hashing pick a new hash function every time there is a collision Other collision resolution schemes such as cuckoo hashing and 2 choice hashing allow a number of collisions before picking a new hash function A survey of fastest known universal and strongly universal hash functions for integers vectors and strings is found in 6 Mathematical guarantees editFor any fixed set S displaystyle S nbsp of n displaystyle n nbsp keys using a universal family guarantees the following properties For any fixed x displaystyle x nbsp in S displaystyle S nbsp the expected number of keys in the bin h x displaystyle h x nbsp is n m displaystyle n m nbsp When implementing hash tables by chaining this number is proportional to the expected running time of an operation involving the key x displaystyle x nbsp for example a query insertion or deletion The expected number of pairs of keys x y displaystyle x y nbsp in S displaystyle S nbsp with x y displaystyle x neq y nbsp that collide h x h y displaystyle h x h y nbsp is bounded above by n n 1 2 m displaystyle n n 1 2m nbsp which is of order O n 2 m displaystyle O n 2 m nbsp When the number of bins m displaystyle m nbsp is chosen linear in n displaystyle n nbsp i e is determined by a function in W n displaystyle Omega n nbsp the expected number of collisions is O n displaystyle O n nbsp When hashing into n 2 displaystyle n 2 nbsp bins there are no collisions at all with probability at least a half The expected number of keys in bins with at least t displaystyle t nbsp keys in them is bounded above by 2 n t 2 n m 1 displaystyle 2n t 2 n m 1 nbsp 7 Thus if the capacity of each bin is capped to three times the average size t 3 n m displaystyle t 3n m nbsp the total number of keys in overflowing bins is at most O m displaystyle O m nbsp This only holds with a hash family whose collision probability is bounded above by 1 m displaystyle 1 m nbsp If a weaker definition is used bounding it by O 1 m displaystyle O 1 m nbsp this result is no longer true 7 As the above guarantees hold for any fixed set S displaystyle S nbsp they hold if the data set is chosen by an adversary However the adversary has to make this choice before or independent of the algorithm s random choice of a hash function If the adversary can observe the random choice of the algorithm randomness serves no purpose and the situation is the same as deterministic hashing The second and third guarantee are typically used in conjunction with rehashing For instance a randomized algorithm may be prepared to handle some O n displaystyle O n nbsp number of collisions If it observes too many collisions it chooses another random h displaystyle h nbsp from the family and repeats Universality guarantees that the number of repetitions is a geometric random variable Constructions editSince any computer data can be represented as one or more machine words one generally needs hash functions for three types of domains machine words integers fixed length vectors of machine words and variable length vectors strings Hashing integers edit This section refers to the case of hashing integers that fit in machines words thus operations like multiplication addition division etc are cheap machine level instructions Let the universe to be hashed be 0 U 1 displaystyle 0 dots U 1 nbsp The original proposal of Carter and Wegman 1 was to pick a prime p U displaystyle p geq U nbsp and define h a b x a x b mod p mod m displaystyle h a b x ax b bmod p bmod m nbsp where a b displaystyle a b nbsp are randomly chosen integers modulo p displaystyle p nbsp with a 0 displaystyle a neq 0 nbsp This is a single iteration of a linear congruential generator To see that H h a b displaystyle H h a b nbsp is a universal family note that h x h y displaystyle h x h y nbsp only holds when a x b a y b i m mod p displaystyle ax b equiv ay b i cdot m pmod p nbsp for some integer i displaystyle i nbsp between 0 displaystyle 0 nbsp and p 1 m displaystyle p 1 m nbsp Since p U displaystyle p geq U nbsp if x y displaystyle x neq y nbsp their difference x y displaystyle x y nbsp is nonzero and has an inverse modulo p displaystyle p nbsp Solving for a displaystyle a nbsp yields a i m x y 1 mod p displaystyle a equiv i cdot m cdot x y 1 pmod p nbsp There are p 1 displaystyle p 1 nbsp possible choices for a displaystyle a nbsp since a 0 displaystyle a 0 nbsp is excluded and varying i displaystyle i nbsp in the allowed range p 1 m displaystyle lfloor p 1 m rfloor nbsp possible non zero values for the right hand side Thus the collision probability is p 1 m p 1 p 1 m p 1 1 m displaystyle lfloor p 1 m rfloor p 1 leq p 1 m p 1 1 m nbsp Another way to see H displaystyle H nbsp is a universal family is via the notion of statistical distance Write the difference h x h y displaystyle h x h y nbsp as h x h y a x y mod p mod m displaystyle h x h y equiv a x y bmod p pmod m nbsp Since x y displaystyle x y nbsp is nonzero and a displaystyle a nbsp is uniformly distributed in 1 p 1 displaystyle 1 dots p 1 nbsp it follows that a x y displaystyle a x y nbsp modulo p displaystyle p nbsp is also uniformly distributed in 1 p 1 displaystyle 1 dots p 1 nbsp The distribution of h x h y mod m displaystyle h x h y bmod m nbsp is thus almost uniform up to a difference in probability of 1 p displaystyle pm 1 p nbsp between the samples As a result the statistical distance to a uniform family is O m p displaystyle O m p nbsp which becomes negligible when p m displaystyle p gg m nbsp The family of simpler hash functions h a x a x mod p mod m displaystyle h a x ax bmod p bmod m nbsp is only approximately universal Pr h a x h a y 2 m displaystyle Pr h a x h a y leq 2 m nbsp for all x y displaystyle x neq y nbsp 1 Moreover this analysis is nearly tight Carter and Wegman 1 show that Pr h a 1 h a m 1 2 m 1 displaystyle Pr h a 1 h a m 1 geq 2 m 1 nbsp whenever p 1 mod m 1 displaystyle p 1 bmod m 1 nbsp Avoiding modular arithmetic edit The state of the art for hashing integers is the multiply shift scheme described by Dietzfelbinger et al in 1997 8 By avoiding modular arithmetic this method is much easier to implement and also runs significantly faster in practice usually by at least a factor of four 9 The scheme assumes the number of bins is a power of two m 2 M displaystyle m 2 M nbsp Let w displaystyle w nbsp be the number of bits in a machine word Then the hash functions are parametrised over odd positive integers a lt 2 w displaystyle a lt 2 w nbsp that fit in a word of w displaystyle w nbsp bits To evaluate h a x displaystyle h a x nbsp multiply x displaystyle x nbsp by a displaystyle a nbsp modulo 2 w displaystyle 2 w nbsp and then keep the high order M displaystyle M nbsp bits as the hash code In mathematical notation this is h a x a x mod 2 w d i v 2 w M displaystyle h a x a cdot x bmod 2 w mathrm div 2 w M nbsp This scheme does not satisfy the uniform difference property and is only 2 m displaystyle 2 m nbsp almost universal for any x y displaystyle x neq y nbsp Pr h a x h a y 2 m displaystyle Pr h a x h a y leq 2 m nbsp To understand the behavior of the hash function notice that if a x mod 2 w displaystyle ax bmod 2 w nbsp and a y mod 2 w displaystyle ay bmod 2 w nbsp have the same highest order M bits then a x y mod 2 w displaystyle a x y bmod 2 w nbsp has either all 1 s or all 0 s as its highest order M bits depending on whether a x mod 2 w displaystyle ax bmod 2 w nbsp or a y mod 2 w displaystyle ay bmod 2 w nbsp is larger Assume that the least significant set bit of x y displaystyle x y nbsp appears on position w c displaystyle w c nbsp Since a displaystyle a nbsp is a random odd integer and odd integers have inverses in the ring Z 2 w displaystyle Z 2 w nbsp it follows that a x y mod 2 w displaystyle a x y bmod 2 w nbsp will be uniformly distributed among w displaystyle w nbsp bit integers with the least significant set bit on position w c displaystyle w c nbsp The probability that these bits are all 0 s or all 1 s is therefore at most 2 2 M 2 m displaystyle 2 2 M 2 m nbsp On the other hand if c lt M displaystyle c lt M nbsp then higher order M bits of a x y mod 2 w displaystyle a x y bmod 2 w nbsp contain both 0 s and 1 s so it is certain that h x h y displaystyle h x neq h y nbsp Finally if c M displaystyle c M nbsp then bit w M displaystyle w M nbsp of a x y mod 2 w displaystyle a x y bmod 2 w nbsp is 1 and h a x h a y displaystyle h a x h a y nbsp if and only if bits w 1 w M 1 displaystyle w 1 ldots w M 1 nbsp are also 1 which happens with probability 1 2 M 1 2 m displaystyle 1 2 M 1 2 m nbsp This analysis is tight as can be shown with the example x 2 w M 2 displaystyle x 2 w M 2 nbsp and y 3 x displaystyle y 3x nbsp To obtain a truly universal hash function one can use the multiply add shift scheme that picks higher order bits h a b x a x b mod 2 w M d i v 2 w displaystyle h a b x ax b bmod 2 w M mathrm div 2 w nbsp where a displaystyle a nbsp is a random positive integer with a lt 2 2 w displaystyle a lt 2 2w nbsp and b displaystyle b nbsp is a random non negative integer with b lt 2 2 w displaystyle b lt 2 2w nbsp This requires doing arithmetic on 2 w displaystyle 2w nbsp bit unsigned integers This version of multiply shift is due to Dietzfelbinger and was later analyzed more precisely by Woelfel 10 Hashing vectors edit This section is concerned with hashing a fixed length vector of machine words Interpret the input as a vector x x 0 x k 1 displaystyle bar x x 0 dots x k 1 nbsp of k displaystyle k nbsp machine words integers of w displaystyle w nbsp bits each If H displaystyle H nbsp is a universal family with the uniform difference property the following family dating back to Carter and Wegman 1 also has the uniform difference property and hence is universal h x i 0 k 1 h i x i mod m displaystyle h bar x left sum i 0 k 1 h i x i right bmod m nbsp where each h i H displaystyle h i in H nbsp is chosen independently at random If m displaystyle m nbsp is a power of two one may replace summation by exclusive or 11 In practice if double precision arithmetic is available this is instantiated with the multiply shift hash family of hash functions 12 Initialize the hash function with a vector a a 0 a k 1 displaystyle bar a a 0 dots a k 1 nbsp of random odd integers on 2 w displaystyle 2w nbsp bits each Then if the number of bins is m 2 M displaystyle m 2 M nbsp for M w displaystyle M leq w nbsp h a x i 0 k 1 x i a i mod 2 2 w d i v 2 2 w M displaystyle h bar a bar x left big sum i 0 k 1 x i cdot a i big bmod 2 2w right mathrm div 2 2w M nbsp It is possible to halve the number of multiplications which roughly translates to a two fold speed up in practice 11 Initialize the hash function with a vector a a 0 a k 1 displaystyle bar a a 0 dots a k 1 nbsp of random odd integers on 2 w displaystyle 2w nbsp bits each The following hash family is universal 13 h a x i 0 k 2 x 2 i a 2 i x 2 i 1 a 2 i 1 mod 2 2 w d i v 2 2 w M displaystyle h bar a bar x left Big sum i 0 lceil k 2 rceil x 2i a 2i cdot x 2i 1 a 2i 1 Big bmod 2 2w right mathrm div 2 2w M nbsp If double precision operations are not available one can interpret the input as a vector of half words w 2 displaystyle w 2 nbsp bit integers The algorithm will then use k 2 displaystyle lceil k 2 rceil nbsp multiplications where k displaystyle k nbsp was the number of half words in the vector Thus the algorithm runs at a rate of one multiplication per word of input The same scheme can also be used for hashing integers by interpreting their bits as vectors of bytes In this variant the vector technique is known as tabulation hashing and it provides a practical alternative to multiplication based universal hashing schemes 14 Strong universality at high speed is also possible 15 Initialize the hash function with a vector a a 0 a k displaystyle bar a a 0 dots a k nbsp of random integers on 2 w displaystyle 2w nbsp bits Compute h a x s t r o n g a 0 i 0 k 1 a i 1 x i mod 2 2 w d i v 2 w displaystyle h bar a bar x mathrm strong a 0 sum i 0 k 1 a i 1 x i bmod 2 2w mathrm div 2 w nbsp The result is strongly universal on w displaystyle w nbsp bits Experimentally it was found to run at 0 2 CPU cycle per byte on recent Intel processors for w 32 displaystyle w 32 nbsp Hashing strings edit This refers to hashing a variable sized vector of machine words If the length of the string can be bounded by a small number it is best to use the vector solution from above conceptually padding the vector with zeros up to the upper bound The space required is the maximal length of the string but the time to evaluate h s displaystyle h s nbsp is just the length of s displaystyle s nbsp As long as zeroes are forbidden in the string the zero padding can be ignored when evaluating the hash function without affecting universality 11 Note that if zeroes are allowed in the string then it might be best to append a fictitious non zero e g 1 character to all strings prior to padding this will ensure that universality is not affected 15 Now assume we want to hash x x 0 x ℓ displaystyle bar x x 0 dots x ell nbsp where a good bound on ℓ displaystyle ell nbsp is not known a priori A universal family proposed by 12 treats the string x displaystyle x nbsp as the coefficients of a polynomial modulo a large prime If x i u displaystyle x i in u nbsp let p max u m displaystyle p geq max u m nbsp be a prime and define h a x h i n t i 0 ℓ x i a ℓ i mod p displaystyle h a bar x h mathrm int left big sum i 0 ell x i cdot a ell i big bmod p right nbsp where a p displaystyle a in p nbsp is uniformly random and h i n t displaystyle h mathrm int nbsp is chosen randomly from a universal family mapping integer domain p m displaystyle p mapsto m nbsp Using properties of modular arithmetic above can be computed without producing large numbers for large strings as follows 16 uint hash String x int a int p uint h INITIAL VALUE for uint i 0 i lt x length i h h a x i mod p return h This Rabin Karp rolling hash is based on a linear congruential generator 17 Above algorithm is also known as Multiplicative hash function 18 In practice the mod operator and the parameter p can be avoided altogether by simply allowing integer to overflow because it is equivalent to mod Max Int Value 1 in many programming languages Below table shows values chosen to initialize h and a for some of the popular implementations Implementation INITIAL VALUE a Bernstein s hash function djb2 19 5381 33 STLPort 4 6 2 0 5 Kernighan and Ritchie s hash function 20 0 31 java lang String hashCode 21 0 31 Consider two strings x y displaystyle bar x bar y nbsp and let ℓ displaystyle ell nbsp be length of the longer one for the analysis the shorter string is conceptually padded with zeros up to length ℓ displaystyle ell nbsp A collision before applying h i n t displaystyle h mathrm int nbsp implies that a displaystyle a nbsp is a root of the polynomial with coefficients x y displaystyle bar x bar y nbsp This polynomial has at most ℓ displaystyle ell nbsp roots modulo p displaystyle p nbsp so the collision probability is at most ℓ p displaystyle ell p nbsp The probability of collision through the random h i n t displaystyle h mathrm int nbsp brings the total collision probability to 1 m ℓ p displaystyle frac 1 m frac ell p nbsp Thus if the prime p displaystyle p nbsp is sufficiently large compared to the length of strings hashed the family is very close to universal in statistical distance Other universal families of hash functions used to hash unknown length strings to fixed length hash values include the Rabin fingerprint and the Buzhash Avoiding modular arithmetic edit To mitigate the computational penalty of modular arithmetic three tricks are used in practice 11 One chooses the prime p displaystyle p nbsp to be close to a power of two such as a Mersenne prime This allows arithmetic modulo p displaystyle p nbsp to be implemented without division using faster operations like addition and shifts For instance on modern architectures one can work with p 2 61 1 displaystyle p 2 61 1 nbsp while x i displaystyle x i nbsp s are 32 bit values One can apply vector hashing to blocks For instance one applies vector hashing to each 16 word block of the string and applies string hashing to the k 16 displaystyle lceil k 16 rceil nbsp results Since the slower string hashing is applied on a substantially smaller vector this will essentially be as fast as vector hashing One chooses a power of two as the divisor allowing arithmetic modulo 2 w displaystyle 2 w nbsp to be implemented without division using faster operations of bit masking The NH hash function family takes this approach See also editK independent hashing Family of Hash functions Rolling hashing Type of hash functionPages displaying short descriptions of redirect targets Tabulation hashing Hash functions computed by exclusive or Min wise independence Data mining techniquePages displaying short descriptions of redirect targets Universal one way hash function type of universal hash function in cryptography proposed as an alternative to collision resistant hash functionsPages displaying wikidata descriptions as a fallback Low discrepancy sequence Type of mathematical sequence Perfect hashing Hash function without any collisionsPages displaying short descriptions of redirect targetsReferences edit a b c d e Carter Larry Wegman Mark N 1979 Universal Classes of Hash Functions Journal of Computer and System Sciences 18 2 143 154 doi 10 1016 0022 0000 79 90044 8 Conference version in STOC 77 Miltersen Peter Bro Universal Hashing PDF Archived from the original PDF on 24 May 2011 Retrieved 24 June 2009 Motwani Rajeev Raghavan Prabhakar 1995 Randomized Algorithms Cambridge University Press p 221 ISBN 0 521 47465 5 David Wagner ed Advances in Cryptology CRYPTO 2008 p 145 Jean Philippe Aumasson Willi Meier Raphael Phan Luca Henzen The Hash Function BLAKE 2014 p 10 Thorup Mikkel 2015 High Speed Hashing for Integers and Strings arXiv 1504 06804 cs DS a b Baran Ilya Demaine Erik D Pătrascu Mihai 2008 Subquadratic Algorithms for 3SUM PDF Algorithmica 50 4 584 596 doi 10 1007 s00453 007 9036 3 S2CID 9855995 Dietzfelbinger Martin Hagerup Torben Katajainen Jyrki Penttonen Martti 1997 A Reliable Randomized Algorithm for the Closest Pair Problem Postscript Journal of Algorithms 25 1 19 51 doi 10 1006 jagm 1997 0873 Retrieved 10 February 2011 Thorup Mikkel 18 December 2009 Text book algorithms at SODA Woelfel Philipp 1999 Efficient Strongly Universal and Optimally Universal Hashing Mathematical Foundations of Computer Science 1999 LNCS Vol 1672 pp 262 272 doi 10 1007 3 540 48340 3 24 a b c d Thorup Mikkel 2009 String hashing for linear probing Proc 20th ACM SIAM Symposium on Discrete Algorithms SODA pp 655 664 CiteSeerX 10 1 1 215 4253 doi 10 1137 1 9781611973068 72 ISBN 978 0 89871 680 1 section 5 3 a b Dietzfelbinger Martin Gil Joseph Matias Yossi Pippenger Nicholas 1992 Polynomial Hash Functions Are Reliable Extended Abstract Proc 19th International Colloquium on Automata Languages and Programming ICALP pp 235 246 Black J Halevi S Krawczyk H Krovetz T 1999 UMAC Fast and Secure Message Authentication PDF Advances in Cryptology CRYPTO 99 Equation 1 Pătrascu Mihai Thorup Mikkel 2011 The power of simple tabulation hashing Proceedings of the 43rd annual ACM Symposium on Theory of Computing STOC 11 pp 1 10 arXiv 1011 5200 doi 10 1145 1993636 1993638 ISBN 9781450306911 a b Kaser Owen Lemire Daniel 2013 Strongly universal string hashing is fast Computer Journal 57 11 Oxford University Press 1624 1638 arXiv 1202 4961 doi 10 1093 comjnl bxt070 Hebrew University Course Slides PDF Robert Uzgalis Library Hash Functions 1996 Kankowsk Peter Hash functions An empirical comparison Yigit Ozan String hash functions Kernighan Ritchie 1988 6 The C Programming Language 2nd ed Prentice Hall pp 118 ISBN 0 13 110362 8 a href Template Cite book html title Template Cite book cite book a CS1 maint multiple names authors list link String Java Platform SE 6 docs oracle com Retrieved 2015 06 10 Further reading editKnuth Donald Ervin 1998 The Art of Computer Programming Vol III Sorting and Searching 3rd ed Reading Mass London Addison Wesley ISBN 0 201 89685 0 External links editOpen Data Structures Section 5 1 1 Multiplicative Hashing Pat Morin Retrieved from https en wikipedia org w index php title Universal hashing amp oldid 1219538176, wikipedia, wiki, book, books, library,

article

, read, download, free, free download, mp3, video, mp4, 3gp, jpg, jpeg, gif, png, picture, music, song, movie, book, game, games.