Home | History | Annotate | Download | only in hash
      1 /*
      2  * Copyright (C) 2011 The Guava Authors
      3  *
      4  * Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except
      5  * in compliance with the License. You may obtain a copy of the License at
      6  *
      7  * http://www.apache.org/licenses/LICENSE-2.0
      8  *
      9  * Unless required by applicable law or agreed to in writing, software distributed under the License
     10  * is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express
     11  * or implied. See the License for the specific language governing permissions and limitations under
     12  * the License.
     13  */
     14 
     15 package com.google.common.hash;
     16 
     17 import com.google.common.annotations.Beta;
     18 import com.google.common.primitives.Ints;
     19 
     20 import java.nio.charset.Charset;
     21 
     22 /**
     23  * A hash function is a collision-averse pure function that maps an arbitrary block of
     24  * data to a number called a <i>hash code</i>.
     25  *
     26  * <h3>Definition</h3>
     27  *
     28  * <p>Unpacking this definition:
     29  *
     30  * <ul>
     31  * <li><b>block of data:</b> the input for a hash function is always, in concept, an
     32  *     ordered byte array. This hashing API accepts an arbitrary sequence of byte and
     33  *     multibyte values (via {@link Hasher}), but this is merely a convenience; these are
     34  *     always translated into raw byte sequences under the covers.
     35  *
     36  * <li><b>hash code:</b> each hash function always yields hash codes of the same fixed bit
     37  *     length (given by {@link #bits}). For example, {@link Hashing#sha1} produces a
     38  *     160-bit number, while {@link Hashing#murmur3_32()} yields only 32 bits. Because a
     39  *     {@code long} value is clearly insufficient to hold all hash code values, this API
     40  *     represents a hash code as an instance of {@link HashCode}.
     41  *
     42  * <li><b>pure function:</b> the value produced must depend only on the input bytes, in
     43  *     the order they appear. Input data is never modified.
     44  *
     45  * <li><b>collision-averse:</b> while it can't be helped that a hash function will
     46  *     sometimes produce the same hash code for distinct inputs (a "collision"), every
     47  *     hash function strives to <i>some</i> degree to make this unlikely. (Without this
     48  *     condition, a function that always returns zero could be called a hash function. It
     49  *     is not.)
     50  * </ul>
     51  *
     52  * <p>Summarizing the last two points: "equal yield equal <i>always</i>; unequal yield
     53  * unequal <i>often</i>." This is the most important characteristic of all hash functions.
     54  *
     55  * <h3>Desirable properties</h3>
     56  *
     57  * <p>A high-quality hash function strives for some subset of the following virtues:
     58  *
     59  * <ul>
     60  * <li><b>collision-resistant:</b> while the definition above requires making at least
     61  *     <i>some</i> token attempt, one measure of the quality of a hash function is <i>how
     62  *     well</i> it succeeds at this goal. Important note: it may be easy to achieve the
     63  *     theoretical minimum collision rate when using completely <i>random</i> sample
     64  *     input. The true test of a hash function is how it performs on representative
     65  *     real-world data, which tends to contain many hidden patterns and clumps. The goal
     66  *     of a good hash function is to stamp these patterns out as thoroughly as possible.
     67  *
     68  * <li><b>bit-dispersing:</b> masking out any <i>single bit</i> from a hash code should
     69  *     yield only the expected <i>twofold</i> increase to all collision rates. Informally,
     70  *     the "information" in the hash code should be as evenly "spread out" through the
     71  *     hash code's bits as possible. The result is that, for example, when choosing a
     72  *     bucket in a hash table of size 2^8, <i>any</i> eight bits could be consistently
     73  *     used.
     74  *
     75  * <li><b>cryptographic:</b> certain hash functions such as {@link Hashing#sha512} are
     76  *     designed to make it as infeasible as possible to reverse-engineer the input that
     77  *     produced a given hash code, or even to discover <i>any</i> two distinct inputs that
     78  *     yield the same result. These are called <i>cryptographic hash functions</i>. But,
     79  *     whenever it is learned that either of these feats has become computationally
     80  *     feasible, the function is deemed "broken" and should no longer be used for secure
     81  *     purposes. (This is the likely eventual fate of <i>all</i> cryptographic hashes.)
     82  *
     83  * <li><b>fast:</b> perhaps self-explanatory, but often the most important consideration.
     84  *     We have published <a href="#noWeHaventYet">microbenchmark results</a> for many
     85  *     common hash functions.
     86  * </ul>
     87  *
     88  * <h3>Providing input to a hash function</h3>
     89  *
     90  * <p>The primary way to provide the data that your hash function should act on is via a
     91  * {@link Hasher}. Obtain a new hasher from the hash function using {@link #newHasher},
     92  * "push" the relevant data into it using methods like {@link Hasher#putBytes(byte[])},
     93  * and finally ask for the {@code HashCode} when finished using {@link Hasher#hash}. (See
     94  * an {@linkplain #newHasher example} of this.)
     95  *
     96  * <p>If all you want to hash is a single byte array, string or {@code long} value, there
     97  * are convenient shortcut methods defined directly on {@link HashFunction} to make this
     98  * easier.
     99  *
    100  * <p>Hasher accepts primitive data types, but can also accept any Object of type {@code
    101  * T} provided that you implement a {@link Funnel Funnel<T>} to specify how to "feed" data
    102  * from that object into the function. (See {@linkplain Hasher#putObject an example} of
    103  * this.)
    104  *
    105  * <p><b>Compatibility note:</b> Throughout this API, multibyte values are always
    106  * interpreted in <i>little-endian</i> order. That is, hashing the byte array {@code
    107  * {0x01, 0x02, 0x03, 0x04}} is equivalent to hashing the {@code int} value {@code
    108  * 0x04030201}. If this isn't what you need, methods such as {@link Integer#reverseBytes}
    109  * and {@link Ints#toByteArray} will help.
    110  *
    111  * <h3>Relationship to {@link Object#hashCode}</h3>
    112  *
    113  * <p>Java's baked-in concept of hash codes is constrained to 32 bits, and provides no
    114  * separation between hash algorithms and the data they act on, so alternate hash
    115  * algorithms can't be easily substituted. Also, implementations of {@code hashCode} tend
    116  * to be poor-quality, in part because they end up depending on <i>other</i> existing
    117  * poor-quality {@code hashCode} implementations, including those in many JDK classes.
    118  *
    119  * <p>{@code Object.hashCode} implementations tend to be very fast, but have weak
    120  * collision prevention and <i>no</i> expectation of bit dispersion. This leaves them
    121  * perfectly suitable for use in hash tables, because extra collisions cause only a slight
    122  * performance hit, while poor bit dispersion is easily corrected using a secondary hash
    123  * function (which all reasonable hash table implementations in Java use). For the many
    124  * uses of hash functions beyond data structures, however, {@code Object.hashCode} almost
    125  * always falls short -- hence this library.
    126  *
    127  * @author Kevin Bourrillion
    128  * @since 11.0
    129  */
    130 @Beta
    131 public interface HashFunction {
    132   /**
    133    * Begins a new hash code computation by returning an initialized, stateful {@code
    134    * Hasher} instance that is ready to receive data. Example: <pre>   {@code
    135    *
    136    *   HashFunction hf = Hashing.md5();
    137    *   HashCode hc = hf.newHasher()
    138    *       .putLong(id)
    139    *       .putString(name)
    140    *       .hash();}</pre>
    141    */
    142   Hasher newHasher();
    143 
    144   /**
    145    * Begins a new hash code computation as {@link #newHasher()}, but provides a hint of the
    146    * expected size of the input (in bytes). This is only important for non-streaming hash
    147    * functions (hash functions that need to buffer their whole input before processing any
    148    * of it).
    149    */
    150   Hasher newHasher(int expectedInputSize);
    151 
    152   /**
    153    * Shortcut for {@code newHasher().putLong(input).hash()}; returns the hash code for the
    154    * given {@code long} value, interpreted in little-endian byte order. The implementation
    155    * <i>might</i> perform better than its longhand equivalent, but should not perform worse.
    156    */
    157   HashCode hashLong(long input);
    158 
    159   /**
    160    * Shortcut for {@code newHasher().putBytes(input).hash()}. The implementation
    161    * <i>might</i> perform better than its longhand equivalent, but should not perform
    162    * worse.
    163    */
    164   HashCode hashBytes(byte[] input);
    165 
    166   /**
    167    * Shortcut for {@code newHasher().putBytes(input, off, len).hash()}. The implementation
    168    * <i>might</i> perform better than its longhand equivalent, but should not perform
    169    * worse.
    170    *
    171    * @throws IndexOutOfBoundsException if {@code off < 0} or {@code off + len > bytes.length}
    172    *   or {@code len < 0}
    173    */
    174   HashCode hashBytes(byte[] input, int off, int len);
    175 
    176   /**
    177    * Shortcut for {@code newHasher().putString(input).hash()}. The implementation <i>might</i>
    178    * perform better than its longhand equivalent, but should not perform worse. Note that no
    179    * character encoding is performed; the low byte and high byte of each character are hashed
    180    * directly (in that order). This is equivalent to using
    181    * {@code hashString(input, Charsets.UTF_16LE)}.
    182    */
    183   HashCode hashString(CharSequence input);
    184 
    185   /**
    186    * Shortcut for {@code newHasher().putString(input, charset).hash()}. Characters are encoded
    187    * using the given {@link Charset}. The implementation <i>might</i> perform better than its
    188    * longhand equivalent, but should not perform worse.
    189    */
    190   HashCode hashString(CharSequence input, Charset charset);
    191 
    192   /**
    193    * Returns the number of bits (a multiple of 32) that each hash code produced by this
    194    * hash function has.
    195    */
    196   int bits();
    197 }
    198