A hash function is any function that takes arbitrary-length input and produces a fixed-length output, formally expressed as H : {0,1} → {0,1} n. Think of H(m) as a "fingerprint" of m. While different messages should ideally have different fingerprints, this is impossible in practice due to the infinite number of possible inputs and only 2 n* possible outputs.
A true unique fingerprinting function would need to be injective (one-to-one), but no hash function can achieve this. Instead, we focus on computational difficulty: a hash function is collision-resistant if no polynomial-time algorithm can find two distinct inputs x and x' such that H(x) = H(x').
Defining Collision Resistance
Hash Function Families
To avoid adversaries with hard-coded collisions, we introduce hash function families. A hash function family ℋ is a set of functions where each H ∈ ℋ has the same output length. Security relies on the random selection of H from ℋ, similar to how cryptographic keys are chosen randomly.
Formal Security Definition
A hash function family ℋ is collision-resistant if the following two libraries are indistinguishable:
- ℒcr-real: Allows the adversary to query H(x) and returns the hash value.
- ℒcr-fake: Tracks hashed values and triggers self-destruct if a collision is detected.
The self-destruct mechanism simplifies security proofs by assuming no collisions occur during normal operation.
Weaker Notions of Collision Resistance
- Target Collision Resistance: Given H and H(x), it is infeasible to find any x' (possibly equal to x) such that H(x) = H(x').
- Second Preimage Resistance: Given H and x, it is infeasible to find x' ≠ x such that H(x) = H(x').
Both are weaker than full collision resistance, as collision-resistant hash families imply these properties.
Hash-Then-MAC: A Practical Application
A common way to construct a secure MAC for arbitrary-length messages is to use a hash function together with a fixed-length MAC.
Construction
Let M be a MAC scheme with message space ? = {0,1} n, and let ℋ be a hash family with output length n. The hash-then-MAC (HtM) scheme is defined as:
- MAC(k, m): Compute y = H(m), then output t = M.MAC(k, y).
- VER(k, m, t): Output M.VER(k, H(m), t).
Security Proof
If ℋ is collision-resistant and M is a secure MAC, then HtM is a secure MAC. The proof uses a hybrid argument:
- Start with the real MAC security game.
- Replace the MAC verification with a check on the hash value.
- Use collision resistance to ensure no collisions occur, allowing a transition to the ideal MAC security game.
A collision in H would break the scheme, as an adversary could forge a MAC for a different message with the same hash.
Merkle-Damgård Construction
Building a hash function for arbitrary-length inputs from a fixed-length compression function is achieved through the Merkle-Damgård (MD) construction.
Construction
Let h : {0,1} n + t → {0,1} n be a compression function. The MD transformation of h is MDh : {0,1} → {0,1} n*, defined as:
- Pad the input x to a multiple of t bits, appending a block encoding the original length of x.
- Initialize y0 to a fixed value (IV).
- For each block xi, compute yi = h(y_i-1 || xi).
- Output the final y_k+1.
Security
If the compression function family ℋ is collision-resistant, then so is MDℋ. The proof shows that any collision in MDh implies a collision in h.
Length-Extension Attacks
A common mistake is to construct a MAC as H(k || m). When H is a Merkle-Damgård hash, this is vulnerable to length-extension attacks.
The Attack
Knowing H(x), an adversary can compute H(x || z) for any z without knowing x. This is because the internal state after processing x is H(x), and the rest of the computation depends only on this state and the padding.
Implications for MACs
If t = H(k || m) is used as a MAC, an adversary can forge the MAC for m || z without knowing k, by exploiting the length-extension property.
Frequently Asked Questions
What is a hash function?
A hash function maps arbitrary-length inputs to fixed-length outputs. Ideally, it should be collision-resistant, meaning it is computationally infeasible to find two different inputs that produce the same output.
Why are hash function families used in security definitions?
Hash function families introduce randomness into the selection of the hash function, preventing adversaries from hard-coding collisions. This makes collision resistance achievable in practice.
What is the hash-then-MAC construction?
Hash-then-MAC combines a hash function and a fixed-length MAC to create a secure MAC for arbitrary-length messages. The message is first hashed, and then the hash value is MACed.
How does the Merkle-Damgård construction work?
Merkle-Damgård processes input in blocks using a compression function. Each block is combined with the current state, and the final state is the output. Padding ensures the input length is a multiple of the block size.
What are length-extension attacks?
Length-extension attacks allow an adversary to compute H(x || z) from H(x) without knowing x. This is a property of Merkle-Damgård hashes and can break certain MAC constructions.
How can I avoid length-extension attacks?
Use cryptographic constructions designed to resist length-extension, such as HMAC, or avoid using plain Merkle-Damgård hashes in contexts where length-extension is a threat. Always explore more strategies for secure hash function applications.