String Hashing Template Searching For Strings Explanation - One Hash Implementation Explanation - Two Hashes Implementation Problems XOR Hashing/Zobrist Hashing Explanation Implementation Problems

Rare

0/18

Hashing

Authors: Benjamin Qi, Andi Qu, Peng Bai

Contributors: Andrew Wang, Kevin Sheng

Quickly testing equality of substrings or sets with a small probability of failure.

Edit This Page

Prerequisites

String Hashing Template Searching For Strings Explanation - One Hash Implementation Explanation - Two Hashes Implementation Problems XOR Hashing/Zobrist Hashing Explanation Implementation Problems

String Hashing


CPH	26.3 - String Hashing
cp-algo	String Hashing	code
PAPS	14.3 - Hashing	many applications
rng-58	Hashing and Probability of Collision

Template

As mentioned in the articles above, there is no need to calculate modular inverses.

C++

class HashedString {
  private:
	// change M and B if you want
	static const long long M = 1e9 + 9;
	static const long long B = 9973;

	// pow[i] contains B^i % M
	static vector<long long> pow;

	// p_hash[i] is the hash of the first i characters of the given string

Java

import java.util.*;

public class HashedString {
	// Change M and B if you want
	public static final long M = (long)1e9 + 9;
	public static final long B = 9973;

	// pow[i] contains B^i % M
	private static ArrayList<Long> pow = new ArrayList<>();

Python

class HashedString:
	# Change M and B if you want
	M = int(1e9) + 9
	B = 9973

	# pow[i] contains B^i % M
	_pow = [1]

	def __init__(self, s: str):
		while len(self._pow) <= len(s):

This implementation calculates

\texttt{hsh}[i + 1] = \left(\sum_{x = 0}^i B^{i - x} \cdot S[x]\right) \bmod M

The hash of any particular substring $S[a : b]$ is then calculated as

\left(\sum_{x = a}^b B^{b - x} \cdot S[x] \right) \bmod M = (\texttt{hsh}[b + 1] - \texttt{hsh}[a] \cdot B^{b - a + 1}) \bmod M

using prefix sums. This is nice because the highest power of $B$ in that polynomial will always be $B^{b - a}$ .

Collision Probability

In general, when using polynomial hashing modulo a prime modulus $M$ , the probability of two distinct strings having equal hashing over all possible choices of the base $B$ can be up to $\frac{n}{M}$ , where $n$ is the length of the longer of the two strings. See rng-58's blog post about hashing (linked above) for how to derive this probability using the Schwarz-Zippel lemma.

Since $10^9 + 9$ is prime, the probability of collision when using this hash is at most $\frac{N}{10^9 + 9} < 10^{-4}$ , by the Schwarz-Zippel lemma. This means that if you select any two different strings of length at most $N=10^5$ and a random base modulo $10^9 + 9$ (e.g. $9973$ in the code), the probability that they hash to the same value is at most $10^{-4}$ .

Warning!

Given a set of the hashes of $m$ distinct strings with length up to $n$ , the probability of two strings having equal hashes can be up to $\frac{m^2n}{M}$ by the birthday paradox. Assuming $m$ and $n$ are on the order of $10^5$ , $M=10^9+7$ is nowhere close to large enough to avoid collisions. Use a larger prime modulus such as $2^{61}-1$ (and do multiplications using 128-bit integers).

Warning!

In contests with open hacking (in particular, Codeforces educational rounds), make sure to choose the base of your polynomial hash randomly, as mentioned here.

C++

In C++, a virtually unhackable way of generating $B$ in the implementation above is to use a random number generator seeded with a high-precision clock, as described here.

mt19937 rng((uint32_t)chrono::steady_clock::now().time_since_epoch().count());
const ll B = uniform_int_distribution<ll>(0, M - 1)(rng);

Searching For Strings

CCC - Easy

Focus Problem – try your best to solve this problem before continuing!

Explanation - One Hash

We'll use a sliding window over $H$ to find the "matches" with $N$ .

Since we don't care about relative order when comparing two substrings, we can store frequency tables of the characters in the current window and in $N$ . When we slide the window, at most two values in that table change. To compare two substrings, we simply compare the 26 values in each table.

If we only needed to count the number of matches, then the above alone would suffice (in fact, IOI 2006 Writing is just that). However, we need to count the distinct permutations of $N$ in $H$ , so we need to be a bit more clever.

One way to solve this is by storing the polynomial hashes of each match in a set, since we expect different permutations to have different polynomial hashes. The answer would simply be the size of that set at the end.

Using a relatively small modulus such as $M=10^9+9$ will not pass (see the note above regarding the birthday paradox). Instead, we use $M=2^{61}-1$ .

Implementation

Time Complexity: $\mathcal O((|N| + |H|) \cdot \Sigma)$ , where $\Sigma$ is the size of the alphabet.

Failure Probability: $\mathcal O\left(\frac{|N||H|^2}{M}\right)$

C++

#include <bits/stdc++.h>
using namespace std;

using ll = long long;

 Code Snippet: HashedString (Click to expand)

int freq_target[26], freq_curr[26];
string n, h;

Explanation - Two Hashes

An alternative solution without frequency tables would be to hash the substrings that we're trying to match. Since order doesn't matter, we need to modify our hash function slightly.

In particular, instead of computing the polynomial hash of the substrings, compute the product $(B + s_1)(B + s_2) \dots (B + s_k) \bmod M$ as the hash (again, using two modulos). This hash is nice because the relative order of the letters doesn't matter, as multiplication is commutative. Furthermore, as any two strings with different frequency tables map to different polynomials in $B$ , they hash to the same value with probability at most $\frac{|N|}{M}$ over the choice of $B$ .

Since this hash requires the modular inverse, there's an extra $\log M$ factor in the time complexity.

Implementation

Time Complexity: $\mathcal O((|N| + |H|) \log M)$

Failure Probability: $\mathcal O\left(\frac{|N||H|^2}{M}\right)$

C++

#include <bits/stdc++.h>
typedef long long ll;
using namespace std;

 Code Snippet: HashedString (Click to expand)

const auto M = HashedString::M;
const auto B = HashedString::B;
const auto mul = HashedString::mul;
const auto mod_mul = HashedString::mod_mul;

Problems

Source	Problem Name	Difficulty	Tags
CSES	Finding Periods	Very Easy	Show Tags Hashing
Silver	Censoring	Easy	Show Tags Hashing
CEOI	2017 - Palindromic Partitions	Easy	Show Tags Greedy, Hashing
CF	Check Transcription	Easy	Show Tags Hashing
CF	Fullmetal Alchemist II	Easy	Show Tags Hashing
Gold	Bovine Genomics	Normal	Show Tags Binary Search, Hashing
Gold	Lights Out	Normal	Show Tags Hashing, Simulation
RMI	2017 - Hangman 2	Normal	Show Tags Hashing
COCI	2017 - Osmosmjerka	Normal	Show Tags Hashing, Probability
COCI	2021 - Sateliti	Hard	Show Tags Binary Search, Hashing
CF	Liar	Hard	Show Tags DP, Hashing
Baltic OI	2018 - Genetics	Hard	Show Tags Hashing
COCI	2016 - Zamjene	Very Hard	Show Tags DSU, Hashing
COI	2016 - Palinilap	Very Hard	Show Tags Binary Search, Hashing

XOR Hashing/Zobrist Hashing

Resources
	CF	XOR Hashing

Hashing can also be used to check if sets of elements are equal. To do this, we first randomly generate a hash value for each unique element. Typically, the hash value is an integer in the range $[0, 2^{63}-1]$ since $2^{63}-1$ is the maximum value of a 64-bit signed integer. The hash of a set $S$ is the XOR sum of the hash values of all the elements in $S$ . Since $x \oplus x = 0$ for all $x$ , we can delete an element $s$ from set $S$ by applying the hash value of $s$ again on the hash. The probability of a collision of $N$ sets is approximately $\frac{N^2}{M}$ , where $M$ is the maximum possible hash value.

Prefix Equality

AC - Easy

Focus Problem – try your best to solve this problem before continuing!

Explanation

For each distinct numerical value in the arrays, we generate a random positive 64-bit integer. With this map, we can build the prefix XOR hashes for $a$ and $b$ .

An issue we have to deal with is duplicate elements, as XORing an element with itself will result in a value of $0$ and will be equivalent it never having existed in the first place. To fix this, we use a set to detect duplicate subsequent values and only XOR an element with the prefix hash if it's new.

Now, to answer a query, we check if the XOR hashes at the given indices are the same.

Implementation

Time Complexity: $\mathcal{O}(N\log N + Q)$

C++

#include <chrono>
#include <iostream>
#include <map>
#include <random>
#include <set>
#include <vector>

using std::cout;
using std::endl;
using std::vector;

Problems

Status	Source	Problem Name	Difficulty	Tags
	CF	Three Occurrences	Hard	Show Tags Two Pointers, XOR Hashing
	CF	Hyperregular Bracket Strings	Hard	Show Tags Combinatorics, XOR Hashing

Module Progress:

Join the USACO Forum!

Stuck on a problem, or don't understand a module? Join the USACO Forum and get help from other competitive programmers!

Join Forum

Table of Contents

Hashing

Prerequisites

Table of Contents

String Hashing

Template

Collision Probability

Warning!

Warning!

Searching For Strings

Explanation - One Hash

Implementation

Explanation - Two Hashes

Implementation

Problems

XOR Hashing/Zobrist Hashing

Explanation

Implementation

Problems

Module Progress:

Join the USACO Forum!

Table of Contents

Hashing

Prerequisites

Table of Contents

String Hashing

Template

Collision Probability

Warning!

Warning!

Searching For Strings

Explanation - One Hash

Implementation

Explanation - Two Hashes

Implementation

Problems

XOR Hashing/Zobrist Hashing

Explanation

Implementation

Problems

Module Progress:Not Started

Join the USACO Forum!

Module Progress: