Finding substrings in a numpy vectorized function
Hide 10 needles in the haystack
20220612174519
This document explores how a vectorized numpy solution for finding all occurrences of a sublist (needle) in a bigger list (haystack) works.
The solution (and question) comes form this stack overflow page called Python/NumPy first occurrence of subarray and is referenced also in the From Python to Numpy ebook
import numpy as np
haystack = np.random.randint(1000, size=(int(1e6),))
needle = np.random.randint(1000, size=(100,))
place = np.random.randint(int(1e6 - 100 + 1), size=10)
for idx in place:
haystack[idx:idx+100] = needle
Solution
def find_subsequence(seq, subseq):
target = np.dot(subseq, subseq)
candidates = np.where(np.correlate(seq,
subseq, mode='valid') == target)[0]
# some of the candidates entries may be false positives, double check
check = candidates[:, np.newaxis] + np.arange(len(subseq))
mask = np.all((np.take(seq, check) == subseq), axis=-1)
return candidates[mask]
>>> find_subsequence(haystack, needle)
array([ 25719, 149766, 279629, 372581, 373305, 535210, 573245, 806295,
838102, 954196])
>>> np.all(np.sort(place) == find_subsequence(haystack, needle))
True
>>> %timeit find_subsequence(haystack, needle)
1 loop, best of 5: 113 ms per loop
Analisys
target = np.dot(needle, needle)
>>> target
31037624
np.correlate
computes the dot
product between needle and every sliding window of the same size in haystack. This means that where this function returns values equal to the target
variable, you probably have a needle.
>>> np.correlate(haystack, needle, mode='valid')
array([26408171, 23900354, 25843323, ..., 22086714, 23579285, 22736943])
Naturally, the same dot-product
may be the result of multiple possible pairs of vectors, so the result of np.correlate
where it is equal with the target
indicates only of possible places where the needle exists.
candidates = np.where(np.correlate(haystack, needle, mode='valid') == target)[0]
>>> candidates.shape, candidates
((10,), array([ 25719, 149766, 279629, 372581, 373305, 535210, 573245, 806295,
838102, 954196]))
check = candidates[:, np.newaxis] + np.arange(len(needle))
>>> check.shape, check[:4, :10]
((10, 100),
array([[ 25719, 25720, 25721, 25722, 25723, 25724, 25725, 25726,
25727, 25728],
[149766, 149767, 149768, 149769, 149770, 149771, 149772, 149773,
149774, 149775],
[279629, 279630, 279631, 279632, 279633, 279634, 279635, 279636,
279637, 279638],
[372581, 372582, 372583, 372584, 372585, 372586, 372587, 372588,
372589, 372590]]))
check
builds a list of consecutive indices for each candidate equal in length to the length of the needle and starting from the candidates
indices returned by the np.correlate
function, which is used as a mask for elements that should be equal to the actual needle values.
mask = np.all((np.take(haystack, check) == needle), axis=-1)
>>> mask
array([ True, True, True, True, True, True, True, True, True,
True])
Mask makes the direct 1-to-1
check that the candidates actually are the start of the needles, by comparing the check values with the needle itself.
Use cases
This vectorisation could be useful for searching (finding) an embedding or a hash in a large list of hashes, where the hashes are larger than 64 bits (so they can’t fit in a single float value). If this isn’t the case (if the hash is smaller than 64 bits) you don’t need the np.correlate
. Just call np.where
and be done.
I wonder if this is (in practice) faster than using a plain dictionary which has a O(1)
lookup. I guess the dictionary may be potentially faster but sure is a lot more memory hungry! At the same time, if you need to optimise for memory maybe you shouldn’t use Python
to begin with..
Maybe a faster implementation via scipy.correlate
Numpy notes in their docs for correlate:
numpy.correlate may perform slowly in large arrays (i.e. n = 1e5) because it does not use the FFT to compute the convolution; in that case, scipy.signal.correlate might be preferable.
I’ve modified the function bellow to use the FFT
method from scipy
to see how much of an improvement we get by using that.
from scipy.signal import correlate
def find_subsequence_scipy(seq, subseq):
target = np.dot(subseq, subseq)
candidates = np.where(correlate(seq, subseq, mode='valid', method='fft') == target)[0]
# some of the candidates entries may be false positives, double check
check = candidates[:, np.newaxis] + np.arange(len(subseq))
mask = np.all((np.take(seq, check) == subseq), axis=-1)
return candidates[mask]
>>> find_subsequence_scipy(haystack, needle), np.all(np.sort(place) == find_subsequence_scipy(haystack, needle))
(array([ 25719, 149766, 279629, 372581, 373305, 535210, 573245, 806295,
838102, 954196]), True)
>>> %timeit find_subsequence(haystack, needle)
1 loop, best of 5: 112 ms per loop
>>> %timeit find_subsequence_scipy(haystack, needle)
10 loops, best of 5: 89.9 ms per loop
We get a 20% speedup (also by adding a new dependency on scipy
)!
Comments