The Knuth-Morris-Pratt (KMP) algorithm is an efficient string searching algorithm that finds occurrences of a “pattern” string within a “text” string.
The algorithm preprocesses the pattern to create a longest prefix-suffix (LPS) array, which allows the algorithm to skip unnecessary comparisons when a mismatch occurs. This array tells us, for each index in the pattern, the length of the longest proper prefix which is also a suffix.
For example, consider the pattern “ABABCABAB”. The LPS array for this pattern would be [0, 0, 1, 2, 0, 1, 2, 3, 4]. You can build the LPS array using the following code:
def build_lps(pattern: str):
lps = [0] * len(pattern)
j = 0 # length of current matched prefix
for i in range(1, len(pattern)):
while j > 0 and pattern[i] != pattern[j]:
j = lps[j - 1]
if pattern[i] == pattern[j]:
j += 1
lps[i] = j
return lps
KMP uses LPS to decide how far to “fall back” in the pattern after a mismatch. If we’ve matched j characters and text[i] != pattern[j], we set j = lps[j-1] (instead of restarting at 0), keeping the longest prefix that could still match at this position.
def kmp_search(text: str, pattern: str):
if pattern == "":
# empty pattern matches everywhere
return list(range(len(text) + 1))
lps = build_lps(pattern)
matches = []
j = 0 # index in pattern
for i in range(len(text)):
while j > 0 and text[i] != pattern[j]:
j = lps[j - 1]
if text[i] == pattern[j]:
j += 1
if j == len(pattern):
matches.append(i - len(pattern) + 1)
j = lps[j - 1]
return matches
The time complexity of the KMP algorithm is , where is the length of the text and is the length of the pattern. The space complexity is due to the LPS array. This makes KMP significantly more efficient than the naive approach, especially for large texts and patterns.