The Knuth-Morris-Pratt Algorithm (KMP) is widly applied in matching substrings problems.
It was introduced by D.E.Knuth、J,H,Morris and V.R.Pratt.
Partial Match Table
The key to KMP Algorithm is the partial match table, usually stored as a list named next
or fail
.
With its help, we could then avoid moving the pointers back to the beginning once fail matching.
Proper Prefix and Suffix
- Proper Prefixes:
All the substrings with one or more cut off the end. - Proper Suffixes:
All the substrings with one or more cut off the beginning.
Eg: abcde
has prefixes: a
,ab
,abc
,abcd
; suffixes:e
,de
,cde
,bcde
Partial Match Value
The length of the longest proper prefix that matches a proper suffix in the same substring.
Sample partial match table
String: s = abababca
, next = [0,0,1,2,3,4,0,1]
string: a b a b a b c a
index: 0 1 2 3 4 5 6 7
value: 0 0 1 2 3 4 0 1
Notes:
- For substrings
a
andab
, they don’t have any matching proper prefix and proper suffix. - Examining the first three characters
aba
, we see that proper prefixa
and proper suffixa
. Thus, the partial match value, which is the lenght of the longest proper matching prefix and suffix, is1
. - Similarly:
abab
has longest matchab
;ababa
:aba
ababab
:abab
abababc
does not have any matchabababca
only has matcha
of length 1
Code:
def pmv_match(sub_string):
"""
return: the length of the longest matching proper prefix and suffix
"""
n = len(sub_string) - 1
for i in range(n):
# start from the longest proper prefix
if sub_string[:n - i] == sub_string[i + 1:]:
return n - i
return 0
Code Implementation of Partial Match Table
Using pmv_match()
function:
def get_pmt(s):
"""
return: the list containing all the partial match values of the substrings
"""
n = len(s)
next = [0 for _ in range(n)]
for i in range(n):
next[i] = pmv_match(s[:i])
return next
Fast Implementation
def get_pmt(s):
n = len(s)
next = [0 for _ in range(n)]
for i in range(1, n):
j = next[i-1]
while j != 0 and s[j] != s[i]:
j = next[j]
if s[j] == s[i]:
next[i] = j + 1
return next
Apply Partial Match Table in KMP
With the help of the Partial Match Table next
, we can then easily match the strings.
Violent Algorithm
Let’ s first take a look at the violent algorithm when matching two strings.
Suppose we’re matching the pattern p = abababca
against the text s = bacbababaabcbab
For violent match, we use two pointers i
and j
and move them each along s
and p
,
- if current character match, i.e.
s[i] == p[j]
,i
andj
move along the string, - else, move
i
to the next search starting pointi - (j - 1)
and resetj = 0
i = 0, j = 0
s = bacbababaabcbab (i = 0) | (unmatch) p = abababca (j = 0)
i = i - j + 1 = 1, j = 0
i = 1, j = 0
s = bacbababaabcbab (i = 2) || p = abababca (j = 1)
i = 2, j = 0
…
def violent_match(s:str, p:str):
s_len = len(s)
p_len = len(p)
# set two pointers
i = 0
j = 0
while i < s_len and j < p_len:
if s[i] == p[j]:
# ①s[i] == p[j], i++, j++
i+=1
j+=1
else:
# ②s[i]! = p[j], i = i - (j - 1),j = 0
i = i - j + 1
j = 0
# if successfully match, return the location of start,otherwise return -1
if j == p_len:
return i - j
else:
return -1
However, it is easily to notice that once unmatched,
if we move back both i
and j
, some known unmatched character need to be reconsider again.
Therefore, the partial match table is introduced to skip these step and move the pointer to the next possible starting position.
KMP Algorithm
Recall the Partial Match table of p = abababca
:
string: a b a b a b c a
index: 0 1 2 3 4 5 6 7
value: 0 0 1 2 3 4 0 1
Similarly, we use two pointers i
and j
and move them each along s
and p
.
When moving the pointers, we use the values in the partial match table to skip ahead
rather than redoing unnecessary old comparisons following the formula:
position to be skipped = partial matched length - table[partial matched length - 1]
(table[partial matched length] > 1
)
The first partial match: i = 1, j = 0
s = bacbababaabcbab (i = 2) | p = abababca (j = 1)
partial matched lenght = 1
,table[partial matched length - 1]
(ortable[0]
) is0
, so we don’t get to skip ahead any.The next partial match:
s = bacbababaabcbab (i = 2) ||||| p = abababca (j = 1)
partial matched lenght = 5
,table[partial_match_length - 1]
(ortable[4]
) is3
,
That means we get to skip aheadpartial matched length - table[partial matched length - 1]
(or5 - table[4] = 2
) characters.Skip ahead 2 characters:
// x denotes a skip s = bacbababaabcbab xx||| p = abababca
This is a
partial matched length
of3
.
The value attable[partial_match_length - 1]
(ortable[2]
) is1
.
That means we get to skip aheadpartial matched length - table[partial matched length - 1]
(or3 - table[2]
or2
) characters:Skip ahead again 2 characters:
// x denotes a skip s = bacbababaabcbab xx| p = abababca
j + len(p) > len(s)
, given no match.
Code implementation
def kmp_search(s: str, p: str, next: [int]):
i = 0
j = 0
s_len = len(s)
p_len = len(p)
while i < s_len and j < p_len:
# ①if j = -1, or s[i] == p[j], i++,j++
if j == -1 or s[i] == p[j]:
i += 1
j += 1
else:
# ② if j != -1 and s[i] != p[j], i unchange and j = next[j]
j = next[j]
if j == p_len:
return i - j
else:
return -1