Understanding and Implementing the Knuth-Morris-Pratt (KMP) Algorithm

Sep 20, 2024

The Knuth-Morris-Pratt (KMP) algorithm is an efficient string-matching algorithm that improves upon the naive approach of substring searching. It's particularly useful when you need to find occurrences of a pattern within a larger text, especially when the pattern might appear multiple times or when the text is very long.

How KMP Works

The key idea behind KMP is to utilize the information gained from previous match attempts to skip unnecessary comparisons. It does this by preprocessing the pattern to create a "failure function" or "partial match table". This table helps the algorithm to decide how many characters to skip when a mismatch occurs.

Example

Let's consider a simple example to understand how KMP works:

Text: "ABABDABACDABABCABAB" Pattern: "ABABCABAB"

The KMP algorithm would preprocess the pattern "ABABCABAB" to create a partial match table. Then, it would use this table to efficiently find the pattern in the text, avoiding unnecessary comparisons.

Step 1:
ABABDABACDABABCABAB
ABABCABAB
^

Step 2:
ABABDABACDABABCABAB
ABABCABAB
 ^

Step 3:
ABABDABACDABABCABAB
ABABCABAB
  ^

Step 4:
ABABDABACDABABCABAB
ABABCABAB
   ^

Step 5:
ABABDABACDABABCABAB
ABABCABAB
    ^
    
Step 6: (Mismatch, shift pattern)
ABABDABACDABABCABAB
  ABABCABAB
  ^

Step 7:
ABABDABACDABABCABAB
  ABABCABAB
   ^

Step 8:
ABABDABACDABABCABAB
  ABABCABAB
    ^

Step 9: (Mismatch, shift pattern)
ABABDABACDABABCABAB
    ABABCABAB
    ^

Step 10:
ABABDABACDABABCABAB
    ABABCABAB
     ^

Step 11: (Mismatch, shift pattern)
ABABDABACDABABCABAB
     ABABCABAB
     ^

Step 12:
ABABDABACDABABCABAB
     ABABCABAB
      ^

Step 13: (Mismatch, shift pattern)
ABABDABACDABABCABAB
      ABABCABAB
      ^

Step 14:
ABABDABACDABABCABAB
      ABABCABAB
       ^

Step 15:
ABABDABACDABABCABAB
      ABABCABAB
        ^

Step 16:
ABABDABACDABABCABAB
      ABABCABAB
         ^

Step 17:
ABABDABACDABABCABAB
      ABABCABAB
          ^

Step 18:
ABABDABACDABABCABAB
      ABABCABAB
           ^

Step 19: (Match found!)
ABABDABACDABABCABAB
      ABABCABAB
            ^

Match found at index 10

Explanation:

The algorithm starts by comparing characters from left to right.
When a mismatch occurs (Step 5), the pattern shifts based on the pre-computed LPS (Longest Proper Prefix which is also Suffix) array.
This process continues, shifting the pattern efficiently to avoid unnecessary comparisons.
Finally, a complete match is found at index 10 (Step 19).

Time and Space Complexity

Time Complexity

The KMP algorithm has two main phases:

Preprocessing phase: O(m)
- Where m is the length of the pattern
- This phase computes the partial match table (LPS array)
Searching phase: O(n)
- Where n is the length of the text
- This phase performs the actual search

Therefore, the overall time complexity of KMP is O(m + n).

Compared to the naive string matching algorithm, which has a worst-case time complexity of O(mn), KMP is significantly more efficient, especially for large texts or when the pattern appears multiple times.

Space Complexity

The space complexity of KMP is O(m), where m is the length of the pattern. This space is used to store the partial match table (LPS array).

The space complexity is independent of the length of the text being searched, which makes KMP memory-efficient even for very large texts.

C++ Implementation

Here's a C++ template class that implements the KMP algorithm:

#include <vector>
#include <string>

template<typename T>
class KMPMatcher {
private:
    std::vector<T> pattern;
    std::vector<int> lps;

    void computeLPSArray() {
        int len = 0;
        int i = 1;
        lps[0] = 0;

        while (i < pattern.size()) {
            if (pattern[i] == pattern[len]) {
                len++;
                lps[i] = len;
                i++;
            } else {
                if (len != 0) {
                    len = lps[len - 1];
                } else {
                    lps[i] = 0;
                    i++;
                }
            }
        }
    }

public:
    KMPMatcher(const std::vector<T>& pat) : pattern(pat), lps(pat.size()) {
        computeLPSArray();
    }

    std::vector<int> search(const std::vector<T>& text) {
        std::vector<int> matches;
        int i = 0;  // index for text[]
        int j = 0;  // index for pattern[]

        while (i < text.size()) {
            if (pattern[j] == text[i]) {
                j++;
                i++;
            }

            if (j == pattern.size()) {
                matches.push_back(i - j);
                j = lps[j - 1];
            } else if (i < text.size() && pattern[j] != text[i]) {
                if (j != 0)
                    j = lps[j - 1];
                else
                    i = i + 1;
            }
        }

        return matches;
    }
};

Using the KMPMatcher Class

Here's how you can use the KMPMatcher class:

#include <iostream>
#include <vector>
#include <string>

int main() {
    std::string text = "ABABDABACDABABCABAB";
    std::string pattern = "ABABCABAB";

    std::vector<char> text_vec(text.begin(), text.end());
    std::vector<char> pattern_vec(pattern.begin(), pattern.end());

    KMPMatcher<char> kmp(pattern_vec);
    std::vector<int> matches = kmp.search(text_vec);

    std::cout << "Pattern found at indices: ";
    for (int index : matches) {
        std::cout << index << " ";
    }
    std::cout << std::endl;

    return 0;
}

This code will output:

Pattern found at indices: 10

Conclusion

The KMP algorithm is a powerful tool for efficient string matching. By preprocessing the pattern and using the information gained from previous comparisons, it significantly reduces the number of comparisons needed in many cases, especially when dealing with repetitive patterns or long texts.

The time complexity of O(m + n) makes it much more efficient than naive string matching algorithms, particularly for large inputs. Its space complexity of O(m) ensures that it remains memory-efficient even when searching through very large texts.

The template class provided here allows you to use KMP with any type that supports equality comparison, not just characters. This makes it versatile for various applications beyond just string matching.

Vi's Random Thoughts

Discussion about this post

Ready for more?