一、BF
BF,简单直观也是字符串处理库中使用最多的,在处理数据长度不大的时候效率不会比KMP和BM慢,示例代码如下:
/** * Code shared by String and StringBuffer to do searches. The * source is the character array being searched, and the target * is the string being searched for. * * @param source the characters being searched. * @param sourceOffset offset of the source string. * @param sourceCount count of the source string. * @param target the characters being searched for. * @param targetOffset offset of the target string. * @param targetCount count of the target string. * @param fromIndex the index to begin searching from. */ static int indexOf(char[] source, int sourceOffset, int sourceCount, char[] target, int targetOffset, int targetCount, int fromIndex) { if (fromIndex >= sourceCount) { return (targetCount == 0 ? sourceCount : -1); } if (fromIndex < 0) { fromIndex = 0; } if (targetCount == 0) { return fromIndex; } char first = target[targetOffset]; int max = sourceOffset + (sourceCount - targetCount); for (int i = sourceOffset + fromIndex; i <= max; i++) { /* Look for first character. */ if (source[i] != first) { while (++i <= max && source[i] != first); } /* Found first character, now look at the rest of v2 */ if (i <= max) { int j = i + 1; int end = j + targetCount - 1; for (int k = targetOffset + 1; j < end && source[j] == target[k]; j++, k++); if (j == end) { /* Found whole string. */ return i - sourceOffset; } } } return -1; }
二、KMP
在BF算法中,每次匹配pattern字符串后是简单的把text的下标回溯,pattern的下标设置为0,这种方法没有充分利用我们已经对比的字符中已知的信息,如果能把已知的信息利用上,就会减少很多重复工作。
KMP算法通过对pattern进行预计算来减少下标回溯的距离,进而加快查找的速度,预计算的数据存储在一个数组中,举例说明:
text:ababcdefgh pattern:ababd
ababcdefgh
a
aba
可以看到当pattern自身有重复的时候,KMP算法利用了这个信息,不需要重新匹配已经匹配的重复字符,匹配算法讲解如下:
对pattern预计算的结果存储在failure[]中,failure[0] = 0,假设failure[i] = k,即text[0 … k-1] == pattern[i-k, i-1],当pattern[i] == pattern[k],则pattern[0 … k] == pattern[i-k, i],进而failure[i+1] = failure[i] + 1 = k + 1;当pattern[i] != pattern[k],模式匹配失败,则k = failure[k]。KMP算法如下:
public class KMPMatch { private String string; private String pattern; private int[] failure; private int matchPoint; public KMPMatch(String string, String pattern) { this.string = string; this.pattern = pattern; failure = new int[pattern.length()]; computeFailure(); } public int getMatchPoint() { return matchPoint; } public boolean match() { // Tries to find an occurence of the pattern in the string int j = 0; if (string.length() == 0) return false; for (int i = 0; i < string.length(); i++) { while (j > 0 && pattern.charAt(j) != string.charAt(i)) { j = failure[j - 1]; } if (pattern.charAt(j) == string.charAt(i)) { j++; } if (j == pattern.length()) { matchPoint = i - pattern.length() + 1; return true; } } return false; } public boolean match1() { int i = 0; int j = 0; if (string.length() == 0) return false; while (i + pattern.length() - j <= string.length()) { if (j >= pattern.length()) { matchPoint = i - pattern.length(); return true; } if (string.charAt(i) == pattern.charAt(j)) { i++; j++; } else { if (j > 0) { j = failure[j - 1]; } else { i++; } } } return false; } /** * Computes the failure function using a boot-strapping process, * where the pattern is matched against itself. */ private void computeFailure() { int j = 0; for (int i = 1; i < pattern.length(); i++) { while (j > 0 && pattern.charAt(j) != pattern.charAt(i)) { j = failure[j - 1]; } if (pattern.charAt(j) == pattern.charAt(i)) { j++; } failure[i] = j; } } }
根据KMP的pattern的failure生成规则可以看出,kmp比较适合字符集基数较少,pattern中能够出现较多重复规则的情况下使用性能比较好,例如DNA等
三、BM
BM算法有两个要点,
(1)在对比text和pattern的时候,pattern从后往前对比
(2)在从后向前对比pattern和text的时候,c是text中当前对比的字符,如果相等正常向前对比就行;如果不相等,分两种情况处理:
1)如果c不包含在pattern中,这个时候text的对比索引可以向前进pattern的长度
2)如果c包含在pattern里,由于是从后向前对比的,所以找出pattern中最右边出现的c,将text的索引向前使text中的c跟pattern中的c对齐
由于要预先计算出text所属字符集所有字符在pattern中最后一次出现的位置,所以这个字符集不能太大,例如英文字符26个用BM来进行查找就比较合适。代码如下:
public class BoyerMoore { private final int R; // the radix private int[] right; // the bad-character skip array private char[] pattern; // store the pattern as a character array private String pat; // or as a string // pattern provided as a string public BoyerMoore(String pat) { this.R = 256; this.pat = pat; // position of rightmost occurrence of c in the pattern right = new int[R]; for (int c = 0; c < R; c++) right[c] = -1; for (int j = 0; j < pat.length(); j++) right[pat.charAt(j)] = j; } // pattern provided as a character array public BoyerMoore(char[] pattern, int R) { this.R = R; this.pattern = new char[pattern.length]; for (int j = 0; j < pattern.length; j++) this.pattern[j] = pattern[j]; // position of rightmost occurrence of c in the pattern right = new int[R]; for (int c = 0; c < R; c++) right[c] = -1; for (int j = 0; j < pattern.length; j++) right[pattern[j]] = j; } // return offset of first match; N if no match public int search(String txt) { int M = pat.length(); int N = txt.length(); int skip; for (int i = 0; i <= N - M; i += skip) { skip = 0; for (int j = M-1; j >= 0; j--) { if (pat.charAt(j) != txt.charAt(i+j)) { skip = Math.max(1, j - right[txt.charAt(i+j)]); break; } } if (skip == 0) return i; // found } return N; // not found } // return offset of first match; N if no match public int search(char[] text) { int M = pattern.length; int N = text.length; int skip; for (int i = 0; i <= N - M; i += skip) { skip = 0; for (int j = M-1; j >= 0; j--) { if (pattern[j] != text[i+j]) { skip = Math.max(1, j - right[text[i+j]]); break; } } if (skip == 0) return i; // found } return N; // not found }}