博客
关于我
强烈建议你试试无所不能的chatGPT,快点击我
BF&KMP&BM简析
阅读量:6715 次
发布时间:2019-06-25

本文共 6761 字,大约阅读时间需要 22 分钟。

hot3.png

一、BF

BF,简单直观也是字符串处理库中使用最多的,在处理数据长度不大的时候效率不会比KMP和BM慢,示例代码如下:

/**     * Code shared by String and StringBuffer to do searches. The     * source is the character array being searched, and the target     * is the string being searched for.     *     * @param   source       the characters being searched.     * @param   sourceOffset offset of the source string.     * @param   sourceCount  count of the source string.     * @param   target       the characters being searched for.     * @param   targetOffset offset of the target string.     * @param   targetCount  count of the target string.     * @param   fromIndex    the index to begin searching from.     */    static int indexOf(char[] source, int sourceOffset, int sourceCount,            char[] target, int targetOffset, int targetCount,            int fromIndex) {        if (fromIndex >= sourceCount) {            return (targetCount == 0 ? sourceCount : -1);        }        if (fromIndex < 0) {            fromIndex = 0;        }        if (targetCount == 0) {            return fromIndex;        }        char first = target[targetOffset];        int max = sourceOffset + (sourceCount - targetCount);        for (int i = sourceOffset + fromIndex; i <= max; i++) {            /* Look for first character. */            if (source[i] != first) {                while (++i <= max && source[i] != first);            }            /* Found first character, now look at the rest of v2 */            if (i <= max) {                int j = i + 1;                int end = j + targetCount - 1;                for (int k = targetOffset + 1; j < end && source[j]                        == target[k]; j++, k++);                if (j == end) {                    /* Found whole string. */                    return i - sourceOffset;                }            }        }        return -1;    }

二、KMP

在BF算法中,每次匹配pattern字符串后是简单的把text的下标回溯,pattern的下标设置为0,这种方法没有充分利用我们已经对比的字符中已知的信息,如果能把已知的信息利用上,就会减少很多重复工作。

KMP算法通过对pattern进行预计算来减少下标回溯的距离,进而加快查找的速度,预计算的数据存储在一个数组中,举例说明:

text:ababcdefgh           pattern:ababd

ababcdefgh

  a

babd
               BF算法

    aba

bd
             KMP算法

可以看到当pattern自身有重复的时候,KMP算法利用了这个信息,不需要重新匹配已经匹配的重复字符,匹配算法讲解如下:

对pattern预计算的结果存储在failure[]中,failure[0] = 0,假设failure[i] = k,即text[0 … k-1] == pattern[i-k, i-1],当pattern[i] == pattern[k],则pattern[0 … k] == pattern[i-k, i],进而failure[i+1] = failure[i] + 1 = k + 1;当pattern[i] != pattern[k],模式匹配失败,则k = failure[k]。KMP算法如下:

public class KMPMatch {    private String string;  private String pattern;  private int[] failure;  private int matchPoint;    public KMPMatch(String string, String pattern) {    this.string = string;    this.pattern = pattern;    failure = new int[pattern.length()];    computeFailure();  }        public int getMatchPoint() {    return matchPoint;  }      public boolean match() {    // Tries to find an occurence of the pattern in the string        int j = 0;    if (string.length() == 0) return false;        for (int i = 0; i < string.length(); i++) {      while (j > 0 && pattern.charAt(j) != string.charAt(i)) {        j = failure[j - 1];      }      if (pattern.charAt(j) == string.charAt(i)) { j++; }      if (j == pattern.length()) {        matchPoint = i - pattern.length() + 1;        return true;      }    }    return false;  }    public boolean match1() {       int i = 0;    int j = 0;    if (string.length() == 0) return false;        while (i + pattern.length() - j <= string.length()) {      if (j >= pattern.length()) {        matchPoint = i - pattern.length();                return true;      }      if (string.charAt(i) == pattern.charAt(j)) {        i++;        j++;              } else {        if (j > 0) { j = failure[j - 1]; }        else { i++; }      }    }      return false;      }    /**    * Computes the failure function using a boot-strapping process,   * where the pattern is matched against itself.   */  private void computeFailure() {    int j = 0;    for (int i = 1; i < pattern.length(); i++) {      while (j > 0 && pattern.charAt(j) != pattern.charAt(i)) { j = failure[j - 1]; }      if (pattern.charAt(j) == pattern.charAt(i)) { j++; }      failure[i] = j;    }  }  }

根据KMP的pattern的failure生成规则可以看出,kmp比较适合字符集基数较少,pattern中能够出现较多重复规则的情况下使用性能比较好,例如DNA等

三、BM

BM算法有两个要点,

(1)在对比text和pattern的时候,pattern从后往前对比

(2)在从后向前对比pattern和text的时候,c是text中当前对比的字符,如果相等正常向前对比就行;如果不相等,分两种情况处理:

          1)如果c不包含在pattern中,这个时候text的对比索引可以向前进pattern的长度

          2)如果c包含在pattern里,由于是从后向前对比的,所以找出pattern中最右边出现的c,将text的索引向前使text中的c跟pattern中的c对齐

由于要预先计算出text所属字符集所有字符在pattern中最后一次出现的位置,所以这个字符集不能太大,例如英文字符26个用BM来进行查找就比较合适。代码如下:

public class BoyerMoore {    private final int R;     // the radix    private int[] right;     // the bad-character skip array    private char[] pattern;  // store the pattern as a character array    private String pat;      // or as a string    // pattern provided as a string    public BoyerMoore(String pat) {        this.R = 256;        this.pat = pat;        // position of rightmost occurrence of c in the pattern        right = new int[R];        for (int c = 0; c < R; c++)            right[c] = -1;        for (int j = 0; j < pat.length(); j++)            right[pat.charAt(j)] = j;    }    // pattern provided as a character array    public BoyerMoore(char[] pattern, int R) {        this.R = R;        this.pattern = new char[pattern.length];        for (int j = 0; j < pattern.length; j++)            this.pattern[j] = pattern[j];        // position of rightmost occurrence of c in the pattern        right = new int[R];        for (int c = 0; c < R; c++)            right[c] = -1;        for (int j = 0; j < pattern.length; j++)            right[pattern[j]] = j;    }    // return offset of first match; N if no match    public int search(String txt) {        int M = pat.length();        int N = txt.length();        int skip;        for (int i = 0; i <= N - M; i += skip) {            skip = 0;            for (int j = M-1; j >= 0; j--) {                if (pat.charAt(j) != txt.charAt(i+j)) {                    skip = Math.max(1, j - right[txt.charAt(i+j)]);                    break;                }            }            if (skip == 0) return i;    // found        }        return N;                       // not found    }    // return offset of first match; N if no match    public int search(char[] text) {        int M = pattern.length;        int N = text.length;        int skip;        for (int i = 0; i <= N - M; i += skip) {            skip = 0;            for (int j = M-1; j >= 0; j--) {                if (pattern[j] != text[i+j]) {                    skip = Math.max(1, j - right[text[i+j]]);                    break;                }            }            if (skip == 0) return i;    // found        }        return N;                       // not found    }}

转载于:https://my.oschina.net/diwayou/blog/149804

你可能感兴趣的文章
kuangbin专题十二 HDU1078 FatMouse and Cheese )(dp + dfs 记忆化搜索)
查看>>
多行文本超出显示省略号
查看>>
转载~基于比较的排序算法的最优下界为什么是O(nlogn)
查看>>
在本机通过SQL远程操作数据库
查看>>
StringMVC返回字符串
查看>>
Windows完成端口网络模型
查看>>
CSS Hack整理
查看>>
leetcode 28. Implement strStr()
查看>>
nginx 服务器重启命令,关闭 (转)
查看>>
实用的正则表达式
查看>>
Hibernate中Criteria的完整用法
查看>>
LINUX创建用户的命令
查看>>
Spring MVC 学习总结(一)——MVC概要与环境配置 转载自【张果】博客
查看>>
POJ 2728 二分+最小生成树
查看>>
[LeetCode] Best Time to Buy and Sell Stock IV
查看>>
nuxt 2.0采坑计之 (引入静态文件css)
查看>>
I/O编程软件题(Java语言)
查看>>
时序逻辑、组合逻辑,我不再怕你了
查看>>
(三)mybatis之对Hibernate初了解
查看>>
git 分支( branch ) 的基本使用
查看>>