java 字符串对齐_最佳字符串对齐的Java实现
java 字符串對齊
有一陣子,我使用了Levenshtein distance的Apache Commons lang StringUtils實(shí)現(xiàn)。 它實(shí)現(xiàn)了一些眾所周知的技巧,通過僅掛接到兩個(gè)數(shù)組而不是為備忘錄表分配巨大的nxm表來使用較少的內(nèi)存。 它還僅檢查寬度為2 * k +1的“條帶”,其中k是最大編輯次數(shù)。
在levenshtein的大多數(shù)實(shí)際用法中,您只關(guān)心一個(gè)字符串是否在另一個(gè)字符串的少量編輯(1、2、3)之內(nèi)。 這避免了使levenstein變得“昂貴”的大部分n * m計(jì)算。 我們發(fā)現(xiàn),在ak <= 3的情況下,具有這些技巧的levenshtein的速度比Jaro-Winkler distance快,后者是一種近似編輯距離計(jì)算,被創(chuàng)建為更快的近似值(這有很多原因)。
不幸的是,Apache Commons Lang實(shí)現(xiàn)僅計(jì)算Levenshtein,而不計(jì)算可能更有用的Damerau-Levenshtein距離 。 Levenshtein定義了編輯操作的插入,刪除和替換。 Damerau變體將* transposition *添加到列表中,這對于我使用編輯距離的大多數(shù)位置都非常有用。 不幸的是,DL距離不是真正的度量標(biāo)準(zhǔn),因?yàn)樗豢紤]三角形不等式,但是有很多應(yīng)用不受此影響。 從該維基百科頁面可以看到,“最佳字符串對齊”和DL距離之間經(jīng)常會(huì)混淆。 實(shí)際上,OSA是一種更簡單的算法,并且需要較少的簿記,因此運(yùn)行時(shí)間可能略微更快。
我找不到任何使用我在Apache Commons Lang中看到的內(nèi)存技巧和“條帶化”技巧的OSA或DL實(shí)現(xiàn)。 因此,我使用這些技巧實(shí)現(xiàn)了自己的OSA。 在某些時(shí)候,我還將使用技巧來實(shí)現(xiàn)DL,并查看性能差異是什么:
這是Java中的OSA。 它是公共領(lǐng)域; 隨意使用。 單元測試如下。 唯一的依賴關(guān)系是Guava-,但它只是前提條件類和文檔注釋,因此如果您愿意,可以輕松刪除該依賴關(guān)系:
package com.github.steveash.util;import static com.google.common.base.Preconditions.checkArgument; import static com.google.common.base.Preconditions.checkNotNull; import static com.google.common.primitives.Shorts.checkedCast; import static java.lang.Math.abs; import static java.lang.Math.max;import java.util.Arrays;import com.google.common.annotations.VisibleForTesting;/*** Implementation of the OSA which is similar to the Damerau-Levenshtein in that it allows for transpositions to* count as a single edit distance, but is not a true metric and can over-estimate the cost because it disallows* substrings to edited more than once. See wikipedia for more discussion on OSA vs DL* <p/>* See Algorithms on Strings, Trees and Sequences by Dan Gusfield for more information.* <p/>* This also has a set of local buffer implementations to avoid allocating new buffers each time, which might be* a premature optimization* <p/>* @author Steve Ash*/ public class OptimalStringAlignment {private static final int threadLocalBufferSize = 64;private static final ThreadLocal<short[]> costLocal = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};private static final ThreadLocal<short[]> back1Local = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};private static final ThreadLocal<short[]> back2Local = new ThreadLocal<short[]>() {@Overrideprotected short[] initialValue() {return new short[threadLocalBufferSize];}};public static int editDistance(CharSequence s, CharSequence t, int threshold) {checkNotNull(s, "cannot measure null strings");checkNotNull(t, "cannot measure null strings");checkArgument(threshold >= 0, "Threshold must not be negative");checkArgument(s.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");checkArgument(t.length() < Short.MAX_VALUE, "Cannot take edit distance of strings longer than 32k chars");if (s.length() + 1 > threadLocalBufferSize || t.length() + 1 > threadLocalBufferSize)return editDistanceWithNewBuffers(s, t, checkedCast(threshold));short[] cost = costLocal.get();short[] back1 = back1Local.get();short[] back2 = back2Local.get();return editDistanceWithBuffers(s, t, checkedCast(threshold), back2, back1, cost);}@VisibleForTestingstatic int editDistanceWithNewBuffers(CharSequence s, CharSequence t, short threshold) {int slen = s.length();short[] back1 = new short[slen + 1]; // "up 1" row in tableshort[] back2 = new short[slen + 1]; // "up 2" row in tableshort[] cost = new short[slen + 1]; // "current cost"return editDistanceWithBuffers(s, t, threshold, back2, back1, cost);}private static int editDistanceWithBuffers(CharSequence s, CharSequence t, short threshold,short[] back2, short[] back1, short[] cost) {short slen = (short) s.length();short tlen = (short) t.length();// if one string is empty, the edit distance is necessarily the length of the otherif (slen == 0) {return tlen <= threshold ? tlen : -1;} else if (tlen == 0) {return slen <= threshold ? slen : -1;}// if lengths are different > k, then can't be within edit distanceif (abs(slen - tlen) > threshold)return -1;if (slen > tlen) {// swap the two strings to consume less memoryCharSequence tmp = s;s = t;t = tmp;slen = tlen;tlen = (short) t.length();}initMemoiseTables(threshold, back2, back1, cost, slen);for (short j = 1; j <= tlen; j++) {cost[0] = j; // j is the cost of inserting this many characters// stripe boundsint min = max(1, j - threshold);int max = min(slen, (short) (j + threshold));// at this iteration the left most entry is "too much" so reset itif (min > 1) {cost[min - 1] = Short.MAX_VALUE;}iterateOverStripe(s, t, j, cost, back1, back2, min, max);// swap our cost arrays to move on to the next "row"short[] tempCost = back2;back2 = back1;back1 = cost;cost = tempCost;}// after exit, the current cost is in back1// if back1[slen] > k then we exceeded, so return -1if (back1[slen] > threshold) {return -1;}return back1[slen];}private static void iterateOverStripe(CharSequence s, CharSequence t, short j,short[] cost, short[] back1, short[] back2, int min, int max) {// iterates over the stripefor (int i = min; i <= max; i++) {if (s.charAt(i - 1) == t.charAt(j - 1)) {cost[i] = back1[i - 1];} else {cost[i] = (short) (1 + min(cost[i - 1], back1[i], back1[i - 1]));}if (i >= 2 && j >= 2) {// possible transposition to check forif ((s.charAt(i - 2) == t.charAt(j - 1)) &&s.charAt(i - 1) == t.charAt(j - 2)) {cost[i] = min(cost[i], (short) (back2[i - 2] + 1));}}}}private static void initMemoiseTables(short threshold, short[] back2, short[] back1, short[] cost, short slen) {// initial "starting" values for inserting all the lettersshort boundary = (short) (min(slen, threshold) + 1);for (short i = 0; i < boundary; i++) {back1[i] = i;back2[i] = i;}// need to make sure that we don't read a default value when looking "up"Arrays.fill(back1, boundary, slen + 1, Short.MAX_VALUE);Arrays.fill(back2, boundary, slen + 1, Short.MAX_VALUE);Arrays.fill(cost, 0, slen + 1, Short.MAX_VALUE);}private static short min(short a, short b) {return (a <= b ? a : b);}private static short min(short a, short b, short c) {return min(a, min(b, c));} }import org.junit.Testimport static com.github.steveash.util.OptimalStringAlignment.editDistance/*** @author Steve Ash*/ class OptimalStringAlignmentTest {@Testpublic void shouldBeZeroForEqualStrings() throws Exception {assert 0 == editDistance("steve", "steve", 1)assert 0 == editDistance("steve", "steve", 0)assert 0 == editDistance("steve", "steve", 2)assert 0 == editDistance("steve", "steve", 100)assert 0 == editDistance("s", "s", 1)assert 0 == editDistance("s", "s", 0)assert 0 == editDistance("s", "s", 2)assert 0 == editDistance("s", "s", 100)assert 0 == editDistance("", "", 0)assert 0 == editDistance("", "", 1)assert 0 == editDistance("", "", 100)}@Testpublic void shouldBeOneForSingleOperation() throws Exception {def a = "steve";for (int i = 0; i < 5; i++) {assertOneOp(new StringBuilder(a).insert(i, 'f'), a)assertOneOp(new StringBuilder(a).deleteCharAt(i), a)def sb = new StringBuilder(a)sb.setCharAt(i, 'x' as char);assertOneOp(sb, a)if (i > 1) {sb = new StringBuilder(a)char t = sb.charAt(i - 1)sb.setCharAt(i - 1, sb.charAt(i))sb.setCharAt(i, t)println "comparing " + sb.toString() + " -> " + aassertOneOp(sb, a)}}}@Testpublic void shouldCountTransposeAsOne() throws Exception {assert 3 == editDistance("xxsteve", "steev", 4)assert 3 == editDistance("xxsteve", "steev", 3)assert 3 == editDistance("steev", "xxsteve", 4)assert 3 == editDistance("steev", "xxsteve", 3)assert -1 == editDistance("steev", "xxsteve", 2)assert 4 == editDistance("xxtseve", "steev", 4)assert 5 == editDistance("xxtsevezx", "steevxz", 5)assert 6 == editDistance("xxtsevezx", "steevxzpp", 6)assert 7 == editDistance("xxtsfevezx", "steevxzpp", 7)assert 4 == editDistance("xxtsf", "st", 7)assert 4 == editDistance("evezx", "eevxzpp", 7)assert 7 == editDistance("xxtsfevezx", "steevxzpp", 7)}@Testpublic void shouldCountLeadingCharacterTranspositionsAsOne() throws Exception {assert 1 == editDistance("rosa", "orsa", 2)}private void assertOneOp(CharSequence a, CharSequence b) {assert 1 == editDistance(a, b, 1)assert 1 == editDistance(b, a, 1)assert 1 == editDistance(a, b, 2)assert 1 == editDistance(b, a, 2)}@Testpublic void shouldShortCutWhenSpecialCase() throws Exception {assert 1 == editDistance("s", "", 1)assert 1 == editDistance("", "s", 1)assert -1 == editDistance("s", "", 0)assert -1 == editDistance("", "s", 0)assert -1 == editDistance("st", "", 1)assert -1 == editDistance("", "st", 1)assert -1 == editDistance("steve", "ste", 0)assert -1 == editDistance("ste", "steve", 0)assert -1 == editDistance("stev", "steve", 0)assert -1 == editDistance("ste", "steve", 1)assert -1 == editDistance("steve", "ste", 1)assert 1 == editDistance("steve", "stev", 1)assert 1 == editDistance("stev", "steve", 1)} } 參考:來自我們的JCG合作伙伴史蒂夫·阿什(Steve Ash)的Java實(shí)現(xiàn)最佳字符串對齊 ,該博客來自Many Cups of Coffee博客。翻譯自: https://www.javacodegeeks.com/2013/11/java-implementation-of-optimal-string-alignment.html
java 字符串對齊
總結(jié)
以上是生活随笔為你收集整理的java 字符串对齐_最佳字符串对齐的Java实现的全部內(nèi)容,希望文章能夠幫你解決所遇到的問題。
- 上一篇: 戏剧性拉满 演员刘金怒摔iPhone后仍
- 下一篇: 腾讯大股东 Prosus 及南非报业集团