Java:String.contains的替代品,可以返回相似性



我有三个字符串

String a = Hello, how are you doing?
String b = Can I as you something?
String c = Hello, how are you doing? Can I ask you something?

我的目标是评估字符串c是否是字符串a和b的合并。注意字符串b中有一个拼写错误;作为";应该是";询问";。

当前的逻辑是(pesudo代码(:

if 
String c contains String a AND String b
then 
merge = true

我遇到的问题是,如果在合并过程中字符串c发生了轻微变化,string.contains((将不再有效,因为它在检查字符串b时返回false。

有没有可能/想法使用另一种有效的我的例子?

我尝试了字符串相似性(Jaccard等(,但这些都不起作用,因为a、b和c的大小可能会有所不同,所以很容易/可能获得正确的相似性百分比。

(我发现(没有任何内置函数可以做到这一点,但我想出了一些有望满足您需求的东西。你显然可以改变这个(我试着让它尽可能干净(

第一步:我们需要一个函数,它接收两个字符串,并返回这两个字符串的差值。我想出了一个非常简单的功能:

public static int getNumberDifferences(String a, String b)
{
int maxLength = Math.max(a.length(), b.length());
int minLength = Math.min(a.length(), b.length());
int result = maxLength - minLength;//the difference in length between the two
for(int i = 0; i < minLength; i++)
{
if(a.charAt(i) != b.charAt(i)) //If the characters are different
result++; //Add one to the result
}
return  result;
}

简而言之,我们遍历字符串,每次遇到差异时,都会在差异数上加一。(注意,在开始时,我取两个字符串的长度差,因此这也计算大小差(

第二步:我们需要另一个函数,它接收(数组中的(每个单词,并返回它遇到的每一个差异。我为此想出了另一个超级简单的功能:

public static int getNumberDifferences(String[] a, String[] b)
{
int result = 0;
for(int i = 0; i < Math.min(a.length, b.length); i++)
{
result += getNumberDifferences(a[i], b[i]);
}
return result;
}

在这个函数中,我们只需添加字符串中每个单词之间的所有差异。

最后,我们显示:

public static void main(String[] args)
{
String a = "Hello, how are you doing?" ;
String b = "Can I ask you something?";
String c = "Hello, how are you doing? Can I ask you something?";
int differences = getNumberDifferences(
(a + " " + b) //Join the two strings with a space in the middle
.split(" "), //Split them to take every word
c.split(" ")); //Split c as well
System.out.println(differences);
}

所以最后的代码是:

public class Main {
public static void main(String[] args)
{
String a = "Hello, how are you doing?" ;
String b = "Can I ask you something?";
String c = "Hello, how are you doing? Can I ask you something?";
int differences = getNumberDifferences(
(a + " " + b) //Join the two strings with a space in the middle
.split(" "), //Split them to take every word
c.split(" ")); //Split c as well
System.out.println(differences);
}
public static int getNumberDifferences(String[] a, String[] b)
{
int result = 0;
for(int i = 0; i < Math.min(a.length, b.length); i++)
{
result += getNumberDifferences(a[i], b[i]);
}
return result;
}
public static int getNumberDifferences(String a, String b)
{
int maxLength = Math.max(a.length(), b.length());
int minLength = Math.min(a.length(), b.length());
int result = maxLength - minLength; //the difference in length between the two
for(int i = 0; i < minLength; i++)
{
if(a.charAt(i) != b.charAt(i)) //If the characters are different
result++; //Add one to the result
}
return  result;
}

}

如果这有帮助,请告诉我:(

如何在注释中正确标记,必须与Levenshtein distance进行比较。

您想使用相似性百分比来比较2个字符串,所以我们可以将这个百分比作为字符串之间的距离和参考字符串长度的关系来关联。所以,如果我们需要100%的相似性,我们的字符串必须是绝对相等的,并且字符串之间的距离为0。相反:若我们需要100%的相似性,我们的字符串必须是绝对不同的,我们的距离几乎和参考字符串的长度一样(或更多(。

我将相似性百分比命名为allowedDiscrepancy,因为它的信息量更大。所以,我的代码有distance方法来计算参考字符串和另一个字符串之间的距离,还有compareWithDiscrepancy方法来进行相关。检查一下,它有效。

public class StringUtils {
public static void main(String[] args) {
final String a = "Hello, how are you doing?";
final String b = "Can I as you something?";
final String c = "Hello, how are you doing? Can I ass you something?";
// allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
//So, we have 2 strings with little difference, so it must be return "true"
assertTrue(compareWithDiscrepancy(c, String.format("%s %s", a, b), 1.0));
// allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
//So, we have 2 strings with little difference, but more than 0, so it must be return "false"
assertFalse(compareWithDiscrepancy(c, String.format("%s %s", a, b), 0.0));
final String sameA = "Hello.";
final String sameB = "How are you?";
final String sameC = String.format("%s %s", sameA, sameB);
// allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
//So, we have 2 strings absolutely equals, so it must be return "true"
assertTrue(compareWithDiscrepancy(sameA, String.format("%s %s", sameA, sameB), 1));
// allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
//So, we have 2 strings absolutely equals, so it must be return "true" too
assertTrue(compareWithDiscrepancy(sameC, String.format("%s %s", sameA, sameB), 0));
final String differentA = "Part 1.";
final String differentB = "Part 2.";
final String differentC = "Absolutely different string";
// allowedDiscrepancy = 1.0 (100%) - strings might be absolutely different
//So, we have 2 absolutely different strings, so it must be return "true"
assertTrue(compareWithDiscrepancy(differentC, String.format("%s %s", differentA, differentB), 1));
// allowedDiscrepancy = 0.0 (0%) - strings must be absolutely equals
//So, we have 2 absolutely different strings, so it must be return "false" too
assertFalse(compareWithDiscrepancy(differentC, String.format("%s %s", differentA, differentB), 0));
System.out.println("Done!");
}
public static boolean compareWithDiscrepancy(final String referenceString, final String testedString, double allowedDiscrepancy) {
if (allowedDiscrepancy < 0) allowedDiscrepancy = 0;
if (allowedDiscrepancy > 1) allowedDiscrepancy = 1;
int distance = distance(referenceString, testedString);
double realDiscrepancy = distance * 1.0 / referenceString.length();
if (realDiscrepancy > 1) realDiscrepancy = 1;
return allowedDiscrepancy >= realDiscrepancy;
}
static int distance(String x, String y) {
int[][] dp = new int[x.length() + 1][y.length() + 1];
for (int i = 0; i <= x.length(); i++) {
for (int j = 0; j <= y.length(); j++) {
if (i == 0) {
dp[i][j] = j;
} else if (j == 0) {
dp[i][j] = i;
} else {
dp[i][j] = min(dp[i - 1][j - 1]
+ cost(x.charAt(i - 1), y.charAt(j - 1)),
dp[i - 1][j] + 1,
dp[i][j - 1] + 1);
}
}
}
return dp[x.length()][y.length()];
}
public static int cost(char a, char b) {
return a == b ? 0 : 1;
}
public static int min(int... numbers) {
return Arrays.stream(numbers)
.min().orElse(Integer.MAX_VALUE);
}
}

最新更新