函数/正则表达式,用于匹配较大字符串中的字符串部分并突出显示部分



我正在尝试构建一个搜索字符串并匹配较大字符串& amp;突出显示它们。请参阅下面的示例:

原始字符串:由于飞机上的开销空间有限,我向您保证,检查袋子没有任何费用,我可以继续填写所有检查的行李表格你。

search&亮点:无费用,我填写表格

所需的结果:由于飞机上的开销空间有限,我向您保证,无费用用于检查袋子,i 可以继续前进,填写全部为您检查了行李表格

我可以一次使用子字符串一次搜索完整的字符串或一次搜索一个单词,但是两者都不会产生所需的结果。诀窍可能是从完整的字符串开始以某种方式递归搜索,然后逐渐将其分解成较小的零件,直到零件匹配为止。有几个假设:

  • 搜索必须尽可能贪婪,即在试图匹配较小的部分或单个单词之前,匹配字符串的较大部分。
  • 搜索在发现的任何匹配项之后始终向前迈进

希望这是有道理的。谁能向我指向正确的方向?我已经搜索了该网站,但没有找到与我想要的类似的东西。

谢谢

让我知道这是否对您有帮助。它不是使用正则弦来查找字符串,只需IndexOf

它首先将单词强调为Tuple,它代表单词的启动索引和结尾索引。

它使用围绕单词的前缀和后缀突出显示文本(此处:html标签)。

static void Main(string[] args)
{
    var input = "Since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you";
    var searchExpression = "no fee, I fill out the forms";
    var highlightedInput = HighlightString(input, searchExpression, "<b>", "</b>");
    Console.WriteLine(highlightedInput);
    Console.ReadLine();
}
public static IEnumerable<Tuple<int, int>> GetHighlights(string input, string searchExpression)
{
    var splitIntoWordsRegex = new Regex(@"W+");
    var words = splitIntoWordsRegex.Split(searchExpression);
    return GetHighlights(input, words);
}
public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
    var highlights = new List<Tuple<int, int>>();
    var lastMatchedIndex = 0;
    foreach (var word in searchExpression)
    {
        var indexOfWord = input.IndexOf(word, lastMatchedIndex,  StringComparison.CurrentCulture);
        var lastIndexOfWord = indexOfWord + word.Length;
        highlights.Add(new Tuple<int, int>(indexOfWord, lastIndexOfWord));
        lastMatchedIndex = lastIndexOfWord;
    }
    return highlights;
}
public static string HighlightString(string input, string searchExpression, string highlightPrefix, string highlightSufix)
{
    var highlights = GetHighlights(input, searchExpression).ToList();
    var output = input;
    for (int i = 0, j = highlights.Count; i<j; i++)
    {
        int diffInputOutput = output.Length - input.Length;
        output = output.Insert(highlights[i].Item1 + diffInputOutput, highlightPrefix);
        diffInputOutput = output.Length - input.Length;
        output = output.Insert(highlights[i].Item2 + diffInputOutput, highlightSufix);
    }
    return output;
}

=========================================

为了减少最小/最大索引突出显示,您可以使用以下代码。虽然不是最漂亮的工作,但可以做。

它可以获取一个单词的所有索引(由于在C#中的大字符串中找到子字符串的所有位置)。将它们添加到highlights中,然后操纵此系列以使关闭匹配与您的需求保持匹配。

public static IEnumerable<Tuple<int, int>> GetHighlights(string input, IEnumerable<string> searchExpression)
{
    var highlights = new List<Tuple<string, int, int>>();
    // Finds all the indexes for 
    // all the words found.
    foreach (var word in searchExpression)
    {
        var allIndexesOfWord = AllIndexesOf(input, word, StringComparison.InvariantCultureIgnoreCase);
        highlights.AddRange(allIndexesOfWord.Select(index => new Tuple<string, int, int>(word, index, index + word.Length)));
    }
    // Reduce the scope of the highlights in order to 
    // keep the indexes as together as possible.
    var firstWord = searchExpression.First();
    var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));
    var lastWord = searchExpression.Last();
    var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));
    var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
    sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);
    highlights = new List<Tuple<string, int, int>>();
    foreach (var word in searchExpression.Reverse())
    {
        var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
        sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
        highlights.Add(lastOccurence);
    }
    highlights.Reverse();
    return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}
public static List<int> AllIndexesOf(string str, string value, StringComparison comparison)
{
    if (String.IsNullOrEmpty(value))
        throw new ArgumentException("the string to find may not be empty", "value");
    List<int> indexes = new List<int>();
    for (int index = 0; ; index += value.Length)
    {
        index = str.IndexOf(value, index, comparison);
        if (index == -1)
            return indexes;
        indexes.Add(index);
    }
}

使用此代码和文本:

"No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you."

我得到以下结果:

否,关于费用,由于飞机上的高架空间有限,我向您保证, no 费用用于检查袋子,i 可以继续前进, fill out all 检查了您>

==================================================================

编辑2 使用正则方法方法与以前的尝试获得的经验。
请注意,如果找不到表达式中的每个单词,就找不到亮点。

public static IEnumerable<Tuple<int,int>> GetHighlights(string expression, string search)
{
    var highlights = new List<Tuple<string, int, int>>();
    var wordsToHighlight = new Regex(@"(w+|[^s]+)").
        Matches(search).
        Cast<Match>().
        Select(x => x.Value);
    foreach(var wordToHighlight in wordsToHighlight)
    {
        Regex findMatchRegex = null;
        if (new Regex(@"W").IsMatch(wordToHighlight))
            findMatchRegex = new Regex(String.Format(@"({0})", wordToHighlight), RegexOptions.IgnoreCase);  // is punctuation
        else
            findMatchRegex = new Regex(String.Format(@"((?<!w){0}(?!w))", wordToHighlight), RegexOptions.IgnoreCase); // si word
        var matches = findMatchRegex.Matches(expression).Cast<Match>().Select(match => new Tuple<string, int, int>(wordToHighlight, match.Index, match.Index + wordToHighlight.Length));
        if (matches.Any())
            highlights.AddRange(matches);
        else
            return new List<Tuple<int, int>>();
    }
    // Reduce the scope of the highlights in order to 
    // keep the indexes as together as possible.
    var firstWord = wordsToHighlight.First();
    var firstWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, firstWord)));
    var lastWord = wordsToHighlight.Last();
    var lastWordIndex = highlights.IndexOf(highlights.Last(x => String.Equals(x.Item1, lastWord)));
    var sanitizedHighlights = highlights.SkipWhile((x, i) => i < firstWordIndex);
    sanitizedHighlights = sanitizedHighlights.TakeWhile((x, i) => i <= lastWordIndex);
    highlights = new List<Tuple<string, int, int>>();
    foreach (var word in wordsToHighlight.Reverse())
    {
        var lastOccurence = sanitizedHighlights.Last((x) => String.Equals(x.Item1, word));
        sanitizedHighlights = sanitizedHighlights.TakeWhile(x => x.Item3 < lastOccurence.Item2);
        highlights.Add(lastOccurence);
    }
    highlights.Reverse();
    return highlights.Select(x => new Tuple<int, int>(x.Item2, x.Item3));
}

也应该注意,这种方法现在要注意标点符号。找到以下结果。

输入:
No, about the fee, since there is limited overhead space on the plane, I assure you, there will be no fee for checking the bags, I can go ahead and fill out all the checked baggage forms for you.

搜索:
no fee, I fill out the forms

输出:
不,大约是费用的,由于飞机上的高架空间有限,我向您保证, no 费用用于检查袋子 i 可以继续前进, fill out 全部

输入:
When First Class Glass receives your call, we will assign a repair person to visit you to assist.

搜索:
we assign a repair person

输出:
当头等舱接听您的电话时访问您提供帮助。

最新更新