使用Regex计算文本摘要中的单词组合使用VB.net



我有一本医学术语词典,这些术语可以是单词的组合,例如:[乳腺癌症前列腺癌症]或单体,如[BreastProstate癌症],甚至[<strong]胰腺β细胞瘤>>。

我需要计算文章摘要中字典中的单词,而不需要计算两次,所以如果我将乳腺癌症计算为

  1. 乳房癌症作为附加病例一起出现时,我不应单独计算
  2. 我从MS SQL数据库中提取单词,在那里我添加了一列,计算单词之间的空白,并按从大到小的顺序排序,然后是单词

我需要做的是,当我计数单词时,将其替换为空白或",这样就无法单独计数。我并不担心抽象的文本总是可以在之后更新。

我在.net WEB API中的VB.net代码是:

While reader.Read '- -pulling words from database
word = reader("Word").ToString
Dim regex As Regex
If word.StartsWith("ER") Then
regex = New Regex("s" + word + "s", RegexOptions.None)
Else
regex = New Regex("s" + word + "s", RegexOptions.IgnoreCase)
End If
Dim regex As Regex = New Regex("b(" + word + ")b", RegexOptions.IgnoreCase)
Dim match As Match = regex.Match(abstractText)
If match.Success Then
TotalAbstractCount += regex.Matches(abstractText).Count
abstractCount += 1
abstractWords.Add(word)
abstractWordsCount.Add(word + " (" + count.ToString + ")")
' new code added to replace word/word string with blank
Dim regex2 = New Regex(word, RegexOptions.IgnoreCase)
abstractText = regex2.Replace(abstractText, " ")
End If
match = match.NextMatch()
End While

使用这个代码,有没有地方可以将匹配更新为空字符串?还是我需要建立一个循环?

更新:我刚刚添加了regex2的新代码,但因为它对每个单词都调用了一个新的regex,所以似乎减缓了整个过程。最终用户正在实时等待结果。整个过程我没有计时,但它似乎从1-1.5秒变成了3-4秒。

此外,如果有一种更快的方法可以在MS SQL 2016服务器中实现这一点,我对此持开放态度。

这是我的(相对未经测试的(答案。

算法是:

  • 获取短语
  • 获取文本
  • 清理文本,将所有非单词字符都变成一个空格
  • 循环短语,找到" phrase "(长度8(并替换为空格" "(长度7(-长度的变化是出现次数
Imports System.Text.RegularExpressions
Imports System.Text
Imports System.Collections.Generic
Imports System
Imports System.Linq
Public Module Module1
Public Sub Main()
Dim phrases() as String  = { "brEast", "bREast canCer", "caNCer" }
Dim text as String = "Breast- cAncer is Cancer!! of .the breAst. we need to keep aBREAST of it as it is CANCERous. Breast Cancer is bad cancer"
Dim cleaner  = new Regex("W+")
'remove all non word characters, replacing them with a single space
Dim cleanText = cleaner.Replace(text, " ")
'put the text into a stringbuilder for much faster string manipulation
'add space at the start and end - spaces delimit words for us
Dim textSb as New StringBuilder(" " & cleanText.ToLower() & " ")
'something to hold the counts of phrases
Dim counts = New Dictionary(Of String, Integer)
'Sort phrases from long to short, prevents "breast" ruining "breast cancer"
Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length)
For Each phrase as String in orderedPhrases
'capture the old length - we'll need this
Dim prevLen as Integer = textSb.Length
'replace all occurrences of the phrase in the text.
'tack a space onto either end of the phrase to find whole words only
'because the replacement str is 1 shorter than the find
'the count of replacements is simply the change in length
'also we need the replacement string to be spaces
'because we rely on spaces at the start and end of a
'find string to delimit a phrase. removing all spaces
'could break our logic. If we replace with nothing:
'"type 1 breast cancer cancer is bad" -> "type 1cancer is bad"
'then we cannot now find " cancer "
Dim findPhrase = " " & phrase.ToLower() & " "
Dim replPhrase = new String(" ", findPhrase.Length - 1)
textSb.Replace(findPhrase, replPhrase)
'store the count of occurrences of this phrase
counts(phrase) = prevLen - textSb.Length
Next phrase
'let's print our counts as proof it works
For Each key as String in counts.Keys
Console.Out.WriteLine(key  & " count is " & counts(key))
Next key

End Sub
End Module

我没有一个大的数据集可以尝试,但我在2.5秒内运行了整个方法100000次

https://dotnetfiddle.net/xxlpde

注意:如果你在很多文本上这样做(例如,作为文本集合的循环(,你可以对短语进行预排序:

Dim orderedPhrases = phrases.OrderByDescending(Function(p As String) p.Length).Select(Function(p as String) " " & s.ToLower() & " ").ToArray()

自然地,findPhrase就变成了phrase,因为我们已经给它添加了空格,并且ToLower((将它设置为

最新更新