r语言 - Regex -删除以某些单词开头的句子,如果它是最后一个句子



根据标题,我正在尝试清理大量的短文本汇编,以删除以某些单词开头的句子-但仅当它是最后时1的

假设我想剪掉最后一句以"Jack is…"开头的句子。
下面是一个不同情况的例子:

test_strings <- c("Jack is the tallest person.", 
"and Jack is the one who said, let there be fries.", 
"There are mirrors. And Jack is there to be suave.", 
"There are dogs. And jack is there to pat them. Very cool.", 
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
"'Jack is so cool!' Jack is cool. Jack is also cold."
)

这是我目前拥有的正则表达式:"(?![A-Z'].+[\.|'] )[Jj]ack,? is.+\.$"

map_chr(test_strings, ~str_replace(.x, "(?![A-Z'].+[\.|'] )[Jj]ack,? is.+\.$", "[TRIM]"))

产生这些结果:

[1] "[TRIM]"                                                   
[2] "and [TRIM]"                                               
[3] "There are mirrors. And [TRIM]"                            
[4] "There are dogs. And [TRIM]"                               
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"  

## Basically my current regex is still too greedy. 
## No trimming should happen for the first 4 examples. 
## 5 - 7th examples are correct. 
## Explanations:
# (1) Wrong. Only one sentence; do not trim, but current regex trims it. 
# (2) Wrong. It is a sentence but does not start with 'Jack is'.
# (3) Wrong. Same situation as (2) -- the sentence starts with 'And' instead of 'Jack is'
# (4) Wrong. Same as (2) (3), but this time test with lowercase `jack`
# (5) Correct. Trim the second sentence as it is the last. Optional ',' removal is tested here.
# (6) Correct.
# (7) Correct. Sometimes texts do not begin with alphabets. 

谢谢你的帮助!

gsub("^(.*\.)\s*Jack,? is[^.]*\.?$", "\1 [TRIM]", test_strings, ignore.case = TRUE)
# [1] "Jack is the tallest person."                              
# [2] "and Jack is the one who said, let there be fries."        
# [3] "There are mirrors. And Jack is there to be suave."        
# [4] "There are dogs. And jack is there to pat them. Very cool."
# [5] "Jack is your lumberjack. [TRIM]"                          
# [6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
# [7] "'Jack is so cool!' Jack is cool. [TRIM]"                  

故障:

  • ^(.*\.)\s*:由于我们需要在我们修剪的内容之前至少有一个句子,我们需要找到前面的点\.;
  • Jack,? is从您的正则表达式
  • [^.]*\.?$:零或更多"非.-dots"后跟.-dot和end- string;如果你想在最后一个句号之后允许空格,那么你可以将其更改为[^.]*\.?\s*$,在您的示例中似乎没有必要

您可以匹配一个点(或使用字符类[.!?]匹配更多字符),然后匹配包含Jack的最后一个句子并以点(或再次匹配字符类以匹配更多字符)结尾:

.Kh*[Jj]ack,? is[^.n]*.$

模式匹配:

  • .K匹配.,忘记到目前为止匹配的内容
  • h*[Jj]ack,? is匹配可选的水平空白字符,然后是千斤顶或千斤顶,以及可选的逗号和is
  • [^.n]*.可选地匹配除.或换行符以外的任何字符
  • $字符串结束

Regex demo | R demo

示例代码:

test_strings <- c("Jack is the tallest person.", 
"and Jack is the one who said, let there be fries.", 
"There are mirrors. And Jack is there to be suave.", 
"There are dogs. And jack is there to pat them. Very cool.", 
"Jack is your lumberjack. Jack, is super awesome.",
"Whereas Jack is, for the whole summer, sound asleep. Zzzz", 
"'Jack is so cool!' Jack is cool. Jack is also cold."
)
sub("\.\K\h*[Jj]ack,? is[^.\n]*\.$", " [TRIM]", test_strings, perl=TRUE)

输出
[1] "Jack is the tallest person."                              
[2] "and Jack is the one who said, let there be fries."        
[3] "There are mirrors. And Jack is there to be suave."        
[4] "There are dogs. And jack is there to pat them. Very cool."
[5] "Jack is your lumberjack. [TRIM]"                          
[6] "Whereas Jack is, for the whole summer, sound asleep. Zzzz"
[7] "'Jack is so cool!' Jack is cool. [TRIM]"

相关内容

最新更新