r-拆分由一个点连接的两个单词

我有一个包含新闻文章的大数据框架。我注意到有些文章中有两个单词用一个点连接，下面的例子显示了The government.said it was important to quit.。我将进行一些主题建模，所以我需要将每个单词分开。

这是我用来分隔这些单词的代码

#String example
test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences")
#Code to separate the words
test <- do.call(paste, as.list(strsplit(test, "\.")[[1]]))
#This is what I get
> test
[1] "i need to separate the words connected by dots  however, I need to keep having the dots separating sentences"

正如你所看到的，我删除了文本上所有的句点。我如何才能得到以下结果：

"i need to separate the words connected by dots. however, I need to keep having the dots separating sentences"

尾注

我的数据框架由1700篇文章组成；所有文本都是小写的。我只是举了一个小例子，说明我在试图将两个由点连接的单词分开时遇到的问题。此外，有什么方法可以在列表中使用strsplit吗？

您可以使用

test <- c("i need.to separate the words connected by dots. however, I need.to keep having the dots separating sentences. Look at http://google.com for s.0.m.e more details.")
# Replace each dot that is in between word characters
gsub("\b\.\b", " ", test, perl=TRUE)
# Replace each dot that is in between letters
gsub("(?<=\p{L})\.(?=\p{L})", " ", test, perl=TRUE)
# Replace each dot that is in between word characters, but no in URLs
gsub("(?:ht|f)tps?://\S*(*SKIP)(*F)|\b\.\b", " ", test, perl=TRUE)

在线观看R演示。

输出：

[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s 0 m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google com for s.0.m e more details."
[1] "i need to separate the words connected by dots. however, I need to keep having the dots separating sentences. Look at http://google.com for s 0 m e more details."

详细信息

b.b-用单词边界括起来的点(即.之前和之后不能是任何非单词字符，除了字母、数字或下划线之外不能有任何字符
(?<=p{L}).(?=p{L})匹配一个前后都没有字母的点((?<=p{L})是负向后看，(?=p{L})是负向前看(
(?:ht|f)tps?://\S*(*SKIP)(*F)|b.b匹配http/ftp或https/ftps，然后是://，然后是任何0个或多个非空白字符，并跳过匹配，继续从遇到SKIP PCRE动词时的位置搜索匹配

相关内容

最新更新

热门标签：