r-如何为数据帧中的每一行提取句子中的最后3个元素



我有以下数据帧:

df <- structure(list(matrix.unlist.all_dates...nrow...230..byrow...T. = c(
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  14 December 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  2 November 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Paris,  19 October 2000", 
"Willem F. Duisenberg,  President of the European Central  Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  5 October 2000", 
"Willem F. Duisenberg,  President of the European Central Bank,  Christian Noyer,  Vice-President of the European Central Bank,  Frankfurt am Main,  14 September 2000", 
"Willem F. Duisenberg,  President of the European Central Bank,  Lucas Papademos,  Vice-President of the European Central Bank,  Frankfurt,  10 July 2003.", 
"Willem F. Duisenberg,  President of the European Central Bank,  Lucas Papademos,  Vice-President of the European Central Bank,    Frankfurt,  5 June 2003."
)), class = "data.frame", row.names = c(NA, -7L))

正如你所看到的,每一行的文本都遵循一个清晰的模式,最后三个单词是日期。我只想提取这三个";单词";(基本上是日期(。

你会怎么做?我尝试了substr,但由于每行的长度不同,我没有成功

您可以使用正则表达式提取日期。

gsub(".* (\d+ \w+ \d+)\.?$", "\1", df[, 1])

图案(\d+ \w+ \d+)与匹配

  1. 一个或多个数字(\d+(,后跟
  2. 一个空间,后面
  3. 一个或多个字母(\w+(,后跟
  4. 一个空间,后面
  5. 一个或多个数字(\d+(

因此,在括号内可以捕捉日期。然后用日期替换整个字符串:\1表示括号内匹配的内容。

一个选项是使用word函数从包stringr(属于tidyverse世界(中直接选择最后三个单词

library(stringr)
str_replace_all(word(df[,1], -3, -1), fixed("."), "")
# [1] "14 December 2000"  "2 November 2000"   "19 October 2000"   "5 October 2000"    "14 September 2000" "10 July 2003"      "5 June 2003"

str_replace_all函数用于替换字符串末尾可能出现的点。fixed辅助函数表示.是一个实际的点,而不是正则表达式标记。

最新更新