我有以下数据帧:
df <- structure(list(matrix.unlist.all_dates...nrow...230..byrow...T. = c(
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Frankfurt am Main, 14 December 2000",
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Frankfurt am Main, 2 November 2000",
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Paris, 19 October 2000",
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Frankfurt am Main, 5 October 2000",
"Willem F. Duisenberg, President of the European Central Bank, Christian Noyer, Vice-President of the European Central Bank, Frankfurt am Main, 14 September 2000",
"Willem F. Duisenberg, President of the European Central Bank, Lucas Papademos, Vice-President of the European Central Bank, Frankfurt, 10 July 2003.",
"Willem F. Duisenberg, President of the European Central Bank, Lucas Papademos, Vice-President of the European Central Bank, Frankfurt, 5 June 2003."
)), class = "data.frame", row.names = c(NA, -7L))
正如你所看到的,每一行的文本都遵循一个清晰的模式,最后三个单词是日期。我只想提取这三个";单词";(基本上是日期(。
你会怎么做?我尝试了substr
,但由于每行的长度不同,我没有成功
您可以使用正则表达式提取日期。
gsub(".* (\d+ \w+ \d+)\.?$", "\1", df[, 1])
图案(\d+ \w+ \d+)
与匹配
- 一个或多个数字(
\d+
(,后跟 - 一个空间
,后面
- 一个或多个字母(
\w+
(,后跟 - 一个空间
,后面
- 一个或多个数字(
\d+
(
因此,在括号内可以捕捉日期。然后用日期替换整个字符串:\1
表示括号内匹配的内容。
一个选项是使用word
函数从包stringr
(属于tidyverse世界(中直接选择最后三个单词
library(stringr)
str_replace_all(word(df[,1], -3, -1), fixed("."), "")
# [1] "14 December 2000" "2 November 2000" "19 October 2000" "5 October 2000" "14 September 2000" "10 July 2003" "5 June 2003"
str_replace_all
函数用于替换字符串末尾可能出现的点。fixed
辅助函数表示.
是一个实际的点,而不是正则表达式标记。