r-提取日期为非标准格式时的年、月和日

我有一列日期，我想将年、月和日提取到单独的列中。不幸的是，日期列中存在不一致的条目，因此使用format(as.Date(),"%Y")或lubridate::year()的正常解决方案不起作用。

以下是一个示例数据帧：

dates_df <- data.frame(dates = c("1985-03-23", "", "1983", "1984-01"))

这是想要的结果：

dates year month  day
1 1985-03-23 1985     3   23
2            <NA>  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984     1 <NA>

我可以用以下代码实现所需的结果，但在大型数据集(>100000行(上速度非常慢：

dates_df$year <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\-"))[1])
dates_df$month <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\-"))[2])
dates_df$day <- sapply(dates_df$dates, function(x) unlist(strsplit(x, "\-"))[3])

我的问题：

有没有更有效(快速(的方法从混乱的日期数据中提取年、月、日列？

使用strsplit并调整lengths.

cbind(dates_df, t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)))
#        dates    1    2    3
# 1 1985-03-23 1985   03   23
# 2            <NA> <NA> <NA>
# 3       1983 1983 <NA> <NA>
# 4    1984-01 1984   01 <NA>

有漂亮的名字：

cbind(dates_df, `colnames<-`(
t(sapply(strsplit(dates_df$dates, '-'), `length<-`, 3)), c('year', 'month', 'day')))
#        dates year month  day
# 1 1985-03-23 1985    03   23
# 2            <NA>  <NA> <NA>
# 3       1983 1983  <NA> <NA>
# 4    1984-01 1984    01 <NA>

我的第一个想法是尝试tidyr::separate。未测试速度，如果示例数据中没有表示日期格式，则可能会出现故障。

tidyr::separate(dates_df, 
dates, 
into = c('year', 'month', 'day'), 
remove = FALSE)
#-----
dates year month  day
1 1985-03-23 1985    03   23
2                  <NA> <NA>
3       1983 1983  <NA> <NA>
4    1984-01 1984    01 <NA>

相关内容

最新更新

热门标签：