R -从具有可变数量分隔条目的列生成新列

  • 本文关键字:新列 分隔 r
  • 更新时间 :
  • 英文 :


我有一个期刊出版物表,我想提取第一,第二和最后一个作者。

不幸的是,作者的数量变化很大,有的只有一个,有的多达35个。

如果出版物有一个作者,我希望只有一个第一作者。如果有两位作者,我希望得到第一和最后一位作者。如果有三个作者,我希望有第一个、第二个最后一个和最后一个作者,以此类推。

原始数据集:

pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
"author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
"author1, author2, author3, author4, author5, author6")), 
class = "data.frame", row.names = c(NA, -6L))

下面是预期的输出:

pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4", 
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3", 
"author1, author2, author3, author4", "author1, author2, author3, author4, author5", 
"author1, author2, author3, author4, author5, author6"),
author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
author_second_last = c("", ""," author2", " author3", " author4", " author5"),
author_last = c("", " author2", " author3", " author4", " author5", " author6")),
class = "data.frame", row.names = c(NA, -6L))

我不知道该怎么做。

以下是如何使用dplyrstringr的想法

library(dplyr)
library(stringr)
author_position = function(str, p, position) {
stopifnot(is.numeric(position))
# split the string up into a vector of pieces using a pattern (in this case `,`)
# and trim the white space
s = str_trim(str_split(str, p, simplify = TRUE))
len = length(s)

# Return NA if the author position chosen is greater than or equal to the length of the new vector
# Caveat: If the position is 1, then return the value at the first position
if(abs(position) >= len) {
if(position == 1) {
first(s)
} else {
NA
}
# Return the the value at the selected position 
} else {
nth(s, position)
}
}
pub1 %>%
rowwise() %>% # group by row
mutate(author_first = author_position(authors,",",1),
author_second_last = author_position(authors,",",-2),
author_last = author_position(authors,",",-1))
# # A tibble: 6 × 5
# # Rowwise: 
#   publication authors                                              author_first author_second_last author_last
#   <chr>       <chr>                                                <chr>        <chr>              <chr>      
# 1 pub1        author1                                              author1      NA                 NA         
# 2 pub2        author1, author2                                     author1      NA                 author2    
# 3 pub3        author1, author2, author3                            author1      author2            author3    
# 4 pub4        author1, author2, author3, author4                   author1      author3            author4    
# 5 pub5        author1, author2, author3, author4, author5          author1      author4            author5    
# 6 pub6        author1, author2, author3, author4, author5, author6 author1      author5            author6 

编辑:允许返回任何作者的位置和添加的评论。

这里唯一的约束是第一/最后的作者是固定的。所以如果你想返回倒数第3位作者而该出版物只有3位作者,它将返回NA,因为从技术上讲,NA被认为是第一位。如果只有3位作者,返回第三位作者也会被认为是最后一位作者。

pub1 %>%
rowwise() %>% # group by row
mutate(author_third = author_position(authors,",",3),
author_third_last = author_position(authors, ",", -3))

# # A tibble: 6 × 4
# # Rowwise: 
#   publication authors                                              author_third author_third_last
#   <chr>       <chr>                                                <chr>        <chr>            
# 1 pub1        author1                                              NA           NA               
# 2 pub2        author1, author2                                     NA           NA               
# 3 pub3        author1, author2, author3                            NA           NA               
# 4 pub4        author1, author2, author3, author4                   author3      author2          
# 5 pub5        author1, author2, author3, author4, author5          author3      author3          
# 6 pub6        author1, author2, author3, author4, author5, author6 author3      author4  

最新更新