我有一个期刊出版物表,我想提取第一,第二和最后一个作者。
不幸的是,作者的数量变化很大,有的只有一个,有的多达35个。
如果出版物有一个作者,我希望只有一个第一作者。如果有两位作者,我希望得到第一和最后一位作者。如果有三个作者,我希望有第一个、第二个最后一个和最后一个作者,以此类推。
原始数据集:
pub1 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6")),
class = "data.frame", row.names = c(NA, -6L))
下面是预期的输出:
pub2 <- structure(list(publication = c("pub1", "pub2", "pub3", "pub4",
"pub5", "pub6"), authors = c("author1", "author1, author2", "author1, author2, author3",
"author1, author2, author3, author4", "author1, author2, author3, author4, author5",
"author1, author2, author3, author4, author5, author6"),
author_first = c("author1", "author1", "author1", "author1", "author1", "author1"),
author_second_last = c("", ""," author2", " author3", " author4", " author5"),
author_last = c("", " author2", " author3", " author4", " author5", " author6")),
class = "data.frame", row.names = c(NA, -6L))
我不知道该怎么做。
以下是如何使用dplyr
和stringr
的想法
library(dplyr)
library(stringr)
author_position = function(str, p, position) {
stopifnot(is.numeric(position))
# split the string up into a vector of pieces using a pattern (in this case `,`)
# and trim the white space
s = str_trim(str_split(str, p, simplify = TRUE))
len = length(s)
# Return NA if the author position chosen is greater than or equal to the length of the new vector
# Caveat: If the position is 1, then return the value at the first position
if(abs(position) >= len) {
if(position == 1) {
first(s)
} else {
NA
}
# Return the the value at the selected position
} else {
nth(s, position)
}
}
pub1 %>%
rowwise() %>% # group by row
mutate(author_first = author_position(authors,",",1),
author_second_last = author_position(authors,",",-2),
author_last = author_position(authors,",",-1))
# # A tibble: 6 × 5
# # Rowwise:
# publication authors author_first author_second_last author_last
# <chr> <chr> <chr> <chr> <chr>
# 1 pub1 author1 author1 NA NA
# 2 pub2 author1, author2 author1 NA author2
# 3 pub3 author1, author2, author3 author1 author2 author3
# 4 pub4 author1, author2, author3, author4 author1 author3 author4
# 5 pub5 author1, author2, author3, author4, author5 author1 author4 author5
# 6 pub6 author1, author2, author3, author4, author5, author6 author1 author5 author6
编辑:允许返回任何作者的位置和添加的评论。
这里唯一的约束是第一/最后的作者是固定的。所以如果你想返回倒数第3位作者而该出版物只有3位作者,它将返回NA,因为从技术上讲,NA被认为是第一位。如果只有3位作者,返回第三位作者也会被认为是最后一位作者。
pub1 %>%
rowwise() %>% # group by row
mutate(author_third = author_position(authors,",",3),
author_third_last = author_position(authors, ",", -3))
# # A tibble: 6 × 4
# # Rowwise:
# publication authors author_third author_third_last
# <chr> <chr> <chr> <chr>
# 1 pub1 author1 NA NA
# 2 pub2 author1, author2 NA NA
# 3 pub3 author1, author2, author3 NA NA
# 4 pub4 author1, author2, author3, author4 author3 author2
# 5 pub5 author1, author2, author3, author4, author5 author3 author3
# 6 pub6 author1, author2, author3, author4, author5, author6 author3 author4