r-如何为每一行返回一系列列中的第一个非NULL值?第二个非NULL值呢



我有以下组织数据:

EmployeeID <- c(10:15)
Job.Title <- c("Program Manager", "Development Manager", "Developer" , "Developer", "Developer", "Summer Intern")
Level.1 <- c(1,1,1,1,1,1)
Level.2 <- c(2,2,2,2,2,2)
Level.3 <- c("",10,10,10,10,10)
Level.4 <- c("","",11,11,11,11)
Level.5 <- c("","","","","",12)
Level.6 <- c("","","","","","")
Pay.Type <- c("Salary", "Salary", "Salary", "Salary", "Salary", "Hourly")
acme = data.frame(EmployeeID, Job.Title, Level.1, Level.2, Level.3, Level.4, Level.5, Level.6, Pay.Type)
acme
EmployeeID           Job.Title Level.1 Level.2 Level.3 Level.4 Level.5 Level.6 Pay.Type
1         10     Program Manager       1       2                                   Salary
2         11 Development Manager       1       2      10                           Salary
3         12           Developer       1       2      10      11                   Salary
4         13           Developer       1       2      10      11                   Salary
5         14           Developer       1       2      10      11                   Salary
6         15       Summer Intern       1       2      10      11      12           Hourly

对于每一行,我需要标识Level.1到Level.6的第一个非NULL值,从右边开始依次为Level.6、Level.5和Level.4,依此类推。我还需要标识相同模式中的第二个非NULL值。每一行的已识别值都需要放在新的列中,因此最终的表如下所示:

EmployeeID           Job.Title Level.1 Level.2 Level.3 Level.4 Level.5 Level.6 Pay.Type Supervisor Manager
1         10     Program Manager       1       2                                   Salary          2       1
2         11 Development Manager       1       2      10                           Salary         10       2
3         12           Developer       1       2      10      11                   Salary         11      10
4         13           Developer       1       2      10      11                   Salary         11      10
5         14           Developer       1       2      10      11                   Salary         11      10
6         15       Summer Intern       1       2      10      11      12           Hourly         12      11

我们可以使用apply逐行获取所有不为null的索引,并选择第一个和第二个值分别获得两列。

acme[, c("Supervisor", "Manager")] <- t(apply(acme[, 8:3], 1, 
function(x) c(x[which(x != "")[1]], x[which(x != "")[2]])))
acme
#  EmployeeID           Job.Title Level.1 Level.2 Level.3 Level.4 Level.5 Level.6 Pay.Type Supervisor Manager
#1         10     Program Manager       1       2                                   Salary          2       1
#2         11 Development Manager       1       2      10                           Salary         10       2
#3         12           Developer       1       2      10      11                   Salary         11      10
#4         13           Developer       1       2      10      11                   Salary         11      10
#5         14           Developer       1       2      10      11                   Salary         11      10
#6         15       Summer Intern       1       2      10      11      12           Hourly         12      11

编辑

如果有很多列,我们需要找到开始列和结束列的索引。我们可以将grep用于相同的

mincol <- min(grep("Level", colnames(acme)))
maxcol <- max(grep("Level", colnames(acme)))
acme[, c("Supervisor", "Manager")] <- t(apply(acme[, maxcol:mincol], 1, 
function(x) c(x[which(x != "")[1]], x[which(x != "")[2]])))

应该起作用。

如果我们只需要Supervisor,我们可以忽略第二部分。

acme[, "Supervisor"] <- t(apply(acme[, maxcol:mincol], 1, 
function(x) x[which(x != "")[1]]))

这是一个data.table"一行":

library(data.table)
setDT(acme)[melt(acme, measure.vars = patterns("Level.\d"))[value != ""][
order(variable), .(Supervisor = value[.N], Manager = value[.N - 1]), by = EmployeeID], 
on = "EmployeeID"][]
EmployeeID           Job.Title Level.1 Level.2 Level.3 Level.4 Level.5 Level.6 Pay.Type Supervisor
#1:         10     Program Manager       1       2                                   Salary          2
#2:         11 Development Manager       1       2      10                           Salary         10
#3:         12           Developer       1       2      10      11                   Salary         11
#4:         13           Developer       1       2      10      11                   Salary         11
#5:         14           Developer       1       2      10      11                   Salary         11
#6:         15       Summer Intern       1       2      10      11      12           Hourly         12
Manager
#1:       1
#2:       2
#3:      10
#4:      10
#5:      10
#6:      11

它的工作原理

  1. data.frame被强制为data.table
  2. 并按顺序从宽格式改为长格式
  3. 以移除其中级别被给定为CCD_ 7的所有行
  4. 现在,数据按级别编号排序(隐含地表示为Level.1Level.2等)
  5. 为每个员工提取最后一个值(Supervisor)和倒数第二个值(Manager),创建由三列组成的中间结果
  6. 最后,将中间结果连接到acme以附加新列
  7. 并打印

注意:melt()将发出警告消息,指出并非所有级别列都具有相同的数据类型。这是由于在acme数据帧的定义中将整数值与字符("")混合造成的。最好使用NA而不是""。BTW:在这种情况下,使用na.rm = FALSEmelt()可以简化代码

注意:步骤4中的简单算术排序最多适用于9个级别(Level.1Level.9)。如果级别更多,则必须提取级别编号并将其强制为整数。

dplyrtidyr依赖于数据整形的解决方案。

library(tidyverse)
acme %>%
gather('level', 'value', starts_with('Level.')) %>%
group_by(EmployeeID) %>%
filter(value != '') %>%
summarise(Supervisor = last(value),
Manager = nth(value, -2)) %>%
left_join(acme)

我们可以用max.col来实现这一点。查找'Level'列的索引('i1'),将基于'i1''的'acme'子集转换为matrix(!=""),应用max.col并获得lastTRUE值的列索引,减去1获得倒数第二个TRUE值('i3'),使用行/列索引提取元素并创建'Supervisor'和'Manager'列

i1 <- grep("Level\.\d+", names(acme))
i2 <- max.col(acme[i1]!="", "last")
i3 <- i2-1
acme$Supervisor <- acme[i1][cbind(1:nrow(acme), i2)]
acme$Manager <-  acme[i1][cbind(1:nrow(acme), i3)]
acme
#  EmployeeID           Job.Title Level.1 Level.2 Level.3 Level.4 Level.5 Level.6 Pay.Type Supervisor Manager
#1         10     Program Manager       1       2                                   Salary          2       1
#2         11 Development Manager       1       2      10                           Salary         10       2
#3         12           Developer       1       2      10      11                   Salary         11      10
#4         13           Developer       1       2      10      11                   Salary         11      10
#5         14           Developer       1       2      10      11                   Salary         11      10
#6         15       Summer Intern       1       2      10      11      12           Hourly         12      11

注:此解决方案非常简单高效,无需任何不必要的整形

最新更新