为了解释我的问题,让我介绍一下我的数据集。
这是我的数据库的一个示例。
idno
是唯一标识符,day
是观察日,clockst
是一天的开始,start
是观察开始,end
观察结束,time
每集的持续时间。
在此数据库中,每个idno被观察两天,因此总共2880分钟。
idno day clockst start end time
1 1 Monday 1.30 1290 1310 20
2 1 Monday 1.50 1310 1320 10
3 1 Monday 2.00 1320 1440 120
4 1 Monday 4.00 0 385 385
5 1 Monday 10.25 385 405 20
6 1 Monday 10.45 405 450 45
7 1 Monday 11.30 450 485 35
8 1 Monday 12.05 485 495 10
9 1 Monday 12.15 495 515 20
10 1 Monday 12.35 515 600 85
11 1 Monday 14.00 600 615 15
12 1 Monday 14.15 615 640 25
13 1 Monday 14.40 640 705 65
14 1 Monday 15.45 705 710 5
15 1 Monday 15.50 710 725 15
我的主要问题是day
/clockst
与start
冲突.由于某些原因,在实验开始之前就已经观察了这些个体,所以基本上start == 0
之前的所有情节。(start == 0
表示实验开始)。
例如,在start == 0
之前观察第 1-3 行。
我想要的是删除这些行。 但是,由于以下问题,我不知道如何唯一标识每个idno/天。
查看示例的另一部分
idno day clockst start end time
30 1 Monday 21.10 1030 1055 25
31 1 Monday 21.35 1055 1110 55
32 1 Monday 22.30 1110 1155 45
33 1 Monday 23.15 1155 1170 15
34 1 Monday 23.30 1170 1290 120
35 1 Tuesday 0.15 1215 1310 95
36 1 Tuesday 1.50 1310 1320 10
37 1 Tuesday 2.00 1320 1440 120
38 1 Tuesday 4.00 0 385 385
39 1 Tuesday 10.25 385 405 20
40 1 Tuesday 10.45 405 450 45
你可以在这里看到,实验的日期和结束是不一样的。如果您查看第 33 - 34 行,您可以看到实验从星期一持续到星期二。
这就是为什么我无法通过idno和天来识别唯一的观察时间。
最终,我想要的是这个
idno day clockst start end time ep day_obs select
1 1 Monday 1.30 1290 1310 20 NA 1 remove
2 1 Monday 1.50 1310 1320 10 NA 1 remove
3 1 Monday 2.00 1320 1440 120 NA 1 remove
4 1 Monday 4.00 0 385 385 1 1 keep
5 1 Monday 10.25 385 405 20 2 1 keep
6 1 Monday 10.45 405 450 45 3 1 keep
7 1 Monday 11.30 450 485 35 4 1 keep
8 1 Monday 12.05 485 495 10 5 1 keep
9 1 Monday 12.15 495 515 20 6 1 keep
10 1 Monday 12.35 515 600 85 7 1 keep
11 1 Monday 14.00 600 615 15 8 1 keep
我想创建一个名为ep
的变量,以增加观察时间,并在"观察前"期间丢失,以便我删除 NA 行。
我还想创建一个观察天数(day_obs
)的变量,它不是基于day
而是基于实验的开始。
知道我怎么能做到这一点吗?
我将数据和输出的子样本放在csv文件中
library(RCurl)
df <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df <- read.csv(text = df)
output <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_sample_output.csv")
output <- read.csv(text = output)
OP 已请求删除每个idno
第一次出现start == 0
之前的所有行。这相当于为每个idno
保留start == 0
首次出现(包括)之后的所有行。
变体 1:仅子集
这可以使用data.table
来实现:
library(data.table)
df <- fread("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df2 <- df[df[, .I[seq(first(which(start == 0)), .N)], by = idno]$V1]
df2
idno day clockst start end time 1: 1 Monday 4.00 0 385 385 2: 1 Monday 10.25 385 405 20 3: 1 Monday 10.45 405 450 45 4: 1 Monday 11.30 450 485 35 5: 1 Monday 12.05 485 495 10 --- 98: 2 Tuesday 22.00 1080 1090 10 99: 2 Tuesday 22.10 1090 1135 45 100: 2 Tuesday 22.55 1135 1145 10 101: 2 Tuesday 23.05 1145 1170 25 102: 2 Tuesday 23.30 1170 1260 90
解释
df[, .I[seq(first(which(start == 0)), .N)], by = idno]
返回要为每个idno
保留的行的行索引。这些用于后续的子集df
idno V1 1: 1 4 2: 1 5 3: 1 6 4: 1 7 5: 1 8 --- 98: 2 104 99: 2 105 100: 2 106 101: 2 107 102: 2 108
备选案文2:备选办法和对其他问题的答复
除了过滤数据外,OP 还请求
- 为每个
idno
创建观察天数day_obs
计数,每次遇到start == 0
时都会提前,并且 - 每个
day_obs
和idno
的行计数(OP 称这有点误导性地增加了观察时间)。
day_obs
可以在未过滤的原始df
中创建
df[, day_obs := cumsum(start == 0 ), by = idno]
df
idno day clockst start end time day_obs 1: 1 Monday 1.30 1290 1310 20 0 2: 1 Monday 1.50 1310 1320 10 0 3: 1 Monday 2.00 1320 1440 120 0 4: 1 Monday 4.00 0 385 385 1 5: 1 Monday 10.25 385 405 20 1 --- 104: 2 Tuesday 22.00 1080 1090 10 2 105: 2 Tuesday 22.10 1090 1135 45 2 106: 2 Tuesday 22.55 1135 1145 10 2 107: 2 Tuesday 23.05 1145 1170 25 2 108: 2 Tuesday 23.30 1170 1260 90 2
请注意,start == 0
第一次出现之前的行可以按day_obs == 0
进行筛选。
现在,我们可以根据以下要求轻松创建ep
df[day_obs > 0, ep := rowid(idno, day_obs)]
df
idno day clockst start end time day_obs ep 1: 1 Monday 1.30 1290 1310 20 0 NA 2: 1 Monday 1.50 1310 1320 10 0 NA 3: 1 Monday 2.00 1320 1440 120 0 NA 4: 1 Monday 4.00 0 385 385 1 1 5: 1 Monday 10.25 385 405 20 1 2 --- 104: 2 Tuesday 22.00 1080 1090 10 2 18 105: 2 Tuesday 22.10 1090 1135 45 2 19 106: 2 Tuesday 22.55 1135 1145 10 2 20 107: 2 Tuesday 23.05 1145 1170 25 2 21 108: 2 Tuesday 23.30 1170 1260 90 2 22
数据
由于下载链接将来可能会断开,以下是下载数据的dput()
:
df <- setDT(
structure(list(idno = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L), day = c("Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday",
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday",
"Tuesday"), clockst = c(1.3, 1.5, 2, 4, 10.25, 10.45, 11.3, 12.05,
12.15, 12.35, 14, 14.15, 14.4, 15.45, 15.5, 16.05, 16.1, 16.3,
16.5, 17.4, 17.55, 18.25, 19, 19.2, 19.35, 20, 20.25, 20.3, 20.55,
21.1, 21.35, 22.3, 23.15, 23.3, 0.15, 1.5, 2, 4, 10.25, 10.45,
11.3, 12, 12.1, 12.3, 12.5, 13.05, 13.55, 14.3, 14.5, 15.5, 16,
16.1, 17.4, 17.55, 18.25, 18.3, 20, 20.15, 20.2, 20.55, 21.1,
21.3, 21.45, 22, 23.15, 0.1, 1, 1.1, 4, 12.05, 12.25, 13, 13.2,
14.3, 15.35, 16, 16.3, 17, 18.3, 19.3, 20, 22, 23.1, 23.3, 1,
1.15, 4, 11.05, 11.3, 11.4, 12, 12.15, 13, 14, 14.3, 15.2, 17,
17.3, 18.3, 19.3, 21.2, 21.35, 21.5, 22, 22.1, 22.55, 23.05,
23.3), start = c(1290L, 1310L, 1320L, 0L, 385L, 405L, 450L, 485L,
495L, 515L, 600L, 615L, 640L, 705L, 710L, 725L, 730L, 750L, 770L,
820L, 835L, 865L, 900L, 920L, 935L, 960L, 985L, 990L, 1015L,
1030L, 1055L, 1110L, 1155L, 1170L, 1215L, 1310L, 1320L, 0L, 385L,
405L, 450L, 480L, 490L, 510L, 530L, 545L, 595L, 630L, 650L, 710L,
720L, 730L, 820L, 835L, 865L, 870L, 960L, 975L, 980L, 1015L,
1030L, 1050L, 1065L, 1080L, 1155L, 1210L, 1260L, 1270L, 0L, 485L,
505L, 540L, 560L, 630L, 695L, 720L, 750L, 780L, 870L, 930L, 960L,
1080L, 1150L, 1170L, 1260L, 1275L, 0L, 425L, 450L, 460L, 480L,
495L, 540L, 600L, 630L, 680L, 780L, 810L, 870L, 930L, 1040L,
1055L, 1070L, 1080L, 1090L, 1135L, 1145L, 1170L), end = c(1310L,
1320L, 1440L, 385L, 405L, 450L, 485L, 495L, 515L, 600L, 615L,
640L, 705L, 710L, 725L, 730L, 750L, 770L, 820L, 835L, 865L, 900L,
920L, 935L, 960L, 985L, 990L, 1015L, 1030L, 1055L, 1110L, 1155L,
1170L, 1290L, 1310L, 1320L, 1440L, 385L, 405L, 450L, 480L, 490L,
510L, 530L, 545L, 595L, 630L, 650L, 710L, 720L, 730L, 820L, 835L,
865L, 870L, 960L, 975L, 980L, 1015L, 1030L, 1050L, 1065L, 1080L,
1155L, 1215L, 1260L, 1270L, 1440L, 485L, 505L, 540L, 560L, 630L,
695L, 720L, 750L, 780L, 870L, 930L, 960L, 1080L, 1150L, 1170L,
1210L, 1275L, 1440L, 425L, 450L, 460L, 480L, 495L, 540L, 600L,
630L, 680L, 780L, 810L, 870L, 930L, 1040L, 1055L, 1070L, 1080L,
1090L, 1135L, 1145L, 1170L, 1260L), time = c(20L, 10L, 120L,
385L, 20L, 45L, 35L, 10L, 20L, 85L, 15L, 25L, 65L, 5L, 15L, 5L,
20L, 20L, 50L, 15L, 30L, 35L, 20L, 15L, 25L, 25L, 5L, 25L, 15L,
25L, 55L, 45L, 15L, 120L, 95L, 10L, 120L, 385L, 20L, 45L, 30L,
10L, 20L, 20L, 15L, 50L, 35L, 20L, 60L, 10L, 10L, 90L, 15L, 30L,
5L, 90L, 15L, 5L, 35L, 15L, 20L, 15L, 15L, 75L, 60L, 50L, 10L,
170L, 485L, 20L, 35L, 20L, 70L, 65L, 25L, 30L, 30L, 90L, 60L,
30L, 120L, 70L, 20L, 40L, 15L, 165L, 425L, 25L, 10L, 20L, 15L,
45L, 60L, 30L, 50L, 100L, 30L, 60L, 60L, 110L, 15L, 15L, 10L,
10L, 45L, 10L, 25L, 90L)), .Names = c("idno", "day", "clockst",
"start", "end", "time"), row.names = c(NA, -108L), class = "data.frame", index = structure(integer(0), "`__start`" = c(4L,
38L, 69L, 87L, 5L, 39L, 6L, 40L, 88L, 7L, 41L, 89L, 90L, 42L,
91L, 8L, 70L, 43L, 9L, 92L, 71L, 44L, 10L, 45L, 72L, 93L, 46L,
73L, 47L, 11L, 94L, 12L, 48L, 74L, 95L, 13L, 49L, 96L, 75L, 14L,
15L, 50L, 51L, 76L, 16L, 17L, 52L, 18L, 77L, 19L, 78L, 97L, 98L,
20L, 53L, 21L, 54L, 22L, 55L, 56L, 79L, 99L, 23L, 24L, 80L, 100L,
25L, 26L, 57L, 81L, 58L, 59L, 27L, 28L, 29L, 60L, 30L, 61L, 101L,
62L, 31L, 102L, 63L, 103L, 64L, 82L, 104L, 105L, 32L, 106L, 107L,
83L, 33L, 65L, 34L, 84L, 108L, 66L, 35L, 67L, 85L, 68L, 86L,
1L, 2L, 36L, 3L, 37L), "`__idno`" = integer(0)))
)