R语言 根据模糊的标识识别并递增序列



为了解释我的问题,让我介绍一下我的数据集。

这是我的数据库的一个示例。

idno是唯一标识符,day是观察日,clockst是一天的开始,start是观察开始,end观察结束,time每集的持续时间。

在此数据库中,每个idno被观察两天,因此总共2880分钟

idno    day clockst start  end time
1     1 Monday    1.30  1290 1310   20
2     1 Monday    1.50  1310 1320   10
3     1 Monday    2.00  1320 1440  120
4     1 Monday    4.00     0  385  385
5     1 Monday   10.25   385  405   20
6     1 Monday   10.45   405  450   45
7     1 Monday   11.30   450  485   35
8     1 Monday   12.05   485  495   10
9     1 Monday   12.15   495  515   20
10    1 Monday   12.35   515  600   85
11    1 Monday   14.00   600  615   15
12    1 Monday   14.15   615  640   25
13    1 Monday   14.40   640  705   65
14    1 Monday   15.45   705  710    5
15    1 Monday   15.50   710  725   15

我的主要问题是day/clockststart冲突.由于某些原因,在实验开始之前就已经观察了这些个体,所以基本上start == 0之前的所有情节。(start == 0表示实验开始)。

例如,在start == 0之前观察第 1-3 行。

我想要的是删除这些行。 但是,由于以下问题,我不知道如何唯一标识每个idno/天

查看示例的另一部分

idno     day clockst start  end time
30    1  Monday   21.10  1030 1055   25
31    1  Monday   21.35  1055 1110   55
32    1  Monday   22.30  1110 1155   45
33    1  Monday   23.15  1155 1170   15
34    1  Monday   23.30  1170 1290  120
35    1 Tuesday    0.15  1215 1310   95
36    1 Tuesday    1.50  1310 1320   10
37    1 Tuesday    2.00  1320 1440  120
38    1 Tuesday    4.00     0  385  385
39    1 Tuesday   10.25   385  405   20
40    1 Tuesday   10.45   405  450   45

你可以在这里看到,实验的日期和结束是不一样的。如果您查看第 33 - 34 行,您可以看到实验从星期一持续到星期二。

这就是为什么我无法通过idno和天来识别唯一的观察时间。

最终,我想要的是这个

idno     day clockst start  end time ep day_obs select
1      1  Monday    1.30  1290 1310   20 NA       1 remove
2      1  Monday    1.50  1310 1320   10 NA       1 remove
3      1  Monday    2.00  1320 1440  120 NA       1 remove
4      1  Monday    4.00     0  385  385  1       1   keep
5      1  Monday   10.25   385  405   20  2       1   keep
6      1  Monday   10.45   405  450   45  3       1   keep
7      1  Monday   11.30   450  485   35  4       1   keep
8      1  Monday   12.05   485  495   10  5       1   keep
9      1  Monday   12.15   495  515   20  6       1   keep
10     1  Monday   12.35   515  600   85  7       1   keep
11     1  Monday   14.00   600  615   15  8       1   keep

我想创建一个名为ep的变量,以增加观察时间,并在"观察前"期间丢失,以便我删除 NA 行。

我还想创建一个观察天数(day_obs)的变量,它不是基于day而是基于实验的开始

知道我怎么能做到这一点吗?

我将数据和输出的子样本放在csv文件中

library(RCurl)
df <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df <- read.csv(text = df)
output <- getURL("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_sample_output.csv")
output <- read.csv(text = output)

OP 已请求删除每个idno第一次出现start == 0之前的所有行。这相当于为每个idno保留start == 0首次出现(包括)之后的所有行。

变体 1:仅子集

这可以使用data.table来实现:

library(data.table)
df <- fread("https://raw.githubusercontent.com/giacomovagni/df_germany_sample/master/df_germany_subsample.csv")
df2 <- df[df[, .I[seq(first(which(start == 0)), .N)], by = idno]$V1]
df2
idno     day clockst start  end time
1:    1  Monday    4.00     0  385  385
2:    1  Monday   10.25   385  405   20
3:    1  Monday   10.45   405  450   45
4:    1  Monday   11.30   450  485   35
5:    1  Monday   12.05   485  495   10
---                                     
98:    2 Tuesday   22.00  1080 1090   10
99:    2 Tuesday   22.10  1090 1135   45
100:    2 Tuesday   22.55  1135 1145   10
101:    2 Tuesday   23.05  1145 1170   25
102:    2 Tuesday   23.30  1170 1260   90

解释

df[, .I[seq(first(which(start == 0)), .N)], by = idno]

返回要为每个idno保留的行的行索引。这些用于后续的子集df

idno  V1
1:    1   4
2:    1   5
3:    1   6
4:    1   7
5:    1   8
---         
98:    2 104
99:    2 105
100:    2 106
101:    2 107
102:    2 108

备选案文2:备选办法和对其他问题的答复

除了过滤数据外,OP 还请求

  • 为每个idno创建观察天数day_obs计数,每次遇到start == 0时都会提前,并且
  • 每个day_obsidno的行计数(OP 称这有点误导性地增加了观察时间)。

day_obs可以在未过滤的原始df中创建

df[, day_obs := cumsum(start == 0 ), by = idno]
df
idno     day clockst start  end time day_obs
1:    1  Monday    1.30  1290 1310   20       0
2:    1  Monday    1.50  1310 1320   10       0
3:    1  Monday    2.00  1320 1440  120       0
4:    1  Monday    4.00     0  385  385       1
5:    1  Monday   10.25   385  405   20       1
---                                             
104:    2 Tuesday   22.00  1080 1090   10       2
105:    2 Tuesday   22.10  1090 1135   45       2
106:    2 Tuesday   22.55  1135 1145   10       2
107:    2 Tuesday   23.05  1145 1170   25       2
108:    2 Tuesday   23.30  1170 1260   90       2

请注意,start == 0第一次出现之前的行可以按day_obs == 0进行筛选。

现在,我们可以根据以下要求轻松创建ep

df[day_obs > 0, ep := rowid(idno, day_obs)]
df
idno     day clockst start  end time day_obs ep
1:    1  Monday    1.30  1290 1310   20       0 NA
2:    1  Monday    1.50  1310 1320   10       0 NA
3:    1  Monday    2.00  1320 1440  120       0 NA
4:    1  Monday    4.00     0  385  385       1  1
5:    1  Monday   10.25   385  405   20       1  2
---                                                
104:    2 Tuesday   22.00  1080 1090   10       2 18
105:    2 Tuesday   22.10  1090 1135   45       2 19
106:    2 Tuesday   22.55  1135 1145   10       2 20
107:    2 Tuesday   23.05  1145 1170   25       2 21
108:    2 Tuesday   23.30  1170 1260   90       2 22

数据

由于下载链接将来可能会断开,以下是下载数据的dput()

df <- setDT(
structure(list(idno = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L), day = c("Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Monday", "Monday", "Monday", "Monday", "Monday", "Monday", 
"Monday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", "Tuesday", 
"Tuesday"), clockst = c(1.3, 1.5, 2, 4, 10.25, 10.45, 11.3, 12.05, 
12.15, 12.35, 14, 14.15, 14.4, 15.45, 15.5, 16.05, 16.1, 16.3, 
16.5, 17.4, 17.55, 18.25, 19, 19.2, 19.35, 20, 20.25, 20.3, 20.55, 
21.1, 21.35, 22.3, 23.15, 23.3, 0.15, 1.5, 2, 4, 10.25, 10.45, 
11.3, 12, 12.1, 12.3, 12.5, 13.05, 13.55, 14.3, 14.5, 15.5, 16, 
16.1, 17.4, 17.55, 18.25, 18.3, 20, 20.15, 20.2, 20.55, 21.1, 
21.3, 21.45, 22, 23.15, 0.1, 1, 1.1, 4, 12.05, 12.25, 13, 13.2, 
14.3, 15.35, 16, 16.3, 17, 18.3, 19.3, 20, 22, 23.1, 23.3, 1, 
1.15, 4, 11.05, 11.3, 11.4, 12, 12.15, 13, 14, 14.3, 15.2, 17, 
17.3, 18.3, 19.3, 21.2, 21.35, 21.5, 22, 22.1, 22.55, 23.05, 
23.3), start = c(1290L, 1310L, 1320L, 0L, 385L, 405L, 450L, 485L, 
495L, 515L, 600L, 615L, 640L, 705L, 710L, 725L, 730L, 750L, 770L, 
820L, 835L, 865L, 900L, 920L, 935L, 960L, 985L, 990L, 1015L, 
1030L, 1055L, 1110L, 1155L, 1170L, 1215L, 1310L, 1320L, 0L, 385L, 
405L, 450L, 480L, 490L, 510L, 530L, 545L, 595L, 630L, 650L, 710L, 
720L, 730L, 820L, 835L, 865L, 870L, 960L, 975L, 980L, 1015L, 
1030L, 1050L, 1065L, 1080L, 1155L, 1210L, 1260L, 1270L, 0L, 485L, 
505L, 540L, 560L, 630L, 695L, 720L, 750L, 780L, 870L, 930L, 960L, 
1080L, 1150L, 1170L, 1260L, 1275L, 0L, 425L, 450L, 460L, 480L, 
495L, 540L, 600L, 630L, 680L, 780L, 810L, 870L, 930L, 1040L, 
1055L, 1070L, 1080L, 1090L, 1135L, 1145L, 1170L), end = c(1310L, 
1320L, 1440L, 385L, 405L, 450L, 485L, 495L, 515L, 600L, 615L, 
640L, 705L, 710L, 725L, 730L, 750L, 770L, 820L, 835L, 865L, 900L, 
920L, 935L, 960L, 985L, 990L, 1015L, 1030L, 1055L, 1110L, 1155L, 
1170L, 1290L, 1310L, 1320L, 1440L, 385L, 405L, 450L, 480L, 490L, 
510L, 530L, 545L, 595L, 630L, 650L, 710L, 720L, 730L, 820L, 835L, 
865L, 870L, 960L, 975L, 980L, 1015L, 1030L, 1050L, 1065L, 1080L, 
1155L, 1215L, 1260L, 1270L, 1440L, 485L, 505L, 540L, 560L, 630L, 
695L, 720L, 750L, 780L, 870L, 930L, 960L, 1080L, 1150L, 1170L, 
1210L, 1275L, 1440L, 425L, 450L, 460L, 480L, 495L, 540L, 600L, 
630L, 680L, 780L, 810L, 870L, 930L, 1040L, 1055L, 1070L, 1080L, 
1090L, 1135L, 1145L, 1170L, 1260L), time = c(20L, 10L, 120L, 
385L, 20L, 45L, 35L, 10L, 20L, 85L, 15L, 25L, 65L, 5L, 15L, 5L, 
20L, 20L, 50L, 15L, 30L, 35L, 20L, 15L, 25L, 25L, 5L, 25L, 15L, 
25L, 55L, 45L, 15L, 120L, 95L, 10L, 120L, 385L, 20L, 45L, 30L, 
10L, 20L, 20L, 15L, 50L, 35L, 20L, 60L, 10L, 10L, 90L, 15L, 30L, 
5L, 90L, 15L, 5L, 35L, 15L, 20L, 15L, 15L, 75L, 60L, 50L, 10L, 
170L, 485L, 20L, 35L, 20L, 70L, 65L, 25L, 30L, 30L, 90L, 60L, 
30L, 120L, 70L, 20L, 40L, 15L, 165L, 425L, 25L, 10L, 20L, 15L, 
45L, 60L, 30L, 50L, 100L, 30L, 60L, 60L, 110L, 15L, 15L, 10L, 
10L, 45L, 10L, 25L, 90L)), .Names = c("idno", "day", "clockst", 
"start", "end", "time"), row.names = c(NA, -108L), class = "data.frame", index = structure(integer(0), "`__start`" = c(4L, 
38L, 69L, 87L, 5L, 39L, 6L, 40L, 88L, 7L, 41L, 89L, 90L, 42L, 
91L, 8L, 70L, 43L, 9L, 92L, 71L, 44L, 10L, 45L, 72L, 93L, 46L, 
73L, 47L, 11L, 94L, 12L, 48L, 74L, 95L, 13L, 49L, 96L, 75L, 14L, 
15L, 50L, 51L, 76L, 16L, 17L, 52L, 18L, 77L, 19L, 78L, 97L, 98L, 
20L, 53L, 21L, 54L, 22L, 55L, 56L, 79L, 99L, 23L, 24L, 80L, 100L, 
25L, 26L, 57L, 81L, 58L, 59L, 27L, 28L, 29L, 60L, 30L, 61L, 101L, 
62L, 31L, 102L, 63L, 103L, 64L, 82L, 104L, 105L, 32L, 106L, 107L, 
83L, 33L, 65L, 34L, 84L, 108L, 66L, 35L, 67L, 85L, 68L, 86L, 
1L, 2L, 36L, 3L, 37L), "`__idno`" = integer(0)))
)

最新更新