我试图定义一个新的变量word_duration,通过从每个音节的最后一个end_time中减去每个唯一的单词第一个start_time来计算。
下面是一个最小的例子,以及我希望数据帧如何使用新的word_duration列:
df <- data.frame("word" = c("each", "each", "unique", "unique", "word", "unique", "unique"),
"syllable" = c("ea", "ch", "u", "nique", "word", "u", "nique"),
"start_time" = c(41.48, 42.95, 43.49, 43.95, 44.07, 44.12, 44.19),
"end_time" = c(42.95, 43.49, 43.95, 44.07, 44.12, 44.19, 44.23))
word syllable start_time end_time word_duration
1 each ea 41.48 42.95 2.01
2 each ch 42.95 43.49 2.01
3 unique u 43.49 43.95 0.58
4 unique nique 43.95 44.07 0.58
5 word word 44.07 44.12 0.05
6 unique u 44.12 44.19 0.11
7 unique nique 44.19 44.23 0.11
应如何定义新变量的示例:
- ,例如单词";独特的";在数据帧中出现两次,有两个音节
- 第一个"的第一个音节;独特的";开始于43.49秒;独特的";结束于44.07
- 因此字的word_ duration";独特的";为44.07-43.49=0.58秒
因此,单个word_durations应该是2.01、0.58、0.05、0.11,但恐怕我需要一些for循环或定义word_duration的东西。每个单词在数据帧中出现多次,这也使它变得复杂,因此需要逐行计算。有什么建议吗?谢谢你的帮助!
您可以通过更改保存在i
中的单词来split
,获得range
和unsplit
的diff
结果并将其存储在df中。
i <- c(0, cumsum(df$word[-1] != head(df$word, -1)))
df$word_duration <- unsplit(lapply(split(df[c("start_time", "end_time")], i),
function(x) diff(range(x))), i)
df
# word syllable start_time end_time word_duration
#1 each ea 41.48 42.95 2.01
#2 each ch 42.95 43.49 2.01
#3 unique u 43.49 43.95 0.58
#4 unique nique 43.95 44.07 0.58
#5 word word 44.07 44.12 0.05
#6 unique u 44.12 44.19 0.11
#7 unique nique 44.19 44.23 0.11
这里有一种方法:
library(zoo) # for na.locf
library(data.table)
df <- data.frame(
"word" = c("each", "each", "unique", "unique", "word", "unique", "unique"),
"syllable" = c("ea", "ch", "u", "nique", "word", "u", "nique" ),
"start_time" = c(41.48, 42.95, 43.49, 43.95, 44.07, 44.12, 44.19),
"end_time" = c(42.95, 43.49, 43.95, 44.07, 44.12, 44.19, 44.23)
) %>% as.data.table
df[, lead := word != shift(word,fill=TRUE) ]
df[ lead == TRUE , word_duration := shift( start_time,type="lead") - start_time ]
## fix the last word:
last_end_time <- last( df$end_time )
df[ lead == TRUE & is.na(word_duration), word_duration := last_end_time - start_time ]
## make sure NA's are filled with the common word_duration for the syllables
df[ , word_duration := na.locf( word_duration ) ]
它识别每个单词的起始时间,然后获取下一个起始单词的起始点,并将其用作终点,因为这在您提供的数据中似乎是有效的。
然后它手动修复最后一个单词,因为它没有下一个单词可以开始。