我有关于基于时间戳的计算的问题:我在一年中的每个小时都有一个大的数据帧df
,其中有Timestamp
(全年(、Export_Country
、Import_Country
和相应的Value
。例如,这里是示例数据帧df
:
df <- data.frame(Timestamp=c("2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('AT','DE','CH','DE','CZ','DE'),
Import_Country=c('DE','AT','DE','CH','DE','CZ'),
Value=c(170.06,289.37,1133.47,0,68.29,0.32),
stringsAsFactors=FALSE)
我想写一个函数,可以计算两个国家内每个时间戳的净值。输出应该看起来像数据帧df1
:
df2<- data.frame(Timestamp=c("2F020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('DE','CH','CZ'),
Import_Country=c('AT','DE','DE'),
Value=c(119.31,1133.47,67.97),
stringsAsFactors=FALSE)
我试着做一些类似的事情:
df3<- df %<>%
group_by(Timestamp,Export_Country,Import_Country) %>%
summarise(Value=sum(Value))
注意,这是str(mydataframe)
的输出
'data.frame': 65520 obs. of 4 variables:
$ DateTime : chr "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" ...
$ Export_Country: Factor w/ 70 levels "AL","AT","BA",..: 15 13 15 10 13 2 53 13 46 10 ...
$ Import_Country: Factor w/ 70 levels "AL","AT","BA",..: 10 46 13 15 2 13 10 15 13 53 ...
$ FlowValue : num 417 251 898 0 1089 ...
有人能帮我吗?非常感谢。
使用tidyverse
,我们可以将数据转换为更长的格式和
# gets the next other country's index based on the current country index
funcp <- function(x) x + 1 - 2 * (x%%2 == 0)
df %>%
# pivoting to longer format in order to facilitate data manipulation
pivot_longer(cols=ends_with("Country"), values_to = "country") %>%
# remove _Country from (Import|Export)_Country and getting the real value of the transaction Imports = - Value
mutate(name=sub("_.+","", name), Value=Value*(1-2*(name=="Import"))) %>%
# adding a with column that contains the counterpart
tibble(with=.$country[funcp(1:nrow(.))]) %>%
# finally grouping by the Timestamp, the country and the counterpart to get the actual Net value
group_by(Timestamp, country, with) %>% summarise(Value=sum(Value)) -> df2
df2
#> # A tibble: 6 x 4
#> # Groups: Timestamp, country [4]
#> Timestamp country with Value
#> <chr> <chr> <chr> <dbl>
#> 1 2020-01-01 00:00:00.000 AT DE -119.
#> 2 2020-01-01 00:00:00.000 CH DE 1133.
#> 3 2020-01-01 00:00:00.000 CZ DE 68.0
#> 4 2020-01-01 00:00:00.000 DE AT 119.
#> 5 2020-01-01 00:00:00.000 DE CH -1133.
#> 6 2020-01-01 00:00:00.000 DE CZ -68.0
如果你只想得到正网络,那么你可以过滤结果:
df2 %>% filter(Value >=0)
#> # A tibble: 3 x 4
#> # Groups: Timestamp, country [3]
#> Timestamp country with Value
#> <chr> <chr> <chr> <dbl>
#> 1 2020-01-01 00:00:00.000 CH DE 1133.
#> 2 2020-01-01 00:00:00.000 CZ DE 68.0
#> 3 2020-01-01 00:00:00.000 DE AT 119.
注:捷克到德国的数值在打印过程中四舍五入,但在tibble 中等于67.97
我知道下面的函数很复杂,可能有更简单的解决方案,但它似乎有效。
fun <- function(X){
f <- function(x){
x[[3]]*(2*(x[[1]] < x[[2]]) - 1)
}
icontr <- grep("Country", names(X), value = TRUE)
X[["Value"]] <- f(X[c(icontr, "Value")])
X[icontr] <- t(apply(X[icontr], 1, sort))
fmla <- paste(c("Timestamp", icontr), collapse = "+")
fmla <- paste("Value", fmla, sep = "~")
fmla <- as.formula(fmla)
out <- aggregate(fmla, X, sum)
i <- out[["Value"]] < 0
tmp <- out[["Export_Country"]][i]
out[["Export_Country"]][i] <- out[["Import_Country"]][i]
out[["Import_Country"]][i] <- tmp
out[["Value"]][i] <- -out[["Value"]][i]
out
}
fun(df)
# Timestamp Export_Country Import_Country Value
#1 2020-01-01 00:00:00.000 DE AT 119.31
#2 2020-01-01 00:00:00.000 CH DE 1133.47
#3 2020-01-01 00:00:00.000 CZ DE 67.97
all.equal(fun(df), df2)
#[1] TRUE
也许您想将Import_Country
和Export_Country
组合成一个字符串?然后你可以group_by
这个,并在两个Value
之间取差。这假设每个Timestamp
只有两个国家/地区可以合并。它还一致地从导出中减去导入。
library(tidyverse)
df %>%
mutate(CountryDyad = paste(pmin(Export_Country, Import_Country),
pmax(Export_Country, Import_Country),
sep = "-")) %>%
group_by(Timestamp, CountryDyad) %>%
summarise(Value = Value[which(startsWith(CountryDyad, Import_Country))] -
Value[which(startsWith(CountryDyad, Export_Country))])
输出
Timestamp CountryDyad Value
<chr> <chr> <dbl>
1 2020-01-01 00:00:00.000 AT-DE 119.
2 2020-01-01 00:00:00.000 CH-DE -1133.
3 2020-01-01 00:00:00.000 CZ-DE -68.0