根据r中的两列计算每个时间小时的净值

我有关于基于时间戳的计算的问题：我在一年中的每个小时都有一个大的数据帧df，其中有Timestamp(全年(、Export_Country、Import_Country和相应的Value。例如，这里是示例数据帧df:

df <- data.frame(Timestamp=c("2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('AT','DE','CH','DE','CZ','DE'), 
Import_Country=c('DE','AT','DE','CH','DE','CZ'), 
Value=c(170.06,289.37,1133.47,0,68.29,0.32),
stringsAsFactors=FALSE)

我想写一个函数，可以计算两个国家内每个时间戳的净值。输出应该看起来像数据帧df1:

df2<- data.frame(Timestamp=c("2F020-01-01 00:00:00.000","2020-01-01 00:00:00.000","2020-01-01 00:00:00.000"),
Export_Country=c('DE','CH','CZ'), 
Import_Country=c('AT','DE','DE'), 
Value=c(119.31,1133.47,67.97),
stringsAsFactors=FALSE)

我试着做一些类似的事情：

df3<- df %<>% 
group_by(Timestamp,Export_Country,Import_Country) %>% 
summarise(Value=sum(Value))

注意，这是str(mydataframe)的输出

'data.frame':   65520 obs. of  4 variables:
$ DateTime      : chr  "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" "2020-01-02 12:00:00.000" ...
$ Export_Country: Factor w/ 70 levels "AL","AT","BA",..: 15 13 15 10 13 2 53 13 46 10 ...
$ Import_Country: Factor w/ 70 levels "AL","AT","BA",..: 10 46 13 15 2 13 10 15 13 53 ...
$ FlowValue     : num  417 251 898 0 1089 ...

有人能帮我吗？非常感谢。

使用tidyverse，我们可以将数据转换为更长的格式和

# gets the next other country's index based on the current country index
funcp <- function(x) x + 1 - 2 * (x%%2 == 0)
df %>% 
# pivoting to longer format in order to facilitate data manipulation
pivot_longer(cols=ends_with("Country"), values_to = "country") %>% 
# remove _Country from (Import|Export)_Country and getting the real value of the transaction Imports = - Value
mutate(name=sub("_.+","", name), Value=Value*(1-2*(name=="Import")))  %>% 
# adding a with column that contains the counterpart
tibble(with=.$country[funcp(1:nrow(.))]) %>% 
# finally grouping by the Timestamp, the country and the counterpart to get the actual Net value
group_by(Timestamp, country, with) %>% summarise(Value=sum(Value)) -> df2
df2
#> # A tibble: 6 x 4
#> # Groups:   Timestamp, country [4]
#>   Timestamp               country with    Value
#>   <chr>                   <chr>   <chr>   <dbl>
#> 1 2020-01-01 00:00:00.000 AT      DE     -119. 
#> 2 2020-01-01 00:00:00.000 CH      DE     1133. 
#> 3 2020-01-01 00:00:00.000 CZ      DE       68.0
#> 4 2020-01-01 00:00:00.000 DE      AT      119. 
#> 5 2020-01-01 00:00:00.000 DE      CH    -1133. 
#> 6 2020-01-01 00:00:00.000 DE      CZ      -68.0

如果你只想得到正网络，那么你可以过滤结果：

df2 %>% filter(Value >=0)
#> # A tibble: 3 x 4
#> # Groups:   Timestamp, country [3]
#>   Timestamp               country with   Value
#>   <chr>                   <chr>   <chr>  <dbl>
#> 1 2020-01-01 00:00:00.000 CH      DE    1133. 
#> 2 2020-01-01 00:00:00.000 CZ      DE      68.0
#> 3 2020-01-01 00:00:00.000 DE      AT     119.

注：捷克到德国的数值在打印过程中四舍五入，但在tibble 中等于67.97

我知道下面的函数很复杂，可能有更简单的解决方案，但它似乎有效。

fun <- function(X){
f <- function(x){
x[[3]]*(2*(x[[1]] < x[[2]]) - 1)
}
icontr <- grep("Country", names(X), value = TRUE)
X[["Value"]] <- f(X[c(icontr, "Value")])
X[icontr] <- t(apply(X[icontr], 1, sort))
fmla <- paste(c("Timestamp", icontr), collapse = "+")
fmla <- paste("Value", fmla, sep = "~")
fmla <- as.formula(fmla)
out <- aggregate(fmla, X, sum)
i <- out[["Value"]] < 0
tmp <- out[["Export_Country"]][i]
out[["Export_Country"]][i] <- out[["Import_Country"]][i]
out[["Import_Country"]][i] <- tmp
out[["Value"]][i] <- -out[["Value"]][i]
out
}

fun(df)
#                Timestamp Export_Country Import_Country   Value
#1 2020-01-01 00:00:00.000             DE             AT  119.31
#2 2020-01-01 00:00:00.000             CH             DE 1133.47
#3 2020-01-01 00:00:00.000             CZ             DE   67.97
all.equal(fun(df), df2)
#[1] TRUE

也许您想将Import_Country和Export_Country组合成一个字符串？然后你可以group_by这个，并在两个Value之间取差。这假设每个Timestamp只有两个国家/地区可以合并。它还一致地从导出中减去导入。

library(tidyverse)
df %>%
mutate(CountryDyad = paste(pmin(Export_Country, Import_Country),
pmax(Export_Country, Import_Country), 
sep = "-")) %>%
group_by(Timestamp, CountryDyad) %>%
summarise(Value = Value[which(startsWith(CountryDyad, Import_Country))] -
Value[which(startsWith(CountryDyad, Export_Country))])

输出

Timestamp               CountryDyad   Value
<chr>                   <chr>         <dbl>
1 2020-01-01 00:00:00.000 AT-DE         119. 
2 2020-01-01 00:00:00.000 CH-DE       -1133. 
3 2020-01-01 00:00:00.000 CZ-DE         -68.0

相关内容

最新更新

热门标签：