r语言 - 如何找到在数据库中添加/删除特定年份的国家/地区?



也许这是一个非常简单的问题,但我是R的新手,找不到答案。希望你会知道:)

我正在探索"世界上的自由"的数据库,其中包含2013年至2021年的数据。

在过滤时,我发现有些年份有209 个国家,有些年份有 210 个国家。 但是我无法弄清楚我可以使用哪个代码/函数来找出正在添加/删除的代码/功能(也许每次都是同一个国家,也许不是,我不知道)。

数据帧包含年份的变量(Edition)和国家名称(Country/Territory)。在这里,您可以一瞥数据框:

> glimpse(head (Freedom_df))
Rows: 6
Columns: 44
$ `Country/Territory` <chr> "Abkhazia", "Afghanistan", "Albania", "Algeria", "Andorra", "Angola"
$ Region              <chr> "Eurasia", "Asia", "Europe", "MENA", "Europe", "SSA"
$ `C/T`               <chr> "t", "c", "c", "c", "c", "c"
$ Edition             <dbl> 2021, 2021, 2021, 2021, 2021, 2021
$ Status              <chr> "PF", "NF", "PF", "NF", "F", "NF"
$ `PR rating`         <dbl> 5, 5, 3, 6, 1, 6
$ `CL rating`         <dbl> 5, 6, 3, 5, 1, 5
$ A1                  <dbl> 2, 1, 3, 1, 4, 0
$ A2                  <dbl> 2, 1, 3, 1, 4, 2
$ A3                  <dbl> 1, 1, 2, 1, 4, 1
$ A                   <dbl> 5, 3, 8, 3, 12, 3
$ B1                  <dbl> 2, 2, 3, 1, 4, 1
$ B2                  <dbl> 3, 2, 3, 1, 4, 1
$ B3                  <dbl> 2, 1, 3, 1, 4, 1
$ B4                  <dbl> 1, 2, 3, 1, 3, 2
$ B                   <dbl> 8, 7, 12, 4, 15, 5
$ C1                  <dbl> 1, 1, 3, 1, 3, 1
$ C2                  <dbl> 1, 1, 2, 1, 4, 1
$ C3                  <dbl> 2, 1, 2, 1, 4, 0
$ C                   <dbl> 4, 3, 7, 3, 11, 2
$ `Add Q`             <dbl> 0, 0, 0, 0, 0, 0
$ `Add A`             <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
$ PR                  <dbl> 17, 13, 27, 10, 38, 10
$ D1                  <dbl> 2, 2, 2, 1, 3, 1
$ D2                  <dbl> 2, 1, 4, 1, 3, 2
$ D3                  <dbl> 1, 1, 3, 2, 4, 2
$ D4                  <dbl> 3, 2, 4, 2, 4, 2
$ D                   <dbl> 8, 6, 13, 6, 14, 7
$ E1                  <dbl> 3, 2, 3, 1, 4, 2
$ E2                  <dbl> 2, 1, 3, 1, 4, 2
$ E3                  <dbl> 1, 1, 2, 1, 3, 2
$ E                   <dbl> 6, 4, 8, 3, 11, 6
$ F1                  <dbl> 1, 1, 2, 1, 4, 1
$ F2                  <dbl> 1, 0, 2, 1, 4, 1
$ F3                  <dbl> 1, 0, 2, 2, 4, 1
$ F4                  <dbl> 1, 1, 3, 2, 3, 2
$ F                   <dbl> 4, 2, 9, 6, 15, 5
$ G1                  <dbl> 1, 0, 3, 2, 4, 1
$ G2                  <dbl> 1, 1, 2, 2, 4, 1
$ G3                  <dbl> 2, 0, 2, 2, 3, 1
$ G4                  <dbl> 1, 1, 2, 1, 4, 0
$ G                   <dbl> 5, 2, 9, 7, 15, 3
$ CL                  <dbl> 23, 14, 39, 22, 55, 21
$ Total               <dbl> 40, 27, 66, 32, 93, 31

在这里,您可以看到我提到的根据年份拥有 209 或 210 个国家/地区的内容:

> count(Freedom_df, Edition) 
# A tibble: 9 x 2
Edition     n
*   <dbl> <int>
1    2013   209
2    2014   209
3    2015   210
4    2016   210
5    2017   209
6    2018   209
7    2019   209
8    2020   210
9    2021   210

下面是两个具有预期输出的可重现示例。

示例 1:在这种情况下,我假设有 209 个国家/地区始终保持不变,只有一个国家/地区被添加和删除。

# A tibble: 9 x 4
Edition     n    Added_country   Removed_country 
*   <dbl> <int>            <chr>             <chr>
1    2013   209               NA                NA
2    2014   209               NA                NA
3    2015   210   "country_name"                NA
4    2016   210               NA                NA
5    2017   209               NA    "country_name"
6    2018   209               NA                NA
7    2019   209               NA                NA
8    2020   210   "country_name"                NA
9    2021   210               NA                NA

示例 2:在这种情况下,我认为有 207 个国家/地区多年来(2013:2021 年)保持不变,另外 3 个国家/地区在保持相同计数的同时添加/删除。

# A tibble: 9 x 4
Edition     n    Different_country
*  <dbl> <int>                <chr>
1   2013   209          "country_A"
2   2013   209          "country_B"
3   2014   209          "country_A"
4   2014   209          "country_C"
5   2015   210          "country_A"
6   2015   210          "country_B"
7   2015   210          "country_C"
8   2016   210          "country_A"
9   2016   210          "country_B"
10   2016   210          "country_C"
11   2017   209          "country_B"
12   2017   209          "country_C"
13   2018   209          "country_B"
14   2018   209          "country_C"
15   2019   209          "country_B"
16   2019   209          "country_C"
17   2020   210          "country_A"
18   2020   210          "country_B"
19   2020   210          "country_C"
20   2021   210          "country_A"
21   2021   210          "country_B"
22   2021   210          "country_C"

我认为这足以解决它,如果您需要任何其他详细信息,请告诉我。 谢谢:)

编辑:由于原始数据已被@awaji98在他/她的答案中追踪,可以看出以下策略有效

freedom <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")

freedom %>% group_by(Edition) %>%
summarise(countries = list(Country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries, default = list(countries[[1]])), setdiff)) %>%
group_by(Edition) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 9 x 4
Edition countries removed       added           
<dbl>     <int> <chr>         <chr>           
1    2013       209 ""            ""              
2    2014       209 ""            ""              
3    2015       210 ""            "Crimea"        
4    2016       210 ""            ""              
5    2017       209 "Puerto Rico" ""              
6    2018       209 ""            ""              
7    2019       209 ""            ""              
8    2020       210 ""            "Eastern Donbas"
9    2021       210 ""            ""

旧答案

让我们首先构建一个数据说自由,因为你没有添加任何

library(tidyverse)
set.seed(2021)
freedom <- data.frame(year = rep(2010:2014, each = 5),
country = c(sample(LETTERS[1:8], 5), 
sample(LETTERS[1:8], 5), 
sample(LETTERS[1:8], 5), 
sample(LETTERS[1:8], 5), 
sample(LETTERS[1:8], 5)),
val = round(100 * runif(25)))
freedom %>% pivot_wider(id_cols = country, names_from = year, names_sort = T, values_from = val) %>%
arrange(country)
#Let's have a look on this data
# A tibble: 8 x 6
country `2010` `2011` `2012` `2013` `2014`
<chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
1 A           40     NA     76     52     63
2 B           82     NA     73     NA     20
3 C           NA     90     NA     NA     43
4 D           NA     52     NA     NA     54
5 E           NA     93     29     36     NA
6 F           21     NA     NA     23     NA
7 G           49     23     31      1     NA
8 H           68     62     70     88     17

现在,下面的语法将给出按年添加和删除的国家/地区的列表。 所有国家/地区都将被视为第一年添加的内容

freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
unnest(c(added, removed))
# A tibble: 14 x 3
year removed added
<int> <chr>   <chr>
1  2010 NA      B    
2  2010 NA      G    
3  2010 NA      A    
4  2010 NA      F    
5  2010 NA      H    
6  2011 B       E    
7  2011 A       D    
8  2011 F       C    
9  2012 D       B    
10  2012 C       A    
11  2013 B       F    
12  2014 G       C    
13  2014 F       D    
14  2014 E       B

或者,如果不想在第一年看到新增内容,请执行此操作

freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
filter(as.numeric(row_number()) != 1) %>%
unnest(c(added, removed))
# A tibble: 9 x 3
year removed added
<int> <chr>   <chr>
1  2011 B       E    
2  2011 A       D    
3  2011 F       C    
4  2012 D       B    
5  2012 C       A    
6  2013 B       F    
7  2014 G       C    
8  2014 F       D    
9  2014 E       B

或者,如果您只想查看添加和删除了多少,请执行此操作

freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = unlist(map2(lag(countries), countries, function(x, y) length(setdiff(x, y)))),
added = unlist(map2(countries, lag(countries), function(x, y) length(setdiff(x, y))))) %>%
select(-countries)
# A tibble: 5 x 3
year removed added
<int>   <int> <int>
1  2010       0     5
2  2011       3     3
3  2012       2     2
4  2013       1     1
5  2014       3     3

进一步更新/编辑正如您所说,每年的国家/地区数量不同,您也可以采用以下策略

#let's edit the df to have unequal country count each year
freedom <- freedom[-c(9,21),]
#now
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
group_by(year) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 5 x 4
year countries removed   added        
<int>     <int> <chr>     <chr>        
1  2010         5 ""        B, G, A, F, H
2  2011         4 "B, A, F" E, D         
3  2012         5 "D"       B, A         
4  2013         5 "B"       F            
5  2014         4 "G, F, E" D, B

我认为现在您可以安全地查看每年增加/删除的数据。

使用真实数据进行最终编辑

真实数据来源于此:

Freedom_df <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")

应用第二次编辑中的代码:

Freedom_df %>% mutate(Country = factor(Country)) %>% 
count(Edition, Country, .drop = FALSE) %>% filter(n == 0)

结果:

Edition Country            n
<dbl> <fct>          <int>
1    2013 Crimea             0
2    2013 Eastern Donbas     0
3    2014 Crimea             0
4    2014 Eastern Donbas     0
5    2015 Eastern Donbas     0
6    2016 Eastern Donbas     0
7    2017 Eastern Donbas     0
8    2017 Puerto Rico        0
9    2018 Eastern Donbas     0
10    2018 Puerto Rico        0
11    2019 Eastern Donbas     0
12    2019 Puerto Rico        0
13    2020 Puerto Rico        0
14    2021 Puerto Rico        0

只是为了澄清这三个国家是哪一年,我们可以创建一个它们的向量并将其称为缺失:

missing <- Freedom_df %>% mutate(Country = factor(Country)) %>% 
count(Edition, Country, .drop = FALSE) %>% filter(n == 0 )%>% 
pull %>% as.character

然后查看日期中缺少的国家/地区出现在哪些年份

Freedom_df %>% filter(Country %in% missing) %>% 
select(Country,Edition) %>% arrange(Country,Edition)

结果:

# A tibble: 13 x 2
Country        Edition
<chr>            <dbl>
1 Crimea            2015
2 Crimea            2016
3 Crimea            2017
4 Crimea            2018
5 Crimea            2019
6 Crimea            2020
7 Crimea            2021
8 Eastern Donbas    2020
9 Eastern Donbas    2021
10 Puerto Rico       2013
11 Puerto Rico       2014
12 Puerto Rico       2015
13 Puerto Rico       2016

基于 OP 示例的第二次编辑

我不确定您的其他列名称是什么,但假设您分别拥有年份和国家/地区的版本和国家/地区, 您可以尝试使用将版本设置为因子变量的计数。

Freedom_df %>% 
# change country column to factor
mutate(Country = factor(`Country/Territory`)) %>%
# count with .drop = FALSE to add 0's for any missing countries 
count(Edition, Country, .drop = FALSE) %>%
# filter to find them
filter(n == 0)

例:

# Example dataframe
df <- structure(list(Edition = c(2021, 2021, 2021, 2021, 2021, 2021, 
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018, 
2018, 2018, 2018, 2018, 2018), Country = c("Barbados", "Belarus", 
"Belgium", "Belize", "Benin", "Bhutan", "Belarus", "Barbados", 
"Belize", "Bhutan", "Benin", "Belgium", "Barbados", "Belize", 
"Belarus", "Bhutan", "Barbados", "Belarus", "Belgium", "Belize", 
"Benin", "Bhutan")), row.names = c(NA, -22L), class = c("tbl_df", 
"tbl", "data.frame"))
# the code
df %>% mutate(Country = factor(Country)) %>% 
count(Edition, Country, .drop = FALSE) %>% 
filter(n == 0) 

结果:

Edition Country     n
<dbl> <fct>   <int>
1    2019 Benin       0
2    2020 Belgium     0

首次编辑如果所有数据都在一个数据框中,则可以先按年份拆分数据帧,然后按上述方式对列表对象使用anti_join。

## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")

以下是实践中的一个例子:

## An example df (reproduced by dput())
df <- structure(list(year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
2020, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021), 
country = c("Abkhazia*", "Afghanistan", "Albania", "Algeria", 
"Andorra", "Angola", "Antigua and Barbuda", "Argentina", 
"Armenia", "Australia", "Austria", "Azerbaijan", "Bangladesh", 
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan", 
"Abkhazia*", "Afghanistan", "Albania", "Algeria", "Andorra", 
"Angola", "Antigua and Barbuda", "Argentina", "Armenia", 
"Australia", "Austria", "Azerbaijan", "Bahrain", "Bangladesh", 
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan"
), score = c("40", "27", "66", "32", "93", "31", "85", "84", 
"55", "97", "93", "10", "39", "95", "11", "96", "87", "65", 
"61", "40", "27", "66", "32", "93", "31", "85", "84", "55", 
"97", "93", "10", "12", "39", "95", "11", "96", "87", "65", 
"61")), row.names = c(NA, -39L), class = c("tbl_df", "tbl", 
"data.frame"))

这是 df:

year country             score
<dbl> <chr>               <chr>
1  2020 Abkhazia*           40   
2  2020 Afghanistan         27   
3  2020 Albania             66   
4  2020 Algeria             32   
5  2020 Andorra             93   
6  2020 Angola              31   
7  2020 Antigua and Barbuda 85   
8  2020 Argentina           84   
9  2020 Armenia             55   
10  2020 Australia           97   
# … with 29 more rows
## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))

结果:

$`2020`
# A tibble: 19 x 3
year country             score
<dbl> <chr>               <chr>
1  2020 Abkhazia*           40   
2  2020 Afghanistan         27   
3  2020 Albania             66   
4  2020 Algeria             32   
5  2020 Andorra             93   
6  2020 Angola              31   
7  2020 Antigua and Barbuda 85   
8  2020 Argentina           84   
9  2020 Armenia             55   
10  2020 Australia           97   
11  2020 Austria             93   
12  2020 Azerbaijan          10   
13  2020 Bangladesh          39   
14  2020 Barbados            95   
15  2020 Belarus             11   
16  2020 Belgium             96   
17  2020 Belize              87   
18  2020 Benin               65   
19  2020 Bhutan              61   
$`2021`
# A tibble: 20 x 3
year country             score
<dbl> <chr>               <chr>
1  2021 Abkhazia*           40   
2  2021 Afghanistan         27   
3  2021 Albania             66   
4  2021 Algeria             32   
5  2021 Andorra             93   
6  2021 Angola              31   
7  2021 Antigua and Barbuda 85   
8  2021 Argentina           84   
9  2021 Armenia             55   
10  2021 Australia           97   
11  2021 Austria             93   
12  2021 Azerbaijan          10   
13  2021 Bahrain             12   
14  2021 Bangladesh          39   
15  2021 Barbados            95   
16  2021 Belarus             11   
17  2021 Belgium             96   
18  2021 Belize              87   
19  2021 Benin               65   
20  2021 Bhutan              61  

然后做anti_join

## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")

结果:

# A tibble: 1 x 3
year country score
<dbl> <chr>   <chr>
1  2021 Bahrain 12  

最新更新