也许这是一个非常简单的问题,但我是R的新手,找不到答案。希望你会知道:)
我正在探索"世界上的自由"的数据库,其中包含2013年至2021年的数据。
在过滤时,我发现有些年份有209 个国家,有些年份有 210 个国家。 但是我无法弄清楚我可以使用哪个代码/函数来找出正在添加/删除的代码/功能(也许每次都是同一个国家,也许不是,我不知道)。
数据帧包含年份的变量(Edition
)和国家名称(Country/Territory
)。在这里,您可以一瞥数据框:
> glimpse(head (Freedom_df))
Rows: 6
Columns: 44
$ `Country/Territory` <chr> "Abkhazia", "Afghanistan", "Albania", "Algeria", "Andorra", "Angola"
$ Region <chr> "Eurasia", "Asia", "Europe", "MENA", "Europe", "SSA"
$ `C/T` <chr> "t", "c", "c", "c", "c", "c"
$ Edition <dbl> 2021, 2021, 2021, 2021, 2021, 2021
$ Status <chr> "PF", "NF", "PF", "NF", "F", "NF"
$ `PR rating` <dbl> 5, 5, 3, 6, 1, 6
$ `CL rating` <dbl> 5, 6, 3, 5, 1, 5
$ A1 <dbl> 2, 1, 3, 1, 4, 0
$ A2 <dbl> 2, 1, 3, 1, 4, 2
$ A3 <dbl> 1, 1, 2, 1, 4, 1
$ A <dbl> 5, 3, 8, 3, 12, 3
$ B1 <dbl> 2, 2, 3, 1, 4, 1
$ B2 <dbl> 3, 2, 3, 1, 4, 1
$ B3 <dbl> 2, 1, 3, 1, 4, 1
$ B4 <dbl> 1, 2, 3, 1, 3, 2
$ B <dbl> 8, 7, 12, 4, 15, 5
$ C1 <dbl> 1, 1, 3, 1, 3, 1
$ C2 <dbl> 1, 1, 2, 1, 4, 1
$ C3 <dbl> 2, 1, 2, 1, 4, 0
$ C <dbl> 4, 3, 7, 3, 11, 2
$ `Add Q` <dbl> 0, 0, 0, 0, 0, 0
$ `Add A` <chr> "N/A", "N/A", "N/A", "N/A", "N/A", "N/A"
$ PR <dbl> 17, 13, 27, 10, 38, 10
$ D1 <dbl> 2, 2, 2, 1, 3, 1
$ D2 <dbl> 2, 1, 4, 1, 3, 2
$ D3 <dbl> 1, 1, 3, 2, 4, 2
$ D4 <dbl> 3, 2, 4, 2, 4, 2
$ D <dbl> 8, 6, 13, 6, 14, 7
$ E1 <dbl> 3, 2, 3, 1, 4, 2
$ E2 <dbl> 2, 1, 3, 1, 4, 2
$ E3 <dbl> 1, 1, 2, 1, 3, 2
$ E <dbl> 6, 4, 8, 3, 11, 6
$ F1 <dbl> 1, 1, 2, 1, 4, 1
$ F2 <dbl> 1, 0, 2, 1, 4, 1
$ F3 <dbl> 1, 0, 2, 2, 4, 1
$ F4 <dbl> 1, 1, 3, 2, 3, 2
$ F <dbl> 4, 2, 9, 6, 15, 5
$ G1 <dbl> 1, 0, 3, 2, 4, 1
$ G2 <dbl> 1, 1, 2, 2, 4, 1
$ G3 <dbl> 2, 0, 2, 2, 3, 1
$ G4 <dbl> 1, 1, 2, 1, 4, 0
$ G <dbl> 5, 2, 9, 7, 15, 3
$ CL <dbl> 23, 14, 39, 22, 55, 21
$ Total <dbl> 40, 27, 66, 32, 93, 31
在这里,您可以看到我提到的根据年份拥有 209 或 210 个国家/地区的内容:
> count(Freedom_df, Edition)
# A tibble: 9 x 2
Edition n
* <dbl> <int>
1 2013 209
2 2014 209
3 2015 210
4 2016 210
5 2017 209
6 2018 209
7 2019 209
8 2020 210
9 2021 210
下面是两个具有预期输出的可重现示例。
示例 1:在这种情况下,我假设有 209 个国家/地区始终保持不变,只有一个国家/地区被添加和删除。
# A tibble: 9 x 4
Edition n Added_country Removed_country
* <dbl> <int> <chr> <chr>
1 2013 209 NA NA
2 2014 209 NA NA
3 2015 210 "country_name" NA
4 2016 210 NA NA
5 2017 209 NA "country_name"
6 2018 209 NA NA
7 2019 209 NA NA
8 2020 210 "country_name" NA
9 2021 210 NA NA
示例 2:在这种情况下,我认为有 207 个国家/地区多年来(2013:2021 年)保持不变,另外 3 个国家/地区在保持相同计数的同时添加/删除。
# A tibble: 9 x 4
Edition n Different_country
* <dbl> <int> <chr>
1 2013 209 "country_A"
2 2013 209 "country_B"
3 2014 209 "country_A"
4 2014 209 "country_C"
5 2015 210 "country_A"
6 2015 210 "country_B"
7 2015 210 "country_C"
8 2016 210 "country_A"
9 2016 210 "country_B"
10 2016 210 "country_C"
11 2017 209 "country_B"
12 2017 209 "country_C"
13 2018 209 "country_B"
14 2018 209 "country_C"
15 2019 209 "country_B"
16 2019 209 "country_C"
17 2020 210 "country_A"
18 2020 210 "country_B"
19 2020 210 "country_C"
20 2021 210 "country_A"
21 2021 210 "country_B"
22 2021 210 "country_C"
我认为这足以解决它,如果您需要任何其他详细信息,请告诉我。 谢谢:)
编辑:由于原始数据已被@awaji98在他/她的答案中追踪,可以看出以下策略有效
freedom <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")
freedom %>% group_by(Edition) %>%
summarise(countries = list(Country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries, default = list(countries[[1]])), setdiff)) %>%
group_by(Edition) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 9 x 4
Edition countries removed added
<dbl> <int> <chr> <chr>
1 2013 209 "" ""
2 2014 209 "" ""
3 2015 210 "" "Crimea"
4 2016 210 "" ""
5 2017 209 "Puerto Rico" ""
6 2018 209 "" ""
7 2019 209 "" ""
8 2020 210 "" "Eastern Donbas"
9 2021 210 "" ""
旧答案
让我们首先构建一个数据说自由,因为你没有添加任何
library(tidyverse)
set.seed(2021)
freedom <- data.frame(year = rep(2010:2014, each = 5),
country = c(sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5),
sample(LETTERS[1:8], 5)),
val = round(100 * runif(25)))
freedom %>% pivot_wider(id_cols = country, names_from = year, names_sort = T, values_from = val) %>%
arrange(country)
#Let's have a look on this data
# A tibble: 8 x 6
country `2010` `2011` `2012` `2013` `2014`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 A 40 NA 76 52 63
2 B 82 NA 73 NA 20
3 C NA 90 NA NA 43
4 D NA 52 NA NA 54
5 E NA 93 29 36 NA
6 F 21 NA NA 23 NA
7 G 49 23 31 1 NA
8 H 68 62 70 88 17
现在,下面的语法将给出按年添加和删除的国家/地区的列表。 所有国家/地区都将被视为第一年添加的内容
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
unnest(c(added, removed))
# A tibble: 14 x 3
year removed added
<int> <chr> <chr>
1 2010 NA B
2 2010 NA G
3 2010 NA A
4 2010 NA F
5 2010 NA H
6 2011 B E
7 2011 A D
8 2011 F C
9 2012 D B
10 2012 C A
11 2013 B F
12 2014 G C
13 2014 F D
14 2014 E B
或者,如果不想在第一年看到新增内容,请执行此操作
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
select(-countries) %>%
filter(as.numeric(row_number()) != 1) %>%
unnest(c(added, removed))
# A tibble: 9 x 3
year removed added
<int> <chr> <chr>
1 2011 B E
2 2011 A D
3 2011 F C
4 2012 D B
5 2012 C A
6 2013 B F
7 2014 G C
8 2014 F D
9 2014 E B
或者,如果您只想查看添加和删除了多少,请执行此操作
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
#group_by(year) %>%
mutate(removed = unlist(map2(lag(countries), countries, function(x, y) length(setdiff(x, y)))),
added = unlist(map2(countries, lag(countries), function(x, y) length(setdiff(x, y))))) %>%
select(-countries)
# A tibble: 5 x 3
year removed added
<int> <int> <int>
1 2010 0 5
2 2011 3 3
3 2012 2 2
4 2013 1 1
5 2014 3 3
进一步更新/编辑正如您所说,每年的国家/地区数量不同,您也可以采用以下策略
#let's edit the df to have unequal country count each year
freedom <- freedom[-c(9,21),]
#now
freedom %>% group_by(year) %>%
summarise(countries = list(country)) %>%
mutate(removed = map2(lag(countries), countries, setdiff),
added = map2(countries, lag(countries), setdiff)) %>%
group_by(year) %>%
mutate(added = toString(unlist(added)),
removed = toString(unlist(removed)),
countries = length(unlist(countries))) %>%
ungroup()
# A tibble: 5 x 4
year countries removed added
<int> <int> <chr> <chr>
1 2010 5 "" B, G, A, F, H
2 2011 4 "B, A, F" E, D
3 2012 5 "D" B, A
4 2013 5 "B" F
5 2014 4 "G, F, E" D, B
我认为现在您可以安全地查看每年增加/删除的数据。
使用真实数据进行最终编辑
真实数据来源于此:
Freedom_df <- read_csv("https://docs.google.com/spreadsheets/d/e/2PACX-1vQwUAEPTrb4AmNNYSdqCupsrXJcDOODfxTEVSZyK-yIAA2ozTGJmWLOnJHa3v-czcBitkfCx2AU_iqj/pub?gid=0&single=true&output=csv")
应用第二次编辑中的代码:
Freedom_df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>% filter(n == 0)
结果:
Edition Country n
<dbl> <fct> <int>
1 2013 Crimea 0
2 2013 Eastern Donbas 0
3 2014 Crimea 0
4 2014 Eastern Donbas 0
5 2015 Eastern Donbas 0
6 2016 Eastern Donbas 0
7 2017 Eastern Donbas 0
8 2017 Puerto Rico 0
9 2018 Eastern Donbas 0
10 2018 Puerto Rico 0
11 2019 Eastern Donbas 0
12 2019 Puerto Rico 0
13 2020 Puerto Rico 0
14 2021 Puerto Rico 0
只是为了澄清这三个国家是哪一年,我们可以创建一个它们的向量并将其称为缺失:
missing <- Freedom_df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>% filter(n == 0 )%>%
pull %>% as.character
然后查看日期中缺少的国家/地区出现在哪些年份
Freedom_df %>% filter(Country %in% missing) %>%
select(Country,Edition) %>% arrange(Country,Edition)
结果:
# A tibble: 13 x 2
Country Edition
<chr> <dbl>
1 Crimea 2015
2 Crimea 2016
3 Crimea 2017
4 Crimea 2018
5 Crimea 2019
6 Crimea 2020
7 Crimea 2021
8 Eastern Donbas 2020
9 Eastern Donbas 2021
10 Puerto Rico 2013
11 Puerto Rico 2014
12 Puerto Rico 2015
13 Puerto Rico 2016
基于 OP 示例的第二次编辑
我不确定您的其他列名称是什么,但假设您分别拥有年份和国家/地区的版本和国家/地区, 您可以尝试使用将版本设置为因子变量的计数。
Freedom_df %>%
# change country column to factor
mutate(Country = factor(`Country/Territory`)) %>%
# count with .drop = FALSE to add 0's for any missing countries
count(Edition, Country, .drop = FALSE) %>%
# filter to find them
filter(n == 0)
例:
# Example dataframe
df <- structure(list(Edition = c(2021, 2021, 2021, 2021, 2021, 2021,
2020, 2020, 2020, 2020, 2020, 2019, 2019, 2019, 2019, 2019, 2018,
2018, 2018, 2018, 2018, 2018), Country = c("Barbados", "Belarus",
"Belgium", "Belize", "Benin", "Bhutan", "Belarus", "Barbados",
"Belize", "Bhutan", "Benin", "Belgium", "Barbados", "Belize",
"Belarus", "Bhutan", "Barbados", "Belarus", "Belgium", "Belize",
"Benin", "Bhutan")), row.names = c(NA, -22L), class = c("tbl_df",
"tbl", "data.frame"))
# the code
df %>% mutate(Country = factor(Country)) %>%
count(Edition, Country, .drop = FALSE) %>%
filter(n == 0)
结果:
Edition Country n
<dbl> <fct> <int>
1 2019 Benin 0
2 2020 Belgium 0
首次编辑如果所有数据都在一个数据框中,则可以先按年份拆分数据帧,然后按上述方式对列表对象使用anti_join。
## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")
以下是实践中的一个例子:
## An example df (reproduced by dput())
df <- structure(list(year = c(2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020,
2020, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021,
2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021, 2021),
country = c("Abkhazia*", "Afghanistan", "Albania", "Algeria",
"Andorra", "Angola", "Antigua and Barbuda", "Argentina",
"Armenia", "Australia", "Austria", "Azerbaijan", "Bangladesh",
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan",
"Abkhazia*", "Afghanistan", "Albania", "Algeria", "Andorra",
"Angola", "Antigua and Barbuda", "Argentina", "Armenia",
"Australia", "Austria", "Azerbaijan", "Bahrain", "Bangladesh",
"Barbados", "Belarus", "Belgium", "Belize", "Benin", "Bhutan"
), score = c("40", "27", "66", "32", "93", "31", "85", "84",
"55", "97", "93", "10", "39", "95", "11", "96", "87", "65",
"61", "40", "27", "66", "32", "93", "31", "85", "84", "55",
"97", "93", "10", "12", "39", "95", "11", "96", "87", "65",
"61")), row.names = c(NA, -39L), class = c("tbl_df", "tbl",
"data.frame"))
这是 df:
year country score
<dbl> <chr> <chr>
1 2020 Abkhazia* 40
2 2020 Afghanistan 27
3 2020 Albania 66
4 2020 Algeria 32
5 2020 Andorra 93
6 2020 Angola 31
7 2020 Antigua and Barbuda 85
8 2020 Argentina 84
9 2020 Armenia 55
10 2020 Australia 97
# … with 29 more rows
## split the dataframe by year
df <- df %>% group_split(year) %>% setNames(unique(df$year))
结果:
$`2020`
# A tibble: 19 x 3
year country score
<dbl> <chr> <chr>
1 2020 Abkhazia* 40
2 2020 Afghanistan 27
3 2020 Albania 66
4 2020 Algeria 32
5 2020 Andorra 93
6 2020 Angola 31
7 2020 Antigua and Barbuda 85
8 2020 Argentina 84
9 2020 Armenia 55
10 2020 Australia 97
11 2020 Austria 93
12 2020 Azerbaijan 10
13 2020 Bangladesh 39
14 2020 Barbados 95
15 2020 Belarus 11
16 2020 Belgium 96
17 2020 Belize 87
18 2020 Benin 65
19 2020 Bhutan 61
$`2021`
# A tibble: 20 x 3
year country score
<dbl> <chr> <chr>
1 2021 Abkhazia* 40
2 2021 Afghanistan 27
3 2021 Albania 66
4 2021 Algeria 32
5 2021 Andorra 93
6 2021 Angola 31
7 2021 Antigua and Barbuda 85
8 2021 Argentina 84
9 2021 Armenia 55
10 2021 Australia 97
11 2021 Austria 93
12 2021 Azerbaijan 10
13 2021 Bahrain 12
14 2021 Bangladesh 39
15 2021 Barbados 95
16 2021 Belarus 11
17 2021 Belgium 96
18 2021 Belize 87
19 2021 Benin 65
20 2021 Bhutan 61
然后做anti_join
## do the anti_join with smaller subset last
anti_join(df$`2021`,df$`2020`, by = "country")
结果:
# A tibble: 1 x 3
year country score
<dbl> <chr> <chr>
1 2021 Bahrain 12