我的数据如下:
# A tibble: 6 x 4
name val time x1
<chr> <dbl> <date> <dbl>
1 C Farolillo 7 2016-04-20 51.5
2 C Farolillo 3 2016-04-21 56.3
3 C Farolillo 7 2016-04-22 56.3
4 C Farolillo 13 2016-04-23 57.9
5 C Farolillo 7 2016-04-24 58.7
6 C Farolillo 9 2016-04-25 59.0
我正在尝试使用pivot_wider
函数来扩展基于name
列的数据。我使用以下代码:
yy <- d %>%
pivot_wider(., names_from = name, values_from = val)
这给了我以下警告信息:
Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(val = list)` to suppress this warning.
* Use `values_fn = list(val = length)` to identify where the duplicates arise
* Use `values_fn = list(val = summary_fun)` to summarise duplicates
输出看起来像:
time x1 out1 out2
2016-04-20 51.50000 <dbl> <dbl>
2 2016-04-21 56.34615 <dbl> <dbl>
3 2016-04-22 56.30000 <dbl> <dbl>
4 2016-04-23 57.85714 <dbl> <dbl>
5 2016-04-24 58.70968 <dbl> <dbl>
6 2016-04-25 58.96774 <dbl> <dbl>
我知道这里提到了这个问题,为了解决这个问题,他们建议使用汇总统计数据。然而,我有时间序列数据,因此不想使用汇总统计数据,因为每天都有一个值(而不是多个值(。
我知道问题是因为val
列有重复(即,在上面的例子中,7出现了3次
关于如何看待和克服这个问题,有什么建议吗?
数据:
d <- structure(list(name = c("C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo",
"C Farolillo", "C Farolillo", "C Farolillo", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica",
"Plaza Eliptica", "Plaza Eliptica"), val = c(7, 3, 7, 13, 7,
9, 20, 19, 4, 5, 5, 2, 6, 6, 16, 13, 7, 6, 3, 3, 6, 10, 5, 3,
5, 3, 4, 4, 10, 11, 4, 13, 8, 2, 8, 10, 3, 10, 14, 4, 2, 4, 6,
6, 8, 8, 3, 3, 13, 10, 13, 32, 25, 31, 34, 26, 33, 35, 43, 22,
22, 21, 10, 33, 33, 48, 47, 27, 23, 11, 13, 25, 31, 20, 16, 10,
9, 23, 11, 23, 26, 16, 34, 17, 4, 24, 21, 10, 26, 32, 10, 5,
9, 19, 14, 27, 27, 10, 8, 28, 32, 25), time = structure(c(16911,
16912, 16913, 16914, 16915, 16916, 16917, 16918, 16919, 16920,
16921, 16922, 16923, 16923, 16924, 16925, 16926, 16927, 16928,
16929, 16930, 16931, 16932, 16933, 16934, 16935, 16936, 16937,
16938, 16939, 16940, 16941, 16942, 16943, 16944, 16945, 16946,
16947, 16948, 16949, 16950, 16951, 16952, 16953, 16954, 16955,
16956, 16957, 16958, 16959, 16960, 16911, 16912, 16913, 16914,
16915, 16916, 16917, 16918, 16919, 16920, 16921, 16922, 16923,
16923, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931,
16932, 16933, 16934, 16935, 16936, 16937, 16938, 16939, 16940,
16941, 16942, 16943, 16944, 16945, 16946, 16947, 16948, 16949,
16950, 16951, 16952, 16953, 16954, 16955, 16956, 16957, 16958,
16959, 16960), class = "Date"), x1 = c(51.5, 56.3461538461538,
56.3, 57.8571428571429, 58.7096774193548, 58.9677419354839, 64.4615384615385,
61.9310344827586, 60.3214285714286, 59.4137931034483, 59.5806451612903,
57.3448275862069, 64.0333333333333, 64.0333333333333, 70.15625,
71.3636363636364, 62.8125, 56.4375, 56.4516129032258, 51.741935483871,
52.84375, 53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936,
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871,
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353,
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697,
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125,
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968,
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871,
85.0645161290323, 51.5, 56.3461538461538, 56.3, 57.8571428571429,
58.7096774193548, 58.9677419354839, 64.4615384615385, 61.9310344827586,
60.3214285714286, 59.4137931034483, 59.5806451612903, 57.3448275862069,
64.0333333333333, 64.0333333333333, 70.15625, 71.3636363636364,
62.8125, 56.4375, 56.4516129032258, 51.741935483871, 52.84375,
53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936,
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871,
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353,
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697,
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125,
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968,
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871,
85.0645161290323)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-102L))
为每个name
创建一个唯一的标识符行,然后使用pivot_wider
library(dplyr)
d %>%
group_by(name) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = name, values_from = val) %>%
select(-row)
# A tibble: 51 x 4
# time x1 `C Farolillo` `Plaza Eliptica`
# <date> <dbl> <dbl> <dbl>
# 1 2016-04-20 51.5 7 32
# 2 2016-04-21 56.3 3 25
# 3 2016-04-22 56.3 7 31
# 4 2016-04-23 57.9 13 34
# 5 2016-04-24 58.7 7 26
# 6 2016-04-25 59.0 9 33
# 7 2016-04-26 64.5 20 35
# 8 2016-04-27 61.9 19 43
# 9 2016-04-28 60.3 4 22
#10 2016-04-29 59.4 5 22
# … with 41 more rows
通常错误
Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
最常见的原因是数据中的重复行(排除val列之后(,而不是val列中的重复。
which(duplicated(d))
# [1] 14 65
OP的数据似乎有两个重复的行,这导致了这个问题。删除重复的行也可以消除错误。
yy <- d %>% distinct() %>% pivot_wider(., names_from = name, values_from = val)
yy
# A tibble: 50 x 4
time x1 `C Farolillo` `Plaza Eliptica`
<date> <dbl> <dbl> <dbl>
1 2016-04-20 51.5 7 32
2 2016-04-21 56.3 3 25
3 2016-04-22 56.3 7 31
4 2016-04-23 57.9 13 34
5 2016-04-24 58.7 7 26
6 2016-04-25 59.0 9 33
7 2016-04-26 64.5 20 35
8 2016-04-27 61.9 19 43
9 2016-04-28 60.3 4 22
10 2016-04-29 59.4 5 22
# ... with 40 more rows
问题是由您想要扩展/透视范围的数据具有重复标识符这一事实引起的。虽然上面的两个建议,即用mutate(row = row_number())
从行号创建一个唯一的人工id,或者只过滤distinct
行,都会让你的数据透视范围更广,但它们会改变你的表的结构,这可能会有一个逻辑和组织问题,下次你尝试将任何内容加入其中时就会出现。
使用id_cols
参数explicity是一种更好的做法,可以看到您实际上希望在数据透视范围较宽后必须是唯一的,如果遇到问题,请首先重新组织原始表。当然,您可能会找到筛选到不同行或添加新ID的原因,很可能是为了避免代码早期的重复。
虽然在OP示例中不可见,但在某些情况下,当不需要时,接受的答案会重复行。这种方法在某些情况下避免了这种情况:
d %>%
pivot_wider(names_from = name, values_from = val
, values_fn = list) %>%
unnest(cols = everything() )
如果存在列表,为避免警告和错误,请保留values_fn = list
示例:
d1 <- tail(d)[1:5,]
d5<-d1
d5$name<-"some"
withlist <- tibble(d1, l = list(c(1,2),c(1,2),c(1,2),c(1,2),c(1,2) ) )
withlist2 <- tibble(d5, l = list( list(1,2),list(1,2),list(1,2,3),list(1,2),list(1,2) ) )
withl <- rbind(withlist,withlist2)
res<-withl %>%
pivot_wider(names_from = name, values_from = l
, values_fn = list)
as.data.frame(res)
# val time x1 Plaza Eliptica some
#1 27 2016-06-03 76.19 1, 2 1, 2
#2 10 2016-06-04 72.65 1, 2 1, 2
#3 8 2016-06-05 76.06 1, 2 1, 2, 3
#4 28 2016-06-06 76.42 1, 2 1, 2
#5 32 2016-06-07 81.77 1, 2 1, 2
这是游戏后期的一种选择,但可以保留非唯一的观察结果,但仍以为中心
table(d$name) # get the unique names_from and frequencies
#
# C Farolillo Plaza Eliptica
# 51 51
(d2 <- d %>% mutate(rno = rep(1:51, 2)) %>%
# repeat 1:51 2 times; unique id by names_from
pivot_wider(names_from = name, values_from = val))
# # A tibble: 51 × 5
# time x1 rno `C Farolillo` `Plaza Eliptica`
# <date> <dbl> <int> <dbl> <dbl>
# 1 2016-04-20 51.5 1 7 32
# 2 2016-04-21 56.3 2 3 25
# 3 2016-04-22 56.3 3 7 31
# 4 2016-04-23 57.9 4 13 34
# 5 2016-04-24 58.7 5 7 26
# 6 2016-04-25 59.0 6 9 33
# 7 2016-04-26 64.5 7 20 35
# 8 2016-04-27 61.9 8 19 43
# 9 2016-04-28 60.3 9 4 22
# 10 2016-04-29 59.4 10 5 22
# # … with 41 more rows