r语言 - pivot_wider问题"Values in `values_from` are not uniquely identified; output will contain list-c



我的数据如下:

# A tibble: 6 x 4
name          val time          x1
<chr>       <dbl> <date>     <dbl>
1 C Farolillo     7 2016-04-20  51.5
2 C Farolillo     3 2016-04-21  56.3
3 C Farolillo     7 2016-04-22  56.3
4 C Farolillo    13 2016-04-23  57.9
5 C Farolillo     7 2016-04-24  58.7
6 C Farolillo     9 2016-04-25  59.0

我正在尝试使用pivot_wider函数来扩展基于name列的数据。我使用以下代码:

yy <- d %>% 
pivot_wider(., names_from = name, values_from = val)

这给了我以下警告信息:

Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.
* Use `values_fn = list(val = list)` to suppress this warning.
* Use `values_fn = list(val = length)` to identify where the duplicates arise
* Use `values_fn = list(val = summary_fun)` to summarise duplicates

输出看起来像:

time       x1        out1    out2 
2016-04-20  51.50000    <dbl>   <dbl>
2   2016-04-21  56.34615    <dbl>   <dbl>
3   2016-04-22  56.30000    <dbl>   <dbl>
4   2016-04-23  57.85714    <dbl>   <dbl>
5   2016-04-24  58.70968    <dbl>   <dbl>
6   2016-04-25  58.96774    <dbl>   <dbl>

我知道这里提到了这个问题,为了解决这个问题,他们建议使用汇总统计数据。然而,我有时间序列数据,因此不想使用汇总统计数据,因为每天都有一个值(而不是多个值(。

我知道问题是因为val列有重复(即,在上面的例子中,7出现了3次

关于如何看待和克服这个问题,有什么建议吗?

数据:

d <- structure(list(name = c("C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", "C Farolillo", 
"C Farolillo", "C Farolillo", "C Farolillo", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", "Plaza Eliptica", 
"Plaza Eliptica", "Plaza Eliptica"), val = c(7, 3, 7, 13, 7, 
9, 20, 19, 4, 5, 5, 2, 6, 6, 16, 13, 7, 6, 3, 3, 6, 10, 5, 3, 
5, 3, 4, 4, 10, 11, 4, 13, 8, 2, 8, 10, 3, 10, 14, 4, 2, 4, 6, 
6, 8, 8, 3, 3, 13, 10, 13, 32, 25, 31, 34, 26, 33, 35, 43, 22, 
22, 21, 10, 33, 33, 48, 47, 27, 23, 11, 13, 25, 31, 20, 16, 10, 
9, 23, 11, 23, 26, 16, 34, 17, 4, 24, 21, 10, 26, 32, 10, 5, 
9, 19, 14, 27, 27, 10, 8, 28, 32, 25), time = structure(c(16911, 
16912, 16913, 16914, 16915, 16916, 16917, 16918, 16919, 16920, 
16921, 16922, 16923, 16923, 16924, 16925, 16926, 16927, 16928, 
16929, 16930, 16931, 16932, 16933, 16934, 16935, 16936, 16937, 
16938, 16939, 16940, 16941, 16942, 16943, 16944, 16945, 16946, 
16947, 16948, 16949, 16950, 16951, 16952, 16953, 16954, 16955, 
16956, 16957, 16958, 16959, 16960, 16911, 16912, 16913, 16914, 
16915, 16916, 16917, 16918, 16919, 16920, 16921, 16922, 16923, 
16923, 16924, 16925, 16926, 16927, 16928, 16929, 16930, 16931, 
16932, 16933, 16934, 16935, 16936, 16937, 16938, 16939, 16940, 
16941, 16942, 16943, 16944, 16945, 16946, 16947, 16948, 16949, 
16950, 16951, 16952, 16953, 16954, 16955, 16956, 16957, 16958, 
16959, 16960), class = "Date"), x1 = c(51.5, 56.3461538461538, 
56.3, 57.8571428571429, 58.7096774193548, 58.9677419354839, 64.4615384615385, 
61.9310344827586, 60.3214285714286, 59.4137931034483, 59.5806451612903, 
57.3448275862069, 64.0333333333333, 64.0333333333333, 70.15625, 
71.3636363636364, 62.8125, 56.4375, 56.4516129032258, 51.741935483871, 
52.84375, 53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936, 
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871, 
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353, 
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697, 
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125, 
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968, 
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871, 
85.0645161290323, 51.5, 56.3461538461538, 56.3, 57.8571428571429, 
58.7096774193548, 58.9677419354839, 64.4615384615385, 61.9310344827586, 
60.3214285714286, 59.4137931034483, 59.5806451612903, 57.3448275862069, 
64.0333333333333, 64.0333333333333, 70.15625, 71.3636363636364, 
62.8125, 56.4375, 56.4516129032258, 51.741935483871, 52.84375, 
53.09375, 52.969696969697, 54, 54.3870967741936, 60.3870967741936, 
64.4516129032258, 66.2903225806452, 68.2333333333333, 69.7741935483871, 
70.5806451612903, 73.8275862068966, 72.8181818181818, 64.6764705882353, 
64.4838709677419, 68.7741935483871, 62.1764705882353, 68.969696969697, 
70.1935483870968, 59.6774193548387, 59.9677419354839, 63.125, 
67.5882352941177, 71.4705882352941, 73.8529411764706, 76.1935483870968, 
72.6451612903226, 76.0645161290323, 76.4193548387097, 81.7741935483871, 
85.0645161290323)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-102L))

为每个name创建一个唯一的标识符行,然后使用pivot_wider

library(dplyr)
d %>%
group_by(name) %>%
mutate(row = row_number()) %>%
tidyr::pivot_wider(names_from = name, values_from = val) %>%
select(-row)
# A tibble: 51 x 4
#   time          x1 `C Farolillo` `Plaza Eliptica`
#   <date>     <dbl>         <dbl>            <dbl>
# 1 2016-04-20  51.5             7               32
# 2 2016-04-21  56.3             3               25
# 3 2016-04-22  56.3             7               31
# 4 2016-04-23  57.9            13               34
# 5 2016-04-24  58.7             7               26
# 6 2016-04-25  59.0             9               33
# 7 2016-04-26  64.5            20               35
# 8 2016-04-27  61.9            19               43
# 9 2016-04-28  60.3             4               22
#10 2016-04-29  59.4             5               22
# … with 41 more rows

通常错误

Warning message:
Values in `val` are not uniquely identified; output will contain list-cols.

最常见的原因是数据中的重复行(排除val列之后(,而不是val列中的重复。

which(duplicated(d))
# [1] 14 65

OP的数据似乎有两个重复的行,这导致了这个问题。删除重复的行也可以消除错误。

yy <- d %>% distinct() %>% pivot_wider(., names_from = name, values_from = val)
yy
# A tibble: 50 x 4
time          x1 `C Farolillo` `Plaza Eliptica`
<date>     <dbl>         <dbl>            <dbl>
1 2016-04-20  51.5             7               32
2 2016-04-21  56.3             3               25
3 2016-04-22  56.3             7               31
4 2016-04-23  57.9            13               34
5 2016-04-24  58.7             7               26
6 2016-04-25  59.0             9               33
7 2016-04-26  64.5            20               35
8 2016-04-27  61.9            19               43
9 2016-04-28  60.3             4               22
10 2016-04-29  59.4             5               22
# ... with 40 more rows

问题是由您想要扩展/透视范围的数据具有重复标识符这一事实引起的。虽然上面的两个建议,即用mutate(row = row_number())从行号创建一个唯一的人工id,或者只过滤distinct行,都会让你的数据透视范围更广,但它们会改变你的表的结构,这可能会有一个逻辑和组织问题,下次你尝试将任何内容加入其中时就会出现。

使用id_cols参数explicity是一种更好的做法,可以看到您实际上希望在数据透视范围较宽后必须是唯一的,如果遇到问题,请首先重新组织原始表。当然,您可能会找到筛选到不同行或添加新ID的原因,很可能是为了避免代码早期的重复。

虽然在OP示例中不可见,但在某些情况下,当不需要时,接受的答案会重复行。这种方法在某些情况下避免了这种情况:

d %>%
pivot_wider(names_from = name, values_from = val
, values_fn = list) %>% 
unnest(cols = everything() )

如果存在列表,为避免警告和错误,请保留values_fn = list

示例:

d1 <- tail(d)[1:5,]
d5<-d1
d5$name<-"some"
withlist  <- tibble(d1, l = list(c(1,2),c(1,2),c(1,2),c(1,2),c(1,2) ) )
withlist2 <- tibble(d5, l = list( list(1,2),list(1,2),list(1,2,3),list(1,2),list(1,2) ) )
withl     <- rbind(withlist,withlist2)
res<-withl %>%
pivot_wider(names_from = name, values_from = l
, values_fn = list)     
as.data.frame(res)
#  val       time    x1 Plaza Eliptica    some
#1  27 2016-06-03 76.19           1, 2    1, 2
#2  10 2016-06-04 72.65           1, 2    1, 2
#3   8 2016-06-05 76.06           1, 2 1, 2, 3
#4  28 2016-06-06 76.42           1, 2    1, 2
#5  32 2016-06-07 81.77           1, 2    1, 2

这是游戏后期的一种选择,但可以保留非唯一的观察结果,但仍以为中心

table(d$name) # get the unique names_from and frequencies
# 
#    C Farolillo Plaza Eliptica 
#             51             51  
(d2 <- d %>% mutate(rno = rep(1:51, 2)) %>% 
# repeat 1:51 2 times; unique id by names_from
pivot_wider(names_from = name, values_from = val))
# # A tibble: 51 × 5
#    time          x1   rno `C Farolillo` `Plaza Eliptica`
#    <date>     <dbl> <int>         <dbl>            <dbl>
#  1 2016-04-20  51.5     1             7               32
#  2 2016-04-21  56.3     2             3               25
#  3 2016-04-22  56.3     3             7               31
#  4 2016-04-23  57.9     4            13               34
#  5 2016-04-24  58.7     5             7               26
#  6 2016-04-25  59.0     6             9               33
#  7 2016-04-26  64.5     7            20               35
#  8 2016-04-27  61.9     8            19               43
#  9 2016-04-28  60.3     9             4               22
# 10 2016-04-29  59.4    10             5               22
# # … with 41 more rows 

相关内容

最新更新