数据准备以显示传输中的数字

最后，我想创建漂亮的马戏团情节，但要到达那里，我需要显示从A到B，B到C和B到A等的人数。

我的数据集：

#Generate some sample data:
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE)
df<-data.frame(proc,dat,Id)

如果我正在为蛇形图准备数据，我会这样做：

Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc")

这将给我一个漂亮的表格，按顺序显示每个患者在每个时间点会发生什么。

但我想进入下一步，即找到在每种不同proc类型(即"EMR"、"RFA"和"活检"(之间转换的患者数量，以便我可以将它们转换为循环所需的格式，即(频率在这里组成(

origin   destination frequency
EMR       RFA           14
EMR       Biopsies      4
EMR       EMR           10
RFA       RFA           24
RFA       Biopsies      42
RFA       EMR           1
Biopsies  RFA           3
Biopsies  Biopsies      6
Biopsies  EMR           16

或者我想另一种显示方式是

destination
EMR            RFA      Biopsies
origin
EMR           10             14         4
RFA            1             24         42
Biopsies      16             3          6

我会使用dplyr来完成这项任务，分析的核心是检索每个患者最后位置的lag函数，以及用于计数病例的summarise函数。

整个分析将像这样完成：

# for reproducibility
set.seed(20170805)
# your data
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE)
# my approach using dplyr
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
df <- data_frame(proc, dat, Id)
df %>% 
# make sure that we progress in the direct order of time...
arrange(dat) %>% 
# for each patient:
group_by(Id) %>%
# find the last position
mutate(origin = lag(proc, 1), destination = proc) %>%
# for each origin, destination-pair...
group_by(origin, destination) %>% 
# summarise the number of pairs
summarise(n = n()) %>% 
# not really necessary, but gives a littlebit nicer output here...
ungroup()
#> # A tibble: 12 x 3
#>      origin destination     n
#>       <chr>       <chr> <int>
#>  1 Biopsies    Biopsies     5
#>  2 Biopsies         EMR     8
#>  3 Biopsies         RFA    11
#>  4      EMR    Biopsies    11
#>  5      EMR         EMR    11
#>  6      EMR         RFA    10
#>  7      RFA    Biopsies     6
#>  8      RFA         EMR    12
#>  9      RFA         RFA     8
#> 10     <NA>    Biopsies     8
#> 11     <NA>         EMR     4
#> 12     <NA>         RFA     6

我设法使用了一种狡猾的方法，基本上是通过将所有列粘贴在一起，然后使用纵梁包分离然后制表。

library(stringr)
Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc")
Sankey$x <- apply( Sankey[ , 2:ncol(Sankey)] , 1 , paste , collapse = "-" )
library(stringr)
myList<-unlist(str_extract_all(Sankey$x,"[A-Z|a-z]+-[A-Z|a-z]+"))
table(myList)

相关内容

最新更新

热门标签：