最后,我想创建漂亮的马戏团情节,但要到达那里,我需要显示从A到B,B到C和B到A等的人数。
我的数据集:
#Generate some sample data:
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE)
df<-data.frame(proc,dat,Id)
如果我正在为蛇形图准备数据,我会这样做:
Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc")
这将给我一个漂亮的表格,按顺序显示每个患者在每个时间点会发生什么。
但我想进入下一步,即找到在每种不同proc
类型(即"EMR"、"RFA"和"活检"(之间转换的患者数量,以便我可以将它们转换为循环所需的格式,即(频率在这里组成(
origin destination frequency
EMR RFA 14
EMR Biopsies 4
EMR EMR 10
RFA RFA 24
RFA Biopsies 42
RFA EMR 1
Biopsies RFA 3
Biopsies Biopsies 6
Biopsies EMR 16
或者我想另一种显示方式是
destination
EMR RFA Biopsies
origin
EMR 10 14 4
RFA 1 24 42
Biopsies 16 3 6
我会使用dplyr
来完成这项任务,分析的核心是检索每个患者最后位置的lag
函数,以及用于计数病例的summarise
函数。
整个分析将像这样完成:
# for reproducibility
set.seed(20170805)
# your data
proc<-sample(c("EMR","RFA","Biopsies"), 100, replace = TRUE)
#Sample dates
dat<-sample(seq(as.Date('2013/01/01'), as.Date('2017/05/01'), by="day"), 100)
#Generate 20 hospital numbers in no particular order:
Id<-sample(c("P43","P63","K52","G24","S55","D07","U87","P22","Y76","I92","P22","P02","U22415","U23","S14","O34","T62","J32","F63","T43"), 100, replace = TRUE)
# my approach using dplyr
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
df <- data_frame(proc, dat, Id)
df %>%
# make sure that we progress in the direct order of time...
arrange(dat) %>%
# for each patient:
group_by(Id) %>%
# find the last position
mutate(origin = lag(proc, 1), destination = proc) %>%
# for each origin, destination-pair...
group_by(origin, destination) %>%
# summarise the number of pairs
summarise(n = n()) %>%
# not really necessary, but gives a littlebit nicer output here...
ungroup()
#> # A tibble: 12 x 3
#> origin destination n
#> <chr> <chr> <int>
#> 1 Biopsies Biopsies 5
#> 2 Biopsies EMR 8
#> 3 Biopsies RFA 11
#> 4 EMR Biopsies 11
#> 5 EMR EMR 11
#> 6 EMR RFA 10
#> 7 RFA Biopsies 6
#> 8 RFA EMR 12
#> 9 RFA RFA 8
#> 10 <NA> Biopsies 8
#> 11 <NA> EMR 4
#> 12 <NA> RFA 6
我设法使用了一种狡猾的方法,基本上是通过将所有列粘贴在一起,然后使用纵梁包分离然后制表。
library(stringr)
Sankey<-dcast(setDT(df)[, if(any(proc=="EMR"|proc=="RFA")) .SD, Id], Id~rowid(Id), value.var ="proc")
Sankey$x <- apply( Sankey[ , 2:ncol(Sankey)] , 1 , paste , collapse = "-" )
library(stringr)
myList<-unlist(str_extract_all(Sankey$x,"[A-Z|a-z]+-[A-Z|a-z]+"))
table(myList)