使用 R 读取文本文件并将提取的数据格式化到表中



我有一个以下基本格式的文本文件,重复几千次:

Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011

如何最好地将上述文本转换为以下格式

Patient Name    DxCodes    PrCodes    Charges
John Smith      123        678        910
Jane Smith      234        567        1011

我已经能够使用 stringi 包中的str_extract将所有患者姓名提取到一个数据帧中,并将 DxCodes、PrCode 和 Charge 提取到另一个数据帧中,如下所示:

Names
John Smith
Jane Smith

Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011

但是我不确定如何继续将上述两个数据帧转换为所需的格式? 我应该从一开始就使用不同的方法吗? 肯定会感谢任何帮助。谢谢!

您可以使用正则表达式序列,然后将这些部分与data.frame()组装在一起。

inx1 <- grep("Patient Name", txt)
inx2 <- grep("Number of dx codes:", txt)
inx3 <- grep("Number of pr codes:", txt)
inx4 <- grep("Charges", txt)
PatientName <- sub("^Patient Name[- ]*", "", txt[inx1])
DxCodes <- sub("^.*: *([[:digit:]]*)$", "\1", txt[inx2])
PrCodes <- sub("^.*: *([[:digit:]]*)$", "\1", txt[inx3])
Charges <- sub("^.*: *([[:digit:]]*)$", "\1", txt[inx4])
DxCodes <- as.integer(DxCodes)
PrCodes <- as.integer(PrCodes)
Charges <- as.integer(Charges)
result <- data.frame(PatientName, DxCodes, PrCodes, Charges)
result
#  PatientName DxCodes PrCodes Charges
#1  John Smith     123     678     910
#2  Jane Smith     234     567    1011

数据。

conn <- textConnection("
Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011
")
txt <- readLines(conn)
close(conn)

这是一个实现,它假定患者文本块中消息的顺序。

数据:

txt <- c(
'Patient Name- John Smith',
'Number of dx codes: 123',
'Number of pr codes: 678',
'Charges: 910',
'Lorem ipsum dolor sit amet, consectetur adipiscing elit. ',
'Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. ',
'Donec interdum iaculis lacus. Nunc in placerat augue. ',
'In ut odio et dui aliquam sagittis at id augue. ',
'Patient Name- Jane Smith',
'Number of dx codes: 234',
'Number of pr codes: 567',
'Charges: 1011')

将患者分成单独的载体:

patients <- split(txt, cumsum(grepl("^Patient Name", txt)))
str(patients)
# List of 2
#  $ 1: chr [1:8] "Patient Name- John Smith" "Number of dx codes: 123" "Number of pr codes: 678" "Charges: 910" ...
#  $ 2: chr [1:4] "Patient Name- Jane Smith" "Number of dx codes: 234" "Number of pr codes: 567" "Charges: 1011"

对于每个患者,解析出相关部分。这假设行的顺序(名称、dx、pr、charge(是静态的,但它可以很容易地扩展。

patients2 <- lapply(patients, function(pat) {
nm <- sapply(strsplit(pat[1], "-")[[1]][-1], trimws)
dx <- as.integer(strsplit(pat[2], ":")[[1]][2])
pr <- as.integer(strsplit(pat[3], ":")[[1]][2])
ch <- as.integer(strsplit(pat[4], ":")[[1]][2])
rest <- paste(pat[-(1:4)], collapse="n")
data.frame(name = nm, dx = dx, pr = pr, charges = ch, rest = rest,
stringsAsFactors = FALSE)
})
str(patients2)
# List of 2
#  $ 1:'data.frame':    1 obs. of  5 variables:
#   ..$ name   : chr "John Smith"
#   ..$ dx     : int 123
#   ..$ pr     : int 678
#   ..$ charges: int 910
#   ..$ rest   : chr "Lorem ipsum dolor sit amet, consectetur adipiscing elit. nDuis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. n"| __truncated__
#  $ 2:'data.frame':    1 obs. of  5 variables:
#   ..$ name   : chr "Jane Smith"
#   ..$ dx     : int 234
#   ..$ pr     : int 567
#   ..$ charges: int 1011
#   ..$ rest   : chr ""

现在合并成一个框架。

patients3 <- do.call(rbind.data.frame, patients2)
str(patients3)
# 'data.frame': 2 obs. of  5 variables:
#  $ name   : chr  "John Smith" "Jane Smith"
#  $ dx     : int  123 234
#  $ pr     : int  678 567
#  $ charges: int  910 1011
#  $ rest   : chr  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. nDuis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. n"| __truncated__ ""

如果你的文本确实是你呈现的,一个连续的块,或者一个连续的字符串,这将使用捕获组,假设每条记录都有DX,PR和费用:

library(stringr)
library(dplyr)
df <- " 
Patient Name- John Smith
Number of dx codes: 123
Number of pr codes: 678
Charges: 910
Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Duis arcu ipsum, ultrices placerat mattis ac, venenatis eu magna. 
Donec interdum iaculis lacus. Nunc in placerat augue. 
In ut odio et dui aliquam sagittis at id augue. 
Patient Name- Jane Smith
Number of dx codes: 234
Number of pr codes: 567
Charges: 1011"
df_b <- data.frame(dx=str_match_all(df, "(?<=dx codes:) [[:digit:]]*"), 
pr=str_match_all(df, "(?<=pr codes:) [[:digit:]]*"),
charges=str_match_all(df,"(?<=harges:) [[:digit:]]*")) 
names(df_b) <- c("dx", "pr", "charges")
# it changed names by the structure but you may rename it easily:
df
dx   pr charges
1  123  678     910
2  234  567    1011

最新更新