在R中读取固定宽度的多行文件

我有一个PDF文件中的数据，正在读取到R.中

library(pdftools)
library(readr)
library(stringr)
library(dplyr)
results <- pdf_text("health_data.pdf") %>% 
readr::read_lines()

当我用这个方法读取它时，会返回一个字符向量。给定列的多行信息分布在不同的行上(并非每个观测的所有列都有数据。

一个可重复的例子如下：

ex_result <- c("03/11/2012 BES 3RD          BES inc and corp           no-            no- sale -",
"           group with                           sale        no- sale",  
"           boxes",                                                                   
"03/11/2012 KRS six and    firefly                  45       mg/dL  100 - 200",        
"           seven",                                                                   
"03/11/2012 KRS core    ladybuyg            55       mg/dL  42 - 87")

我正在尝试将read_fwf与fwf_widths一起使用，因为我了解到，如果您给定多行记录的宽度，它可以处理多行输入。

ex_result_width <- read_fwf(ex_result, fwf_widths(
c(10, 24, 16, 7, 5, 15,100), 
c("date", "name","description", "value", "unit","range","ab_flag")))

我通过在控制台nchar中键入我看到的该列的最长字符串来确定大小。

使用fwf_widths，我可以通过在width =参数中定义10个字节来获得日期列，但对于NAME列，如果我将其设置为24个字节，它会返回串联的列，而不是拆分的行，以考虑多行，然后级联到现在有错误数据的其他列，其余的在空间用完时删除。

最终，这就是所需的输出：

desired_output <-tibble(
date = c("03/11/2012","03/11/2012","03/11/2012"),
name = c("BES 3RD group with boxes", "KRS six and seven", "KRS core"),
description = c("BES inc and corp", "firefly", "ladybug"),
value = c("no-sale", "45", "55"),
unit = c("","mg/dL","mg/dL"),
range = c("no-sale no-sale", "100 - 200", "42 - 87"),
ab_flag = c("", "", ""))

我想看看：

如何让fwf_widths识别多行文本和缺少的列
有没有更好的方法来读取pdf文件，以解释多行值和缺少的列？(我一直在学习本教程，但它似乎有一个更结构化的pdf文件(

str_subset(ex_result，pattern="\\d｛2｝\/quot；([1] "；2012年11月03日BES 3RD BES公司和公司无-无-销售-"；[2] "；2012年11月3日KRS-six和萤火虫45mg/dL 100-200〃
[3]"；2012年11月3日KRS核心ladybuyg 55 mg/dL 42-87〃；

相关内容

最新更新

热门标签：