如何矢量化功能

  • 本文关键字:功能 矢量化 r
  • 更新时间 :
  • 英文 :


我具有一个函数,该函数获取一些自由文本,然后根据单词列表将文本分为列。它运行良好,但有人建议我,如果它被矢量化,它将更好地工作。

该功能称为Extractor

Extractor <- function(x, y, stra, strb, t) {
  x <- data.frame(x)
  t <- gsub("[^[:alnum:],]", " ", t)
  t <- gsub(" ", "", t, fixed = TRUE)
  x[, t] <-
    stringr::str_extract(x[, y], stringr::regex(paste(stra,
                                                      "(.*)", strb, sep = ""), 
                                                dotall = TRUE))
  x[, t] <- gsub("\\.*", "", x[, t])
  names(x[, t]) <- gsub(".", "", names(x[, t]), fixed = TRUE)
  x[, t] <- gsub("       ", "", x[, t])
  x[, t] <- gsub(stra, "", x[, t], fixed = TRUE)
  if (strb != "") {
    x[, t] <- gsub(strb, "", x[, t], fixed = TRUE)
  }
  x[, t] <- gsub("       ", "", x[, t])
  x[, t] <- ColumnCleanUp(x, t)
  return(x)
}
ColumnCleanUp <- function(x, y) {
  x <- (data.frame(x))
  x[, y] <- gsub("^\.n", "", x[, y])
  x[, y] <- gsub("^:", "", x[, y])
  x[, y] <- gsub(".", "n", x[, y], fixed = TRUE)
  x[, y] <- gsub("\s{5}", "", x[, y])
  x[, y] <- gsub("^\.", "", x[, y])
  x[, y] <- gsub("$\.", "", x[, y])
  return(x[, y])
}

我使用如下:

HistolTree<-list("Hospital Number","Patient Name","DOB:","General Practitioner:",
"Date of Procedure:","Clinical Details:","Macroscopic description:","Histology:","Diagnosis:","")
for(i in 1:(length(HistolTree)-1)) {
Mypath<-Extractor(Mypath,"PathReportWhole",as.character(HistolTree[i]),
as.character(HistolTree[i+1]),as.character(HistolTree[i]))
}

一个示例输入文本是:

Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood
DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"

预期输出为:

structure(list(PathReportWhole = structure(1L, .Label = "Hospital Number 233456 Patient Name: Jonny Begoodn    DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely", class = "factor"), 
    HospitalNumber = " 233456 ", PatientName = " Jonny Begood", 
    DOB = " 13/01/77 ", GeneralPractitioner = NA_character_, 
    Dateofprocedure = NA_character_, ClinicalDetails = " Dyaphagia and reflux ", 
    Macroscopicdescription = " 3 pieces of oesophagus, all good biopsiesn ", 
    Histology = " These show chronic reflux and other bits n bobsn ", 
    Diagnosis = " Acid reflux likely"), row.names = c(NA, -1L
), .Names = c("PathReportWhole", "HospitalNumber", "PatientName", 
"DOB", "GeneralPractitioner", "Dateofprocedure", "ClinicalDetails", 
"Macroscopicdescription", "Histology", "Diagnosis"), class = "data.frame")

基本上,我通过循环反复调用该函数(尽管这里只有一个示例,实际数据框架具有> 2000行)。

apply()是一种以矢量化的方式应用该功能的方式吗?如果没有,我可以关于如何对其进行矢量化,以便避免使用循环?我了解矢量化函数的想法意味着将函数应用于整个向量而不是循环,我需要将输入列表转换为字符向量,但我从那里陷入困境。

而不是矢量化您的现有函数,我认为我会尝试简化您的各种正则表达式。我可以看到您在做什么,您有一个data.frame带有令人讨厌的单个字符串中的原始病理数据,例如:

医院编号233456患者姓名:乔尼·乞求DOB:13/01/77 全科医生:De'Ath博士程序日期:13/01/99临床 详细信息:dyaphagia和回流宏观描述:3片 食道,所有好的活检。组织学:这些节目慢性回流 和其他bobs。诊断:胃酸反流可能

您相信您正在使用一种好方法,即使用标题("医院号","患者名称:",...)来提取数据(" 233456"," Jonny Begood",...)。但是,我认为,有一种更简单的方法可以使用正则表达式来执行此操作,即将标题用作LookBehind和LookAhead令牌。因此,在上面的字符串中,我们看到医院编号的数据 是"医院编号"one_answers"患者名称"之间的所有内容:",删除了空格,即" 233456"。可以应用相同的原理来提取每个后续数据。另外几行代码将在数据中将单独的数据列入其自己的列。

这是您创建测试数据的代码。Frame:

Mypath<-"Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and reflux Macroscopic description: 3 pieces of oesophagus, all good biopsies. Histology: These show chronic reflux and other bits n bobs. Diagnosis: Acid reflux likely"
Mypath<-data.frame(Mypath)
names(Mypath)<-"PathReportWhole"

然后,我们创建标头的字符向量:

x <- c("Hospital Number", "Patient Name:", "DOB:", "General Practitioner:", "Date of Procedure:", "Clinical Details:", "Macroscopic description:", "Histology:", "Diagnosis:")

请注意,这些必须与数据中包含的标头完全匹配。另外,我们不需要一个空字符串作为最后一个条目,就像您上面的那样。

然后,我们可以编写一个函数,该函数以参数为data.frame df,data.frame列包含原始数据 colName的名称(以使该函数尽可能一般)和标头的向量headers

extractPath <- function(df, colName, headers) {
  # df: data.frame containing raw path data
  # colName: name of column containing data
  # headers: character vector of headers (delimiters in raw path data)
  for (i in seq_len(length(headers))) {
    # left delimiter
    delimLeft <- headers[i]
    # right delimiter, not relevant if at end of headers
    if (i < length(headers)) {
      delimRight <- headers[i+1]
      # regex to match everything between delimiting headers
      regex <- paste0("(?<=", delimLeft, ").*(?=", delimRight, ")")
    } else {
      # regex to match everything to right of final delimiting header
      regex <- paste0("(?<=", delimLeft, ").*$")
    }
    # generate column name for new column
    # use alpha characters only (i.e. ignore colon), and remove spaces
    columnName <- str_extract(delimLeft, "[[:alpha:] ]*") %>% str_replace_all(" ", "")
    # create new column of data, and trim whitespace
    df[[columnName]] <- str_extract(df[[colName]], regex) %>% str_trim()
  }
  # return output data.frame
  df
}

在这里,我正在使用tidverse软件包生态系统,即dplyrstringr。函数通过每个标头循环,生成适当的正则表达式,然后应用它们创建新的数据列。

这样调用该功能:

out <- extractPath(Mypath, "PathReportWhole", x)

这是单行测试数据的输出。帧:

> glimpse(out)
Observations: 1
Variables: 10
$ PathReportWhole        <fctr> Hospital Number 233456 Patient Name: Jonny Begood DOB: 13/01/77 General Practitioner: Dr De'ath Date of Procedure: 13/01/99 Clinical Details: Dyaphagia and re...
$ HospitalNumber         <chr> "233456"
$ PatientName            <chr> "Jonny Begood"
$ DOB                    <chr> "13/01/77"
$ GeneralPractitioner    <chr> "Dr De'ath"
$ DateofProcedure        <chr> "13/01/99"
$ ClinicalDetails        <chr> "Dyaphagia and reflux"
$ Macroscopicdescription <chr> "3 pieces of oesophagus, all good biopsies."
$ Histology              <chr> "These show chronic reflux and other bits n bobs."
$ Diagnosis              <chr> "Acid reflux likely"

(您可能需要进一步整理数据,转换字符日期等。)

我还用数据进行了测试。框架的数千行,它在一秒钟左右的时间内运行。

最新更新