r语言 - 逐行循环遍历数据帧,生成大量嵌套的XML记录(HL7格式)



我目前正在尝试使用R将data.frame中的记录转换为嵌套的XML记录。我有一些在R中解析XML文档的经验,但从未需要编写它们。我试图寻找解释如何做到这一点的资源,但我发现的所有资源都非常简单,或者只关注将XML读入R而不是编写它。

这是我的数据的一个例子。实际数据有几十万行。

example <- structure(list(patientid = c(10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009, 100010), firstname = c("Jane1","Jane2", "Jane3", "Jane4", "Jane5", "Jane6", "Jane7", "Jane8", "Jane9","Jane10"), lastname = c("Doe1", "Doe2", "Doe3", "Doe4", "Doe5", "Doe6","Doe7", "Doe8", "Doe9", "Doe10"), middle = c("Middle1", "Middle2", "Middle3","Middle4", "Middle5", "Middle6", "Middle7", "Middle8", "Middle9", "Middle10"), dob = c("20150101", "20150102", "20150103", "20150104", "20150105","20150106", "20150107", "20150108", "20150109", "20150110"),organizationname = c("Practice 1", "Practice 2","Practice 3", "Practice 4","Practice 5", "Practice 6", "Practice 7","Practice 8", "Practice 9", "Practice 10"), organizationid = c(90L, 61L, 32L, 21L, 3L, 28L, 53L, 8L,60L, 3L), numericvalue1 = c(6.86105238215947, 13.0761869792404,1.33006454293633, 10.2726574035132, NA, NA, NA, NA, 20.2213535916207,43.123550939618), numericunitcd = c("%", "%", "%", "%", "%","%", "%", "%", "%", "%"), observationcode = c("ASCVD-10YR","ASCVD-10YR", "ASCVD-10YR", "ASCVD-10YR", "ASCVD-10YR", "ASCVD-10YR","ASCVD-10YR", "ASCVD-10YR", "ASCVD-10YR", "ASCVD-10YR"),text = c("ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score","ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score","ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score","ASCVD 10 Year Risk Score", "ASCVD 10 Year Risk Score"),observationcodesystem = c("CUSTOM", "CUSTOM", "CUSTOM","CUSTOM", "CUSTOM", "CUSTOM", "CUSTOM","CUSTOM", "CUSTOM", "CUSTOM"), dateofobservation = c("20150716","20150716", "20150716", "20150716", "20150716", "20150716","20150716", "20150716", "20150716", "20150716"), providerid = c(400001,400002, 400003, 400004, 400005, 400006, 400007, 400008, 400009,4000010), providerfirst = c("Doogie1", "Doogie2", "Doogie3","Doogie4", "Doogie5", "Doogie6", "Doogie7", "Doogie8", "Doogie9","Doogie10"), providerlast = c("Howser1", "Howser2", "Howser3","Howser4", "Howser5", "Howser6", "Howser7", "Howser8", "Howser9","Howser10")), .Names = c("patientid", "firstname", "lastname","middle", "dob", "organizationname", "organizationid", "numericvalue1","numericunitcd", "observationcode", "text", "observationcodesystem","dateofobservation", "providerid", "providerfirst", "providerlast"), row.names = c(1L, 6L, 7L, 8L, 12L, 15L, 21L, 167392L, 167412L,167420L), class = "data.frame")

我最终需要将每一行数据写入以下内容(注意:无法找出一种方法来突出显示有问题的字段,但它们是上述data.frame中的列名,即example$column):

<HL7Message DomainID="1" DomainName="Domain" OrganizationID="example$organizationid" OrganizationName="example$organiationname" SourceSystem="DR">
    <MSH parentseq="-1" seq="1">
        <Segment component="-1" field="0" subcomponent="-1">MSH</Segment>
        <FieldSeparator component="-1" field="1" subcomponent="-1">|</FieldSeparator>
        <EncodingCharacters component="-1" field="2" subcomponent="-1">^~&amp;</EncodingCharacters>
        <SendingFacility component="-1" field="4" subcomponent="-1">
            <NamespaceID component="1" field="4" subcomponent="-1">RP-1</NamespaceID>
        </SendingFacility>
        <DateTime component="-1" field="7" subcomponent="-1">
            <Time component="1" field="7" subcomponent="-1">systemdatetime</Time>
        </DateTime>
        <MessageType component="-1" field="9" subcomponent="-1">
            <MessageCode component="1" field="9" subcomponent="-1">ADT</MessageCode>
            <TriggerEvent component="2" field="9" subcomponent="-1">A28</TriggerEvent>
        </MessageType>
    </MSH>
    <PID parentseq="-1" seq="2">
        <Segment component="-1" field="0" subcomponent="-1">PID</Segment>
        <SetID-PID component="-1" field="1" subcomponent="-1">1</SetID-PID>
        <PatientIdentifierList component="-1" field="3" subcomponent="-1">
            <IDNumber component="1" field="3" subcomponent="-1">example$patientid</IDNumber>
        </PatientIdentifierList>
        <PatientName component="-1" field="5" subcomponent="-1">
            <FamilyName component="1" field="5" subcomponent="-1">
                <Surname component="1" field="5" subcomponent="1">example$firstname</Surname>
            </FamilyName>
            <GivenName component="2" field="5" subcomponent="-1">data$lastname</GivenName>
            <SecondAndFurtherGivenNames component="3" field="5" subcomponent="-1">example$middle</SecondAndFurtherGivenNames>
        </PatientName>
        <DateTimeOfBirth component="-1" field="7" subcomponent="-1">
            <Time component="1" field="7" subcomponent="-1">example$dob</Time>
        </DateTimeOfBirth>
    </PID>
    <PV1 parentseq="-1" seq="3">
        <Segment component="-1" field="0" subcomponent="-1">PV1</Segment>
        <SetID-PV1 component="-1" field="1" subcomponent="-1">1</SetID-PV1>
        <PatientClass component="-1" field="2" subcomponent="-1">O</PatientClass>
        <AssignedPatientLocation component="-1" field="3" subcomponent="-1">
            <PointOfCare component="1" field="3" subcomponent="-1">example$organizationname</PointOfCare>
        </AssignedPatientLocation>
        <AttendingDoctor component="-1" field="7" subcomponent="-1">
            <IDNumber component="1" field="7" subcomponent="-1">example$providerid</IDNumber>
            <FamilyName component="2" field="7" subcomponent="-1">
                <Surname component="2" field="7" subcomponent="1">example$providerlast</Surname>
            </FamilyName>
            <GivenName component="3" field="7" subcomponent="-1">example$providerfirst</GivenName>
        </AttendingDoctor>
        <ReferringDoctor component="-1" field="8" subcomponent="-1">
        </ReferringDoctor>
    </PV1>
    <OBX parentseq="3" seq="4">
        <Segment component="-1" field="0" parentseq="-1" subcomponent="-1">OBX</Segment>>
        <ObservationIdentifier component="-1" field="3" parentseq="-1" subcomponent="-1">
            <Identifier component="1" field="3" parentseq="-1" subcomponent="-1">example$observationcode</Identifier>
            <Text component="2" field="3" parentseq="-1" subcomponent="-1">example$text</Text>
            <NameofCodingSystem component="3" field="3" parentseq="-1" subcomponent="-1">example$observationcodesystem</NameofCodingSystem>
        </ObservationIdentifier>
        <ObservationValue component="-1" field="5" parentseq="-1" subcomponent="-1">
            <Identifier component="1" field="5" parentseq="-1" subcomponent="-1">example$numericvalue1</Identifier>
        </ObservationValue>
        <Units component="-1" field="6" parentseq="-1" subcomponent="-1">
            <Identifier component="1" field="6" parentseq="-1" subcomponent="-1">example$numericunitcd</Identifier>
        </Units>
        <ObservationResultStatus component="-1" field="11" parentseq="-1" subcomponent="-1">F</ObservationResultStatus>
        <DateTimeOfObservation component="-1" field="14" parentseq="-1" subcomponent="-1">
            <Time component="1" field="14" parentseq="-1" subcomponent="-1">example$dateofobservation</Time>
        </DateTimeOfObservation>
    </OBX>
    <ZPI parentseq="-1" seq="8">
        <Segment component="-1" field="0" subcomponent="-1">ZPI</Segment>
        <RecordType component="-1" field="1" subcomponent="-1">
            <Text component="2" field="1" subcomponent="-1">Risk Score</Text>
        </RecordType>
    </ZPI>
</HL7Message>

我已经研究了SaveXML {XML}和write.XML {kulife}函数,但仍然非常困惑。我是否需要单独编写单行的每个部分(MSH, PID, PV1, OBX, ZPI),然后在循环到下一行数据之前将它们连接起来?感谢任何可以帮助我更好地理解如何实现这一点的人。

您可以简单地将上面的模板XML保存在一个文件中,并在其中添加gsub,特别是如果您已经知道上面的模板XML是格式良好的。

library(XML)
# helper function to sanitize strings for XML
sanitize <- function (str) {
    XML:::insertEntities(str, XML:::XMLEntities)
}
xmlTemplate <- readLines('template.xml')
xmlLines <- sapply(1:nrow(example),
       function (i) {
           o <- xmlTemplate
           for (n in names(example)) {
               # successively replace example$foo
               o <- gsub(paste0('example$', n), sanitize(example[i, n]), o, fixed=T)
           }
           o
       })
out.xml <- paste(xmlLines, collapse='n')

为了达到这个目的,brew包可以让你类似于上面的模板。只要您知道您的模板XML是格式良好的,并且不会出现其他应该保持原样的字符串example${something},就可以了。

如果您真的想以xml的方式完成它,您可以将模板字符串读入XML,设置适当的属性,然后将其写出来:

library(XML)
nodes <- lapply(1:nrow(example), function (i) {
    xmlTemplate <- xmlTreeParse('template.xml', useInternalNodes=T)
    n <- getNodeSet(xmlTemplate, '/HL7Message')[[1]]
    xmlAttrs(n) <- c(OrganizationID=example$organizationid[i], OrganizationName=example$organizationname[i])
    # and so on for all the other values you have to set.
    getNodeSet(xmlTemplate, '/HL7Message')[[1]]
    })
# Then write out all the nodes.

有点麻烦,因为您必须导航到每个节点的适当属性/值并进行替换,但我想它更万无一失。但是,如前所述,如果您知道您的模板XML格式良好,那么直接使用gsubsanitize就可以了。

明白了。

require(dplyr)
require(stringr)
xml_replacer <- function(df, xml_template, unique_id = "patientid") {
  for (i in 1:nrow(df)) {
    replacements <- unlist(df[i, ])
    names(replacements) <- paste0("<<",names(df), ">>")
    xml_result <- str_replace_all(xml_template, replacements)
    writeLines(xml_result, paste0(df[i, unique_id], "_xml_result.xml"))
  }
  return(TRUE)
}
xml_replacer(example, xml_lines)

最新更新