我一直在尝试将XML文件放入数据框架中,但正在努力,我尝试了几种方法,这就是我所处的位置。
我的XML文件看起来像20K段:
<?xml version="1.0"?>
<data experimentId="5244" savingTime="2018-01-06T14:25:48-0500" eventType="Workflow" userId="303">
<root>
<set id="ASSAY_WORKFLOW">
<row state="MODIFIED" pk="5905_Standard_Validation_Standard_Validation">
<field name="ASSAY_ID">5244</field>
<field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
<field name="WORKFLOW_STATE">0</field>
<field name="ASSAY_WORKFLOW_STATE">InDelegation</field>
<field name="WORKFLOW_LAST_STEP_ID">17896</field>
</row>
</set>
<set id="WORKFLOW_STEPS">
<row state="NEW" pk="17896">
<field name="STEP_ID">17896</field>
<field name="WORKFLOW_ID">5905_Standard_Validation_Standard_Validation</field>
<field name="STEP_DATE">2018-01-06T14:25:45-0500</field>
<field name="STEP_DATE_TZ">America/New_York</field>
<field name="USER_ID">303</field>
<field name="USER_FULL_NAME">Ron Swanson</field>
<field name="NEW_WORKFLOW_ASSAY_STATE">InDelegation</field>
<field name="FORMER_WORKFLOW_ASSAY_STATE">Draft</field>
<field name="ROLE_ID">1</field>
</row>
</set>
<set id="WORKFLOW_STEP_VARIABLES">
<row state="NEW" pk="17896¤nextActorId">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">nextActorId</field>
<field name="VALUE">2</field>
</row>
<row state="NEW" pk="17896¤validateToPendingValidation">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">validateToPendingValidation</field>
<field name="VALUE">false</field>
</row>
<row state="NEW" pk="17896¤signToPendingSignature">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">signToPendingSignature</field>
<field name="VALUE">false</field>
</row>
<row state="NEW" pk="17896¤comment">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">comment</field>
<field name="VALUE">GH-VAP, IgG1 repeats,</field>
</row>
<row state="NEW" pk="17896¤actionDelegateU">
<field name="STEP_ID">17896</field>
<field name="VARIABLE_ID">actionDelegateU</field>
<field name="VALUE">directDelegateU</field>
</row>
</set>
<set id="WORKFLOW_ROLE_NAMES">
<row state="NEW" pk="1">
<field name="ROLE_ID">1</field>
<field name="LANGUAGE_ID">2</field>
<field name="DESCRIPTION">Author</field>
</row>
</set>
</root>
</data>
对于每个根节点,有具有相同标签的"字段"属性"名称"的子元素。在数据框中识别我想要的列的值和名称的值。
我可以将所有内容弄清楚:
library(XML)
xmlfilealt <- xmlParse("data/eln_audit_workflow.xml")
username <- xpathSApply(xmlfilealt, "//field[@name='USER_FULL_NAME']", xmlValue)
title <- xpathSApply(xmlfilealt, "//field[@name='VALUE']", xmlValue)
state <- xpathSApply(xmlfilealt, "//field[@name='ASSAY_WORKFLOW_STATE']", xmlValue)
actionDate <- xpathSApply(xmlfilealt, "//field[@name='STEP_DATE']", xmlValue)
actor <- xpathSApply(xmlfilealt, "//field[@name='DESCRIPTION']", xmlValue)
我计划与它们创建一个数据。框架的框架,但它们都是略有不同的长度,我认为这是因为某些根节点中可能缺少一些元素。有人可以向我暗示如何处理这个问题吗?
谢谢
对于可能存在也可能不存在的儿童元素,考虑在整个父节点上迭代,在此为 <row>
,通过节点位置。然后,使用XPATH的concat
构建将每列施放到所需值或零长度字符串的数据范围列表,以始终返回相等长度列的结果。最后rbind
最终列表的所有数据范围。
row_length <- length(xpathSApply(xmlfilealt, "//row"))
df_List <- lapply(seq(row_length), function(i){
data.frame(
username = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='USER_FULL_NAME'],'')", i), xmlValue),
title = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='VALUE'],'')", i), xmlValue),
state = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='ASSAY_WORKFLOW_STATE'],'')", i), xmlValue),
actionDate = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='STEP_DATE'],'')", i), xmlValue),
actor = xpathSApply(xmlfilealt, sprintf("concat(//row[%s]/field[@name='DESCRIPTION'],'')", i), xmlValue),
stringsAsFactors = FALSE
)
})
# CONCATENATE ALL DFs
finaldf <- do.call(rbind, df_List)
# CONVERT ZERO-LENGTH STRINGS TO NA
finaldf[] <- sapply(finaldf, function(col) ifelse(col=='', NA, col))
finaldf
# username title state actionDate actor
# 1 Ron Swanson 2 InDelegation 2018-01-06T14:25:45-0500 Author
# 2 <NA> false <NA> <NA> <NA>
# 3 <NA> false <NA> <NA> <NA>
# 4 <NA> GH-VAP, IgG1 repeats, <NA> <NA> <NA>
# 5 <NA> directDelegateU <NA> <NA> <NA>
# 6 <NA> <NA> <NA> <NA> <NA>
# 7 <NA> <NA> <NA> <NA> <NA>
# 8 <NA> <NA> <NA> <NA> <NA>
此XML非常不一致,很难以一致的方式解析。我更喜欢使用XML2软件包,因为我发现语法易于使用。
library(xml2)
# parse all of the root nodes into separate nodes
rootnodes<-xml_find_all(page, "root")
# read the desired fields from each individual root nodes
a<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='ASSAY_WORKFLOW_STATE']"))})
b<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='STEP_DATE']"))})
c<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='USER_FULL_NAME']"))})
d<-sapply(rootnodes, function(xnode) { xml_text(xml_find_first(xnode, "set/row/field[@name='DESCRIPTION']"))})
#Create the desired output
df=data.frame(assaystate = a, stepdate=b, name = c, description = d)
这种方法的优点是每个根节点的期望应包含每个所需字段。XML2的XML_FIND_FIRST如果缺少该字段/节点,将返回NA
我忽略了值字段,因为至少有具有值属性的字段,并且尚不清楚是否需要一个或全部值。