如何在python中解析大型(6gb)xml文件,并将一些元素保存在列表或表中



我正在尝试使用powershell解析一个xml文件(6 gb(,并将元素保存在txt文件中。我想把名字、中间名和姓氏的串联保存到一个txt文件中,以便在有全名的情况下有一个完整的名字,以及特定日期类型的日期。

<Person id="44855" action="chg" date="26-Aug-2022">
<Gender>Male</Gender>
<ActiveStatus>Active</ActiveStatus>
<Deceased>No</Deceased>
<NameDetails>
<Name NameType="Primary Name">
<NameValue>
<FirstName>*****</FirstName>
<Surname>******</Surname>
</NameValue>
</Name>
<Name NameType="Low Quality AKA">
<NameValue>
<FirstName>****</FirstName>
</NameValue>
</Name>
<Name NameType="Spelling Variation">
<NameValue>
<FirstName>*****</FirstName>
<Surname>****</Surname>
</NameValue>
</Name>
</NameDetails>
<DateDetails>
<Date DateType ="Inactive as of">
<Datevalue Day = "12" Month = "Oct" Year = "2019">
</Date>
<Date DateType = "Date of Birth">
<Datevalue Day = "1" Month = "Jan" Year = "1980">
</Date>

我试着用xml.etree.ElementTree解析它,但我有内存错误,我试过pandas read_xml,但有内存错误。我没有安装lxml的权限,所以我不能使用lxml.etree。这是我第一次上传问题,我不知道如何正确地做,但可以问任何问题,我真的需要帮助。

到目前为止,我的代码是这样的(它是从这个平台上的一个类似问题中重复使用的(

using assembly System.Xml
using assembly System.Xml.Linq 
$Filename = "C:Usersc096830DesktopPFA2_202208262200_D.xml"
$reader = [System.Xml.XmlReader]::Create($Filename)
Write-Host "O script está a executar..." 
while($reader.EOF -eq $False)
{
if($reader.Name -ne "Person")
{
$reader.ReadToFollowing("Person")
}
if($reader.EOF -eq $False)
{
$xPerson = [System.Xml.Linq.XElement]::ReadFrom($reader)
$Names = $xPerson.Descendants("Name")
$Dates = $xPerson.Descendants("DateDetails")
$activestatus = $xPerson.Descendants("ActiveStatus").Value
if($activestatus -eq 'Inactive')
{
foreach($name in $Names)
{
$nameType = $name.Attribute("NameType").Value
if($nameType -ne "Primary Name")
{   $namevalue = $name.Elements("NameValue")
foreach($name in $namevalue)
{
$firstName = $name.Descendants("FirstName").Value
$middleName = $name.Descendants("MiddleName").Value
$surName = $name.Descendants("Surname").Value
$fullname = $firstname + " " +  $middleName+ " " + $surName                    
}
}else{
$firstName = $name.Descendants("FirstName").Value
$middleName = $name.Descendants("MiddleName").Value
$surName = $name.Descendants("Surname").Value
$fullname = $firstname + " " +  $middleName+ " " + $surName 
}

} 
foreach($date in $Dates)
{
if($date.Attributes("DateType").Value -like "Inactive as of")

{
$datevalue = $date.Descendants("DateValue")
$day = $datevalue.Attribute("Day").Value
$month = $datevalue.Attribute("Month").Value
$year = $datevalue.Attribute("Year").Value
$fulldate = $day + "/" + $month + "/"+"year"
$namedate = $fullname, $fulldate|
Out-File -FilePath C:Users... -Append
}
}
}
} 
}

我认为在这种大小下,编写自定义XML处理器是最容易的,xml.entree对于较大的文件来说似乎效率很低
这是一篇文章,展示了它是如何工作的。

尝试以下电源外壳脚本

using assembly System.Xml
using assembly System.Xml.Linq 
$Filename = "C:temptest.xml"
$reader = [System.Xml.XmlReader]::Create($Filename)
Write-Host "O script está a executar..." 
while($reader.EOF -eq $False)
{
if($reader.Name -ne "Person")
{
$reader.ReadToFollowing("Person")
}
if($reader.EOF -eq $False)
{
$xPerson = [System.Xml.Linq.XElement]::ReadFrom($reader)
$Names = $xPerson.Descendants("Name")
$Dates = $xPerson.Descendants("Date")
$activestatus = $xPerson.Descendants("ActiveStatus").Value
if($activestatus -eq 'Active')
#       if($activestatus -eq 'Inactive')
{
foreach($name in $Names)
{
$nameType = $name.Attribute("NameType").Value
if($nameType -ne "Primary Name")
{   $namevalue = $name.Elements("NameValue")
foreach($name in $namevalue)
{
$firstName = $name.Descendants("FirstName").Value
$middleName = $name.Descendants("MiddleName").Value
$surName = $name.Descendants("Surname").Value
$fullname = $firstname + " " +  $middleName+ " " + $surName                    
}
}
else
{
$firstName = $name.Descendants("FirstName").Value
$middleName = $name.Descendants("MiddleName").Value
$surName = $name.Descendants("Surname").Value
$fullname = $firstname + " " +  $middleName+ " " + $surName 
}

} 
foreach($date in $Dates)
{
Write-Host "date"
if($date.Attributes("DateType").Value -like "Inactive as of")

{
$datevalue = $date.Descendants("Datevalue")
$day = $datevalue.Attribute("Day").Value
$month = $datevalue.Attribute("Month").Value
$year = $datevalue.Attribute("Year").Value
$fulldate = $month + " " + $day + " " + $year
Write-Host "fulldate = " $fulldate
$dateTime = [System.DateTime]::Parse($fulldate)
#Out-File -FilePath C:temptest.txt -Append
Write-Host "Date = "  $dateTime.ToString("MM-dd-yyyy") 
}
}
}
} 
}

最新更新