从XLIFF文件中提取数据并创建数据帧



我有一个XLIFF文件,其结构如下。

<?xml version="1.0" encoding="UTF-8"?>
<xliff xmlns="urn:oasis:names:tc:xliff:document:1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="1.2" xsi:schemaLocation="urn:oasis:names:tc:xliff:document:1.2 http://docs.oasis-open.org/xliff/v1.2/os/xliff-core-1.2-strict.xsd">
<file original="" datatype="plaintext" xml:space="preserve" source-language="en" target-language="es-419">
<header>
<tool tool-id="tool" tool-name="tool" />
</header>
<body>
<trans-unit id="tool-123456789-1" resname="123456::title">
<source>Name 1 </source>
<target state="final">Name 1 target language </target>
</trans-unit>
<trans-unit id="tool-123456780-1" resname="123456::summary">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456790-1" resname="123456::relevant">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456791-1" resname="123456::description">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.</target>
</trans-unit>
<trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
<source>Lorem Ipsum </source>
<target state="final">Lorem Ipsum local</target>
</trans-unit>
<trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
</trans-unit>
<trans-unit id="tool-123456792-1" resname="123456::654321::from_area_code">
<source>Lorem Ipsum </source>
<target state="final">Lorem Ipsum local</target>
</trans-unit>
<trans-unit id="tool-123456793-1" resname="123456::654321::852741::content">
<source>Lorem Ipsum is simply dummy text of the printing and typesetting industry.</source>
<target state="final">Lorem Ipsum is simply dummy text of the printing and typesetting industry local.</target>
</trans-unit>

</body>
</file>
</xliff>

我想提取trans单元、源和目标标签上的内容,以构建具有以下结构的数据帧:

目标文本目标文本目标文本目标文本目标文本
TAG目标
标题源文本
描述源文本
摘要源文本
相关源文本
起始区号源文本

尝试:

import pandas as pd
import xml.etree.ElementTree as ET
tree = ET.parse("your_file.xml")
root = tree.getroot()
data = []
for tu in root.findall(".//{urn:oasis:names:tc:xliff:document:1.2}trans-unit"):
source = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}source")
target = tu.find(".//{urn:oasis:names:tc:xliff:document:1.2}target")
data.append(
{
"TAG": tu.attrib["resname"].split("::")[-1],
"SOURCE": source.text,
"TARGET": target.text,
}
)
df = pd.DataFrame(data)
print(df)

打印:

TAG                                                                      SOURCE                                                                                     TARGET
0           title                                                                     Name 1                                                                     Name 1 target language 
1         summary  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
2        relevant  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
3     description  Lorem Ipsum is simply dummy text of the printing and typesetting industry.  Lorem Ipsum is simply dummy text of the printing and typesetting industry local language.
4  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
5         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.
6  from_area_code                                                                Lorem Ipsum                                                                           Lorem Ipsum local
7         content  Lorem Ipsum is simply dummy text of the printing and typesetting industry.           Lorem Ipsum is simply dummy text of the printing and typesetting industry local.

最新更新