我的朋友
在下面的代码中,我尝试转换XML(https://issat.ttn.tn/cu/export/akouda.php)到CSV文件,
代码:
import requests
import xml.etree.ElementTree as Xet
import pandas as pd
from html import unescape
url = "https://issat.ttn.tn/cu/export/akouda.php"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//phases/* | //time")#
#df["value"] = df["value"].ffill()
df
df.to_csv('output0.csv')
这里有一些结果:
,value,phases,id,act_energy,react_energy,current_inst,voltage_inst,power_inst,power_fact,thd
0,2022-04-14 15:45:00,,,,,,,,,
1,,,0.0,0.3000000000001819,0.4324445747717669,2.0,241.7,0.27,0.57,27.39
2,,,1.0,0.0,0.0,13.06,242.5,0.66,0.2,22.69
3,,,2.0,0.0,0.0,1.07,243.7,0.15,0.58,48.05
4,2022-04-14 15:30:00,,,,,,,,,
5,,,0.0,0.2999999999999545,0.108885460271677,1.02,240.4,0.23,0.94,23.7
6,,,1.0,0.0,0.0,14.54,241.0,0.86,0.24,23.99
7,,,2.0,0.0,0.0,1.07,243.5,0.15,0.59,48.08
8,2022-04-14 15:15:00,,,,,,,,,
9,,,0.0,0.3999999999998636,0.5618044649492236,0.7,243.1,0.1,0.58,42.46
10,,,1.0,0.0,0.0,17.82,241.9,1.99,0.46,33.59
11,,,2.0,0.0,0.0,1.08,246.3,0.15,0.58,51.09
12,2022-04-14 15:00:00,,,,,,,,,
13,,,0.0,0.6000000000001364,0.8427066974243144,0.71,241.7,0.1,0.58,44.02
14,,,1.0,0.0,0.0,18.74,240.5,2.21,0.49,31.3
15,,,2.0,0.0,0.0,1.08,245.3,0.15,0.58,51.77
我需要:
- 删除具有日期但没有读数的类似行的行(0&4&8&12)
- 只获取id为1的行
- 删除phases列
有人能帮忙吗?
考虑运行两个read_xml
调用,调整xpath
并使用attrs_only
。由于两者将处于同一级别(一个<phases>
位于@id=1
,一个为<time>
),因此join
的结果为:
...
time_df = pd.read_xml(s, xpath="//time", attrs_only=True, names=["time"])
phase_df = pd.read_xml(s, xpath="//phase[@id=1]")
time_phase_df = time_df.join(phase_df)
time_phase_df
time id act_energy ... power_inst power_fact thd
0 2022-04-15 00:00:00 1 0 ... 0.84 0.28 22.35
1 2022-04-14 23:45:00 1 0 ... 0.83 0.28 23.16
2 2022-04-14 23:30:00 1 0 ... 0.83 0.28 22.43
3 2022-04-14 23:15:00 1 0 ... 0.83 0.28 22.56
4 2022-04-14 23:00:00 1 0 ... 0.82 0.28 22.57
... .. ... ... ... ... ...
1289 2022-04-01 02:15:00 1 0 ... 0.69 0.25 22.70
1290 2022-04-01 02:00:00 1 0 ... 0.69 0.25 22.66
1291 2022-04-01 01:45:00 1 0 ... 0.69 0.25 22.46
1292 2022-04-01 01:30:00 1 0 ... 0.69 0.25 22.00
1293 2022-04-01 01:25:00 1 0 ... 0.69 0.25 22.34
即将在Pandas 1.5中推出的read_xml
将支持解析日期:
time_df = pd.read_xml(
s, xpath="//time", attrs_only=True, names=["time"], parse_dates=["value"]
)
尝试:
import requests
import pandas as pd
from html import unescape
url = "https://issat.ttn.tn/cu/export/akouda.php"
s = unescape(requests.get(url).text)[5:-6]
df = pd.read_xml(s, xpath="//phases/* | //time")
df["value"] = df["value"].ffill()
df = df.drop(columns="phases")
# if you want only id==1 you can skip this:
# df = df[~df.isna().any(axis=1)]
print(df[df["id"] == 1])
打印:
value id act_energy react_energy current_inst voltage_inst power_inst power_fact thd
2 2022-04-14 23:15:00 1.0 0.0 0.0 12.06 241.0 0.83 0.28 22.56
6 2022-04-14 23:00:00 1.0 0.0 0.0 12.04 240.5 0.82 0.28 22.57
10 2022-04-14 22:45:00 1.0 0.0 0.0 12.04 240.2 0.82 0.28 22.56
14 2022-04-14 22:30:00 1.0 0.0 0.0 12.03 240.1 0.82 0.28 22.24
18 2022-04-14 22:15:00 1.0 0.0 0.0 12.01 240.1 0.82 0.28 22.52
22 2022-04-14 22:00:00 1.0 0.0 0.0 12.00 239.8 0.82 0.28 22.74
26 2022-04-14 21:45:00 1.0 0.0 0.0 11.96 239.9 0.82 0.28 22.58
...