在写入csv文件之前,我正试图从电子邮件中抓取该表并删除任何特殊字符(\r\n等等(。
我已经设法刮取了数据,但列被包裹在"\r\n"中,我无法删除(我是新手(
正在尝试刮擦的表:
表-图像
Python代码:
for emailid in items:
# getting the mail content
resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
text = str(data[0][1])
tree = BeautifulSoup(text, "lxml")
table_tag = tree.select("table")[0]
tab_data = [[item.text for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
print(table_tag)
for data in tab_data:
writer.writerow(data)
print(' '.join(data))
结果:
\r\n快速编号\r\n\r\n订单号=\r\n\r\n\r\n零件号\r\n\r\n说明\r\n\r\nOM=\r\n\r\n\r\n订单数量\r\n\r\n接收数量\r\n\r\n接收日期(dd/mm/yyyy(\r\n\r\n其他信息\r\n\\r\nE03B1A\r\n\r\nE00015130\r\n\r\nYK71114105=\r\np>\\r\n\r\nCOLOUR TOP ASSY(r(=\r\n\r\n\r\n\r\nPECE\r\n\r\n1\r\n\r\n1\r\n\r\n06/10/2020=\r\np>\\r\n\r\n\\r\nE03B1E\r\n\r\nE00015134\r\n\r\nYK78804497=\r\np>\\r\n\r\nDIE BUTTON=\r\np>\\r\n\r\nPECE\r\n\r\n4\r\n\r\n4\r\n\r\n06/10/2020=\r\np>\\r\n\r\n
预期结果
- 快速编号订单无零件号
- nE03B1A nE00015130 nYK71114105
- nE03B1E nE00015134 nYK78804497
提前感谢(这是我的第一篇帖子,所以请温柔一点(
要删除这些字符串,您需要在这些字符串上使用.strip()
。所以试试:
tab_data = [[item.text.strip() for item in row_data.select("td")]
for row_data in table_tag.select("tr")]
但我可以建议,让panda从html:解析表格吗
import pandas as pd
for emailid in items:
# getting the mail content
resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
text = str(data[0][1])
table = pd.read_html(text)[0]
df_obj = table.select_dtypes(['object'])
table[df_obj.columns] = df_obj.apply(lambda x: x.str.strip())
print(table)
table.to_csv('file.csv', index=False)
导入重新
removedData=re.sub("^[a-zA-Z0-9]",",dataForRemoveSlashNandR(
打印(删除数据(
信用证:chitown88
for emailid in items:
# getting the mail content
resp, data = m.fetch(emailid, '(UID BODY[TEXT])')
text = str(data[0][1])
table = pd.read_html(text)[0]
df_obj = table.select_dtypes(['object'])
table[df_obj.columns] = df_obj.apply(lambda x: x.str.strip("\r\n"))
print(table)
table.to_csv(outfile, index=False)