我不时从AWS sns接收文件(json(。其中一些只包含一个对象,另一些则包含多个对象。
我正试图运行一个python脚本来提取几个列作为示例,但由于这几个对象或文件的格式,我遇到了多个错误。理想情况下,我希望将所有数据放入列中,特别是包含在";消息";部分(即eventType、from、to、bcc、messageId、timestamp等
有人能帮忙吗?感谢
脚本:
data = []
for line in open(folder + file, 'r', encoding='utf-8'):
data.append(json.loads(line))
pd.json_normalize(data)
#df = pd.DataFrame(((d['Message']) for d in data), columns=['Message'])
df = pd.DataFrame([(
data[0]['Timestamp'],
data[0]['Subject'],
data[0]['Message']['eventType'])
], columns=['Timestamp', 'Subject', 'eventType'])
df.to_csv(folder + 'testOutput.csv', index=False, encoding='utf-8')
print(df)
具有多个对象的示例文件:
{"Type":"Notification","MessageId":"0579da9d-671f-547f-879a-8151d6048","TopicArn":"arn:aws:sns:region:123456789:SES-notification-s","Subject":"Event Notification","Message":"{"eventType":"Click","mail":{"timestamp":"2022-10-21T14:39:47.003Z","source":"tester <fromemail@email.com>","sendingAccountId":"123456789","messageId":"01000183fafc5f3b-4f5f3dc2-281a-472a-85-915f85289858-000000","destination":["toemail@email.com","bccemail@email.com"],"headersTruncated":false,"headers":[{"name":"Content-Type","value":"multipart/mixed; boundary=\"===============4453571503606487627==\""},{"name":"MIME-Version","value":"1.0"},{"name":"Subject","value":"Email tracking test"},{"name":"From","value":"tester <fromemail@email.com>"},{"name":"To","value":"toemail@email.com"},{"name":"Bcc","value":"bccemail@email.com"},{"name":"Message-ID","value":"null"}],"commonHeaders":{"from":["tester <fromemail@email.com>"],"to":["toemail@email.com"],"bcc":["bccemail@email.com"],"messageId":"010001fafc5f33b-4f53dc2-281a-472a-8f65-915f889858-000000","subject":"Email tracking test"},"tags":{"ses:operation":["SendRawEmail"],"ses:configuration-set":["ses-email-tracking"],"ses:source-ip":["12.12.123.123"],"ses:from-domain":["domain.dev"],"ses:caller-identity":["caller_identity"]}},"click":{"timestamp":"2022-10-21T14:40:55.000Z","ipAddress":"66.66.666.666","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36","link":"https://website.com","linkTags":null}}n","Timestamp":"2022-10-21T14:40:55.066Z","UnsubscribeURL":"unsubscribeLink"}
{"Type":"Notification","MessageId":"03758739-75a9-5462-8361-dcf5c410e015","TopicArn":"arn:aws:sns:region:123456789:SES-notification-s","Subject":"Event Notification","Message":"{"eventType":"ClickNum2","mail":{"timestamp":"2022-10-21T14:39:47.003Z","source":"tester <fromemail@email.com>","sendingAccountId":"123456789","messageId":"01000183fafc5f3b-4f5f3dc2-281a-472a-85-915f85289858-000000","destination":["toemail2@email.com","bccemail2@email.com"],"headersTruncated":false,"headers":[{"name":"Content-Type","value":"multipart/mixed; boundary=\"===============4453571503606487627==\""},{"name":"MIME-Version","value":"1.0"},{"name":"Subject","value":"Email tracking test"},{"name":"From","value":"tester <fromemail@email.com>"},{"name":"To","value":"toemail2@email.com"},{"name":"Bcc","value":"bccemail2@email.com"},{"name":"Message-ID","value":"null"}],"commonHeaders":{"from":["tester <fromemail@email.com>"],"to":["toemail2@email.com"],"bcc":["bccemail2@email.com"],"messageId":"010001fafc5f33b-4f53dc2-281a-472a-8f65-915f889858-000000","subject":"Email tracking test"},"tags":{"ses:operation":["SendRawEmail"],"ses:configuration-set":["ses-email-tracking"],"ses:source-ip":["12.12.123.123"],"ses:from-domain":["domain.dev"],"ses:caller-identity":["caller_identity"]}},"click2":{"timestamp":"2022-10-21T15:45:50.000Z","ipAddress":"55.55.555.555","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36","link":"https://website.com","linkTags":null}}n","Timestamp":"2022-10-21T15:50:50.066Z","UnsubscribeURL":"unsubscribeLink"}
{"Type":"Notification","MessageId":"03758739-75a9-5462-8361-dcf5c410e015","TopicArn":"arn:aws:sns:region:123456789:SES-notification-s","Subject":"Event Notification","Message":"{"eventType":"ClickNum3","mail":{"timestamp":"2022-10-21T14:39:47.003Z","source":"tester <fromemail@email.com>","sendingAccountId":"123456789","messageId":"01000183fafc5f3b-4f5f3dc2-281a-472a-85-915f85289858-000000","destination":["toemail3@email.com","bccemail3@email.com"],"headersTruncated":false,"headers":[{"name":"Content-Type","value":"multipart/mixed; boundary=\"===============4453571503606487627==\""},{"name":"MIME-Version","value":"1.0"},{"name":"Subject","value":"Email tracking test"},{"name":"From","value":"tester <fromemail@email.com>"},{"name":"To","value":"toemail3@email.com"},{"name":"Bcc","value":"bccemail3@email.com"},{"name":"Message-ID","value":"null"}],"commonHeaders":{"from":["tester <fromemail@email.com>"],"to":["toemail3@email.com"],"bcc":["bccemail3@email.com"],"messageId":"010001fafc5f33b-4f53dc2-281a-472a-8f65-915f889858-000000","subject":"Email tracking test"},"tags":{"ses:operation":["SendRawEmail"],"ses:configuration-set":["ses-email-tracking"],"ses:source-ip":["12.12.123.123"],"ses:from-domain":["domain.dev"],"ses:caller-identity":["caller_identity"]}},"click3":{"timestamp":"2022-10-21T16:50:50.000Z","ipAddress":"44.44.444.444","userAgent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36","link":"https://website.com","linkTags":null}}n","Timestamp":"2022-10-21T16:50:50.066Z","UnsubscribeURL":"unsubscribeLink"}
您可以尝试包含从json到csv文件的所有字段。
import pandas as pd
import json
df=pd.read_json('test.json',lines=True)
newdf=[]
for i,row in df.iterrows():
data=json.loads(row['Message'])
row['timestamp']=data['mail']['timestamp']
row['eventType']=data['eventType']
newdf.append(row)
df=pd.DataFrame(newdf)
df.to_csv("test.csv",index=False)
如何解析整个输入:
import fileinput
import json
import pandas as pd
def parse_input(filename):
rows = (l for l in fileinput.input(filename))
parsed_1 = [json.loads(row) for row in rows]
for row in parsed_1:
row['Message'] = json.loads(row['Message'])
return parsed_1
input = parse_input('input.txt')
然后
df = pd.json_normalize(input)
并选择所需的列?
In [5]: df
Out[5]:
Type MessageId ... Message.click3.link Message.click3.linkTags
0 Notification 0579da9d-671f-547f-879a-8151d6048 ... NaN NaN
1 Notification 03758739-75a9-5462-8361-dcf5c410e015 ... NaN NaN
2 Notification 03758739-75a9-5462-8361-dcf5c410e015 ... https://website.com NaN
[3 rows x 39 columns]