Python大文件嵌套行JSON到CSV文件,只提取特定的键值对



嗨,我正在寻找一些帮助,将Twitter行JSON数据解析为python字典,并使用python 3.8.5提取到较小的CSV文件。Twitter数据已经被收集并保存到几个大约450Mb压缩的gzip文件中,并且>2.7Gb未压缩。每个文件包含大约800000行。所有的JSON文件都包含所有的twitter对象。我只想提取某些键:值,因为我不需要所有的数据。然而,我在提取那些特定的键时遇到了困难,因为有些键是嵌套的。并非所有键都包含值。在这种情况下,我希望返回"Null"/"None"。所有其他帖子和YouTube视频都处理简单的文件或提取所有密钥。

我已经成功地将JSON数据逐行解析到python字典中(注意,我发现ujson在内存加载和速度方面效果更好(:

import gzip
import json
import ujson
import csv
tweets = []
with gzip.open('small_test_file.gz', 'r') as infile:
for line in infile:
tweets.append(ujson.loads(line)) 

print("Finished processing: " + str(len(tweets)) + " lines")
infile.close()

这些是我想要的键/列:

header = ['id', 'created_at', 'screen_name', 'text', 'lang', 'place.country_code', 'place.name', 'coordinates_long', 'coordinates_lat']

这是csv。我正在使用的DictWriter代码:

with open('clean_test_long.csv', 'w', encoding = 'utf-8') as outfile:    # opens outfile as json
header = ['id', 'created_at', 'screen_name', 'text', 'lang', 'place.country_code', 'place.name', 'coordinates_long', 'coordinates_lat']

csv_writer = csv.DictWriter(outfile, fieldnames ='header', restval = None)

csv_writer.writeheader()    # write header row using filednames
for tweet in tweets:    
csv_writer.writerow(tweet['id'],
tweet['created_at'],
tweet['user']['screen_name'],
tweet['text'],
tweet['lang'],
tweet['place']['country_code'],
tweet['place']['name'],
tweet['coordinates']['coordinates'][0],
tweet['coordinates']['coordinates'][1])
outfile.close()

我得到以下错误:

tweet['coordinates']['coordinates'][0],
TypeError: 'NoneType' object is not subscriptable

我也试过在所有对象上使用.get'tweet.get('coordinates'(.get('coordinats'([0]'来替换缺失的值,但这不起作用。

我也尝试过pandas json.normalize,但这并不能使结构变平到顶级之外,并且会转储大量的gzip文件,这就是为什么我希望在对pandas进行分析之前先清理它。

数据行示例:

[
{
"truncated": false,
"contributors": null,
"place": null,
"reply_count": 0,
"retweeted": false,
"in_reply_to_status_id_str": null,
"source": "",
"in_reply_to_screen_name": null,
"id": 1233904784635256833,
"retweet_count": 0,
"filter_level": "low",
"user": {
"profile_background_image_url": "",
"profile_text_color": "333333",
"profile_background_tile": false,
"profile_background_image_url_https": "",
"profile_image_url_https": "",
"profile_background_color": "C0DEED",
"url": null,
"profile_sidebar_border_color": "C0DEED",
"location": null,
"default_profile": true,
"listed_count": 65,
"id": 1092190045,
"statuses_count": 62340,
"translator_type": "none",
"profile_image_url": "",
"is_translator": false,
"id_str": "1092190045",
"time_zone": null,
"friends_count": 24,
"profile_banner_url": "",
"favourites_count": 25,
"profile_sidebar_fill_color": "DDEEF6",
"description": null,
"protected": false,
"contributors_enabled": false,
"lang": null,
"name": "Rathausuhr Neuku00f6lln",
"notifications": null,
"following": null,
"created_at": "Tue Jan 15 14:06:09 +0000 2013",
"profile_use_background_image": true,
"utc_offset": null,
"follow_request_sent": null,
"screen_name": "rh_neukoelln",
"verified": false,
"geo_enabled": true,
"default_profile_image": false,
"profile_link_color": "1DA1F2",
"followers_count": 1653
},
"id_str": "1233904784635256833",
"in_reply_to_user_id": null,
"in_reply_to_status_id": null,
"lang": "de",
"favorited": false,
"favorite_count": 0,
"entities": {
"symbols": [],
"hashtags": [],
"urls": [],
"user_mentions": []
},
"coordinates": {
"type": "Point",
"coordinates": [
13.435,
52.481388
]
},
"in_reply_to_user_id_str": null,
"created_at": "Sun Mar 01 00:00:00 +0000 2020",
"timestamp_ms": "1583020800156",
"text": "schepper",
"quote_count": 0,
"geo": {
"type": "Point",
"coordinates": [
52.481388,
13.435
]
},
"is_quote_status": false
},
{
"truncated": false,
"contributors": null,
"place": {
"attributes": {},
"bounding_box": {
"type": "Polygon",
"coordinates": [
[
[
-7.017507,
52.122381
],
[
-7.017507,
52.797086
],
[
-6.141269,
52.797086
],
[
-6.141269,
52.122381
]
]
]
},
"full_name": "Wexford, Ireland",
"url": "",
"name": "Wexford",
"country_code": "IE",
"id": "0239f5fd632185d5",
"country": "Ireland",
"place_type": "city"
},
"in_reply_to_status_id": null,
"retweeted": false,
"in_reply_to_status_id_str": null,
"source": "",
"in_reply_to_screen_name": null,
"quoted_status": {
"display_text_range": [
0,
53
],
"truncated": false,
"place": null,
"in_reply_to_status_id": null,
"retweeted": false,
"in_reply_to_status_id_str": null,
"source": "",
"in_reply_to_screen_name": null,
"id": 1233902879301349379,
"retweet_count": 40,
"filter_level": "low",
"user": {
"profile_background_image_url": "",
"profile_text_color": "000000",
"profile_background_tile": false,
"profile_background_image_url_https": "",
"profile_image_url_https": "",
"profile_background_color": "000000",
"url": "",
"profile_sidebar_border_color": "000000",
"location": "NYC",
"default_profile": false,
"listed_count": 616,
"id": 249547283,
"statuses_count": 51127,
"translator_type": "none",
"profile_image_url": "",
"is_translator": false,
"id_str": "249547283",
"time_zone": null,
"friends_count": 1187,
"profile_banner_url": "",
"favourites_count": 88876,
"profile_sidebar_fill_color": "000000",
"description": "Host of the Michael Brooks Show, join: @tmbsfm Contributor/producer, @Majorityfm Co-host Woke Bros. Member of the Yacubian Left",
"protected": false,
"contributors_enabled": false,
"lang": null,
"name": "Michael Brooks",
"notifications": null,
"following": null,
"created_at": "Wed Feb 09 08:13:53 +0000 2011",
"profile_use_background_image": false,
"utc_offset": null,
"follow_request_sent": null,
"screen_name": "_michaelbrooks",
"verified": false,
"geo_enabled": true,
"default_profile_image": false,
"profile_link_color": "0065B3",
"followers_count": 79224
},
"possibly_sensitive": false,
"lang": "en",
"id_str": "1233902879301349379",
"in_reply_to_user_id": null,
"contributors": null,
"quoted_status_id": 1233899739906813952,
"reply_count": 35,
"quoted_status_id_str": "1233899739906813952",
"favorited": false,
"favorite_count": 423,
"entities": {
"symbols": [],
"hashtags": [],
"urls": [
{
"indices": [
54,
77
],
"expanded_url": "",
"display_url": "",
"url": ""
}
],
"user_mentions": []
},
"coordinates": null,
"in_reply_to_user_id_str": null,
"created_at": "Sat Feb 29 23:52:25 +0000 2020",
"text": "Are they genuinely nuts enough to think they can win? ",
"quote_count": 1,
"geo": null,
"is_quote_status": true
},

您可以简单地尝试并捕获异常(这意味着坐标为空值(,甚至使用if语句来检查坐标键是否有相应的值并且没有值None

您可以使用以下基本解决方案:

  • 使用Try-Catch:
try:
coordinates = tweet['coordinates']['coordinates'][0]
except:
coordinates = None

  • 检查参数是否存在
if 'coordinates' in tweet and 'coordinates' in tweet['coordinates'] and len(tweet['coordinates']['coordinates']) > 0: 
coordinates = tweet['coordinates']['coordinates'][0]

我更喜欢使用第一种解决方案,它会处理得更快。

最新更新