从推文 json 格式文件解析的有效方法

我正在解析 json 格式并使用 gzip 压缩的推文数据。

这是我的代码：

###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize
##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0
#Parser provides parsing the input data and return as pd.DataFrame format
###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
for file in files:
#file tracking, #Memory Checker:
print(file, tweets.memory_usage())
# ext represent the extension.
ext = os.path.splitext(file)[-1]
if ext == '.gz':
with gzip.open(os.path.join(root, file), "rt") as tweet_file:
# print(tweet_file)
for line in tweet_file:
try:
temp = line.partition('|')
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
#idx for DataFrame ix
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
else:
with open(os.path.join(root, file), "r") as tweet_file:
# print(tweets_file)
for line in tweet_file:
try:
temp = line.partition('|')
#date
date = temp[0]
tweet = json.loads(temp[2])
if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
# Mapping for memory.
# The index must be sequence like series.
# temporary solve by listlizing int values: id, retweet-count.
#print(tweet)
temp_dict = {"id": [tweet["user"]["id"]],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])]}
temp_DF = pd.DataFrame(temp_dict, index=[idx])
tweets = pd.concat([tweets, temp_DF])
idx += 1
except:
continue
##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()

我的代码可以分为 3 个部分：读取、处理以选择列和存储。我感兴趣的是我想更快地解析它们。所以我的问题来了：太慢了。怎么可能快得多？由熊猫JSON阅读器阅读？好吧，我想它比正常的json.loads快得多... 但！因为我的原始推文数据具有多索引值。所以熊猫read_json不起作用。总的来说，我不确定我是否很好地实现了我的代码。有什么问题或更好的方法吗？我对编程有点陌生。所以请教我做得更好。

p.s 计算机在代码运行时刚刚关闭。为什么会这样？内存问题？

感谢您阅读本文。

附言

20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http://api.twitter.com/1/geo/id/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http://a0.twimg.com/profile_images/1220577968/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http://a3.twimg.com/a/1301071706/images/themes/theme1/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}

这只是一行。我有超过 200GB 的 gzip 文件压缩。我猜这个数字一开始指的是它的日期。我不确定你是否清楚。

首先，恭喜你。作为一名软件工程师，当你面对这样的现实挑战时，你会变得更好。

现在，谈谈您的解决方案。每个软件分 3 个阶段运行。

输入数据。
处理数据。
输出数据。(回应)

输入数据

1.1. 无聊的员工

信息最好应采用一种格式。为了实现这一点，我们编写了解析器，API，包装器，适配器。所有这些背后的想法是将数据转换为相同的格式。这有助于避免使用不同数据源时出现问题，如果其中一个数据源刹车 - 您只修复一个适配器，仅此而已，所有其他适配器，您的解析器仍然可以工作。

1.2. 您的案件

您的数据采用相同的scheme但文件格式不同。您可以将其转换为一种格式，如读取为 json、txt，也可以提取将数据转换为单独函数或模块的方法并重用/调用 2 次。例：

with gzip.open(os.path.join(root, file), "rt") as tweet_file:
process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
process_data(tweet_file)
process_data(tweet_file):
for line in tweet_file:
# do your stuff

2. 处理数据

2.1 无聊的员工

这很可能是一个瓶颈部分。在这里，您的目标是将数据从给定格式转换为所需的格式，并在需要时执行一些操作。在这里，您可以获得所有异常，所有性能问题，所有业务逻辑。这就是 SE 工艺派上用场的地方，你创建一个架构，你决定在其中放置多少错误。

2.2 您的案件

处理这个问题的最简单方法是知道如何找到它。如果这是性能 - 放置时间戳来跟踪它。有了经验，发现问题会变得更容易。在这种情况下，dt.concat最有可能导致性能下降。每次调用时，它都会将所有数据复制到新实例，因此当您只需要 1 个内存对象时，您有 2 个内存对象。尽量避免它concat，将所有数据收集到一个列表中，然后将其放入数据帧中。

例如，我不会在开始时将所有数据放入数据帧中，您可以收集它并放入 csv 文件中，然后从中构建数据帧，pandas 可以很好地处理 csv 文件。下面是一个示例：

import json
import pandas as pd
from pandas.io.json import json_normalize
import csv
source_file = '11April1.txt'
result_file = 'output.csv'

with open(source_file) as source:
with open(result_file, 'wb') as result:
writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
writer.writeheader();
# get index together with a line
for index, line in enumerate(source):
# a handy way to get data in 1 func call.
date, data = line.split('|')
tweet = json.loads(data)
if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
continue
item =  {"id": tweet["user"]["id"],
"text": tweet["text"],
"hashtags": tweet["entities"]["hashtags"][0]["text"],
"date":[int(date[:8])],
"idx": index}
# either write it to the csv or save into the array
# tweets.append(item)
writer.writerow(item)
print "done"

3. 输出数据。

3.1. 无聊的员工

处理完数据并采用正确的格式后，您需要查看结果，对吗？这是HTTP响应和页面加载发生的地方，pandas构建图表等的地方。您决定需要什么样的输出，这就是您创建软件的原因，以从您不想自己经历的格式中获得您想要的东西。

3.2 您的案件

您必须找到一种有效的方法来从处理的文件中获得所需的输出。也许你需要将数据放入HDF5格式并在Hadoop上处理它，在这种情况下，你的软件输出变成了某人的软件输入，很性感吧？:D 撇开笑话不谈，从csv或数组收集所有处理后的数据，并将其按块放入HDF5中，这很重要，因为您无法将所有内容加载到RAM中，RAM被称为临时内存是有原因的，它快速且非常有限，明智地使用它。在我看来，这就是您的PC关闭的原因。或者由于某些 C 库性质，可能存在内存损坏，这有时是可以的。

总的来说，尝试尝试并回到StackOverflow(如果有的话)。