蟒.填充数据帧:代码优化

我有大约 4M 行的大数据集。我需要通过正则表达式清理它并放入 Pandas 的数据帧中。这是我的代码：

# 1) reading a text file with a dataset, where 4M rows
orgfile = open("good_dmoz.txt", "r")
# 2) create an empty dataframe
df0=pd.DataFrame(columns=['url'])
# 3) creating mask for cleaning data
regex = re.compile(r"(?<=')(.*?)(?=')")
# 4) clearing data and put into the dataframe
for line in orgfile:
urls = regex.findall(line)
df0.url = df0.append({"url": urls[0]}, ignore_index=True)

代码在一个小片段中处理任务，但处理完整数据(4M 行(需要很长时间。我的问题是：是否可以优化代码？通过优化，我的意思是降低代码执行的速度。

谢谢！

我同意对这个问题的评论。然而，我们都是从某个地方开始的。正如其他人提到的，Shokan，您遇到的性能问题部分是由于append和for循环造成的。试试这个：

1.从文本文件创建熊猫数据帧，仅一列，每行一行

df_rawtext = pd.read_csv('good_dmoz.txt', header = None, names = ['raw_data'], sep = 'n')

2. 测试每行是否存在正则表达式并过滤：

PATTERN = r"(?<=')(.*?)(?=')"
df_rawtext = df_rawtext.loc[df_rawtext.iloc[:,0].str.contains(PATTERN)]

3. 提取模式

df_rawtext['URL'] = df_rawtext['raw_data'].str.extract(PATTERN, expand = False)

我在这里执行步骤 2，因为步骤 3 将为不匹配的行抛出错误。

ValueError: pattern contains no capture groups

如果有人知道更好的方法，请随时做出贡献。我渴望学习。

相关内容

最新更新

热门标签：