处理大型熊猫数据框(模糊匹配)

我想进行模糊匹配，其中我将大型数据帧（130.000行）列的字符串匹配到列表（400行）。我编写的代码在一个小样本上进行了测试（将3000行匹配到400行），并且工作正常。它太大了，无法在这里复制，但是它很大程度上可以像这样：

1）列的数据归一化2）创建列的笛卡尔产品并计算Levensthein距离3）选择最高得分匹配，然后在单独的列表中存储"大_CSV_NAME"。4）将"大_CSV_NAMES"列表与"大_CSV"进行比较，拔出所有相交数据并写入CSV。

因为笛卡尔产品包含超过5000万个记录，所以我很快就会遇到内存错误。

这就是为什么我想知道如何将大数据集划分为块，然后我可以运行脚本。

到目前为止，我已经尝试过：

df_split = np.array_split(df, x (e.g. 50 of 500))
for i in df_split:
  (step 1/4 as above)

以及：

for chunk in pd.read_csv('large_csv.csv', chunksize= x (e.g. 50 or 500))
  (step 1/4 as above)

这些方法似乎都没有用。我想知道如何在块中运行模糊匹配，从而将大型CSV切成部分，运行代码，拿起一块，运行代码等。

与此同时，我编写了一个脚本，该脚本将数据框在块中切片，然后准备进一步处理。由于我是新手Python，因此代码可能有点混乱，但我仍然想与可能与我相同的问题遇到的人分享。

。

import pandas as pd
import math 

partitions = 3    #number of ways to split df
length = len(df)
list_index = list(df.index.values)
counter = 0     #var that will be used to stop slicing when df ends
block_counter0 = 0      #var which will indicate the begin index of slice                                                              
block_counter1 = block_counter0 + math.ceil(length/partitions)  #likewise
while counter < int(len(list_index)):      #stop slicing when df ends
    df1 = df.iloc[block_counter0:block_counter1]  #temp df that forms chunk
    for i in range(block_counter0, block_counter1 ):
        #insert operations on row of df1 here
    counter += 1  #increase counter by 1 to stop slicing in time
    block_counter0 = block_counter1   #when for loop ends indices areupdated
    if block_counter0 + math.ceil(length / partitions) > 
           int(len(list_index)):
      block_counter1 = len(list_index)
      counter +=1
    else:
      block_counter1 = block_counter0 + math.ceil(length / partitions)

相关内容

最新更新

热门标签：