更快地处理Pandas列



我在panda中有一列,其中每个元素都是字符串列表。列表中的字符串元素是浮动的。我只需要为每个列表选择前三个和后三个浮动。

for index,rows in tqdm(data.iterrows()):
s=rows['prob_tokens'].split(' ')
x=[float(elem) for elem in s]
x.sort()
high_sum=0
low_sum=0
try:
low_sum = math.log(x[0])+math.log(x[1])+math.log(x[2])
except:
low_sum=-10000000
try:
high_sum= math.log(x[-3])+math.log(x[-1])+math.log(x[-2])
except:
high_sum=-10000000
data.loc[index,'high_sum']=high_sum
data.loc[index,'low_sum']=low_sum

这是非常低效的,并且需要大量时间来处理1M行的文件。有没有更快的方法?

low_sum-5.03
prob_tokenshigh_sum
0.028424 0.000922 0.037654 0.563366 0.99988 0.916362 0.356194-0.29
import numpy as np
def by_row(row):
s = row["prob_tokens"].split(" ")
x = [float(elem) for elem in s]
x.sort()
high_sum = 0
low_sum = 0
try:
low_sum = np.log(x[:3]).sum()
except:
low_sum = -10000000
try:
high_sum = np.log(x[:-3]).sum()
except:
high_sum = -10000000
row["low_sum"] = low_sum
row["high_sum"] = high_sum
return row
df["high_sum", "low_sum"] = np.NaN
df = df.apply(by_row, axis=1)

你只需要apply一次:D

在pandas数据帧上使用循环非常慢,您应该尽可能寻找其他方法。查看此问题了解更多信息。

对于这个问题,为high_sumlow_sum创建两个函数,然后使用df.apply将数据应用于列:

import math
import pandas as pd
def high_sum(x):
try:
return sum([math.log(i) for i in sorted(x)[-3:]])
except:
return -10000000

def low_sum(x):
try:
return sum([math.log(i) for i in sorted(x)[:3]])
except:
return -10000000
temp = ['0.028424', '0.000922', '0.037654', '0.563366', '0.99988', '0.916362', '0.356194']
df = pd.DataFrame([[temp]]*int(1e6))
df["high_sum"] = df[0].apply(lambda x: high_sum([float(i) for i in x]))
df["low_sum"] = df[0].apply(lambda x: low_sum([float(i) for i in x]))

这需要~1s才能在我的计算机上运行

最新更新