我有一个数据帧,它的列如下:
data = [
'[[0.1, 0.2, 0.3], [0, 0.5]]',
'[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])
它显示了一个段落中一句话中一个单词的概率,单词和句子的数量是随机的。我想得到另一列average_prob
,它是每行平均值的平均值。所以这里基本上是0.225和0.25。
列word_probs
的数据类型为字符串。
我怎样才能做到这一点?提前感谢!
我们需要首先用ast
将字符串转换为列表,然后进行explode
import ast
df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
Out[408]:
0 0.225
1 0.250
已经有了一个更紧凑的答案,但我在这个混乱的代码中包含了几行关于将计算出的平均值存储在数据帧中的内容
抄近路,用本尼的答案
data = [
'[[0.1, 0.2, 0.3], [0, 0.5]]',
'[[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]]'
]
df = pd.DataFrame(data, columns=['word_probs'])
df['average_prob'] = df.word_probs.map(ast.literal_eval).explode().map(np.mean).groupby(level=0).mean()
print(df)
更长的路,没有ast导入
(这也可以省略,我只是想我会包括所有可能的步骤。例如,我重用的数组附加模式可以用生成器代替(
def row_averages(df: pd.DataFrame) -> list[float]:
row_average_list: list[float] = []
df['average_prob'] = [0]*len(df['word_probs']) # create the new col with the length of the old col
for i, row in enumerate(df['word_probs']):
temp: str = row
segments_no_brackets = temp.strip('[[').strip(']]').split('], [')
average_list: list[float] = []
for seg in segments_no_brackets:
list_of_str_float: list[str] = seg.split(', ')
# any tallying datastructure will do
internal_list: list[float] = []
for char in list_of_str_float:
number = float(char)
internal_list.append(number)
inner_avg = np.mean(internal_list)
average_list.append(inner_avg)
row_average = np.mean(average_list)
row_average_list.append(row_average)
# enter into new col
df.at[i, 'average_prob'] = row_average # overwrite the zeroes set above with the average
print(df)
# do a rounding here if you want to control sig figs
return [float(f'{num:.3}') for num in row_average_list]
打印输出(df(:
word_probs average_prob
0 [[0.1, 0.2, 0.3], [0, 0.5]] 0.225
1 [[0.1, 0.2], [0.3, 0.4, 0.5], [0, 0.4]] 0.250