在python中更新数组值的更快的循环方法



我有一个数据帧test如下

Student_Id  Math  Physical  Arts Class Sub_Class
0        id_1     6         7     9     A         x
1        id_2     9         7     1     A         y
2        id_3     3         5     5     C         x
3        id_4     6         8     9     A         x
4        id_5     6         7    10     B         z
5        id_6     9         5    10     B         z
6        id_7     3         5     6     C         x
7        id_8     3         4     6     C         x
8        id_9     6         8     9     A         x
9       id_10     6         7    10     B         z
10      id_11     9         5    10     B         z
11      id_12     3         5     6     C         x
我的代码

中列出了两个数组 array_list和array_top我想创建一个新列这样,它循环遍历数据框的每一行,然后从数组中更新值,如下所示:

for index, row in test.iterrows():
test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]

对于较大的集合,此循环花费的时间过多。是否有更快的方法来做这个

?我的代码

import pandas as pd
import numpy as np
#Ceate dataframe
data = [
["id_1",6,7,9, "A", "x"],
["id_2",9,7,1, "A","y" ],
["id_3",3,5,5, "C", "x"],
["id_4",6,8,9, "A","x" ],
["id_5",6,7,10, "B", "z"],
["id_6",9,5,10,"B", "z"],
["id_7",3,5,6, "C", "x"],
["id_8",3,4,6, "C", "x"],
["id_9",6,8,9, "A","x" ],
["id_10",6,7,10, "B", "z"],
["id_11",9,5,10,"B", "z"],
["id_12",3,5,6, "C", "x"]

]
test = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])

#Create two arrays which are of same length as the test data
arr_list = np.array([[1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6]])
array_top = np.array([[0],[1],[1],[2],[1], [0], [0],[1],[1],[2],[1], [0]])
#Create the column Highest_Scoe
for index, row in test.iterrows():
test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]

首先循环遍历数组以创建新列,然后将其分配给数据框,这比循环遍历数据框的每一行要快得多

71.7µs vs 2.77 ms(也就是快了39倍)我的计时赛

In [95]: %%timeit
...: new_test['Highest_Score'] = [arr_list[r][c][0] for r,c in enumerate(array_top)]
...:
...:
71.7 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [96]: %%timeit
...: for index, row in test.iterrows():
...:       test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
...:
2.77 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

作为向pandas DataFrame添加新数据的一般规则,您希望在pandas之外执行所有循环和编译,然后一次性分配所有数据

相关内容

  • 没有找到相关文章

最新更新