我有一个数据帧test如下
Student_Id Math Physical Arts Class Sub_Class
0 id_1 6 7 9 A x
1 id_2 9 7 1 A y
2 id_3 3 5 5 C x
3 id_4 6 8 9 A x
4 id_5 6 7 10 B z
5 id_6 9 5 10 B z
6 id_7 3 5 6 C x
7 id_8 3 4 6 C x
8 id_9 6 8 9 A x
9 id_10 6 7 10 B z
10 id_11 9 5 10 B z
11 id_12 3 5 6 C x
我的代码中列出了两个数组 array_list和array_top我想创建一个新列这样,它循环遍历数据框的每一行,然后从数组中更新值,如下所示:
for index, row in test.iterrows():
test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
对于较大的集合,此循环花费的时间过多。是否有更快的方法来做这个
?我的代码
import pandas as pd
import numpy as np
#Ceate dataframe
data = [
["id_1",6,7,9, "A", "x"],
["id_2",9,7,1, "A","y" ],
["id_3",3,5,5, "C", "x"],
["id_4",6,8,9, "A","x" ],
["id_5",6,7,10, "B", "z"],
["id_6",9,5,10,"B", "z"],
["id_7",3,5,6, "C", "x"],
["id_8",3,4,6, "C", "x"],
["id_9",6,8,9, "A","x" ],
["id_10",6,7,10, "B", "z"],
["id_11",9,5,10,"B", "z"],
["id_12",3,5,6, "C", "x"]
]
test = pd.DataFrame(data, columns = ['Student_Id', 'Math', 'Physical','Arts', 'Class', 'Sub_Class'])
#Create two arrays which are of same length as the test data
arr_list = np.array([[1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [1, 2, 3], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6], [4, 5, 6]])
array_top = np.array([[0],[1],[1],[2],[1], [0], [0],[1],[1],[2],[1], [0]])
#Create the column Highest_Scoe
for index, row in test.iterrows():
test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
首先循环遍历数组以创建新列,然后将其分配给数据框,这比循环遍历数据框的每一行要快得多
71.7µs vs 2.77 ms(也就是快了39倍)我的计时赛
In [95]: %%timeit
...: new_test['Highest_Score'] = [arr_list[r][c][0] for r,c in enumerate(array_top)]
...:
...:
71.7 µs ± 1.7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [96]: %%timeit
...: for index, row in test.iterrows():
...: test.loc[index,'Highest_Score'] = arr_list [index][array_top [index]]
...:
2.77 ms ± 49.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
作为向pandas DataFrame添加新数据的一般规则,您希望在pandas之外执行所有循环和编译,然后一次性分配所有数据