遍历pandas dataframe，按条件选择行，当条件为真时，选择一些只包含唯一值的其他行 &g

我有一个大的(1M+)数据帧，类似于

Column A    Column B   Column C
0       'Aa'        'Ba'        14    
1       'Ab'        'Bc'        24           
2       'Ab'        'Ba'        24
...

基本上我有一个字符串对的列表，每个字符串对都有一个数字，这个数字只取决于a列。我要做的是:

遍历数据帧
对于每一行，用条件
如果条件通过，选择该行
对其他N行进行采样，这样每个通过条件的行都有N+1行
但是以一种方式对它们进行采样，即每个N+1组只有行，其中条件也被传递，并且没有列a或B重复的字符串
不同N+1组的重复并不重要，结果N+1组的列表将比初始df长得多。我的任务要求在N+1组中处理和传递所有条目，这些组没有重复项。

例如，有条件Column C>15，并且N = 5，那么对于通过条件的行:

Column A    Column B   Column C
78       'Ae'        'Bf'        16

我们将有N组，例如:

Column A    Column B   Column C
78       'Ag'        'Br'        18
111      'Ah'        'Bg'        20
20       'An'        'Bd'        17
19       'Am'        'Bk'        18
301      'Aq'        'Bq'        32

我的初始代码是一团糟，我已经尝试过随机采样行，直到达到N，并检查它们的条件，并建立一个重复的字典来检查它们是否唯一。然而，以数百万个长间隔一次又一次地滚动随机数被证明太慢了。

我的第二个想法是从条件传递的行向前迭代，搜索通过条件的其他行，并再次根据重复的字典检查它们。这开始变得更加可行，但是它有一个问题，当到达df的末端时，迭代必须重置到df的开始，并且它没有找到N个可行的行。还是感觉很慢。这样的:

in_data = []
for i in range(len(df)):

A = df.iloc[i]['A']
B = df.iloc[i]['B']

if (condition(A)):

in_data.append([A, B])
dup_dict = {}
dup_dict[A] = 1
dup_dict[B] = 1
j = i
k = 1

while (j < len(df) and k != N):
other_A = df.iloc[j]['A']
other_B = df.iloc[j]['B']

if (condition(other_A) and
other_A not in dup_dict and
other_B not in dup_dict):

dup_dict[other_A] = 1
dup_dict[other_B] = 1
in_data.append([other_A, other_B])
k += 1

j += 1

if (j == len(df) and k != N):

j = 0

return in_data

我最新的想法是通过apply()来实现它，但它开始变得太复杂，因为我不知道如何正确地在apply()中索引df并向前迭代，再加上如何做重置技巧。

所以，必须有一个更精简的解决方案。噢，原来的数据帧更像是~60M长，但它通过多处理被分割并分布在可用的cpu内核中，因此/task的大小更小。

编辑:条件是动态的，即C列在每次检查中都与随机数进行比较，因此不应该被预先屏蔽。

编辑2:一些错别字。

如果我有这个权利，你是对的

data = [
["Ag", "Br", 18],
["Ah", "Bg", 20],
["An", "Bd", 17],
["Am", "Bk", 18],
["Aq", "Bq", 32],
"Aq", "Aq", 16],
]
df = pd.DataFrame(data=data, columns=['A', 'B', 'C'])
temp_df = df[(df.C > 14) & (df.A != df.B)] # e.g. condition_on_c = 14
# get the first row to sample
initial_row_index = temp_df.sample(1, random_state=42).index.values[0]
output = temp_df[temp_df.index != initial_row_index].sample(N, replace=True)
# sample = True means with replacement so you may get dup rows (definitely if N > len(temp_df) - 1
output = pd.concat([temp_df.loc[[initial_row_index]], output])
# if N = 5 we get 
A   B   C
1  Ah  Bg  20 # initial row
3  Am  Bk  18
4  Aq  Bq  32
2  An  Bd  17
4  Aq  Bq  32
4  Aq  Bq  32

您可以看到原始索引是您正在采样的数据帧中的原始索引。所以你可以重置这个索引。

相关内容

最新更新

热门标签：