对于查找和解的循环,大量数据花费大量时间.(0.15mln * 36k行的14 hrs)



我在python3.5中运行此代码以找到和解(逻辑回归)。

for i in (ones2.index):
    for j in (zeros2.index):
      pairs_tested = pairs_tested+1
      if(ones2.iloc[i,1] > zeros2.iloc[j,1]):
          conc = conc+1
      elif(ones2.iloc[i,1]==zeros2.iloc[j,1]):
          ties = ties+1
      else:
          disc = disc+1
  # Calculate concordance, discordance and ties
concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested
print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested)

zeros2(panda dataframe)和36k行中的 ande2(Panda Dataframe)中有0.15mln行。两个表都有两个变量

[i] 响应器(zeros2中的RESSONDER0 = 0,而响应者1 = 1中的1)。

[ii] 概率(zeros2中的prob0 in zeros2和ones2中的prob1)。

我的问题是: for循环已经花了12小时,并且在问这个问题的时候仍在运行。需要帮忙。如何更快地执行此操作。我正在使用8GB RAM的Windows 64位计算机运行此操作。

您的代码正在进行54亿个计算,这是两个循环(0.15 mil * 36k)的计算:

我会做这样的事情:(感谢@Leon帮助我使这个答案更好)

from bisect import bisect_left, bisect_right
zeros_list = sorted([zeros2.iloc[j,1] for j in zeros2.index])
zeros2_length = len(zeros2_list)
for i in ones2.index:
    cur_disc = bisect_left(zeros2_list, ones2.iloc[i,1])
    cur_ties = bisect_right(zeros2_list, ones2.iloc[i,1]) - cur_disc
    disc += cur_disc
    ties += cur_ties
    conc += zeros2_length - cur_ties - cur_disc
pairs_tested = zeros2_length * len(ones2.index)
concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested
print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested

或反过来,像这样:

zeros_list = sorted([zeros2.iloc[j,1] for j in zeros2.index])
ones2_list = sorted([ones2.iloc[i,1] for i in ones2.index])
zeros2_length = len(zeros2_list)
ones2_length = len(ones2_list)
for i in zeros2.index:
    cur_conc = bisect_left(ones2_list, zeros2.iloc[i,1])
    cur_ties = bisect_right(ones2_list, zeros2.iloc[i,1]) - cur_conc
    conc += cur_conc
    ties += cur_ties
    disc += ones2_length - cur_ties - cur_conc
# We could also achieve the above like this too:
# for i in zeros2_list:
#    cur_conc = bisect_left(ones2_list, i)
#    cur_ties = bisect_right(ones2_list, i) - cur_conc
#    conc += cur_conc
#    ties += cur_ties
#    disc += ones2_length - cur_ties - cur_conc
pairs_tested = zeros2_length * ones2_length
concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested
print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)
print("Pairs = %r", pairs_tested
Probability = model.predict_proba(data[predictors])
Probability1 = pd.DataFrame(Probability)
Probability1.columns = ['Prob_LoanStatus_0','Prob_LoanStatus_1']
TruthTable = pd.merge(data[[outcome]], Probability1[['Prob_LoanStatus_1']], how='inner', left_index=True, right_index=True)
zeros = TruthTable[(TruthTable['Loan_Status']==0)].reset_index().drop(['index'], axis = 1)
ones = TruthTable[(TruthTable['Loan_Status']==1)].reset_index().drop(['index'], axis = 1)
from bisect import bisect_left, bisect_right
zeros_list = sorted([zeros.iloc[j,1] for j in zeros.index])
zeros_length = len(zeros_list)
disc = 0
ties = 0
conc = 0
for i in ones.index:
    cur_conc = bisect_left(zeros_list, ones.iloc[i,1])
    cur_ties = bisect_right(zeros_list, ones.iloc[i,1]) - cur_conc
    conc += cur_conc
    ties += cur_ties
pairs_tested = zeros_length * len(ones.index)
disc = pairs_tested - conc - ties
print("Pairs = ", pairs_tested)
print("Conc = ", conc)
print("Disc = ", disc)
print("Tied = ", ties)
concordance = conc/pairs_tested
discordance = disc/pairs_tested
ties_perc = ties/pairs_tested
print("Concordance = %r", concordance)
print("Discordance = %r", discordance)
print("Tied = %r", ties_perc)

我遵循了Sreyantha Chary的答复,这是优雅的,但是在答案的第一部分中混合了一致性和不一致的百分比。

最新更新