如何避免 R 中的 for 循环 [三重循环又名三重威胁]



目前,我在计算时间方面遇到了问题,因为我在 R 中运行了三重 for 循环,以在星期几和小时级别为每个唯一 ID 创建异常阈值。

我的原始数据框:唯一 ID

、事件日期小时、事件日期、事件星期几、事件小时、数值变量 1、数值变量 2 等。
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
  #I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
  output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
  for (k in unique(df$event_day_of_wk)) {
    for (z in unique(df$event_hr)) {
      merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
      #1st anomaly threshold - I have multiple different anomaly thresholds
      # TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
      for(j in names(merchant.df)){
        merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
        #merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
        #merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
        #merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
      }
      mu_vector        = apply( merchant.df, 2, mean )
      sigma_matrix     = cov( merchant.df, use="complete.obs", method='pearson' )
      inv_sigma_matrix = ginv(sigma_matrix)
      det_sigma_matrix = det( sigma_matrix )
      z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
      eps = quantile(z_probas,0.01)
      mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
      #2nd anomaly threshold
      nov = ncol(merchant.df)
      pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
      pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
      lambda <- pca_result$eig[, 'eigenvalue']
      anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
      significance <- c (0.99)
      thresh = qchisq(significance, nov)
      pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
      #This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
      temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
      output.df <- rbind(output.df, temp.output.df)
     }
    }
   #Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
   write.csv(output.df,row.names=FALSE)
   }

下面的代码显示了我正在做什么的想法。如您所见,我运行 3 for 循环,其中我以最低粒度(即一周中某一天的小时级别(计算多个异常检测,然后在完成后将每个唯一的customer_id级别输出到 csv 中。

总的来说,代码运行得非常快;但是,做一个三重 for 循环会扼杀我的性能。有谁知道我还可以通过任何其他方法进行这样的操作,因为我的原始数据框并且需要在每个unique_id级别输出 csv?

  • 所以不要使用三重循环。使用 dplyr::group_by(customer_id, event_day_of_wk, event_hr) 或等效data.table。两者都应该更快。
  • 无需在每次迭代中显式附加 rbindcbind,这会降低您的性能。
  • 此外,无需将整个输入 df cbind()到输出 df 中;您唯一的实际输出是mv_outliers, pca_outliers ;您可以稍后join()输入和输出 df customer_id, event_day_of_wk, event_hr
  • 编辑:由于您要整理每个customer_id的所有结果,然后write.csv()它们,因此需要在分组的外部级别进行,并在内部级别group_by(event_day_of_wk, event_hr)

.

# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input
require(dplyr)
output.df <- df %>%
  group_by(customer_id) %>%
    group_by(event_day_of_wk, event_hr) %>%
    # columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
    # Either a) you can hardcode their names in mutate(), summarize() calls
    #  or b) you can reference the vars by string in mutate_(), summarize_() calls
    # TRANSFORM VARIABLES
    mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%
    mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
    # compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...
    summarize(
       z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
       eps = quantile(z_probas,0.01),
       mv_outliers = (z_probas<eps)
    ) %>%
    # similarly, use mutate() and do.call() for your PCA invocation...
    # Your outputs are mv_outliers, pca_outliers
    # You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output
    # Now remove all your temporary variables from your output:
    select(-foo_log, -bar_log, ...) %>%
    # or else just select(mv_outliers, pca_outliers) the variables you want to keep
  ungroup() %>%  # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)
  write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')
ungroup()  # group_by(customer_id)

另请参阅"在 dplyr 链中写入.csv(( ">

最新更新