契据相当于pandas.merge



我希望以与pandas.dataframe.merge相似的方式在每个帧中基于特定列的两个deedle(f#(帧合并。包含数据列和(城市,州(列的框架以及包含以下列的信息框架:(城市,州(;拉特长的。如果我想将LAT长列添加到我的主要框架中,我将合并(City,State(列上的两个帧。

这是一个示例:

    let primaryFrame =
            [(0, "Job Name", box "Job 1")
             (0, "City, State", box "Reno, NV")
             (1, "Job Name", box "Job 2")
             (1, "City, State", box "Portland, OR")
             (2, "Job Name", box "Job 3")
             (2, "City, State", box "Portland, OR")
             (3, "Job Name", box "Job 4")
             (3, "City, State", box "Sacramento, CA")] |> Frame.ofValues
    let infoFrame =
            [(0, "City, State", box "Reno, NV")
             (0, "Lat", box "Reno_NV_Lat")
             (0, "Long", box "Reno_NV_Long")
             (1, "City, State", box "Portland, OR")
             (1, "Lat", box "Portland_OR_Lat")
             (1, "Long", box "Portland_OR_Long")] |> Frame.ofValues
    // see code for merge_on below.
    let mergedFrame = primaryFrame
                      |> merge_On infoFrame "City, State" null

这将导致"合并帧"看起来像这样:

> mergedFrame.Format();;
val it : string =
  "     Job Name City, State    Lat             Long             
0 -> Job 1    Reno, NV       Reno_NV_Lat     Reno_NV_Long     
1 -> Job 2    Portland, OR   Portland_OR_Lat Portland_OR_Long 
2 -> Job 3    Portland, OR   Portland_OR_Lat Portland_OR_Long 
3 -> Job 4    Sacramento, CA <missing>       <missing>   

我想出了一种方法(上面示例中使用的'Merge_on'功能(,但是作为F#新手的销售工程师,我想有一种更加惯用/有效的方法。以下是我与"删除的替代生产"一起执行此操作的功能,它可以实现您期望的,并且是" Merge_on"函数所需的;如果您想对更好的方法发表评论,请这样做。

    let removeDuplicateRows column (frame : Frame<'a, 'b>) =
             let nonDupKeys = frame.GroupRowsBy(column).RowKeys
                              |> Seq.distinctBy (fun (a, b) -> a) 
                              |> Seq.map (fun (a, b) -> b)  
             frame.Rows.[nonDupKeys]

    let merge_On (infoFrame : Frame<'c, 'b>) mergeOnCol missingReplacement 
                  (primaryFrame : Frame<'a,'b>) =
          let frame = primaryFrame.Clone() 
          let infoFrame =  infoFrame                           
                           |> removeDuplicateRows mergeOnCol 
                           |> Frame.indexRows mergeOnCol
          let initialSeries = frame.GetColumn(mergeOnCol)
          let infoFrameRows = infoFrame.RowKeys
          for colKey in infoFrame.ColumnKeys do
              let newSeries =
                  [for v in initialSeries.ValuesAll do
                        if Seq.contains v infoFrameRows then  
                            let key = infoFrame.GetRow(v)
                            yield key.[colKey]
                        else
                            yield box missingReplacement ]
              frame.AddColumn(colKey, newSeries)
          frame

感谢您的帮助!

更新:

切换帧。indexrowsstring到frame.indexrows来处理" Mergoncol"中类型不是字符串的情况。

如tomas所建议的,摆脱了infoframe.clone((

deedle可悲的是连接帧的方式(仅在行/列键中(意味着它没有一个不错的内置函数可以在非键列上连接框架。

据我所知,您的方法对我来说很好。您不需要infoFrame上的Clone(因为您没有突变框架(,我认为您可以用infoFrame.TryGetRow替换infoFrame.GetRow(然后您不需要提前获取键(,但是除此之外,您的代码外观外观很好!

我想出了一种替代方案,并且做了一些较短的方法,如下所示:

// Index the info frame by city/state, so that we can do lookup
let infoByCity = infoFrame |> Frame.indexRowsString "City, State"
// Create a new frame with the same row indices as 'primaryFrame' 
// containing the additional information from infoFrame.
let infoMatched = 
  primaryFrame.Rows
  |> Series.map (fun k row -> 
      // For every row, we get the "City, State" value of the row and then
      // find the corresponding row with additional information in infoFrame. Using 
      // 'ValueOrDefault' will automatically give missing when the key does not exist
      infoByCity.Rows.TryGet(row.GetAs<string>("City, State")).ValueOrDefault)
  // Now turn the series of rows into a frame
  |> Frame.ofRows
// Now we have two frames with matching keys, so we can join!
primaryFrame.Join(infoMatched)

这有点短,也许更加自称,但是我没有进行任何测试来检查哪个测试速度更快。除非性能是主要问题,否则我认为使用更可读的版本是一个不错的默认选择!

最新更新