F# 中的"left join"性能



我实际上是在F#中执行一个sql左联接,接收两个csv文件并生成第三个。我的文件不是很大(大约20万行(,但我仍然体验到可怕的性能——事实上,在xl中使用vlookup会更快。。。

csv都有一个标识符";列";它保持兼容的值,但一个csv中存在的值不能保证在另一个csw中。

我对它进行了修改,我怀疑在一个csv中搜索另一行是致命的。

EDIT:用Map替换Array大大提高了性能。但我想这仍然可以进一步改进

有什么改进的想法吗?

某些(伪(代码:

open FSharp.Data
type DataLeft = 
CsvProvider<Sample = "identifier;var1;var2", AssumeMissingValues = true, Schema = "identifier (string), var1, var2", Separators=";", HasHeaders=true, Encoding="UTF-8">
type DataRight = 
CsvProvider<Sample = "identifier;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">
type Output =
CsvProvider<Sample = "identifier;var1;var2;var3;var4", AssumeMissingValues = true, Schema = "identifier (string), var1, var2, var3 (float option), var4 (float option)", Separators=";", HasHeaders=true, Encoding="UTF-8">
let leftRows = DataLeft.Load(leftPath).Rows
// (slightly) more efficient to convert to array
let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray
**EDIT: let rightRows = DataRight.Load(rightPath).Rows |> Seq.map (fun row -> (row.Identifier, row)) |> Map.ofSeq**
let getMissingVars (row : DataLeft.Row) =
let id = row.Identifier
let rightRow = rightRows |> Array.tryFind (fun rRow -> rRow.Identifier = id)
**EDIT: let rightRow = rightRows.TryFind(id)**
match rightRow with
| None ->
Output.Row(
id,
row.Var1,
row.Var2,
None,
None)
| Some realRow -> 
Output.Row(
id,
row.Var1,
row.Var2,
realRow.Var3,
realRow.Var4)
let rows = leftRows |> Seq.map getMissingVars
let csv = new Output(rows)
csv.Save(path = "outputPath")

简单地创建一个用于查找的字典就解决了这个问题。我放弃了进一步改进的尝试,所以把这个作为答案发布。

根据编辑替换

let rightRows = DataRight.Load(rightPath).Rows |> Seq.toArray

带有

let rightRows = 
DataRight.Load(rightPath).Rows 
|> Seq.map (fun row -> (row.Identifier, row)) 
|> Map.ofSeq

或者一些更好的字典。然后用Map.tryFind替换Array.tryFind

最新更新