数据框中列表列的极值交点

import polars as pl
df = pl.DataFrame({'a': [[1, 2, 3], [8, 9, 4]], 'b': [[2, 3, 4], [4, 5, 6]]})

给定数据框架df

a           b
[1, 2, 3]   [2, 3, 4]
[8, 9, 4]   [4, 5, 6]

我想要得到一个列c，它是a和b的交集

a           b          c
[1, 2, 3]   [2, 3, 4]    [2, 3]
[8, 9, 4]   [4, 5, 6]     [4]

我知道我可以在python set intersection中使用apply函数，但是我想使用polar表达式。

两极>= 0.18.10

使用set操作:

df.select(
intersection = pl.col('a').list.set_intersection('b'),
difference = pl.col('a').list.set_difference('b'),
union = pl.col('a').list.set_union('b')
)

极性>= 0.18.5，极性<0.18.10

对list(旧名称)使用set操作:

df.select(
intersection = pl.col('a').list.intersection('b'),
difference = pl.col('a').list.difference('b'),
union = pl.col('a').list.union('b')
)

高偏振星& lt;0.18.5

我们可以用arr.eval表达式来完成交集。arr.eval表达式允许我们将列表视为序列/列，这样我们就可以使用与列和序列相同的上下文和表达式。

首先，让我们扩展你的例子，这样我们就可以展示当交点为空时会发生什么。

df = pl.DataFrame(
{
"a": [[1, 2, 3], [8, 9, 4], [0, 1, 2]],
"b": [[2, 3, 4], [4, 5, 6], [10, 11, 12]],
}
)
df

shape: (3, 2)
┌───────────┬──────────────┐
│ a         ┆ b            │
│ ---       ┆ ---          │
│ list[i64] ┆ list[i64]    │
╞═══════════╪══════════════╡
│ [1, 2, 3] ┆ [2, 3, 4]    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 9, 4] ┆ [4, 5, 6]    │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] ┆ [10, 11, 12] │
└───────────┴──────────────┘

的算法有两种方法可以做到这一点。第一个可扩展到两个以上集合的交集(参见下面的其他注意事项)。

df.with_column(
pl.col("a")
.arr.concat('b')
.arr.eval(pl.element().filter(pl.count().over(pl.element()) == 2))
.arr.unique()
.alias('intersection')
)

或

df.with_column(
pl.col("a")
.arr.concat('b')
.arr.eval(pl.element().filter(pl.element().is_duplicated()))
.arr.unique()
.alias('intersection')
)

shape: (3, 3)
┌───────────┬──────────────┬──────────────┐
│ a         ┆ b            ┆ intersection │
│ ---       ┆ ---          ┆ ---          │
│ list[i64] ┆ list[i64]    ┆ list[i64]    │
╞═══════════╪══════════════╪══════════════╡
│ [1, 2, 3] ┆ [2, 3, 4]    ┆ [2, 3]       │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 9, 4] ┆ [4, 5, 6]    ┆ [4]          │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] ┆ [10, 11, 12] ┆ []           │
└───────────┴──────────────┴──────────────┘

工作原理

首先将两个列表连接成一个列表。同时出现在两个列表中的任何元素都将出现两次。

df.with_column(
pl.col("a")
.arr.concat('b')
.alias('ablist')
)

shape: (3, 3)
┌───────────┬──────────────┬────────────────┐
│ a         ┆ b            ┆ ablist         │
│ ---       ┆ ---          ┆ ---            │
│ list[i64] ┆ list[i64]    ┆ list[i64]      │
╞═══════════╪══════════════╪════════════════╡
│ [1, 2, 3] ┆ [2, 3, 4]    ┆ [1, 2, ... 4]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 9, 4] ┆ [4, 5, 6]    ┆ [8, 9, ... 6]  │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] ┆ [10, 11, 12] ┆ [0, 1, ... 12] │
└───────────┴──────────────┴────────────────┘

然后我们可以使用arr.eval函数，它允许我们将连接的列表视为一个系列/列。在本例中，我们将使用filter上下文来查找出现一次以上的任何元素。(polars.element表达式在列表上下文中的使用就像polars.col在序列中的使用一样。)

df.with_column(
pl.col("a")
.arr.concat('b')
.arr.eval(pl.element().filter(pl.count().over(pl.element()) == 2))
.alias('filtered')
)

shape: (3, 3)
┌───────────┬──────────────┬───────────────┐
│ a         ┆ b            ┆ filtered      │
│ ---       ┆ ---          ┆ ---           │
│ list[i64] ┆ list[i64]    ┆ list[i64]     │
╞═══════════╪══════════════╪═══════════════╡
│ [1, 2, 3] ┆ [2, 3, 4]    ┆ [2, 3, ... 3] │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [8, 9, 4] ┆ [4, 5, 6]    ┆ [4, 4]        │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ [0, 1, 2] ┆ [10, 11, 12] ┆ []            │
└───────────┴──────────────┴───────────────┘

注意:上面的步骤也可以使用is_duplicated表达式表示。(在其他注意事项一节中，我们将看到，在计算两个以上集合的交集时，使用is_duplicated将不起作用。)

df.with_column(
pl.col("a")
.arr.concat('b')
.arr.eval(pl.element().filter(pl.element().is_duplicated()))
.alias('filtered')
)

剩下的就是使用arr.unique表达式(即开头所示的结果)从结果中删除重复项。

其他笔记我假设你的列表是真正的集合，在每个列表中元素只出现一次。如果原始列表中有重复的列表，我们可以在连接步骤之前对每个列表应用arr.unique。

同样，这个过程也可以推广到求两个以上集合的交集。只需将所有列表连接在一起，然后将filter步骤从== 2更改为== n(其中n是集合的数量)。(注意:使用上面的is_duplicated表达式将不能处理两个以上的集合。)

arr.eval方法确实有parallel关键字。您可以尝试将其设置为True，看看在您的特定情况下是否会产生更好的性能。

其他设置操作

对称差异:将filter标准更改为== 1(并省略arr.unique步骤)

Union:先用arr.concat后用arr.unique

设置差异:计算交集(如上所述)，然后连接原始列表/集合并过滤只出现一次的项目。或者，对于较小的列表大小，您可以将"a"连接到它本身，然后连接到"b"，然后过滤出现两次(但不是三次)的元素。

两极>= 0.18.10

极性>= 0.18.5，极性<0.18.10

高偏振星& lt;0.18.5

工作原理

其他设置操作

相关内容

最新更新

热门标签：