是否有熊猫聚合函数结合了"任意"和"唯一"的特征？

我有一个包含类似数据的大型数据集：

>>> df = pd.DataFrame(
...     {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
...      'B': ['a', 'b', 'c', 'a', 'a', np.nan]})          
>>> df
A    B
0    one    a
1    two    b
2    two    c
3    one    a
4    one    a
5  three  NaN

有两个聚合函数"任意"one_answers"唯一"：

>>> df.groupby('A')['B'].any()
A
one       True
three    False
two       True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one         [a]
three     [nan]
two      [b, c]
Name: B, dtype: object

但我想得到以下结果(或接近它的结果)：

A
one           a
three     False
two        True

我可以用一些复杂的代码来完成，但最好在python包中找到合适的函数或最简单的方法来解决问题。如果你能帮我，我将不胜感激。

您可以聚合第一列的Series.nunique和唯一值，并删除其他列可能缺少的值：

df1 = df.groupby('A').agg(count=('B','nunique'), 
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A                             
one        1               [a]
three      0                []
two        2            [b, c]

然后，如果列count大于1，则创建掩码；如果count等于1:，则用uniq_without_NaNs替换值

out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one          a
three    False
two       True
Name: count, dtype: object

>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one          a
three    False
two       True
dtype: object

控制群组聚集器
计算unique的数量
- 如果>1，即超过1个uniques，置True
- 如果==1，即只有1个唯一值，则输入该唯一值
- 否则，即没有uniques(完整的NaN)，则为False

您可以将groupby与agg组合，并使用布尔掩码来选择正确的输出：

# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])

输出：

>>> out
A
one          a
three    False
two       True
>>> agg
any  unique
A                   
one     True     [a]
three  False   [nan]
two     True  [b, c]
>>> m
A
one       True  # choose 'unique' column
three    False  # choose 'any' column
two      False  # choose 'any' column

new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']

这会给你：

A      B
0    one   True
1  three  False
2    two   True

现在，如果我们想找到我们可以做的值：

df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)

它给出：

A
one        a
three    NaN
two        b

表达式

series = df.groupby('A')['B'].agg(lambda x: pd.Series(x.unique()))

将给出下一个结果：

one        a
three    Nan
two   [b, c]

其中简单值可以通过类型来识别：

series[series.apply(type) == str]

认为它很容易经常使用，但可能不是最佳解决方案。

相关内容

最新更新

热门标签：