如何在没有重复条件的情况下从熊猫数据帧中提取特定行?



请帮助我,我需要正确的 python 代码来满足图像中的这些条件:条件 = 如果"消息"具有 1 个"类别">则将整行保存在新数据框中

但是,如果消息重复只有一个类别,则不应保存行

df[df.duplicated(['Message'], keep=False)]

在此处输入图像描述

我已经尝试过这个重复的概念,但它打印了所有值,包括消息是否重复只有一个类别

我需要正确的python代码来获得图像中相同的输出文件格式

您可以在两行中执行此操作:首先删除MessageCategory相同的重复项,然后找到具有唯一CategoryMessage的所有重复项:

df = df.drop_duplicates(subset=['Message', 'Category'])
df[df.duplicated(subset='Message', keep=False)]

编辑:如果朴素drop_duplicates方法删除了有效数据(它恰好适用于您的示例,但可能会遇到较大数据的问题),您可以显式应用唯一Category的条件:

valid = df.groupby('Message').Category.transform('nunique') > 1
df[df.duplicated(subset='Message', keep=False) & valid]

它最终也是两行,这很好,但通读起来有点复杂。我们正在按Message对数据帧进行分组,以确定其中哪些具有多个Category唯一值。我们专门使用 groupbytransform方法使其成为与原始数据的轴零一起使用的形状,将其用作条件为> 1的布尔掩码。

我按如下方式重新创建了您的数据集:

data = {
"Message": [
"I like this product",
"I am going to buy",
"Quality is good",
"Nice product",
"Its weight is high",
"I like this product",
"Its working",
"love it",
"Design is good",
"Its working",
"Design is good",
"Design is good",
"Design is good",
"Design is good"
],
"Category": [
"Satisfaction",
"Intent",
"Product Quality",
"Product Quality",
"Weight",
"Product Quality",
"Performance",
"Satisfaction",
"Design",
"Product Quality",
"Design",
"Design",
"Design",
"Design"
],
}

然后,我们可以在df["Message"]列中找到所有唯一值,并确定其中哪些值在df['Category']列中具有多个唯一值:

import pandas as pd
df = pd.DataFrame(data)
messages = df.Message.unique()
multi_category = []
for message in messages:
subset = len(df[df["Message"] == message]["Category"].unique())
if subset > 1:
multi_category.append(message)
final = df[df["Message"].isin(multi_category)]
print(final)

最后,我们通过"消息"列过滤DataFrame,得到以下输出:

Message         Category
0  I like this product     Satisfaction
5  I like this product  Product Quality
6          Its working      Performance
9          Its working  Product Quality

最新更新