请帮助我,我需要正确的 python 代码来满足图像中的这些条件:条件 = 如果"消息"具有 1 个"类别">则将整行保存在新数据框中
但是,如果消息重复只有一个类别,则不应保存行
df[df.duplicated(['Message'], keep=False)]
在此处输入图像描述
我已经尝试过这个重复的概念,但它打印了所有值,包括消息是否重复只有一个类别
我需要正确的python代码来获得图像中相同的输出文件格式
您可以在两行中执行此操作:首先删除Message
和Category
相同的重复项,然后找到具有唯一Category
的Message
的所有重复项:
df = df.drop_duplicates(subset=['Message', 'Category'])
df[df.duplicated(subset='Message', keep=False)]
编辑:如果朴素drop_duplicates
方法删除了有效数据(它恰好适用于您的示例,但可能会遇到较大数据的问题),您可以显式应用唯一Category
的条件:
valid = df.groupby('Message').Category.transform('nunique') > 1
df[df.duplicated(subset='Message', keep=False) & valid]
它最终也是两行,这很好,但通读起来有点复杂。我们正在按Message
对数据帧进行分组,以确定其中哪些具有多个Category
唯一值。我们专门使用 groupbytransform
方法使其成为与原始数据的轴零一起使用的形状,将其用作条件为> 1
的布尔掩码。
我按如下方式重新创建了您的数据集:
data = {
"Message": [
"I like this product",
"I am going to buy",
"Quality is good",
"Nice product",
"Its weight is high",
"I like this product",
"Its working",
"love it",
"Design is good",
"Its working",
"Design is good",
"Design is good",
"Design is good",
"Design is good"
],
"Category": [
"Satisfaction",
"Intent",
"Product Quality",
"Product Quality",
"Weight",
"Product Quality",
"Performance",
"Satisfaction",
"Design",
"Product Quality",
"Design",
"Design",
"Design",
"Design"
],
}
然后,我们可以在df["Message"]
列中找到所有唯一值,并确定其中哪些值在df['Category']
列中具有多个唯一值:
import pandas as pd
df = pd.DataFrame(data)
messages = df.Message.unique()
multi_category = []
for message in messages:
subset = len(df[df["Message"] == message]["Category"].unique())
if subset > 1:
multi_category.append(message)
final = df[df["Message"].isin(multi_category)]
print(final)
最后,我们通过"消息"列过滤DataFrame
,得到以下输出:
Message Category
0 I like this product Satisfaction
5 I like this product Product Quality
6 Its working Performance
9 Its working Product Quality