如何在Python中使用regex删除字符串列表中的重复行?



我有一个DataFrame如下

df
Index   Lines
0  /// User states this is causing a problem and but the problem can only be fixed by the user. /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem. //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \ User describes the problem in the problem report.

我想删除重复的句子,但不删除重复的单词。

我尝试了下面的解决方案,但它也在过程中删除了重复的单词。

df['cleaned'] = (df['lines'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))

结果是

Index   cleaned
0  /// User states this is causing a problem and but the can only be fixed by user.
1  //- How to fix the problem is stated below. Below are list of solutions problem.
2  User describes the problem in report.

但预期的解决方案是:

Index   cleaned
0  /// User states this is causing a problem and but the problem can only be fixed by the user.
1  //- How to fix the problem is stated below. Below are the list of solutions to the problem.
2 \ User describes the problem in the problem report.

如何让它删除重复的行,但不删除重复的单词?有没有办法把这件事做完?

在正则表达式中是否有一种方法可以抓取以"."结尾的第一个句子?检查第一个句子是否在大字符串中再次出现并删除从第一个字符串重复到末尾的所有内容?

请忠告或建议。谢谢! !

IIUC:

out = df['Lines'].str.findall(r'[^.]+').explode() 
.reset_index().drop_duplicates() 
.groupby('Index')['Lines'] 
.apply(lambda x: '.'.join(x))
>>> out[0]
/// User states this is causing a problem and but the problem can only be fixed by the user
>>> out[1]
//- How to fix the problem is stated below. Below are the list of solutions to the problem
>>> print(out[2])
\ User describes the problem in the problem report

由于您的数据框架只是存储字符串,让我们手动完成:

seen = set()
for i, row in enumerate(df["lines"]):
lines = row.split(". ")
keep = []
for line in lines:
line = line.strip()
# if you want to clean up
line = line.strip("\/-").strip()
if line[-1] != ".":
line += "."
if line not in seen:
keep.append(line)
seen.add(line)
df["lines"][i] = " ".join(keep)

我们逐行迭代列,用&quot分隔每行。";(它根据句子进行拆分),然后如果这个句子还没有被看到,我们就把它存储在一个列表中。然后我们将行设置回该列表,再次连接。

删除令牌我们分手以来,我们添加一个"!"给每一个不以1结尾的句子。

最新更新