用pandas数据帧替换重复的单词，但使用regex保留中间单词

我已经在pd.dataframe中导入了一些表。在数据框架中有一列带有公司名称，我想通过删除重复的单词来清理它。

例如：

"奔驰=>quot；奔驰
"特斯拉123特斯拉123〃=>quot；特斯拉123〃
"Apple Store Inc-Apple Store In"苹果商店公司=>quot；苹果商店公司

到目前为止，我已经了解了如何使用regex来处理前两种情况。然而，我似乎不知道如何处理第三种情况。

这是我对第三种情况的代码：

df_comp['comp_no_duplicate'] = df_comp['comp_name'].str 
.replace(r'(^b[A-Z]{1,}.*b)(.*)-{1}b1b', r'12')

使用此代码，我得到第三种情况的结果为："Apple Store Inc-Apple Store In"苹果商店公司=>quot；苹果商店公司；

如何为这种情况编写正则表达式？

对这么多(潜在(规则进行硬编码可能会很麻烦。也许你可以采取一些不同的做法。你不想要重复的条款。那么，为什么不过滤掉多次出现的术语呢？

有多种方法可以做到这一点，具体取决于您需要什么。你可以保留第一次，最后一次，你可以追求速度(这将牺牲条款的顺序(或坚持保持顺序。以下是一些实施建议：

import re
import pandas
from typing import List

# Your data
df = pandas.DataFrame(
[
{"text": "Benz-Benz"},
{"text": "Tesla 123-Tesla 123"},
{"text": "Apple Store Inc-Apple Store In"},
]
)

def unordered_deduplicate(text: str) -> str:
"""Take a string and remove duplicate terms, without preserving
the order of the terms.
Args:
text (str): The input text
Returns:
str: The cleaned output
"""
return " ".join(set(re.split(r"s|-", text)))

def ordered_deduplicate(text: str) -> str:
"""Take a string and remove duplicate terms, only keeping the
first occurence of a term.
Args:
text (str): The input string
Returns:
str: The cleaned output
"""
# Make a list of all the terms
unique_terms_count = {term: 0 for term in set(re.split(r"s|-", text))}
# Loop the terms
cleaned: List[str] = []
for term in re.split(r"s|-", text):
# Only keep them in the cleaned list if they haven't been seen before
if unique_terms_count[term] == 0:
cleaned.append(term)
unique_terms_count[term] += 1
return " ".join(cleaned)

# Create clean text columns in different ways
df["unordered_text"] = df["text"].apply(unordered_deduplicate)
df["ordered_text"] = df["text"].apply(ordered_deduplicate)
print(df)

相关内容

最新更新

热门标签：