删除pandas列中的标点符号，但保留列表结构的原始列表

我知道如何为单元格中的单个列表做到这一点，但我需要保持列表的多个列表的结构，如[["I","need","to","remove","punctuations","."],[...],[...]]->[["I","need","to","remove","punctuations"],[...],[...]]

我知道的所有方法都变成了这个->["I","need","to","remove","punctuations",...]

data["clean_text"] = data["clean_text"].apply(lambda x: [', '.join([c for c in s if c not in string.punctuation]) for s in x])
data["clean_text"] = data["clean_text"].str.replace(r'[^ws]+', '')
...

最好的方法是什么?

按照您的方法，我只需添加一个listcomp和一个辅助函数:

import string
def clean_up(lst):
return [[w for w in sublist if w not in string.punctuation] for sublist in lst]
data["clean_text"] = [clean_up(x) for x in data["text"]]

输出:

print(data) # -- with two different columns so we can see the difference
                              text  
0  [[I, need, to, remove, punctuations, .], [This, is, another, list, with, commas, ,, and, periods, .]]   
               clean_text  
0  [[I, need, to, remove, punctuations], [This, is, another, list, with, commas, and, periods]]

如果您的数据帧不是那么大，您可以尝试explode列表到行，然后过滤掉包含标点符号的行，最后group返回行。

df_ = df[['clean_text']].copy()
out = (df_.assign(g1=range(len(df)))
.explode('clean_text', ignore_index=True)
.explode('clean_text')
.loc[lambda d: ~d['clean_text'].isin([',', '.'])]  # remove possible punctuation
.groupby(level=0).agg({'clean_text': list, 'g1': 'first'})
.groupby('g1').agg({'clean_text': list}))

print(df_)
clean_text
0  [[I, need, to, remove, punctuations, .], [Play, games, .]]

print(out)
clean_text
g1
0   [[I, need, to, remove, punctuations], [Play, games]]

相关内容

最新更新

热门标签：