我知道如何为单元格中的单个列表做到这一点,但我需要保持列表的多个列表的结构,如[["I","need","to","remove","punctuations","."],[...],[...]]
->[["I","need","to","remove","punctuations"],[...],[...]]
我知道的所有方法都变成了这个->["I","need","to","remove","punctuations",...]
data["clean_text"] = data["clean_text"].apply(lambda x: [', '.join([c for c in s if c not in string.punctuation]) for s in x])
data["clean_text"] = data["clean_text"].str.replace(r'[^ws]+', '')
...
最好的方法是什么?
按照您的方法,我只需添加一个listcomp和一个辅助函数:
import string
def clean_up(lst):
return [[w for w in sublist if w not in string.punctuation] for sublist in lst]
data["clean_text"] = [clean_up(x) for x in data["text"]]
输出:
print(data) # -- with two different columns so we can see the difference
text
0 [[I, need, to, remove, punctuations, .], [This, is, another, list, with, commas, ,, and, periods, .]]
clean_text
0 [[I, need, to, remove, punctuations], [This, is, another, list, with, commas, and, periods]]
如果您的数据帧不是那么大,您可以尝试explode
列表到行,然后过滤掉包含标点符号的行,最后group
返回行。
df_ = df[['clean_text']].copy()
out = (df_.assign(g1=range(len(df)))
.explode('clean_text', ignore_index=True)
.explode('clean_text')
.loc[lambda d: ~d['clean_text'].isin([',', '.'])] # remove possible punctuation
.groupby(level=0).agg({'clean_text': list, 'g1': 'first'})
.groupby('g1').agg({'clean_text': list}))
print(df_)
clean_text
0 [[I, need, to, remove, punctuations, .], [Play, games, .]]
print(out)
clean_text
g1
0 [[I, need, to, remove, punctuations], [Play, games]]