根据字数列表优化数据帧



我有一个类型为list的单词列表,其中包含大量英语单词。

我还有一个数据帧,看起来像:-

FileName        PageNo     LineNo   GOODS_DESC  
1   17743633 - 1    TM000002    69      Abuj Cen Le 
31  17743633 - 1    TM000007    126     Mr USD  
33  17743633 - 1    TM000008    22      TABLEAU EMBALLAGE
34  17743633 - 1    TM000008    24      LISA e EMBALV
46  17743633 - 1    TM000008    143     Cen 
47  17743633 - 1    TM000008    146     A Gl
50  17743633 - 1    TM000009    121     Ppvv Tn Ppvv In 
51  17743633 - 1    TM000009    129     SPECIFY
52  17743633 - 1    TM000009    136     Decrp G 
58  17743633 - 1    TM000009    97      Je ugn  
60  17743633 - 1    TM000009    108     De Veel 
61  17743633 - 1    TM000014    44      TYRE CHIPS SHREDDED TYRES   
63  17743633 - 1    TM000014    48      TYRE CHIPS SHREDDED TYRES

我只想保留单词列表中存在的"GOODS_DESC"列中的那些单词。

我想要的输出是:-

FileName        PageNo     LineNo   GOODS_DESC  
1   17743633 - 1    TM000002    69      NaN
31  17743633 - 1    TM000007    126     Mr USD  
33  17743633 - 1    TM000008    22      TABLEAU
34  17743633 - 1    TM000008    24      LISA  
46  17743633 - 1    TM000008    143     NaN 
47  17743633 - 1    TM000008    146     NaN
50  17743633 - 1    TM000009    121     NaN 
51  17743633 - 1    TM000009    129     SPECIFY
52  17743633 - 1    TM000009    136     NaN
58  17743633 - 1    TM000009    97      NaN 
60  17743633 - 1    TM000009    108     NaN
61  17743633 - 1    TM000014    44      TYRE CHIPS SHREDDED TYRES   
63  17743633 - 1    TM000014    48      TYRE CHIPS SHREDDED TYRES

我的方法也是提供输出,但我使用的是列表,而且速度很慢。我想让它快点。

for rows in df.itertuples():
a = []
flat_list = []
a.append(rows.GOODS_DESC)
flat_list = [item.strip() for sublist in a for item in sublist.split(' ') if item.strip()]
flat_list = list(sorted(set(flat_list), key=flat_list.index))
flat_list = [i for i in flat_list if i.lower() in word_list]
if(not flat_list):
df.drop(rows.Index,inplace=True)
continue
s=' '.join(flat_list)
df.loc[rows.Index,'GOODS_DESC']=s
df['GOODS_DESC'] = df['GOODS_DESC'].str.upper()

你的逻辑似乎过于复杂。您可以将单个列表理解与pd.Series.apply一起使用。我建议,如下所示,您使用set进行 O(1( 查找和str.casefold来匹配字符串,而不管大小写如何。

s = pd.Series(['Abuj Cen Le', 'Mr USD', 'TABLEAU EMBALLAGE', 'LISA e EMBALV'])
word_set = {i.casefold() for i in ['Mr', 'USD', 'TABLEAU', 'LISA']}
def apply_filter(x):
out = ' '.join([i for i in x.split() if i.casefold() in word_set])
return out if out else np.nan
res = s.apply(apply_filter)
print(res)
0        NaN
1     Mr USD
2    TABLEAU
3       LISA
dtype: object

您可以使用mergejoin来执行所需的操作。首先做一些准备:

#input file, you can have any number of columns in it
df_input = pd.DataFrame({'col1':range(1,5),
'GOODS_DESC':['Abuj Cen Le', 'Mr USD', 
'TABLEAU EMBALLAGE', 'LISA e EMBALV']})
# DF from the list of words
df_word = pd.DataFrame({'Word':[word.lower() for word in word_list]})
# create a df_stack with each word of your sentences as a row keeping indexes for join later
df_stack = (df_input['GOODS_DESC'].str.split(' ',expand=True).stack().reset_index())
# Column with same name as df_word and lowercase for merge after
df_stack['Word'] = df_stack[0].str.lower()

现在您可以使用merge然后join

df_join = df_stack.merge(df_word).groupby('level_0')[0].apply(lambda x: ' '.join(x))
df_output = df_input.join(df_join)
# get ride
df_output = df_output.drop('GOODS_DESC',1).rename(columns={0:'GOODS_DESC'})

它看起来像很多线条,但mergejoin是高效的,所以我希望它会足够快。

相关内容

  • 没有找到相关文章

最新更新