加快Dataframe中数百万正则表达式的替换速度



我有:

  • 大约40k个二元/三元单词的位置列表
    ['San Francisco CA', 'Oakland CA', 'San Diego CA',...]

  • 具有数百万行的Pandas数据帧。

字符串_列 字符串_位置_移动
加州奥克兰汉堡王 汉堡王
沃尔玛Walnut Creek CA 沃尔玛

使用trrex,它构建了一个与此资源中相同的等效模式(实际上它受到了这个答案的启发):

from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)

输出

string_column      string_column_location_removed
0                    Burger King Oakland CA                        Burger King 
1                   Walmart Walnut Creek CA                            Walmart 
2                   Random Other Thing Here             Random Other Thing Here
3           Another random other thing here     Another random other thing here
4        Really Appreciate the help on this  Really Appreciate the help on this
...                                     ...                                 ...
1499995             Walmart Walnut Creek CA                            Walmart 
1499996             Random Other Thing Here             Random Other Thing Here
1499997     Another random other thing here     Another random other thing here
1499998  Really Appreciate the help on this  Really Appreciate the help on this
1499999                  Thank you so Much!                  Thank you so Much!
[1500000 rows x 2 columns]

定时(str.replace运行时)

%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

时间不包括构建模式所需的时间。

免责声明我是trrex 的作者

相关内容

  • 没有找到相关文章

最新更新