我有:
-
大约40k个二元/三元单词的位置列表
['San Francisco CA', 'Oakland CA', 'San Diego CA',...]
-
具有数百万行的Pandas数据帧。
字符串_列 | 字符串_位置_移动 |
---|---|
加州奥克兰汉堡王 | 汉堡王 |
沃尔玛Walnut Creek CA | 沃尔玛 |
使用trrex,它构建了一个与此资源中相同的等效模式(实际上它受到了这个答案的启发):
from random import choice
from string import ascii_lowercase, digits
import pandas as pd
import trrex as tx
# making random list here
chars = ascii_lowercase + digits
locations_lookup_list = [''.join(choice(chars) for _ in range(10)) for _ in range(40000)]
locations_lookup_list.append('Walnut Creek CA')
locations_lookup_list.append('Oakland CA')
strings_for_df = ["Burger King Oakland CA", "Walmart Walnut Creek CA",
"Random Other Thing Here", "Another random other thing here", "Really Appreciate the help on this",
"Thank you so Much!"] * 250000
df = pd.DataFrame(strings_for_df, columns=["string_column"])
pattern = tx.make(locations_lookup_list, suffix="", prefix="")
df["string_column_location_removed"] = df["string_column"].str.replace(pattern, "", regex=True)
print(df)
输出
string_column string_column_location_removed
0 Burger King Oakland CA Burger King
1 Walmart Walnut Creek CA Walmart
2 Random Other Thing Here Random Other Thing Here
3 Another random other thing here Another random other thing here
4 Really Appreciate the help on this Really Appreciate the help on this
... ... ...
1499995 Walmart Walnut Creek CA Walmart
1499996 Random Other Thing Here Random Other Thing Here
1499997 Another random other thing here Another random other thing here
1499998 Really Appreciate the help on this Really Appreciate the help on this
1499999 Thank you so Much! Thank you so Much!
[1500000 rows x 2 columns]
定时(str.replace
运行时)
%timeit df["string_column"].str.replace(pattern, "", regex=True)
8.84 s ± 180 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
时间不包括构建模式所需的时间。
免责声明我是trrex 的作者