目标是在字符串中获取bigram出现的数量
换句话说,如何在较大的字符串中获取子字符串的计数?
# Sample data with text
hi = {1: "My name is Lance John",
2: "Am working at Savings Limited in Germany",
3: "Have invested in mutual funds",
4: "Savings Limited accepts mutual funds as investment option",
5: "Savings Limited also accepts other investment option"}
hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
# have two categories with pre-defined words
name = ['Lance John', 'Germany']
finance = ['Savings Limited', 'investment option', 'mutual funds']
# want count of bigrams in each category for each record
# the output should look like this
ID name finance
1 1 0
2 1 2
3 0 1
4 0 3
5 0 2
可以使用正则表达式完成。我们经常假设正则表达式是"魔术",因为它们可以在单个功能调用中完成所有操作。
我不知道在不同的组中找到不同单词的绩效是否比手动搜索更有效 - 但是,它肯定会比纯Python代码中的手动搜索更有效,因为搜索进行了搜索放在高度优化的字节码中,以紧密的循环运行。
因此,如果您只有一个组,那么您所需要的就是将您的模式与"或"(|
)Regexp运算符相隔的,它将匹配每个单词。您可以使用" Finditer" Regexp方法以及collections.Counter
数据结构来总结每个单词的发生:
In [56]: test = "parrot parrot bicycle parrot inquisition bicycle parrot"
In [57]: expression = re.compile("parrot|bicycle|inquisition")
In [58]: Counter(match.group() for match in expression.finditer(test))
Out[58]: Counter({'parrot': 4, 'bicycle': 2, 'inquisition': 1})
现在,您扩展了概念 - 将相关表达式放在名为组的正则表达式中(由括号包含的子图案,并在括号中以 ?P<groupname>
的前缀为前缀,并在groupname中包含< >
)。每个组的身体都是您上面的单词的顺序,每个组名称您的收藏名称 - 因此:
expression = r'(?P<finance>Savings Limited|investment option|mutual funds)|(?P<name>Lance John|Germany)')
在您给出的例子的情况下,将以名为 finance
和 name
的组产生匹配项。要用计数器晒太阳,我们必须使用表达式匹配对象的groupdict
方法,并取下所得dict的键 -
In[65]: Counter(m.groupdict().keys()[0] for m in expression.finditer(hi[1]))
Out[65]: Counter({'finance': 1})
现在只有一种方法可以通过编程来构建您的表达方式,而不必进行铁码 - 它可以使用两个嵌套的"加入"操作员 - 外在的组合一个组合组,而内在的一个可以使每个术语串联到每个词中。组。
如果您将术语放在字典中,而不是将每个词命名为孤立的变量,那将是更优雅的 - 因此,您将拥有:
domains = {'finance': [...], 'names': [...]}
,上面的言论可以通过:
来构建groups = []
for groupname in domains.keys():
term_group = "|".join(re.escape(term) for term in terms)
groups.append(r"(?P<{}>{})".format(groupname, term_group) )
expression = re.compile("|".join(groups))
然后,只需晒太阳:
data = []
for key, textline in hi.items():
data.append((key, Counter(m.groupdict().keys()[0] for m in expression.finditer(textline)) ))
(在旁注上,认为尝试使用Nested Generator表达式构建Regexp是多么难以理解):
expression = re.compile("|".join("(?P<{0}>{1})".format(
groupname,
"|".join(
"{}".format(
re.escape(term)) for term in domains[groupname]
)
) for group in domains.keys() )
)
hi = {1: "My name is Lance John. Lance John is senior marketing analyst",
2: "Am working at Savings Limited in Germany",
3: "Have invested in mutual funds",
4: "Savings Limited accepts mutual funds as investment option",
5: "Savings Limited also accepts other investment option"}
hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
name = ['Lance John', 'Germany', 'senior', 'working']
finance = ['Savings Limited', 'investment option', 'mutual funds']
def f(cell_value):
return [((v[1])) for v in ((s, cell_value.count(s)) for s in search) if v]
search = name
df=hi['notes'].apply(f)
search = finance
df1=hi['notes'].apply(f)
df2 = pd.DataFrame({'name': df.apply(np.count_nonzero), 'finance': df1.apply(np.count_nonzero), 'text': hi['notes']})
能够使用此链接计数在细胞熊猫中的多个子字符串的外观来解决它
只需修改代码即可使用count_nonzero而不是直接sum