从字符串中搜索和计数bigrams(在字符串中计数子字符串事件)



目标是在字符串中获取bigram出现的数量
换句话说,如何在较大的字符串中获取子字符串的计数?

# Sample data with text
hi = {1: "My name is Lance John", 
  2: "Am working at Savings Limited in Germany",
  3: "Have invested in mutual funds",
  4: "Savings Limited accepts mutual funds as investment option",
  5: "Savings Limited also accepts other investment option"}
hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
# have two categories with pre-defined words
name = ['Lance John', 'Germany']
finance = ['Savings Limited', 'investment option', 'mutual funds']
# want count of bigrams in each category for each record
# the output should look like this
ID name finance  
1    1    0  
2    1    2
3    0    1
4    0    3
5    0    2

可以使用正则表达式完成。我们经常假设正则表达式是"魔术",因为它们可以在单个功能调用中完成所有操作。

我不知道在不同的组中找到不同单词的绩效是否比手动搜索更有效 - 但是,它肯定会比纯Python代码中的手动搜索更有效,因为搜索进行了搜索放在高度优化的字节码中,以紧密的循环运行。

因此,如果您只有一个组,那么您所需要的就是将您的模式与"或"(|)Regexp运算符相隔的,它将匹配每个单词。您可以使用" Finditer" Regexp方法以及collections.Counter数据结构来总结每个单词的发生:

In [56]: test = "parrot parrot bicycle parrot inquisition bicycle parrot"
In [57]: expression = re.compile("parrot|bicycle|inquisition")
In [58]: Counter(match.group() for match in expression.finditer(test))
Out[58]: Counter({'parrot': 4, 'bicycle': 2, 'inquisition': 1})

现在,您扩展了概念 - 将相关表达式放在名为组的正则表达式中(由括号包含的子图案,并在括号中以 ?P<groupname>的前缀为前缀,并在groupname中包含< >)。每个组的身体都是您上面的单词的顺序,每个组名称您的收藏名称 - 因此:

 expression = r'(?P<finance>Savings Limited|investment option|mutual funds)|(?P<name>Lance John|Germany)')

在您给出的例子的情况下,将以名为 financename的组产生匹配项。要用计数器晒太阳,我们必须使用表达式匹配对象的groupdict方法,并取下所得dict的键 -

In[65]: Counter(m.groupdict().keys()[0] for m in expression.finditer(hi[1]))
Out[65]: Counter({'finance': 1})

现在只有一种方法可以通过编程来构建您的表达方式,而不必进行铁码 - 它可以使用两个嵌套的"加入"操作员 - 外在的组合一个组合组,而内在的一个可以使每个术语串联到每个词中。组。

如果您将术语放在字典中,而不是将每个词命名为孤立的变量,那将是更优雅的 - 因此,您将拥有:

 domains = {'finance': [...], 'names': [...]} 

,上面的言论可以通过:

来构建
groups = []
for groupname in domains.keys():
    term_group = "|".join(re.escape(term) for term in terms)
    groups.append(r"(?P<{}>{})".format(groupname, term_group)  ) 
expression = re.compile("|".join(groups))

然后,只需晒太阳:

data = []
for key, textline in hi.items():
    data.append((key, Counter(m.groupdict().keys()[0] for m in expression.finditer(textline)) ))

(在旁注上,认为尝试使用Nested Generator表达式构建Regexp是多么难以理解):

 expression = re.compile("|".join("(?P<{0}>{1})".format(
      groupname,
      "|".join(
          "{}".format(
                  re.escape(term)) for term in domains[groupname]
           )
       ) for group in domains.keys() )
 )
hi = {1: "My name is Lance John. Lance John is senior marketing analyst", 
      2: "Am working at Savings Limited in Germany",
      3: "Have invested in mutual funds",
      4: "Savings Limited accepts mutual funds as investment option",
      5: "Savings Limited also accepts other investment option"}
hi = pd.DataFrame(hi.items(), columns = ['id', 'notes'])
name = ['Lance John', 'Germany', 'senior', 'working']
finance = ['Savings Limited', 'investment option', 'mutual funds']
def f(cell_value):
    return [((v[1])) for v in ((s, cell_value.count(s)) for s in search) if v]
search = name
df=hi['notes'].apply(f)

search = finance
df1=hi['notes'].apply(f)
df2 = pd.DataFrame({'name': df.apply(np.count_nonzero), 'finance': df1.apply(np.count_nonzero), 'text': hi['notes']})

能够使用此链接计数在细胞熊猫中的多个子字符串的外观来解决它
只需修改代码即可使用count_nonzero而不是直接sum

来计数唯一的外观

最新更新