如何按字符串顺序识别子字符串



我有下面的句子列表。

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']

我也有一组选定的概念。

selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']

现在,我想根据句子顺序从sentences中选择seleceted_concepts中的概念。

即。我的输出应如下。

output = [['data mining','process','patterns','methods','machine learning','database systems'],['data mining','interdisciplinary subfield','information'],['data mining','knowledge discovery','databases process']]

我可以在句子中提取概念,如下所示。

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        if item in sentence:
             sentence_tokens.append(item)
    output.append(sentence_tokens)

但是,我在组织句子顺序的提取概念时遇到了麻烦。在Python中有什么简单的方法吗?

做到这一点的一种方法是使用 .find()方法查找子字符串的位置,然后按该值进行排序。例如:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        index = sentence.find(item)
        if index >= 0:
             sentence_tokens.append((index, item))
    sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
    output.append(sentence_tokens)

您可以使用.find()和.insert()。类似:

output = []
for sentence in sentences:
    sentence_tokens = []
    for item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert(pos, item)
    output.append(sentence_tokens)

唯一的问题是在selected_concepts中重叠。例如,"数据库过程"one_answers"过程"。在这种情况下,它们最终将与他们在Selected_concept中所处的顺序相反。您可以通过以下内容解决此问题:

output = []
selected_concepts_multiplier = len(selected_concepts)
for sentence in sentences:
    sentence_tokens = []
    for k,item in selected_concepts:
        pos = sentence.find(item)
        if pos != -1:
             sentence_tokens.insert((selected_concepts_multiplier * pos) + k, item)
    output.append(sentence_tokens)

有一个内置的语句称为" in"。它可以检查其他字符串中是否有任何字符串。

sentences = [
'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd'
]
selected_concepts = [
 'machine learning',
 'patterns',
 'data mining',
 'methods','database systems',
 'interdisciplinary subfield','knowledege discovery',
 'databases process',
 'information',
 'process'
 ]
output = [] #prepare the output
for s in sentences: #now lets check each sentences
    output.append(list()) #add a list to output, so it will become multidimensional list
    for c in selected_concepts: #check all selected_concepts
        if c in s: #if there a selected concept in a sentence
            output[-1].append(c) #then add the selected concept to the last list in output
print(output)

您可以使用以下事实:正则表达式搜索文本顺序,从左到右,禁止重叠:

import re
concept_re = re.compile(r'b(?:' +
    '|'.join(re.escape(concept) for concept in selected_concepts) + r')b')
output = [match
        for sentence in sentences for match in concept_re.findall(sentence)]
output
# => ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems', 'data mining', 'interdisciplinary subfield', 'information', 'information', 'data mining', 'databases process']

这也应该比单独搜索概念更快,因为算法REGEXPS的使用效率更高,并且在低级代码中完全实现。

有一个区别 - 如果一个概念在一个句子中重复自我,则您的代码只会给出一个句子的外观,而此代码将它们全部输出。如果这是一个有意义的区别,那么放置列表很容易。

在这里我使用了一个简单的re.findall方法,如果在字符串中匹配模式,则re.findall将以匹配的模式给出输出,否则它将基于该模式返回一个空列表我写了这个代码

import re
selected_concepts = ['machine learning','patterns','data mining','methods','database systems','interdisciplinary subfield','knowledege discovery','databases process','information','process']
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use','data mining is the analysis step of the knowledge discovery in databases process or kdd']
output = []
for sentence in sentences:
    matched_concepts = []
    for selected_concept in selected_concepts:
        if re.findall(selected_concept, sentence):
            matched_concepts.append(selected_concept)
    output.append(matched_concepts)
print output

输出:

[['machine learning', 'patterns', 'data mining', 'methods', 'database systems', 'process'], ['data mining', 'interdisciplinary subfield', 'information'], ['data mining', 'databases process', 'process']]

最新更新