spaCy:泛化一个语言工厂,它获得一个正则表达式来在文本中创建跨



使用spaCy,可以在文档中定义与文本匹配的正则表达式相对应的span。我想把它概括成一个语言工厂。

创建span的代码可以像这样:
nlp = spacy.load("en_core_web_sm")
text = "this is pepa pig text comprising a brake and fig. 45. The house is white."
doc=nlp(text)
def _component(doc, name, regular_expression):
if name not in doc.spans:
doc.spans[name] = []
for i, match in enumerate(re.finditer(regular_expression, doc.text)):
label = name + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[name].append(span_to_add)
return doc
doc = _component(doc, 'pepapig', r"pepaspig")  

我想把这个推广到工厂。工厂将接受一个特定的正则表达式列表,其名称如下:

[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]]

我尝试这样做的方式如下(代码不工作)

@Language.factory("myregexes6", default_config={})
def add_regex_match_as_span(nlp, name, regular_expressions):   
for i,rex_d in enumerate(regular_expressions):
print(rex_d)
name = rex_d['name']
rex = rex_d['rex']
_component(doc, name=name, regular_expression=rex, DEBUG=False)
return doc
nlp.add_pipe(add_regex_match_as_span(nlp, "MC", regular_expressions=[{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]))

我正在寻找上述代码的解决方案

我得到的错误是:
[E966] `nlp.add_pipe` now takes the string name of the registered component factory, not a callable component. Expected string, but got this is pepa pig text comprising a brake and fig. 45. The house is white. (name: 'None').
- If you created your component with `nlp.create_pipe('name')`: remove nlp.create_pipe and call `nlp.add_pipe('name')` instead.
- If you passed in a component like `TextCategorizer()`: call `nlp.add_pipe` with the string name instead, e.g. `nlp.add_pipe('textcat')`.
- If you're using a custom component: Add the decorator `@Language.component` (for function components) or `@Language.factory` (for class components / factories) to your custom component and assign it a name, e.g. `@Language.component('your_name')`. You can then run `nlp.add_pipe('your_name')` to add it to the pipeline.
去年编辑

如何将工厂保存到.py文件中并从其他文件中重新读取?

我认为您需要遵循[自定义组件的文档][1]中的内容。以下是我试图解决你所面临的问题的方法。
我将首先创建一个组件,在这种情况下它应该是一个类,因为你有参数"一个状态"。在本例中,参数是一个名为regex_list的{'name': name, 'rex': rex}的列表。

class RegExComponent:
def __init__(self, regex_list):
self.regex_list = regex_list

def __call__(self, doc):
for re_item in self.regex_list:
if re_item['name'] not in doc.spans:

doc.spans[re_item['name']] = []
for i, match in enumerate(re.finditer(re_item['rex'], doc.text)):
label = re_item['name'] + "_" + str(i)
start, end = match.span()
span = doc.char_span(start, end, alignment_mode = "expand")
span_to_add = Span(doc, span.start, span.end, label=label)
doc.spans[re_item['name']].append(span_to_add)
return doc

现在你有了你的组件,你需要一个"工厂";使用指定的参数创建它。你可以这样做:

@Language.factory("myregex", default_config={})
def create_regex(nlp, name, regex_list):   
return RegExComponent(regex_list)

nlp和name应该总是在那里,而regex是你的regex组件的输入。
下面是如何调用新创建的组件的示例:

regex_list = [{'name':'pepapig','rex':r"pepaspig"},{'name':'pepapig2','rex':r"georgespig"}]
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("myregex", "MC", config={'regex_list': regex_list})
text = "this is pepa pig text comprising a brake and fig. 45. The house is white. Hello george pig"
doc=nlp(text)
print(doc.spans)  # {'pepapig': [pepa pig], 'pepapig2': [george pig]}

如果你在一个单独的文件'custom_pipe.py'中创建了这个组件,你可以这样调用它:

import spacy
from custom_pipe import RegExComponent
regex_list = [
{"name": "pepapig", "rex": r"pepaspig"},
{"name": "pepapig2", "rex": r"georgespig"},
]
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("myregex", "MC", config={"regex_list": regex_list})
text = "this is pepa pig text comprising a brake and fig. 45. The house is white. Hello george pig"
doc = nlp(text)
print(doc.spans)
我希望我的回答对你有帮助!
[1]: https://spacy.io/usage/processing-pipelines example-stateful-components

相关内容

最新更新