用于解析议会辩论的语法分析器?



我希望解析来自转录工具的纯文本(目标是将其呈现为LegalDocML)。

我的问题是我不知道从哪里开始,学习语法解析器是一个相当陡峭的学习曲线。我正在寻找关于哪种解析器适合这个问题的指导。

我的直觉是,下面是LR语法工具的候选,因为可能有一些明确的分隔符?(大写代表演讲者,括号代表演讲者的角色,方括号代表演讲时间)但也有一些NLP的需要——对于抱怨,演讲的对象通常在演讲的第一句话中松散地出现。

如有任何建议,不胜感激

作为示例:

Legislative Assembly
Thursday, 19 May 2022

THE SPEAKER (Mrs M.H. Roberts) took the chair at 9.00 am, acknowledged country and read prayers.
PAPER TABLED
A paper was tabled and ordered to lie upon the table of the house.
SMALL BUSINESS ASSISTANCE GRANTS
Statement by Minister for Small Business
Statement
MR D.T. PUNCH (Bunbury — Minister for Small Business) [9.01 am]: I would like to bring to the attention of the house some recent changes made by the McGowan government to the small business assistance grants. As I have previously advised the house, in February the state government announced a $67 million level 1 COVID-19 business assistance package, and more recently a $72 million package for businesses impacted by level 2 public health and social measures, taking the total committed to COVID-19 business support to almost $1.7 billion over the past two years. The level 1 package includes $42 million in rent relief assistance and the level 2 package includes a $66.8 million small business hardship grants program.
Last month, a revision and expansion of the small business hardship grants program was announced.
.
.
.
HOME INDEMNITY INSURANCE
Grievance
MR R.S. LOVE (Moore — Deputy Leader of the Opposition) [9.06 am]: I grieve today to the Parliamentary Secretary to the Minister for Commerce on behalf of Western Australian residents who have had their

这个问题确实处于上下文无关的解析和自然语言解析之间的尴尬境地,上下文无关的解析过于精确,无法处理非结构化的话语,而自然语言解析(据我所知,目前的技术状况)并不是为了利用微妙的印刷线索而设计的。

无论如何,我的建议是使用一组特殊的正则表达式来尝试捕获打印样式和样板短语。("一张纸被放在桌子上,并被命令放在房子的桌子上。")几十年前,当我尝试用加拿大的同类程序做类似的事情时(当时Perl是最先进的),我就是这样做的,并且它基本上是有效的,尽管需要一定数量的人工干预。(我的风格是使用完整性检查来检测处理不当的情况,并将其记录下来,以便将来进行改进。)这一切的工作量将取决于你需要的结果有多精确。

如果你有足够的计算资源,你很有可能建立一个机器学习模型来完成合理的工作。但你仍然需要做大量的验证和重新校准,除非你能容忍错误。

相关内容

  • 没有找到相关文章

最新更新