正则表达式从段落中提取参考书目文本 - Python



在我的Python任务中,我有一个参考书目的字符串(段落(,我想将其解析为字符串列表。

这是整个字符串

A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia. A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386. E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565. C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98. C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To ap- pear. I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69. A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language. F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D. Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall. L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40. L. Lee. 1999. Measure of distributional similarity. In A CL 99, pages 25-32. N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318. M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A learning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June. A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In ARPA, Plainsboro, N J, March. R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10. D. Roth and D. Zelenko. 1998. Part of speech tagging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142. D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na- tional Conference on Artificial Intelligence, pages 806-813. D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904. P. Tapanainen and T. Jrvinen. 1997. A non- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC. D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95. D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131

这就是我想要的输出方式...

A. Berger and H. Printz. 1998. Recognition performance of a large-scale dependency-grammar language model. In Int'l Conference on Spoken Language Processing (ICSLP'98), Sydney, Australia.

A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386.
E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565.
C. Chelba and F. Jelinek. 1998. Exploiting syntactic structure for language modeling. In COLINGA CL '98.
C. Cumby and D. Roth. 2000. Relational representations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To appear.
I. Dagan, L. Lee, and F. Pereira. 1999. Similaritybased models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69.
A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language.
F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press.
D. Jurafsky and J. H. Martin. 200. Speech and Language Processing. Prentice Hall. 

等等...

我尝试了不同的正则表达式,但无法获得正确的结果。 因为字符串没有任何特定的结尾。

但是每个新字符串都以作者姓名开头,然后是年份,然后是论文名称。

例如,在第一个字符串中 作者姓名 (A. Berger( 后跟一个and和另一个作者姓名 (H. printz.(,然后是 Year1998.。但在第二字符串中,作者姓名(A. Blum.(后跟1992.年。

任何形式的帮助将不胜感激。

unable to get a proper result. because string does not have any specific end. But every new string is starting with Author Name(s) following by year

这可能就足够了。我写了一个正则表达式,works on your whole sample
但它仍然是主观的。任何名称形式或标点符号
的加减都会将其吹出水面。

((?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+(?:[ t]*,[ t]*(?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+)*(?:[ t]*,)?(?:[ t]+and[ t]+(?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+)*[ t]*.[ t]*d{4}[ t]*.)(?!S)

替换为rn1

在此处查看示例 -> https://regex101.com/r/ylZKDH/1

每个请求的 Python 子示例

>>> import re
>>>
>>> biblioStr = '''A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia. A. Blum. 1992. Learning boo
lean functions in an infinite attribute space. Machine Learning, 9(4):373-386. E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21
(4):543-565. C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98. C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Confe
rence on the Principles of Knowledge Representation and Reasoning. To ap- pear. I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69. A. R. Golding and D. Roth.
1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language. F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D.
Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall. L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40. L. Lee. 1999. Measure of distributi
onal similarity. In A CL 99, pages 25-32. N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318. M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A lea
rning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June. A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for
prepositional phrase attachment. In ARPA, Plainsboro, N J, March. R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10. D. Roth and D. Zelenko. 1998. Part of speech ta
gging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142. D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na-
tional Conference on Artificial Intelligence, pages 806-813. D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904. P. Tapanainen and T. Jrvinen. 1997. A non
- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC. D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Span
ish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95. D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131
... '''
>>>
>>> Rx = re.compile( r"((?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+(?:[ t]*,[ t]*(?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+)*(?:[ t]*,)?(?:[ t]+and[ t]+(?:(?<![a-zA-Z])[A-Z].[ t]+)+[A-Z][a-zA-Z]+)*[ t]*.[ t]*d{4}[ t]*.)(?!
S)" )
>>>
>>> print (re.sub( Rx, r'rn1', biblioStr ))
A. Berger and H. Printz. 1998. Recognition perfor- mance of a large-scale dependency-grammar lan- guage model. In Int'l Conference on Spoken Lan- guage Processing (ICSLP'98), Sydney, Australia.
A. Blum. 1992. Learning boolean functions in an infinite attribute space. Machine Learning, 9(4):373-386.
E. Brill. 1995. Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics, 21(4):543-565.
C. Chelba and F. Jelinek. 1998. Exploiting syntac- tic structure for language modeling. In COLING- A CL '98.
C. Cumby and D. Roth. 2000. Relational repre- sentations that facilitate learning. In Proc. of the International Conference on the Principles of Knowledge Representation and Reasoning. To ap- pear.
I. Dagan, L. Lee, and F. Pereira. 1999. Similarity- based models of word cooccurrence probabilities. Machine Learning, 34(1-3):43-69.
A. R. Golding and D. Roth. 1999. A Winnow based approach to context-sensitive spelling correction. Machine Learning, 34(1-3):107-130. Special Issue on Machine Learning and Natural Language.
F. Jelinek. 1998. Statistical Methods for Speech Recognition. MIT Press. D. Jurafsky and J. H. Martin. 200. Speech and Lan- guage Processing. Prentice Hall.
L. Lee and F. Pereira. 1999. Distributional similar- ity models: Clustering vs. nearest neighbors. In A CL 99, pages 33-40.
L. Lee. 1999. Measure of distributional similarity. In A CL 99, pages 25-32.
N. Littlestone. 1988. Learning quickly when irrel- evant attributes abound: A new linear-threshold algorithm. Machine Learning, 2:285-318.
M. Munoz, V. Punyakanok, D. Roth, and D. Zimak. 1999. A learning approach to shallow parsing. In EMNLP-VLC'99, the Joint SIGDAT Conference on Empirical Methods in Natural Language Pro- cessing and Very Large Corpora, June.
A. Ratnaparkhi, J. Reynar, and S. Roukos. 1994. A maximum entropy model for prepositional phrase attachment. In ARPA, Plainsboro, N J, March.
R. Rosenfeld. 1996. A maximum entropy approach to adaptive statistical language modeling. Com- puter, Speech and Language, 10.
D. Roth and D. Zelenko. 1998. Part of speech tagging using a network of linear separators. In COLING-ACL 98, The 17th International Conference on Computational Linguistics, pages 1136-1142.
D. Roth. 1998. Learning to resolve natural language ambiguities: A unified approach. In Proc. Na- tional Conference on Artificial Intelligence, pages 806-813.
D. Roth. 1999. Learning in natural language. In Proc. of the International Joint Conference of Ar- tificial Intelligence, pages 898-904.
P. Tapanainen and T. Jrvinen. 1997. A non- projective dependency parser. In In Proceedings of the 5th Conference on Applied Natural Lan- guage Processing, Washington DC.
D. Yarowsky. 1994. Decision lists for lexical ambi- guity resolution: application to accent restoration in Spanish and French. In Proc. of the Annual Meeting of the A CL, pages 88-95.
D. Yuret. 1998. Discovery of Linguistic Relations Using Lexical Attraction. Ph.D. thesis, MIT. 131