我有一个文件,内容如下:
SUBJECT COMPANY:
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS SUBJECT CORP
CENTRAL INDEX KEY: 0000000000
STANDARD INDUSTRIAL CLASSIFICATION: []
IRS NUMBER: 123456789
STATE OF INCORPORATION: DE
FISCAL YEAR END: 1231
然后在文件后面,它有这样的东西:
<REPORTING-OWNER>
COMPANY DATA:
COMPANY CONFORMED NAME: MISCELLANEOUS OWNER CORP
CENTRAL INDEX KEY: 0101010101
STANDARD INDUSTRIAL CLASSIFICATION: []
我需要做的是捕获公司符合要求的名称、中央索引键、IRS 编号、财政年度结束或我要提取的任何内容,但仅限于主题公司部分,而不是报告所有者部分。这些行可以按任何顺序排列,也可以不存在,但如果它们存在,我想捕获它们的值。
我尝试构建的正则表达式如下所示:
(?:COMPANY CONFORMED NAME:s*(?'conformed_name'(?!(?:A|AN|THE)b)[A-Z-/\=|&!#$(){}:;,@`. ]+)|CENTRAL INDEX KEY:s*(?'cik'd{10})|IRS NUMBER:s*(?'IRS_number'w{2}-?w{7,8})|FISCAL YEAR END:s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))
预期结果如下:
conformed_name = "MISCELLANEOUS SUBJECT CORP"
CIK = "000000000"
IRS_number = "123456789"
fiscal_year_end = "1231"
任何风格的正则表达式都是可以接受的,因为我会适应最适合该场景的任何内容。感谢您阅读我的困境以及您可以提供的任何指导。
我最终自己弄清楚了。在这里尝试一下。
/SUBJECT COMPANY:s+COMPANY DATA:(?:s+(?:(?:COMPANY CONFORMED NAME:s+(?'conformed_name'[^n]+))|(?:CENTRAL INDEX KEY:s+(?'CIK'd{10}))|(?:STANDARD INDUSTRIAL CLASSIFICATION:s+(?'assigned_SIC'[^n]+))|(?:IRS NUMBER:s+?(?'IRS_number'w{2}-?w{7,8}))|(?:STATE OF INCORPORATION:s+(?'state_of_incorporation'w{2}))|(?:FISCAL YEAR END:s+(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))n))+/s
要仅匹配公司部分,并且仅在前面加上"主题公司"时,请使用后面的外观:
(?<=SUBJECT COMPANY:tn n )(?:COMPANY CONFORMED NAME:s*(?'conformed_name'(?!(?:A|AN|THE)b)[A-Z-/\=|&!#$(){}:;,@`. ]+)|CENTRAL INDEX KEY:s*(?'cik'd{10})|IRS NUMBER:s*(?'IRS_number'w{2}-?w{7,8})|FISCAL YEAR END:s*(?'fiscal_year_end'(?:0[1-9]|1[0-2])(?:0[1-9]|[1-2][0-9]|3[0-1])))