在python中使用regex从文本文件中提取特定字符集后的文本



嗨,我有以下格式的文本,我想保存名称(例如:第二自然科学院)和它的别名以及原始名称在字典中的名称,如以下格式,

尝试使用以下代码执行此操作

无法提取模式
re.findall(r'[a-z A-z 0-9 /n/-]+', ^[a.k.a.][a-z A-z 0-9 /n/-]+', textData)
re.findall(r'a.k.a. : (S+)', textData)

完全不知道该怎么做,有人能帮我一下吗


#预期输出

"2ND COMPLEX OF NEURAL SCIENCES":["2ND COMPLEX OF NATURAL NEURAL", "ACADEMY OF NEURAL 
SCIENCES", "CHE 2 CHAON KWAHAK-WON", "KUKPAN KAHAK-WON", "SECOND COMPLEX OF NEURAL SCIENCES 
RESEARCH INSTITUTE"]
"LOSTIK VE HAVAIK HIZMETLARI LTD":["LOSTIK VE HAVAIK HIZMETLARI LTD"]
"7 KARNES":["7 KARNES"]
"SWING OF TIR":["7TH OF TIR COMPLEX", "7TH OF TIR INDUSTRIAL COMPLEX", "7TH OF TIR 
INDUSTRIES", "7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN", "MOJTAMAE SANATE HAFTOME TIR" etc]
# textData.txt

2ND COMPLEX OF NEURAL SCIENCES (a.k.a. ACADEMY OF NEURAL 
SCIENCES; a.k.a. CHE 2 CHAON KAHAK-WON; a.k.a. CHE 2 CHAYON KAHAK-WON;
a.k.a. KUKPAN KAHAK-WON; a.k.a. NATIONAL DEFENSE ACADEMY; a.k.a.
SANSRI; a.k.a. SECOND COMPLEX OF NEURAL SCIENCES; a.k.a. SECOND
COMPLEX OF NEURAL SCIENCES RESEARCH INSTITUTE), Pyongyang, Korea,
North; Secondary sanctions risk: North Korea Sanctions Regulations,
sections 510.201 and 510.210; Transactions Prohibited For Persons
Owned or Controlled By U.S. Financial Institutions: North Korea
Sanctions Regulations section 510.214.
LOSTIK VE HAVAIK HIZMETLARI LTD., No. 3/182 Antepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions.
[IFSR] (Linked To: MAHAN AIR).
7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia).
SWING OF TIR (a.k.a. 7TH OF TIR COMPLEX; a.k.a. 7TH OF TIR INDUSTRIAL
COMPLEX; a.k.a. 7TH OF TIR INDUSTRIES; a.k.a. 7TH OF TIR INDUSTRIES
OF ISFAHAN/ESFAHAN; a.k.a. MOJTAMAE SANATE HAFTOME TIR; a.k.a.
SANAYE HAFTOME TIR; a.k.a. SEVENTH OF TIR), Mobarakeh Road Km 45,
Isfahan, Iran; P.O. Box 81465-478, Isfahan, Iran; Additional
Sanctions Information - Subject to Secondary Sanctions.

你似乎对方括号的含义感到困惑。也许复习一下正则表达式中方括号和圆括号的区别是什么?

你们的要求似乎不太清楚,但是像这样的?

import re
with open('textData.txt', 'r') as lines:
text = lines.read()
for segment in text.split('nn'):
para = ' '.join(segment.splitlines())
if para:
name = re.match(r'^[^,()]+(?=, | ()', para)
if name:
akas = [name.group(0)]
akas.extend(re.findall(r'(?<=a.k.a. )([^;)]+)', para))
print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))

这里假设每条记录与其他记录之间用空行分隔,并且文件足够小,可以装入内存。

您可以使用2个捕获组,并在(?:;s)?a.k.a.s上拆分组2的值以获得单独的值。

使用re.findall将返回捕获组值

^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b)(?: ((a.k.a.[^()]+(?:sa.k.a.[^()]+)*)))?

模式匹配

  • ^字符串
  • 起始
  • (Capture组1
    • [A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b匹配不以单词字符
    • 结尾的大写字符和空格
  • )关闭组1
  • (?:非捕获组
    • (匹配(
    • (Capture组2
      • a.k.a.[^()]+(?:sa.k.a.[^()]+)*)匹配以a.k.a开头的重复部分,然后匹配除()以外的任何字符
    • )关闭第二组
  • )?关闭非捕获组并使其为可选

Regex demo | Python demo

例如

import re
import pprint
pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b)(?: ((a.k.a.[^()]+(?:sa.k.a.[^()]+)*)))?"
with open('textData.txt') as f:
textData = f.read()
d = {}
for t in re.findall(pattern, textData, re.M):
parts = [p for p in re.split(r"(?:;s)?a.k.a.s", t[1]) if p]
parts.insert(0, (t[0]))
d[t[0]] = parts
pprint.pprint(d)

输出
{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
'ACADEMY OF NEURAL nSCIENCES',
'CHE 2 CHAON KAHAK-WON',
'CHE 2 CHAYON KAHAK-WON',
'KUKPAN KAHAK-WON',
'NATIONAL DEFENSE ACADEMY',
'SANSRI',
'SECOND COMPLEX OF NEURAL SCIENCES',
'SECONDn'
'COMPLEX OF NEURAL SCIENCES RESEARCH '
'INSTITUTE'],
'7 KARNES': ['7 KARNES'],
'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
'SWING OF TIR': ['SWING OF TIR',
'7TH OF TIR COMPLEX',
'7TH OF TIR INDUSTRIALnCOMPLEX',
'7TH OF TIR INDUSTRIES',
'7TH OF TIR INDUSTRIESnOF ISFAHAN/ESFAHAN',
'MOJTAMAE SANATE HAFTOME TIR',
'SANAYE HAFTOME TIR',
'SEVENTH OF TIR']}

最新更新