嗨,我有以下格式的文本,我想保存名称(例如:第二自然科学院)和它的别名以及原始名称在字典中的名称,如以下格式,
尝试使用以下代码执行此操作
无法提取模式re.findall(r'[a-z A-z 0-9 /n/-]+', ^[a.k.a.][a-z A-z 0-9 /n/-]+', textData)
re.findall(r'a.k.a. : (S+)', textData)
完全不知道该怎么做,有人能帮我一下吗
#预期输出
"2ND COMPLEX OF NEURAL SCIENCES":["2ND COMPLEX OF NATURAL NEURAL", "ACADEMY OF NEURAL
SCIENCES", "CHE 2 CHAON KWAHAK-WON", "KUKPAN KAHAK-WON", "SECOND COMPLEX OF NEURAL SCIENCES
RESEARCH INSTITUTE"]
"LOSTIK VE HAVAIK HIZMETLARI LTD":["LOSTIK VE HAVAIK HIZMETLARI LTD"]
"7 KARNES":["7 KARNES"]
"SWING OF TIR":["7TH OF TIR COMPLEX", "7TH OF TIR INDUSTRIAL COMPLEX", "7TH OF TIR
INDUSTRIES", "7TH OF TIR INDUSTRIES OF ISFAHAN/ESFAHAN", "MOJTAMAE SANATE HAFTOME TIR" etc]
# textData.txt2ND COMPLEX OF NEURAL SCIENCES (a.k.a. ACADEMY OF NEURAL
SCIENCES; a.k.a. CHE 2 CHAON KAHAK-WON; a.k.a. CHE 2 CHAYON KAHAK-WON;
a.k.a. KUKPAN KAHAK-WON; a.k.a. NATIONAL DEFENSE ACADEMY; a.k.a.
SANSRI; a.k.a. SECOND COMPLEX OF NEURAL SCIENCES; a.k.a. SECOND
COMPLEX OF NEURAL SCIENCES RESEARCH INSTITUTE), Pyongyang, Korea,
North; Secondary sanctions risk: North Korea Sanctions Regulations,
sections 510.201 and 510.210; Transactions Prohibited For Persons
Owned or Controlled By U.S. Financial Institutions: North Korea
Sanctions Regulations section 510.214.
LOSTIK VE HAVAIK HIZMETLARI LTD., No. 3/182 Antepe
Bagdat Cad. Istasyon Yolu Sok., Istanbul 34840, Turkey; Additional
Sanctions Information - Subject to Secondary Sanctions.
[IFSR] (Linked To: MAHAN AIR).
7 KARNES, Avenida Ciudad de Cali No. 15A-91, Local A06-07, Bogota,
Colombia; Matricula Mercantil No 1978075 (Colombia).
SWING OF TIR (a.k.a. 7TH OF TIR COMPLEX; a.k.a. 7TH OF TIR INDUSTRIAL
COMPLEX; a.k.a. 7TH OF TIR INDUSTRIES; a.k.a. 7TH OF TIR INDUSTRIES
OF ISFAHAN/ESFAHAN; a.k.a. MOJTAMAE SANATE HAFTOME TIR; a.k.a.
SANAYE HAFTOME TIR; a.k.a. SEVENTH OF TIR), Mobarakeh Road Km 45,
Isfahan, Iran; P.O. Box 81465-478, Isfahan, Iran; Additional
Sanctions Information - Subject to Secondary Sanctions.
你似乎对方括号的含义感到困惑。也许复习一下正则表达式中方括号和圆括号的区别是什么?
你们的要求似乎不太清楚,但是像这样的?
import re
with open('textData.txt', 'r') as lines:
text = lines.read()
for segment in text.split('nn'):
para = ' '.join(segment.splitlines())
if para:
name = re.match(r'^[^,()]+(?=, | ()', para)
if name:
akas = [name.group(0)]
akas.extend(re.findall(r'(?<=a.k.a. )([^;)]+)', para))
print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))
这里假设每条记录与其他记录之间用空行分隔,并且文件足够小,可以装入内存。
您可以使用2个捕获组,并在(?:;s)?a.k.a.s
上拆分组2的值以获得单独的值。
使用re.findall将返回捕获组值
^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b)(?: ((a.k.a.[^()]+(?:sa.k.a.[^()]+)*)))?
模式匹配
^
字符串 起始(
Capture组1[A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b
匹配不以单词字符 结尾的大写字符和空格
)
关闭组1(?:
非捕获组(
匹配(
(
Capture组2a.k.a.[^()]+(?:sa.k.a.[^()]+)*)
匹配以a.k.a
开头的重复部分,然后匹配除(
和)
以外的任何字符
)
关闭第二组
)?
关闭非捕获组并使其为可选
Regex demo | Python demo
例如
import re
import pprint
pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?b)(?: ((a.k.a.[^()]+(?:sa.k.a.[^()]+)*)))?"
with open('textData.txt') as f:
textData = f.read()
d = {}
for t in re.findall(pattern, textData, re.M):
parts = [p for p in re.split(r"(?:;s)?a.k.a.s", t[1]) if p]
parts.insert(0, (t[0]))
d[t[0]] = parts
pprint.pprint(d)
输出{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
'ACADEMY OF NEURAL nSCIENCES',
'CHE 2 CHAON KAHAK-WON',
'CHE 2 CHAYON KAHAK-WON',
'KUKPAN KAHAK-WON',
'NATIONAL DEFENSE ACADEMY',
'SANSRI',
'SECOND COMPLEX OF NEURAL SCIENCES',
'SECONDn'
'COMPLEX OF NEURAL SCIENCES RESEARCH '
'INSTITUTE'],
'7 KARNES': ['7 KARNES'],
'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
'SWING OF TIR': ['SWING OF TIR',
'7TH OF TIR COMPLEX',
'7TH OF TIR INDUSTRIALnCOMPLEX',
'7TH OF TIR INDUSTRIES',
'7TH OF TIR INDUSTRIESnOF ISFAHAN/ESFAHAN',
'MOJTAMAE SANATE HAFTOME TIR',
'SANAYE HAFTOME TIR',
'SEVENTH OF TIR']}