输入如下:
assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
如果你在这里注意,第4行是5行,这里是重复的,只是(resid 44 and name H )
和(resid 53 and name H )
被切换。我理想的输出会返回这样的东西:
assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! DUPLICATE ! note string 4 ! note string 5
因此,我已经开始使用python中读取文件的典型方法。
txt = open(filename)
print ( lines[0] )
我显然需要捕获(
和)
之间的字符串,然后进行某种类型的搜索。我用regex捕捉到了这些,这是孩子们的东西。我的想法是在嵌套循环中使用match[0]
和match[1]
并进行搜索。我失败的尝试是:
for i in lines:
# match = re.search("\(.*?\)", i)
match = re.findall('\(.*?\)',i)
for x in i:
mm = re.search("match[0] match[1]", lines)
print ( mm )
如果我打印match[0]
和match[1]
,它们会给我想要的内容。进行此搜索的最佳方法是什么,这样我就可以保留和转移注释标志?我认为将DUPLICATE
添加到音符字符串将是微不足道的。
我真的只对python解决方案感兴趣。我还需要使用这个400行的程序,我一直在写。
感谢
更熟练使用regex的人可能会给你一个更好的实现来获取密钥,但将元组存储为密钥并反转以检查它是否已经存在应该有效:
lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5"""
import re
d = {}
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# ("foo","bar") == ("bar","foo") , also check current key is not in d
if tuple(reversed(key)) not in d and key not in d:
d[key] = line
pp(list(d.values()))
输出:
['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4']
如果订单很重要,请使用collections.Ordereddict
。我不确定你到底想在字符串中添加什么,但这会将DUPLICATE ! string 5
等添加到现有键值中:
from collections import OrderedDict
d = OrderedDict()
import re
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
# (resid 44 and name H ) (resid 53 and name H ) -> (resid 53 and name H ) (resid 44 and name H )
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key] = line
输出:
['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1',
'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2',
'assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3',
'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4 DUPLICATE ! string 5']
根据你想做的,你可以每次附加原始行和DUPLICATE ! string ...
,所以在我们看到dup之前的原始字符串将是第一个元素,其余的将是所有的DUPLICATE ! string ...
:
lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 6"""
from collections import defaultdict
d = defaultdict(list)
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
key = tuple(r1.split(r2.findall(line)[0]))
rev_k = tuple(reversed(key))
if rev_k in d:
d[rev_k].append(line + " DUPLICATE " + " ".join(line.rsplit(None,4)[1:]))
elif key in d:
d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
else:
d[key].append(line)
pp(list(d.values()))
输出:
[['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
'string 1'],
['assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
'string 4',
'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
'string 5 DUPLICATE ! note string 5',
'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
'string 6 DUPLICATE ! note string 6'],
['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
'string 3'],
['assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
'string 2']]
构建简单字典(或OrderedDict
),将排序后的值作为关键字,整行(或注释)作为值。
让我们假设这就是你想要的独特之处:
>>> re.findall("(.*?)", lns[3])
['(resid 44 and name H )', '(resid 53 and name H )']
所以您可以准备排序密钥:
>>> tmp1 = set(re.findall("(.*?)", lns[3])) # Line 4
>>> tmp2 = set(re.findall("(.*?)", lns[4])) # Line 5
>>> tmp1
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp2
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp1 == tmp2
但是set
是不可破解的,所以您必须将其转换为例如tuple
,这样它就可以用作字典的密钥:
字典的关键字几乎是任意值。不可散列的值,即包含列表、字典或其他可变类型(按值而不是按对象标识进行比较)的值,不能用作键。
key = tuple(set((re.findall("(.*?)", lns[3]))))
你不是只需要存储行,也许还需要数钥匙吗?
result = {}
with open(filename, 'r') as file:
for line in file:
key = tuple(set((re.findall("(.*?)", line))))
if key in result:
result[key][3] += 1
else:
result[key] = [line.strip(), 1]
for line, count in result.values():
print('Seen line', line, count, 'times')
或者用密钥存储每一行:
result = collections.defaultdict(list)
# ...
key = tuple(set((re.findall("(.*?)", line))))
result[key].append(line.strip())
# And nice printing
for key, lines in result.items():
print('Seen', key, 'on following lines:')
for l in lines:
print('t', l)
print()