Python查找重复并保留注释字符串



输入如下:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5

如果你在这里注意,第4行是5行,这里是重复的,只是(resid 44 and name H )(resid 53 and name H )被切换。我理想的输出会返回这样的东西:

assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! DUPLICATE ! note string 4 ! note string 5

因此,我已经开始使用python中读取文件的典型方法。

txt = open(filename)
print ( lines[0] )

我显然需要捕获()之间的字符串,然后进行某种类型的搜索。我用regex捕捉到了这些,这是孩子们的东西。我的想法是在嵌套循环中使用match[0]match[1]并进行搜索。我失败的尝试是:

for i in lines:
#   match = re.search("\(.*?\)", i)
    match = re.findall('\(.*?\)',i)
    for x in i:
        mm = re.search("match[0] match[1]", lines)
        print ( mm )

如果我打印match[0]match[1],它们会给我想要的内容。进行此搜索的最佳方法是什么,这样我就可以保留和转移注释标志?我认为将DUPLICATE添加到音符字符串将是微不足道的。

我真的只对python解决方案感兴趣。我还需要使用这个400行的程序,我一直在写。

感谢

更熟练使用regex的人可能会给你一个更好的实现来获取密钥,但将元组存储为密钥并反转以检查它是否已经存在应该有效:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5"""
import re
d = {}
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0]))
    # ("foo","bar") == ("bar","foo") , also check current key is not in d
    if tuple(reversed(key)) not in d and key not in d:
        d[key] = line
pp(list(d.values()))

输出:

['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
 'string 3',
 'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
 'string 2',
 'assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
 'string 1',
 'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
 'string 4']

如果订单很重要,请使用collections.Ordereddict。我不确定你到底想在字符串中添加什么,但这会将DUPLICATE ! string 5等添加到现有键值中:

from collections import OrderedDict
d = OrderedDict()
import re
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0])) 
      # (resid 44 and name H ) (resid 53 and name H ) ->  (resid 53 and name H ) (resid 44 and name H )   
    rev_k = tuple(reversed(key))
    if rev_k in d:
        d[rev_k] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    elif key in d:
        d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    else:
        d[key] = line

输出:

['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
 'string 1',
 'assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
 'string 2',
 'assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
 'string 3',
 'assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
 'string 4 DUPLICATE ! string 5']

根据你想做的,你可以每次附加原始行和DUPLICATE ! string ...,所以在我们看到dup之前的原始字符串将是第一个元素,其余的将是所有的DUPLICATE ! string ...:

lines = """assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note string 1
assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note string 2
assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note string 3
assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note string 4
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 5
assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note string 6"""
from collections import defaultdict

d = defaultdict(list)
r1 = re.compile(r"(?<=))s")
r2 = re.compile(r"(.*)")
for line in lines.splitlines():
    key = tuple(r1.split(r2.findall(line)[0]))
    rev_k = tuple(reversed(key))
    if rev_k in d:
        d[rev_k].append(line + " DUPLICATE " + " ".join(line.rsplit(None,4)[1:]))
    elif key in d:
            d[key] += " DUPLICATE " + " ".join(line.rsplit(None,4)[1:])
    else:
        d[key].append(line)

    pp(list(d.values()))

输出:

[['assign (resid 3 and name H ) (resid 18 and name H ) 2.5 2.5 2.5 ! note '
  'string 1'],
 ['assign (resid 44 and name H ) (resid 53 and name H ) 2.5 2.5 2.5 ! note '
  'string 4',
  'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
  'string 5 DUPLICATE ! note string 5',
  'assign (resid 53 and name H ) (resid 44 and name H ) 2.5 2.5 2.5 ! note '
  'string 6 DUPLICATE ! note string 6'],
 ['assign (resid 42 and name H ) (resid 55 and name H ) 2.5 2.5 2.5 ! note '
  'string 3'],
 ['assign (resid 16 and name H ) (resid 5 and name H ) 2.5 2.5 2.5 ! note '
  'string 2']]

构建简单字典(或OrderedDict),将排序后的值作为关键字,整行(或注释)作为值。

让我们假设这就是你想要的独特之处:

>>> re.findall("(.*?)", lns[3])
['(resid 44 and name H )', '(resid 53 and name H )']

所以您可以准备排序密钥:

>>> tmp1 = set(re.findall("(.*?)", lns[3])) # Line 4
>>> tmp2 = set(re.findall("(.*?)", lns[4])) # Line 5
>>> tmp1
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp2
{'(resid 44 and name H )', '(resid 53 and name H )'}
>>> tmp1 == tmp2

但是set不可破解的,所以您必须将其转换为例如tuple,这样它就可以用作字典的密钥:

字典的关键字几乎是任意值。不可散列的值,即包含列表、字典或其他可变类型(按值而不是按对象标识进行比较)的值,不能用作键。

key = tuple(set((re.findall("(.*?)", lns[3]))))

你不是只需要存储行,也许还需要数钥匙吗?

result = {}
with open(filename, 'r') as file:
    for line in file:
        key = tuple(set((re.findall("(.*?)", line))))
        if key in result:
            result[key][3] += 1
        else:
            result[key] = [line.strip(), 1]
for line, count in result.values():
    print('Seen line', line, count, 'times')

或者用密钥存储每一行:

result = collections.defaultdict(list)
# ...
        key = tuple(set((re.findall("(.*?)", line))))
        result[key].append(line.strip())
# And nice printing
for key, lines in result.items():
    print('Seen', key, 'on following lines:')
    for l in lines:
        print('t', l)
    print()

最新更新