交换 ID 和 Python 性能



我希望能得到帮助,让我的代码更高效地运行。我的代码的目的是取出第一个ID(RUID),并将其替换为基于ID的密钥文件的未识别ID(RESPID)。输入数据文件是一个以制表符分隔的大文本文件,大约2.5GB。数据非常宽,每行有数千列。我有一个可以工作的函数,但在实际数据上它非常慢。我的第一个文件已经运行了4天,只有1.4GB。我不知道我的代码的哪一部分问题最大,但我怀疑这是我重新构建行并单独写入每一行的地方。任何关于如何改进的建议都将不胜感激,4天的处理时间太长了!非常感谢。

def swap():
#input files
infile1 = open(r"Z:ped_test.txt", 'rb')
keyfile = open(r"Z:ruid_respid_test.txt", 'rb')
#output file
outfile=open(r"Z:ped_testRESPID.txt", 'wb')
# create dictionary of RUID-RESPID 
COLUMN = 1 #Column containing RUID
RESPID={}
for k in keyfile:
    kList = k.rstrip('rn').split('t')
    if kList[0] not in RESPID and kList[0] != "":
        RESPID[kList[0]]=kList[1]
#print RESPID
print "creating RESPID-RUID xwalk dictionary is done"
print "Start creating new file"
print str(datetime.datetime.now())
count=0
for line in infile1:
 #if not re.match('#', line): #if there is a header     
    sline = line.split()
    #slen = len(sline)
    RUID = sline[COLUMN]
    #print RUID
    C0 = sline[0]
    #print C0
    DAT=sline[2:]
    for key in RESPID:
        if key==RUID:
            NewID=RESPID[key]
    row=str(C0+'t'+NewID)
    for a in DAT:
        row=row+'t'+a
    #print row
outfile.write(row)
outfile.write('n')
infile1.close()
keyfile.close()
outfile.close()
print "All Done: RESPID replacement is complete"
print str(datetime.datetime.now())

您有几个地方可以加快速度。主要是,当您可以只使用"get"函数读取值时,枚举RESPID中的所有键是一个问题。但由于你有很宽的线条,还有一些其他的粗花呢会有所不同。

def swap():
    #input files
    infile1 = open(r"Z:ped_test.txt", 'rb')
    keyfile = open(r"Z:ruid_respid_test.txt", 'rb')
    #output file
    outfile=open(r"Z:ped_testRESPID.txt", 'wb')
    # create dictionary of RUID-RESPID 
    COLUMN = 1 #Column containing RUID
    RESPID={}
    for k in keyfile:
        kList = k.split('t', 2)   # minor: jut grab what you need
        if kList[0] and kList[0] not in RESPID: # minor: do the cheap test first
            RESPID[kList[0]]=kList[1]
    #print RESPID
    print "creating RESPID-RUID xwalk dictionary is done"
    print "Start creating new file"
    print str(datetime.datetime.now())
    count=0
    for line in infile1:
     #if not re.match('#', line): #if there is a header     
        sline = line.split('t', 2) # minor: just grab what you need
        #slen = len(sline)
        RUID = sline[COLUMN]
        #print RUID
        C0 = sline[0]
        #print C0
        DAT=sline[2:]
        # the biggie, just use a lookup
        #for key in RESPID:
        #   if key==RUID:
        #       NewID=RESPID[key]
        rows = 't'.join([sline[0], RESPID.get(RUID, sline[1]), sline[2]])
        #row=str(C0+'t'+NewID)
        #for a in DAT:
        #   row=row+'t'+a
        #print row
    outfile.write(row)
    outfile.write('n')
    infile1.close()
    keyfile.close()
    outfile.close()
    print "All Done: RESPID replacement is complete"
    print str(datetime.datetime.now())

您不需要对RESPID进行迭代。替换:

for key in RESPID:
    if key==RUID:
        NewID=RESPID[key]

带有

NewId = RESPID[RUID]

它做同样的事情,因为密钥总是RUID。我敢肯定,这将大大减少程序的运行时间,因为RESPID是巨大的,你要检查每个键的次数与"ped_test.txt"中的行数一样多。

最新更新