循环访问文件值并检查字典中是否有任何/所有相应实例的最有效方法

我有一个包含用户名的文件，每行一个，我需要将文件中的每个名称与csv文件中的所有值进行比较，并在每次用户名出现在csv文件时记下。我需要使搜索尽可能高效，因为csv文件是40K行长的

我的示例persons.txt文件：

Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge

我的示例locations.csv文件：

GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes

我的代码：

import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)

for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("n"+new_user, per)
print("Parse Complete")

def main():
parse_files()
main()

我的代码当前有效。基于示例文件中的样本数据；Samson，David；以及1个"；Simpson，Marge"；在locations.csv文件中。我希望有人能给我指导，告诉我如何转换persons.txt文件或locations.csv文件(40K+行(，以使过程尽可能高效。我认为目前需要10-15分钟。我知道循环不是最有效的，但我确实需要检查每个名称，看看它在csv文件中的位置。

我认为@Tomalak的SQLite解决方案非常有用，但如果你想让它更接近你的原始代码，请参阅下面的版本。

实际上，它减少了正在进行的文件打开/关闭/读取的数量，并有望加快速度。

由于你的样品很小，我不能做任何真正的测量。

展望未来，您可以考虑将panda用于这类任务——它可以非常方便地使用csv，并且比csv模块更优化。

import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set  = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("n"+user, per)
print("Parse Complete")
target_csv.close()

def main():
parse_files()
main()

相关内容

最新更新

热门标签：