我有一个包含用户名的文件,每行一个,我需要将文件中的每个名称与csv文件中的所有值进行比较,并在每次用户名出现在csv文件时记下。我需要使搜索尽可能高效,因为csv文件是40K行长的
我的示例persons.txt文件:
Smith, Robert
Samson, David
Martin, Patricia
Simpson, Marge
我的示例locations.csv文件:
GreaterLocation,LesserLocation,GroupName,DisplayName,InBook
NorthernHemisphere,UnitedStates,Pilots,"Wilbur, Andy, super pilot",Yes
WesternHemisphere,China,Pilots,"Kirby, Mabry, loves pizza",Yes
WesternHemisphere,Japan,Drivers,"Samson, David, big kahuna",Yes
NortherHemisphere,Canada,Drivers,"Randos, Jorge",Yes
SouthernHemispher,Australia,Mechanics,"Freeman, Gordon",Yes
NortherHemisphere,Mexico,Pilots,"Simpson, Marge",Yes
SouthernHemispher,New Zealand,Mechanics,"Samson, David",Yes
我的代码:
import csv
def parse_files():
with open('data_file/persons.txt', 'r') as user_list:
lines = user_list.readlines()
for user_row in lines:
new_user = user_row.strip()
per = []
with open('data_file/locations.csv', newline='') as target_csv:
DictReader_loc = csv.DictReader(target_csv)
for loc_row in DictReader_loc:
if new_user.lower() in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("n"+new_user, per)
print("Parse Complete")
def main():
parse_files()
main()
我的代码当前有效。基于示例文件中的样本数据;Samson,David;以及1个";Simpson,Marge";在locations.csv文件中。我希望有人能给我指导,告诉我如何转换persons.txt文件或locations.csv文件(40K+行(,以使过程尽可能高效。我认为目前需要10-15分钟。我知道循环不是最有效的,但我确实需要检查每个名称,看看它在csv文件中的位置。
我认为@Tomalak的SQLite解决方案非常有用,但如果你想让它更接近你的原始代码,请参阅下面的版本。
实际上,它减少了正在进行的文件打开/关闭/读取的数量,并有望加快速度。
由于你的样品很小,我不能做任何真正的测量。
展望未来,您可以考虑将panda用于这类任务——它可以非常方便地使用csv,并且比csv模块更优化。
import csv
def parse_files():
with open('persons.txt', 'r') as user_list:
# sets are faster to match against than lists
# do the lower() here to avoid repetition
user_set = set([u.strip().lower() for u in user_list.readlines()])
# open file at beginning, close after done
# you could also encapsulate the whole thing into a `with` clause if
# desired
target_csv = open("locations.csv", "r", newline='')
DictReader_loc = csv.DictReader(target_csv)
for user in user_set:
per = []
for loc_row in DictReader_loc:
if user in loc_row['DisplayName'].lower():
per.append(DictReader_loc.line_num)
print(DictReader_loc.line_num, loc_row['DisplayName'])
if len(per) > 0:
print("n"+user, per)
print("Parse Complete")
target_csv.close()
def main():
parse_files()
main()