处理大型数据集 - 行操作和列操作



我有一个看起来像这样的数据集:

Teams_and_seasons

我的输出应该是这样的:Teams_and_seasons_output

数据集列出了8支球队不同赛季的积分。代码应该按点降序排序数据集,然后将团队和点连接起来,并在一行中显示结果。

我在Excel中写了一个vba宏,但是当行数达到250K时,Excel缓慢地爬行到蜗牛速度,花了2小时。我想把这个处理改为Python,使它运行得更快。但是,我不是Python Numpy或Pandas方面的专家,我正在寻求帮助来实现这一目标。

谢谢你的帮助。数据集以csv文件的形式提供。

我会发布一个解决方案,但我不确定你会得到什么样的时间增加。你得测试一下。此外,我相信有更简洁的解决方案使用pandas或其他东西。

基本策略是对每一行进行排序,并跟踪哪个队与哪个分数相关联。

同样,如果你是python的新手,如果代码没有意义,你可以阅读列表推导。它们是一行for循环。例子:

a = [i for i in range(5)] #-> [0, 1, 2, 3, 4]
b = [[i for i in range(3)] for j in range(2)] #->  [[0, 1, 2], [0, 1, 2]]
# parse the csv file - there is also a csv library you can look at
def parseCSV(fname):
# extract the rows from the file
lines = None 
with open(fname) as f:
lines = [line.strip().split(',') for line in f]
# uncomment to view the data structure to ensure its consistent  
for line in lines:
#print(line, sep=" ")
continue
# extract the teams and remove from lines
teams = lines[0]
lines = lines[1:]
# not part of output, but if you need the seasons 
seasons = [line[0] for line in lines]
# extract the data - parse it to integer 
scores = [[int(li) for li in line[1:]] for line in lines]
# return the values 
return teams, seasons, scores
def sortRow(row, teams):
#associated each score with the team that scored it
row = [(team_index, score) for team_index, score in enumerate(row)]
# sort the row based on score (x[1]) - lambda x: x is an inline function 
# where x is the input (in this case (team_index, score) and x[1] tells 
# python to sort by the score value (since its at position 1)
# the [::-1] at the end reverse the array since we want descending order
sorted_row = sorted(row, key=lambda x: x[1])[::-1]

# transform index and value to string format requested in output
return [f'{teams[team_index]}-{score}' for team_index, score in sorted_row] 

# tie it all together 
formatted_data = []
teams, seasons, score_matrix = parseCSV('scores.csv')
for row in score_matrix:
formatted_row = sortRow(row, teams)
formatted_data.append(formatted_row)
# save to csv file all at once - to add to existing file replace 'w' with 'a' 
with open('sorted_scores.csv', 'w') as out:
for row in formatted_data:
out.write(f'{",".join(row).strip()}n')

这是scores。csv文件

Team1, Team2, Team3, Team4, Team5, Team6, Team7, Team8
Season1, 22, 12, 12, 22, 14, 18, 28, 10 
Season2, 14, 30, 16, 16, 18, 14, 16, 24
Season3, 24, 24, 16, 16, 28, 30, 16, 16
Season4, 20, 20, 14, 24, 12, 18, 20, 22
Season5, 30, 28, 10, 16, 24, 14, 14, 18

最新更新