处理具有几乎相似记录但不同时间的csv文件-需要将它们分组为一条记录



我正试图解决下面的实验室和有问题。这个问题涉及一个csv输入。解决方案需要满足一些标准。任何帮助或提示都将不胜感激。我的代码与我的输出一起位于问题的末尾。

Each row contains the title, rating, and all showtimes of a unique movie.
A space is placed before and after each vertical separator ('|') in each row.
Column 1 displays the movie titles and is left justified with a minimum of 44 characters.
If the movie title has more than 44 characters, output the first 44 characters only.
Column 2 displays the movie ratings and is right justified with a minimum of 5 characters.
Column 3 displays all the showtimes of the same movie, separated by a space.

这是输入:

16:40,Wonders of the World,G
20:00,Wonders of the World,G
19:00,End of the Universe,NC-17
12:45,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
15:00,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
19:30,Buffalo Bill And The Indians or Sitting Bull's History Lesson,PG
10:00,Adventure of Lewis and Clark,PG-13
14:30,Adventure of Lewis and Clark,PG-13
19:00,Halloween,R

这是预期的输出:

Wonders of the World                         |     G | 16:40 20:00
End of the Universe                          | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull |    PG | 12:45 15:00 19:30
Adventure of Lewis and Clark                 | PG-13 | 10:00 14:30
Halloween                                    |     R | 19:00

我的代码:

import csv
rawMovies = input()
repeatList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
for movie in moviesList:
time = movie[0]
#print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
#print(show)
rating = movie[2]
#print(rating)
print('{0: <44} | {1: <6} | {2}'.format(show, rating, time))

我的输出没有评级对齐到右边,我不知道如何过滤重复的电影而不删除列表的时间部分:

Wonders of the World                         | G      | 16:40
Wonders of the World                         | G      | 20:00
End of the Universe                          | NC-17  | 19:00
Buffalo Bill And The Indians or Sitting Bull | PG     | 12:45
Buffalo Bill And The Indians or Sitting Bull | PG     | 15:00
Buffalo Bill And The Indians or Sitting Bull | PG     | 19:30
Adventure of Lewis and Clark                 | PG-13  | 10:00
Adventure of Lewis and Clark                 | PG-13  | 14:30
Halloween                                    | R      | 19:00

可以将输入数据收集到字典中,将title-rating-tuple作为键,并将showtime收集到列表中,然后打印合并后的信息。例如(你必须调整文件名):

import csv
movies = {}
with open("data.csv", "r") as file:
for showtime, title, rating in csv.reader(file):
movies.setdefault((title, rating), []).append(showtime)
for (title, rating), showtimes in movies.items():
print(f"{title[:44]: <44} | {rating: >5} | {' '.join(showtimes)}")

输出:

Wonders of the World                         |     G | 16:40 20:00
End of the Universe                          | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull |    PG | 12:45 15:00 19:30
Adventure of Lewis and Clark                 | PG-13 | 10:00 14:30
Halloween                                    |     R | 19:00

由于输入似乎来自连接块,您也可以使用itertools.groupby(来自标准库)并在读取时打印:

import csv
from itertools import groupby
from operator import itemgetter
with open("data.csv", "r") as file:
for (title, rating), group in groupby(
csv.reader(file), key=itemgetter(1, 2)
):
showtimes = " ".join(time for time, *_ in group)
print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")

考虑评级字符串的最大长度。从该值中减去评级的长度。创建一个该长度的空格字符串,并附加评级。所以基本上

your_desired_str = ' '*(6-len(Rating))+Rating

也可以替换

'somestr {value}'.format(value)

与f字符串,更容易阅读

f'somestr {value}'

下面是我从社区得到的一些建议。

rawMovies = input()
outputList = []
with open(rawMovies, 'r') as movies:
moviesList = csv.reader(movies)
movieold = [' ', ' ', ' ']
for movie in moviesList:
if movieold[1] == movie[1]:
outputList[-1][2] += ' ' + movie[0]
else:
time = movie[0]
# print(time)
show = movie[1]
if len(show) > 45:
show = show[0:44]
# print(show)
rating = movie[2]
outputList.append([show, rating, time])
movieold = movie
# print(rating)
#print(outputList)
for movie in outputList:
print('{0: <44} | {1: <5} | {2}'.format(movie[0], movie[1].rjust(5), movie[2]))

我将使用Python的groupby()函数,它可以帮助您将具有相同值的连续行分组。

例如:


import csv
from itertools import groupby
with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)

for title, entries in groupby(csv_movies, key=lambda x: x[1]):
movies = list(entries)
showtimes = ' '.join(row[0] for row in movies)
rating = movies[0][2]

print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")

给你:

Wonders of the World                         |     G | 16:40 20:00
End of the Universe                          | NC-17 | 19:00
Buffalo Bill And The Indians or Sitting Bull |    PG | 12:45 15:00 19:30
Adventure of Lewis and Clark                 | PG-13 | 10:00 14:30
Halloween                                    |     R | 19:00

所以groupby()是如何工作的呢?

当读取CSV文件时,您将每次获得一行。groupby()所做的是将行分组到包含具有相同值的行的小列表中。它查找的值是使用key参数给出的。在这种情况下,lambda函数一次传递一行,它返回x[1]的当前值,即titlegroupby()一直读取行,直到该值发生变化。然后将当前列表作为entries作为迭代器返回。

此方法假定您希望分组的行在文件中的连续行中。你甚至可以写你自己的group by发生器功能:

def group_by_title(csv):
title = None
entries = []

for row in csv:
if title and row[1] != title:
yield title, entries
entries = []

title = row[1]
entries.append(row)

if entries:
yield title, entries

with open('movies.csv') as f_movies:
csv_movies = csv.reader(f_movies)

for title, entries in group_by_title(csv_movies):
showtimes = ' '.join(row[0] for row in entries)
rating = entries[0][2]

print(f"{title[:44]: <44} | {rating: >5} | {showtimes}")
file_name = input()
my_movies = {}
with open(file_name, 'r') as f:
rows = f.readlines()
for row in rows:
showtimes ,title, rating = row.strip().split(",")
if title in my_movies:
my_movies[title]["showtimes"].append(showtimes)
else:
my_movies[title] = {"rating": rating, "showtimes": [showtimes]}


for movie, item in my_movies.items():
showtimes = " ".join(item["showtimes"])
ratings = item["rating"]
title = movie[:44]
print(f'{title:<44} | {ratings:>5} | {showtimes}')