我与此链接相关的问题在这里
上面的链接中有很好的解释。但在我的情况下有点不同。
user meetings
178787 287750
178787 151515
178787 158478
576585 896352
576585 985639
576585 456988
预期结果为
user meetings
178787 "[287750,151515,158478]"
576585 "[896352,985639,456988]"
我如何使用带有上述代码的 python 完成此操作。提前谢谢。
您可以在文件中逐行读取split
行,并将meeting
添加到键为user
的dictionary
中。使用此处看到的方法可以非常整齐地完成此操作。
然后,我们可以使用 tabs
将此字典写回同一个文件,以使所有内容对齐。
因此,假设您的文件称为 f.csv
,代码将如下所示:
d = {}
for l in open('f.csv').read().split('n')[1:-1]:
u, m = l.split()
d.setdefault(u, []).append(m)
with open('f.csv', 'w') as f:
f.write('usertmeetingsn')
for u, m in d.items():
f.write(u + 't' + str(m) + 'n')
这会产生所需的输出:
user meetings
178787 ['287750', '151515', '158478']
576585 ['896352', '985639', '456988']
既然user
将是关键,让我们塞一本字典。注意:这最终会将整个文件加载到内存中一次,但不需要先按user
对文件进行排序。另请注意,输出也不会排序(因为dict.items()
不会以任何确定的顺序检索字典项(。
output = {}
with f as open('input.csv'):
for line in f:
user, meeting = line.strip('rn').split()
# we strip newlines before splitting on whitespace
if user not in output and user != 'user':
# the user was not found in the dict (and we want to skip the header)
output[user] = [meeting] # add the user, with the first meeting
else: # user already exists in dict
output[user].append(meeting) # add meeting to user entry
# print output header
print("user meetings") # I used a single space, feel free to use 't' etc.
# lets retrieve all meetings per user
for user, meetings in output.items() # in python2, use .iteritems() instead
meetings = ','.join(_ for _ in meetings) # format ["1","2","3"] to "1,2,3"
print('{} "[{}]"'.format(user, meetings))
更高级:排序输出。我首先通过对键进行排序来做到这一点。请注意,这将使用更多内存,因为我也在创建密钥列表。
# same as before
output = {}
with f as open('input.csv'):
for line in f:
user, meeting = line.strip('rn').split()
# we strip newlines before splitting on whitespace
if user not in output and user != 'user':
# the user was not found in the dict (and we want to skip the header)
output[user] = [meeting] # add the user, with the first meeting
else: # user already exists in dict
output[user].append(meeting) # add meeting to user entry
# print output header
print("user meetings") # I used a single space, feel free to use 't' etc.
# sort my dict keys before printing them:
for user in sorted(output.keys()):
meetings = ','.join(_ for _ in output[user])
print('{} "[{}]"'.format(user, meetings))
from collections import defaultdict
import csv
inpath = '' # Path to input CSV file
outpath = '' # Path to output CSV file
output = defaultdict(list) # Dictionary like {user_id: [meetings]}
for row in csv.DictReader(open(inpath)):
output[row['user']].append(row['meetings'])
with open(outpath, 'w') as f:
for user, meetings in output.items():
row = user + ',' + str(meetings) + 'n'
f.write(row)
Pandas groupby 提供了一个不错的解决方案:
import pandas as pd
df = pd.read_csv('myfile.csv', columns=['user', 'meetings'])
df_grouped = df.groupby('user')['meetings'].apply(list).astype(str).reset_index()