使两列CSV文件用户ID显示一次,并以空格分隔的会议列表 - python



我与此链接相关的问题在这里

上面的链接中有很好的解释。但在我的情况下有点不同。

user     meetings
178787    287750
178787    151515
178787    158478
576585    896352
576585    985639
576585    456988

预期结果为

user       meetings
178787   "[287750,151515,158478]"
576585   "[896352,985639,456988]"

我如何使用带有上述代码的 python 完成此操作。提前谢谢。

您可以在文件中逐行读取split行,并将meeting添加到键为userdictionary中。使用此处看到的方法可以非常整齐地完成此操作。

然后,我们可以使用 tabs 将此字典写回同一个文件,以使所有内容对齐。

因此,假设您的文件称为 f.csv ,代码将如下所示:

d = {}
for l in open('f.csv').read().split('n')[1:-1]:
    u, m = l.split()
    d.setdefault(u, []).append(m)
with open('f.csv', 'w') as f:
    f.write('usertmeetingsn')
    for u, m in d.items():
        f.write(u + 't' + str(m) + 'n')

这会产生所需的输出:

user    meetings
178787  ['287750', '151515', '158478']
576585  ['896352', '985639', '456988']

既然user将是关键,让我们塞一本字典。注意:这最终会将整个文件加载到内存中一次,但不需要先按user对文件进行排序。另请注意,输出也不会排序(因为dict.items()不会以任何确定的顺序检索字典项(。

output = {}
with f as open('input.csv'):
    for line in f:
        user, meeting = line.strip('rn').split()
        # we strip newlines before splitting on whitespace
        if user not in output and user != 'user': 
            # the user was not found in the dict (and we want to skip the header)
            output[user] = [meeting] # add the user, with the first meeting
        else: # user already exists in dict
            output[user].append(meeting) # add meeting to user entry
# print output header
print("user meetings") # I used a single space, feel free to use 't' etc.
# lets retrieve all meetings per user
for user, meetings in output.items() # in python2, use .iteritems() instead
    meetings = ','.join(_ for _ in meetings) # format ["1","2","3"] to "1,2,3"
    print('{} "[{}]"'.format(user, meetings))

更高级:排序输出。我首先通过对键进行排序来做到这一点。请注意,这将使用更多内存,因为我也在创建密钥列表。

# same as before
output = {}
with f as open('input.csv'):
for line in f:
    user, meeting = line.strip('rn').split()
    # we strip newlines before splitting on whitespace
    if user not in output and user != 'user': 
        # the user was not found in the dict (and we want to skip the header)
        output[user] = [meeting] # add the user, with the first meeting
    else: # user already exists in dict
        output[user].append(meeting) # add meeting to user entry
# print output header
print("user meetings") # I used a single space, feel free to use 't' etc.
# sort my dict keys before printing them:
for user in sorted(output.keys()):
    meetings = ','.join(_ for _ in output[user])
    print('{} "[{}]"'.format(user, meetings))
from collections import defaultdict
import csv
inpath = ''  # Path to input CSV file
outpath = ''  # Path to output CSV file
output = defaultdict(list)  # Dictionary like {user_id: [meetings]}
for row in csv.DictReader(open(inpath)):
    output[row['user']].append(row['meetings'])
with open(outpath, 'w') as f:
    for user, meetings in output.items():
        row = user + ',' + str(meetings) + 'n'
        f.write(row)

Pandas groupby 提供了一个不错的解决方案:

import pandas as pd
df = pd.read_csv('myfile.csv', columns=['user', 'meetings'])
df_grouped = df.groupby('user')['meetings'].apply(list).astype(str).reset_index()

相关内容

最新更新