用熊猫从CSV中加载随机样品



我有格式的CSV

Team, Player

我想做的是将过滤器应用于现场团队,然后从每个团队中获取3个球员的随机子集。

所以,例如,我的CSV看起来像:

Man Utd, Ryan Giggs
Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish
...

我想以每个团队的3个随机球员组成的XL,而在少于3的情况下,只有1或2个

Man Utd, Paul Scholes
Man Utd, Paul Ince
Man Utd, Danny Pugh
Liverpool, Steven Gerrard
Liverpool, Kenny Dalglish

我开始使用XLRD,我的原始帖子在这里。

我现在正在尝试使用熊猫,因为我相信这将在未来更加灵活。

所以,在psuedocode中我想做的是:

foreach(team in csv)
   print random 3 players + team they are assigned to

我一直在寻找熊猫,并试图找到最好的方法,但是找不到与我想做的类似的事情(对Google来说是一件困难的事情!)。这是我到目前为止的尝试:

import pandas as pd
from collections import defaultdict
import csv as csv

columns = defaultdict(list) # each value in each column is appended to a list
with open('C:\Users\ADMIN\Desktop\CSV_1.csv') as f:
    reader = csv.DictReader(f) # read rows into a dictionary format
    for row in reader: # read a row as {column1: value1, column2: value2,...}
        print(row)
        #for (k,v) in row.items(): # go over each column name and value
        #    columns[k].append(v) # append the value into the appropriate list
                                 # based on column name k

所以我已经评论了最后两行,因为我不确定是否需要。我现在要打印每行,所以我只需要每个足球队的随机3行(或在较少情况下的1或2行)。

我该如何完成?有任何提示/窍门吗?

谢谢。

首先使用更好优化的read_csv

import pandas as pd
df = pd.read_csv('DataFrame') 

现在作为一个随机示例,使用lambda通过随机化数据框来获取随机子集(例如,用LIVFC替换'x'):

In []
df= pd.DataFrame()
df['x'] = np.arange(0, 10, 1)
df['y'] = np.arange(0, 10, 1)
df['x'] = df['x'].astype(str)
df['y'] = df['y'].astype(str)
df['x'].ix[np.random.random_integers(0, len(df), 10)][:3]
Out [382]:
0    0
3    3
7    7
Name: x, dtype: object

这将使您更熟悉Pandas,但是从版本0.16.x开始,现在有一个DataFrame.sample方法内置:

df = pandas.DataFrame(data)
# Randomly sample 70% of your dataframe
df_0.7 = df.sample(frac=0.7)
# Randomly sample 7 elements from your dataframe
df_7 = df.sample(n=7)
For either approach above, you can get the rest of the rows by doing:
df_rest = df.loc[~df.index.isin(df_0.7.index)]

最新更新