CSV Pandas PYTHON



我有一个CSV文件,看起来像这样

S1,    22,   MD  , 0.022, ,  523.324
S2,    22,   MD  , 4.32,  , 342.54 
S3,    22,   MD  , 3.54,  ,   0.32
S4,    22,   MD  , 4.32,  ,  0.54  
S1,    33,   MD  , 5.32,  ,  0.43
S2,    33,   MD  , 11.54, ,  0.65
S3,    33,   MD  , 22.5,  ,  0.324
S4,    33,   MD  , 45.89  ,  0.32
S1,    44,  MD  , 3.53   ,  3.32
S2,    44,  MD  ,  4.5   ,  0.322
S3,    44,  MD  , 43.65  ,   45.78
S4,    44,   MD,   43.54 , 0.321

文件没有任何头,但是我不关心MD

我需要我的输出文件看起来像这样:
 Size ,   S1` ,    S2  ,   S3  ,   S4   
  22   ,  0.022 ,  4.32 ,  45.89 ,  4.32
  33  ,  5.32,    11.54 ,  22.5,   45.89, 
  44  ,  3.53,    4.5,     43.65,  43.54
        3 values, 3 values, 3,values, 3 values

可以看到,输出文件包含头文件。最后一行还表示每列中值的总数。

到目前为止我的代码:

import pandas as pd

import numpy as np

导入csv

df = pd.read_csv (r 'C: testuser 用户桌面 file.csv’,usecols = [0, 1, 2, 3, 4])

df.columns=pd.MultiIndex.from_tuples(zip(['Names','FileSize','x','y','z'],df.columns)) #添加列标题…(这个做错了)

df_out=df.groupby('Names','FileSize').count().reset_index() #假设打印不同的值

df_out.to_csv (processed_data_out.csv,列("名字","文件大小","x",' y ', ' z '],头= False,指数= False)

我没有使用输出中的最后一列,因为如果用户要求查看该信息,应该生成该列。

Pandas方法在这方面非常好。

读取数据:

import pandas as pd
df = pd.read_csv('data_in.csv', names=['Label','Requirements'], skiprows=1) # This assumes and skips the header row ('TSD' in your question)
>>> df
   Label  Requirements
0      A             1
1      A             2
2      A             3
3      A             4
4      A             5
5      B            11
6      B            22
7      B            45
8      C           NaN
9      C           NaN
10     C           NaN

数要求:

df_out = df.groupby('Label').count().reset_index()
>>> df_out
  Label  Requirements
0     A             5
1     B             3
2     C             0

根据需要设置格式:

df_out['Output'] = df_out.apply(lambda row: '%s doesn't have any requirement'%(row['Label']) if row['Requirements']==0 else '%s has %d requirements'%(row['Label'],row['Requirements']), axis=1)
>>> df_out
  Label  Requirements                          Output
0     A             5            A has 5 requirements
1     B             3            B has 3 requirements
2     C             0  C doesn't have any requirement
导出为CSV:
df_out.to_csv('processed_data_out.csv', columns=['Output'], header=False, index=False)

我建议使用字典:

my_dict = {}
with open(your_file, 'r') as infile:
    for line in infile:
        line_list = line.split(' ')
        if len(line_list) == 2:
            key, requirement = line_list
            if key in my_dict:
                my_dict[key] += 1
            else:
                my_dict[key] = 0
        elif len(line_list) == 1:
            key = line_list[0]
            if key not in my_dict:
                my_dict[key] = 0

然后将字典my_dict写入另一个csv文件…

编辑:这是假设你有一个空格分隔的文件,但你可以改变分隔符在line.split(' ')的任何分隔符…

最新更新