将pandas数据帧从tsv转换为parquet



我试图对pandas数据帧进行某种程度的透视,但使用了一个"扭曲";我想?这就是加载到panda中的输入表(tsv(。

timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7

输出应该像这个

timestamp   sensor_a sensor_a_type sensor_a_unit sensor_b sensor_b_type sensor_b_unit
1607724078  coords1  string        gps           1        int           bar
1607724079  coords5  string        gps           4        int           bar
1607724080  coords9  string        gps           7        int           bar

因此,正如您所看到的,它应该按时间戳分组,传感器列的值必须定义为一个单独的列,该列的值来自原始值列。此外,原始类型和单位列必须是"0";名称合并";使用新的sensor_x列并保持它们的值。

我们的目标是把它保存为镶木地板。重要提示:可以有两个以上的传感器。

目前,我已经完成了创建一个包含我需要的所有信息的MultiIndexed数据帧,但我无法将其写入镶木地板,因为我想MultiIndex必须以某种方式扁平化,这样传感器单元和类型就有了自己的独立列。

import pandas as pd
if __name__ == '__main__':
df = pd.read_csv("data/test.tsv", delimiter='t', index_col='timestamp')
df2 = df.pivot(columns=['sensor', 'unit', 'type'], values='value')
print(df2)

输出

sensor     sensor_a sensor_b sensor_c
unit            gps      bar  percent
type         string      int   double
timestamp                            
1607724078  coords1        1    11.11
1607724079  coords5        4    44.44
1607724080  coords9        7    77.77

提前感谢!

以下是部分答案。。。只需要重命名一些列:

from io import StringIO
sList = '''timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7'''.split('n')
s = 'n'.join([','.join(l.split()) for l in sList])
with StringIO(s) as sio:
df = pd.read_csv(sio)
df[df.sensor == 'sensor_a'].merge(df[df.sensor == 'sensor_b'], on='timestamp', suffixes=['_a','_b'])

输出

timestamp   sensor_a    type_a  unit_a  value_a sensor_b    type_b  unit_b  value_b
0   1607724078  sensor_a    string  gps coords1 sensor_b    int bar 1
1   1607724079  sensor_a    string  gps coords5 sensor_b    int bar 4
2   1607724080  sensor_a    string  gps coords9 sensor_b    int bar 7

根据此解决方案,您可以仅用顶级替换列

df2.columns = df2.columns.get_level_values(0)

这应该会产生类似的结果

sensor_a sensor_b sensor_c
1607724078  coords1        1    11.11
1607724079  coords5        4    44.44
1607724080  coords9        7    77.77

您可以使用set_index和unstack来重塑数据帧,然后使用以下列表理解来压平多索引:

import pandas as pd    
from io import StringIO
#Input file
sList = '''timestamp   sensor      type    unit    value
1607724078  sensor_a    string  gps     coords1
1607724078  sensor_b    int     bar     1
1607724079  sensor_a    string  gps     coords5
1607724079  sensor_b    int     bar     4
1607724080  sensor_a    string  gps     coords9
1607724080  sensor_b    int     bar     7'''
#Read file in as dataframe
df = pd.read_csv(StringIO(sList), sep='ss+', engine='python')
#Reshape the dataframe
df_out = df.set_index(['timestamp', 'sensor']).unstack()
#Flatten multiindex
df_out.columns = [f'{j}_{i}' if i != 'value' else f'{j}' for i, j in  df_out.columns]
df_out = df_out.reset_index()
print(df_out)

输出:

timestamp sensor_a_type sensor_b_type sensor_a_unit sensor_b_unit sensor_a sensor_b
0  1607724078        string           int           gps           bar  coords1        1
1  1607724079        string           int           gps           bar  coords5        4
2  1607724080        string           int           gps           bar  coords9        7