将十六进制数的Pandas数据框转换为整数和浮点数,并将其拆分为单列



我想将包含空格分隔的十六进制数字的pandas数据框转换为整数和浮点数(有些列只包含整数,有些列是浮点数)。数据框有一个索引列(它是一个时间变量)。

数据框看起来像这样:

print(selected_df.XData)
DataSrvTime
2021-07-08T08:43:29.0616419     C7 10 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:30.0866790     C2 16 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:31.1107931     CB E 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
2021-07-08T08:43:32.1398927     BF 13 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:33.1697282     BA 15 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
...                        
2021-07-08T11:12:51.1695194     4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:52.2000730     5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:53.2248873     4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:54.2574457     2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:56.3157504     6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
Name: XData, Length: 7799, dtype: object

首先,我将数据集拆分为一个单独的列。数据框的第一个字符是分隔符。因此,我取消了第一列,只保留包含列的数据,并添加了一个列名:

# Defining the column names:
header = ["00","01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","MTF1","MTF3","MTF5","MTF7","SP","SFR","T","RH","PM_1","PM_2","PM_3","#RG","#RL","#RR","#RC","LS","Checksum"] 
# split data into single columns
x_df = selected_df.XData.str.split(' ', expand=True)
# dismiss first delimiter column
x_df.drop(0, inplace=True, axis=1)
#add column names
x_df.columns = header

现在,我尝试了不同的方法将十六进制数据转换为整数和/或浮点数,所有这些都会导致错误。也许你们中有人有比我更好的主意。

# Simply test apply solution without header yet:
res = x_df.apply(int, base = 16)

导致这个错误:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-107-004c8b72cd04> in <module>
---> 39 res = x_df.apply(int, base = 16)
c:python38libsite-packagespandascoreframe.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8831             kwargs=kwargs,
8832         )
-> 8833         return op.apply().__finalize__(self, method="apply")
8834 
8835     def applymap(
c:python38libsite-packagespandascoreapply.py in apply(self)
725             return self.apply_raw()
726 
--> 727         return self.apply_standard()
728 
729     def agg(self):
c:python38libsite-packagespandascoreapply.py in apply_standard(self)
849 
850     def apply_standard(self):
--> 851         results, res_index = self.apply_series_generator()
852 
853         # wrap results
c:python38libsite-packagespandascoreapply.py in apply_series_generator(self)
865             for i, v in enumerate(series_gen):
866                 # ignore SettingWithCopy here in case the user mutates
--> 867                 results[i] = self.f(v)
868                 if isinstance(results[i], ABCSeries):
869                     # If we have a view on v, we need to make a copy because
c:python38libsite-packagespandascoreapply.py in f(x)
136 
137             def f(x):
--> 138                 return func(x, *args, **kwargs)
139 
140         else:
TypeError: int() can't convert non-string with explicit base

执行print(x_df.dtypes)显示所有列都属于"对象"。类型。我以为str.split已经把分裂的列变成了字符串?

那么,结果的前24列的数据类型应该是整数,而其余的数据类型应该是浮点-除了最后一个,它是一个简单的校验和。

我必须用循环来解决这个问题吗?

感谢阅读

我找到了一个半解决方案(整数值只在这一步):

使用for循环,我遍历列(在这种情况下,我覆盖它们;如果不是有意的,我可以保留原始数据并使用manipulated = x_df.copy())

创建一个新的数据框架。为了避免之前的错误,我必须添加迭代变量来指定当前列:

# convert hex data to int
for column in x_df:
x_df[column] = x_df[column].apply(int, base=16)
print(x_df)

最新更新