我想将包含空格分隔的十六进制数字的pandas数据框转换为整数和浮点数(有些列只包含整数,有些列是浮点数)。数据框有一个索引列(它是一个时间变量)。
数据框看起来像这样:
print(selected_df.XData)
DataSrvTime
2021-07-08T08:43:29.0616419 C7 10 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:30.0866790 C2 16 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:31.1107931 CB E 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
2021-07-08T08:43:32.1398927 BF 13 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T08:43:33.1697282 BA 15 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
...
2021-07-08T11:12:51.1695194 4 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:52.2000730 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:53.2248873 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:54.2574457 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
2021-07-08T11:12:56.3157504 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0...
Name: XData, Length: 7799, dtype: object
首先,我将数据集拆分为一个单独的列。数据框的第一个字符是分隔符。因此,我取消了第一列,只保留包含列的数据,并添加了一个列名:
# Defining the column names:
header = ["00","01","02","03","04","05","06","07","08","09","10","11","12","13","14","15","16","17","18","19","20","21","22","23","MTF1","MTF3","MTF5","MTF7","SP","SFR","T","RH","PM_1","PM_2","PM_3","#RG","#RL","#RR","#RC","LS","Checksum"]
# split data into single columns
x_df = selected_df.XData.str.split(' ', expand=True)
# dismiss first delimiter column
x_df.drop(0, inplace=True, axis=1)
#add column names
x_df.columns = header
现在,我尝试了不同的方法将十六进制数据转换为整数和/或浮点数,所有这些都会导致错误。也许你们中有人有比我更好的主意。
# Simply test apply solution without header yet:
res = x_df.apply(int, base = 16)
导致这个错误:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-107-004c8b72cd04> in <module>
---> 39 res = x_df.apply(int, base = 16)
c:python38libsite-packagespandascoreframe.py in apply(self, func, axis, raw, result_type, args, **kwargs)
8831 kwargs=kwargs,
8832 )
-> 8833 return op.apply().__finalize__(self, method="apply")
8834
8835 def applymap(
c:python38libsite-packagespandascoreapply.py in apply(self)
725 return self.apply_raw()
726
--> 727 return self.apply_standard()
728
729 def agg(self):
c:python38libsite-packagespandascoreapply.py in apply_standard(self)
849
850 def apply_standard(self):
--> 851 results, res_index = self.apply_series_generator()
852
853 # wrap results
c:python38libsite-packagespandascoreapply.py in apply_series_generator(self)
865 for i, v in enumerate(series_gen):
866 # ignore SettingWithCopy here in case the user mutates
--> 867 results[i] = self.f(v)
868 if isinstance(results[i], ABCSeries):
869 # If we have a view on v, we need to make a copy because
c:python38libsite-packagespandascoreapply.py in f(x)
136
137 def f(x):
--> 138 return func(x, *args, **kwargs)
139
140 else:
TypeError: int() can't convert non-string with explicit base
执行print(x_df.dtypes)
显示所有列都属于"对象"。类型。我以为str.split
已经把分裂的列变成了字符串?
那么,结果的前24列的数据类型应该是整数,而其余的数据类型应该是浮点-除了最后一个,它是一个简单的校验和。
我必须用循环来解决这个问题吗?
感谢阅读
我找到了一个半解决方案(整数值只在这一步):
使用for
循环,我遍历列(在这种情况下,我覆盖它们;如果不是有意的,我可以保留原始数据并使用manipulated = x_df.copy()
)
创建一个新的数据框架。为了避免之前的错误,我必须添加迭代变量来指定当前列:
# convert hex data to int
for column in x_df:
x_df[column] = x_df[column].apply(int, base=16)
print(x_df)