我正在使用一个数据集,其中某些行缺少一列,并且后续列被错误地移动到缺少列的位置,因此它可能如下所示:
date tap time count
0 20160730 on 02:30 415.0
1 20160730 on 02:30 18.0
2 20160730 on 02:30 24.0
3 20160730 on 02:30 31.0
4 20160730 on 13:30 64.0
... ... ... ... ...
169549 20170101 23:45 29 NaN
169550 20170101 23:45 34 NaN
169551 20170101 23:45 43 NaN
169552 20170101 23:45 42 NaN
169553 20170101 23:45 60 NaN
请注意,在最后 5 行中,"时间"的值位于"点击"列中,而"计数"的值位于"时间"列中。这不仅发生在最后几行,而且发生在整个数据集中。
我正在尝试制作一个执行此操作的函数:
for each item in the 'tap' column
if item is neither 'on' or 'off', then
the value of the 'count' column in that row takes on the value of the 'time' column
the value of the 'time' column in that row takes on the value of the 'tap' column
the value of the 'tap' column in that row is replaced by a string "N/A"
所以希望最终结果看起来像这样:
date tap time count
0 20160730 on 02:30 415.0
1 20160730 on 02:30 18.0
2 20160730 on 02:30 24.0
3 20160730 on 02:30 31.0
4 20160730 on 13:30 64.0
... ... ... ... ...
169549 20170101 N/A 23:45 29
169550 20170101 N/A 23:45 34
169551 20170101 N/A 23:45 43
169552 20170101 N/A 23:45 42
169553 20170101 N/A 23:45 60
到目前为止,我只加载了csv文件...
import pandas as pd
df = pd.read_csv('data.csv', dtype={
'date': str,
'tap': str,
'time': str,
'count': float})
我确定我错过了一些非常简单的东西,但我已经在谷歌上花了几个小时,只是找不到正确的语法来做到这一点。请让我知道如何完成这项工作。
将DataFrame.shift
与条件一起使用Series.isin
,只需要将所有列转换为字符串以避免不匹配的 dtype 缺少值(如最后一列(:
m = df['tap'].isin(['on','off'])
cols = ['tap','time','count']
df.loc[~m, cols] = df.loc[~m, cols].astype(str).shift(axis=1)
df['count'] = df['count'].astype(int)
print (df)
date tap time count
0 20160730 on 02:30 415
1 20160730 on 02:30 18
2 20160730 on 02:30 24
3 20160730 on 02:30 31
4 20160730 on 13:30 64
169549 20170101 NaN 23:45 29
169550 20170101 NaN 23:45 34
169551 20170101 NaN 23:45 43
169552 20170101 NaN 23:45 42
169553 20170101 NaN 23:45 60
如果要分配新列而不移位:
m = df['tap'].isin(['on','off'])
df.loc[~m, ['time','count']] = df.loc[~m, ['tap','time']].to_numpy()
df.loc[~m, 'tap'] = np.nan
df['count'] = df['count'].astype(int)
print (df)
date tap time count
0 20160730 on 02:30 415
1 20160730 on 02:30 18
2 20160730 on 02:30 24
3 20160730 on 02:30 31
4 20160730 on 13:30 64
169549 20170101 NaN 23:45 29
169550 20170101 NaN 23:45 34
169551 20170101 NaN 23:45 43
169552 20170101 NaN 23:45 42
169553 20170101 NaN 23:45 60
尝试
wrong_vals = df['tap'].isin(['on', 'off'])
df.loc[~wrong_vals, 'time'] = df['tap']
df.loc[wrong_vals, 'tap'] = None