如何通过在数据框中编写if语句来创建一个新字段


In[df['create_date'] = pd.to_datetime(df.create_date)
df['second_date'] = pd.to_datetime(df.second_date)
df
study_name  indicator    create_date   second_date
0       science          A     2022-02-25    2022-01-01
1       science      [A/C]     2022-02-25    2022-04-10
2          math          C     2022-03-02    2022-01-01
3          math        B/C     2022-03-02    2022-04-10 
4  entertainment        [E     2021-09-01           NaT
5     technology         D     2022-01-03    2022-01-01
6     technology     A/B/C     2022-01-03    2022-02-20
7     technology     A, B]     2022-02-20    2022-04-10
8         social         C     2021-10-25           NaT

当create_date字段的值(日期)介于from_date和to_date之间以及当它不是,但发生错误时,我写了一个if语句。ValueError: Series的真值是不明确的。使用a.empty a.bool (), a.item (), a.any()或所有()。

from_date = '2022-01-01'
to_date = '2022-04-10'
def create_date_in_period(x):
if (x >= from_date) & (x <= to_date):
interval = abs(df['second_date'] - df['create_date']).dt.days
return interval.div(99)
else:
return np.nan
df['INTERVAL'] = df.apply(create_date_in_period(df['create_date']), axis=1)
df

study_name  indicator    create_date   second_date  INVERVAL
0           science          A     2022-02-25    2022-01-01      0.55
1           science      [A/C]     2022-02-25    2022-04-10      0.44
2              math          C     2022-03-02    2022-01-01       0.6
3              math        B/C     2022-03-02    2022-04-10       0.4
4      entertainment        [E     2021-09-01           NaT       NaT
5         technology         D     2022-01-03    2022-01-01      0.02
6         technology     A/B/C     2022-01-03    2022-02-20      0.48 
7         technology     A, B]     2022-02-20    2022-04-10      0.49
8             social         C     2021-10-25           NaT       NaT

我用谷歌搜索解决这个错误,我找到的解决方案是"&&quot;或"|";而不是and and or据说可以通过写下

来解决

可以使用

df['INTERVAL2'] = np.where(df['create_date'].between(from_date, to_date), df['second_date'].sub(df['create_date']).dt.days.div(99), np.nan)

修改代码,

  1. 您应该将行而不是列传递给DataFrame.apply
  2. from_dateto_date应该是日期时间类型
from_date = pd.to_datetime('2022-01-01')
to_date = pd.to_datetime('2022-04-10')
def create_date_in_period(row):
x = row['create_date']
if (x >= from_date) & (x <= to_date):
interval = abs(row['second_date'] - row['create_date']).days
return interval / 99
else:
return np.nan
df['INTERVAL'] = df.apply(lambda row: create_date_in_period(row), axis=1)
# or
df['INTERVAL'] = df.apply(create_date_in_period, axis=1)
print(df)
study_name indicator create_date second_date  INTERVAL
0        science         A  2022-02-25  2022-01-01  0.555556
1        science     [A/C]  2022-02-25  2022-04-10  0.444444
2           math         C  2022-03-02  2022-01-01  0.606061
3           math       B/C  2022-03-02  2022-04-10  0.393939
4  entertainment        [E  2021-09-01         NaT       NaN
5     technology         D  2022-01-03  2022-01-01  0.020202
6     technology     A/B/C  2022-01-03  2022-02-20  0.484848
7     technology     A, B]  2022-02-20  2022-04-10  0.494949
8         social         C  2021-10-25         NaT       NaN

不要使用loop/apply对于这种任务,这是低效的。

使用矢量代码,借助between:

from_date = '2022-01-01'
to_date = '2022-04-10'
df['INTERVAL'] = (df['create_date']
.sub(df['second_date'])
.abs()
.dt.days
.div(99)
.where(df['create_date'].between(from_date, to_date))
)
print(df)

输出:

study_name indicator create_date second_date  INTERVAL
0        science         A  2022-02-25  2022-01-01  0.555556
1        science     [A/C]  2022-02-25  2022-04-10  0.444444
2           math         C  2022-03-02  2022-01-01  0.606061
3           math       B/C  2022-03-02  2022-04-10  0.393939
4  entertainment        [E  2021-09-01         NaT       NaN
5     technology         D  2022-01-03  2022-01-01  0.020202
6     technology     A/B/C  2022-01-03  2022-02-20  0.484848
7     technology     A, B]  2022-02-20  2022-04-10  0.494949
8         social         C  2021-10-25         NaT       NaN

最新更新