如何使用字符串模式在原始列的基础上添加列



df

         f
0   l2y_q1_eps_gg
1   l2y_q2_eps_gg
2   l2y_q3_eps_gg
3   l2y_q4_eps_gg
4   l1y_q1_eps_gg

目标

         f          fr_date
0   l2y_q1_eps_gg   20190331
1   l2y_q2_eps_gg   20190630
2   l2y_q3_eps_gg   20190930
3   l2y_q4_eps_gg   20191231
4   l1y_q1_eps_gg   20200331
5   cy_q1_eps_gg    20210331

fr_date列的值是每个季度每年的最后一天,规则如下,fr_date的类型为int:

  • 2019年12月
  • 2020年1月1日
  • cy:2021
  • q1-q4:每个季度的最后一天

注意:

  • f列的起始模式为l2y/l1y/cy+q1/q2/q3/q4
  • 如果当前年份发生更改,则规则将发生更改。例如,如果当前年份是2022年,那么l2y→2020年1月1日→2021年,cy→2022年

您需要两件事:一个转换函数,以及如何将该函数应用于pandas Dataframe的列以获得新列。

翻译功能

有几种方法可以做到这一点,但这里有一种:

from datetime import datetime
# Last days of quarters are always the same
last_quarter_days = {"q1": "0331", "q2": "0630", "q3": "0930", "q4": "1231"}
def translate_date(string):
    # Extract year and quarter for the full string
    year_str, quarter_str, _, _, = string.split("_")
    # Compute year automatically
    current_year = datetime.today().year
    if year_str == "cy":
        year = current_year
    else:
        # This is a dumb extractor, you could do a pattern search
        # and raise an exception if the string is not correct
        sub = int(year_str[1])
        year = current_year - sub
    # Translate the quarter string thanks to the translation table
    day = last_quarter_days[quarter_str]
    # return the date as an integer (but maybe you want a string?)
    return int("{year}{day}".format(year=year, day=day))

哪个给出:

>>> translate_date("cy_q1_eps_gg")                    
20210331

如何将其应用于数据帧

采用熊猫地图法。

df["fr_date"] = df["f"].map(translate_date)

您可以使用QuarterEnd偏移量来计算每个季度结束的日期:

current_year = pd.datetime.now().year
mapping = {"l2y": current_year - 2, "l1y": current_year - 1, "cy": current_year}
df["year"] = df.f.str.extract(r"([^_]+)")
df["year"] = df["year"].map(mapping)
df["quarter"] = df.f.str.extract(r"_q([d])")
df["fr_date"] = df.apply(
    lambda x: (
        pd.Timestamp(year=x["year"], month=int(x["quarter"]) * 3, day=1)
        + pd.tseries.offsets.QuarterEnd()
    ).strftime("%Y%m%d"),
    axis=1,
)
print(df[["f", "fr_date"]])

印刷品(2021年(:

               f   fr_date
0  l2y_q1_eps_gg  20190331
1  l2y_q2_eps_gg  20190630
2  l2y_q3_eps_gg  20190930
3  l2y_q4_eps_gg  20191231
4  l1y_q1_eps_gg  20200331
5   cy_q1_eps_gg  20210331
df = pd.concat([df, df['f'].str.split('_', expand=True)], axis=1)
df
               f    0   1    2   3
0  l2y_q1_eps_gg  l2y  q1  eps  gg
1  l2y_q2_eps_gg  l2y  q2  eps  gg
2  l2y_q3_eps_gg  l2y  q3  eps  gg
3  l2y_q4_eps_gg  l2y  q4  eps  gg
4  l1y_q1_eps_gg  l1y  q1  eps  gg
df['year']=df[0].map({'l2y':'2019','l1y':'2020','cy':'2021'})
df['quarter']=df[1].str.upper()
df['fr_date'] = df['year'] + '-' + df['quarter']
df = df.drop([0,1,2,3], axis=1)
print(df)
               f  year quarter  fr_date
0  l2y_q1_eps_gg  2019      Q1  2019-Q1
1  l2y_q2_eps_gg  2019      Q2  2019-Q2
2  l2y_q3_eps_gg  2019      Q3  2019-Q3
3  l2y_q4_eps_gg  2019      Q4  2019-Q4
4  l1y_q1_eps_gg  2020      Q1  2020-Q1
df['fr_date'] = pd.to_datetime([f'{x[:4]}{x[-2:]}' for x in df['fr_date']])
df
               f  year quarter    fr_date
0  l2y_q1_eps_gg  2019      Q1 2019-01-01
1  l2y_q2_eps_gg  2019      Q2 2019-04-01
2  l2y_q3_eps_gg  2019      Q3 2019-07-01
3  l2y_q4_eps_gg  2019      Q4 2019-10-01
4  l1y_q1_eps_gg  2020      Q1 2020-01-01

df['fr_date'] = pd.to_datetime(df['fr_date']) +  pd.tseries.offsets.QuarterEnd()
df['fr_date'] = df['fr_date'].dt.strftime('%Y%m%d')
df = df.drop(['year', 'quarter'], axis=1)
print(df)
               f   fr_date
0  l2y_q1_eps_gg  20190331
1  l2y_q2_eps_gg  20190630
2  l2y_q3_eps_gg  20190930
3  l2y_q4_eps_gg  20191231
4  l1y_q1_eps_gg  20200331

生成一个函数change_string并应用于列f。该功能执行以下操作:

  • 创建一个包含年份映射的字典
  • 使用正则表达式从字符串中提取年份代码,然后使用dictionary从该代码中提取年份
  • 使用正则表达式从字符串中提取季度
  • 使用pd.Timestamp创建季度开始,使用月=季度*3日=1以及pd.tseries.offsets.QuarterEnd()计算季度结束
  • 最后使用strftime所需字符串格式返回datetime
def change_string(data):
    changes = {"cy": date.today().year, "l1y": date.today().year-1, "l2y": date.today().year-2}
    year = changes[re.findall("^ldy", data)[0]]
    quarter = int(re.findall("_q(d)", data)[0])
    data =  (pd.Timestamp(year=year, month =quarter * 3, day=1) + pd.tseries.offsets.QuarterEnd()).strftime("%Y%m%d")
    return data

df = pd.DataFrame({"f":["l2y_q1_eps_gg","l2y_q2_eps_gg","l2y_q3_eps_gg","l2y_q4_eps_gg","l1y_q1_eps_gg"]})
df["fr_date"] = df.f.apply(change_string)
print(df)

               f         fr_date
    0   l2y_q1_eps_gg   20190331
    1   l2y_q2_eps_gg   20190630
    2   l2y_q3_eps_gg   20190930
    3   l2y_q4_eps_gg   20191231
    4   l1y_q1_eps_gg   20200331

最新更新