如何将每日数据批量转换为每周数据



我的Dataframe11,516,015rows,这是daily数据。ts_code的数据是重复的,ts_code+trade_date是唯一的。我需要将所有数据转换为weekly数据,其中一些数据如下:

>>>print(df.tail(10))
ts_code trade_date        open        high         low       close   pre_close    change   pct_chg           vol       amount  adj_factor
11516005  000011.SZ   20210602    49.83250    49.98750    48.12750    48.51500    50.25875  -1.74375 -3.469545  23642.818065   115647.098       3.875
11516006  000010.SZ   20210602    43.20775    43.63875    43.10000    43.31550    43.42325  -0.10775 -0.248139   4262.180974    18472.845      10.775
11516007  000009.SZ   20210602    88.81000    89.60929    87.83309    88.09952    89.25405  -1.15453 -1.293532  28523.113388   252740.741       8.881
11516008  000008.SZ   20210602    50.86616    51.31432    50.86616    51.09024    51.09024   0.00000  0.000000   5659.005712    28933.202      22.408
11516009  000007.SZ   20210602    30.65080    30.73364    30.15376    30.31944    30.65080  -0.33136 -1.081081   3568.358281    10841.980       8.284
11516010  000006.SZ   20210602   193.12203   193.12203   190.56654   191.29668   192.75696  -1.46028 -0.757576   1591.295642    30539.090      36.507
11516011  000005.SZ   20210602    16.68240    16.96044    16.40436    16.68240    16.58972   0.09268  0.558659   8889.586750    14812.102       9.268
11516012  000004.SZ   20210602    65.87744    69.81952    64.69888    67.34048    64.94272   2.39776  3.692115  11595.858760    77192.135       4.064
11516013  000002.SZ   20210602  4078.37650  4183.02918  4049.13531  4118.39076  4093.76660  24.62416  0.601504   5546.065718  2287264.276     153.901
11516014  000001.SZ   20210602  2673.79269  2677.15032  2616.71298  2673.79269  2677.15032  -3.35763 -0.125418   4445.341089  1176608.126     111.921
>>>print(df[df.ts_code=='000001.SZ'].tail(10))
ts_code trade_date        open        high         low       close   pre_close     change   pct_chg          vol       amount  adj_factor
11477270  000001.SZ   20210520  2620.07061  2668.19664  2604.40167  2665.95822  2641.33560   24.62262  0.932203  3625.667837   957478.853     111.921
11481565  000001.SZ   20210521  2672.67348  2696.17689  2577.54063  2629.02429  2665.95822  -36.93393 -1.385390  4818.896722  1263058.114     111.921
11485858  000001.SZ   20210524  2627.90508  2641.33560  2595.44799  2627.90508  2629.02429   -1.11921 -0.042571  3073.021417   806092.207     111.921
11490157  000001.SZ   20210525  2634.62034  2767.80633  2624.54745  2753.25660  2627.90508  125.35152  4.770017  8688.044424  2363145.902     111.921
11494461  000001.SZ   20210526  2757.73344  2811.45552  2742.06450  2799.14421  2753.25660   45.88761  1.666667  8213.505776  2286540.248     111.921
11498767  000001.SZ   20210527  2787.95211  2815.93236  2744.30292  2774.52159  2799.14421  -24.62262 -0.879648  4503.992638  1246712.048     111.921
11503076  000001.SZ   20210528  2762.21028  2765.56791  2704.01136  2742.06450  2774.52159  -32.45709 -1.169827  4399.538335  1200523.315     111.921
11507387  000001.SZ   20210531  2723.03793  2745.42213  2676.03111  2708.48820  2742.06450  -33.57630 -1.224490  4604.594401  1244209.045     111.921
11511699  000001.SZ   20210601  2708.48820  2714.08425  2630.14350  2677.15032  2708.48820  -31.33788 -1.157025  5584.456536  1490476.624     111.921
11516014  000001.SZ   20210602  2673.79269  2677.15032  2616.71298  2673.79269  2677.15032   -3.35763 -0.125418  4445.341089  1176608.126     111.921
>>>print(df[df.trade_date=='20210601'].tail(10))
ts_code trade_date        open        high         low       close   pre_close    change   pct_chg           vol       amount  adj_factor
11511690  000011.SZ   20210601    50.02625    50.29750    49.05750    50.25875    49.79375   0.46500  0.933852  18663.819355    92935.863       3.875
11511691  000010.SZ   20210601    43.63875    43.63875    43.10000    43.42325    43.63875  -0.21550 -0.493827   5253.744780    22713.494      10.775
11511692  000009.SZ   20210601    90.76382    90.76382    88.36595    89.25405    90.85263  -1.59858 -1.759531  42079.017003   376049.211       8.881
11511693  000008.SZ   20210601    50.64208    51.09024    50.41800    51.09024    50.64208   0.44816  0.884956   6865.234738    34876.868      22.408
11511694  000007.SZ   20210601    30.65080    30.89932    29.90524    30.65080    30.65080   0.00000  0.000000   4805.589087    14552.560       8.284
11511695  000006.SZ   20210601   192.75696   193.12203   190.93161   192.75696   192.02682   0.73014  0.380228   1634.333689    31401.554      36.507
11511696  000005.SZ   20210601    16.21900    16.86776    16.03364    16.58972    16.21900   0.37072  2.285714  10139.708675    16756.998       9.268
11511697  000004.SZ   20210601    65.30848    65.79616    64.29248    64.94272    65.18656  -0.24384 -0.374065   5646.429626    36524.364       4.064
11511698  000002.SZ   20210601  4136.85888  4150.70997  4073.75947  4093.76660  4109.15670 -15.39010 -0.374532   3962.999656  1622419.892     153.901
11511699  000001.SZ   20210601  2708.48820  2714.08425  2630.14350  2677.15032  2708.48820 -31.33788 -1.157025   5584.456536  1490476.624     111.921

将我的daily数据转换为weekly数据的功能如下,但只能转换一个ts_code

def daily2weekly(df):
period_type = 'W'
df = pd.DataFrame.from_records(df, index=pd.to_datetime(df['trade_date']))
df_weekly = df.resample(period_type).last()
df_weekly['open'] = df['open'].resample(period_type).first()
df_weekly['high'] = df['high'].resample(period_type).max()
df_weekly['low'] = df['low'].resample(period_type).min()
df_weekly['vol'] = df['vol'].resample(period_type).sum()
df_weekly['amount'] = df['amount'].resample(period_type).sum()
df_weekly = df_weekly[df_weekly['trade_date'].notna()]
df_weekly['pre_close'] = df_weekly['close'].shift(axis=0, periods=1)
df_weekly['change'] = df_weekly.close.sub(df_weekly.pre_close)
df_weekly['pct_chg'] = df_weekly.change.div(df_weekly.pre_close).mul(100)
df_weekly = pd.DataFrame.from_records(df_weekly, index=range(len(df_weekly)))
return df_weekly

用法:

>>>t = daily2weekly(df[df.ts_code == '000001.SZ'])
>>>print(t.tail(10))
ts_code trade_date        open        high         low       close   pre_close     change    pct_chg           vol        amount  adj_factor
1498  000001.SZ   20210402  2363.10144  2455.27128  2354.21760  2387.53200  2347.55472   39.97728   1.702933  30859.466717  7.418832e+06     111.048
1499  000001.SZ   20210409  2393.08440  2453.05032  2340.89184  2365.32240  2387.53200  -22.20960  -0.930233  15314.973615  3.665203e+06     111.048
1500  000001.SZ   20210416  2388.64248  2388.64248  2199.86088  2249.83248  2365.32240 -115.48992  -4.882629  27897.387796  6.350587e+06     111.048
1501  000001.SZ   20210423  2224.29144  2626.28520  2210.96568  2586.30792  2249.83248  336.47544  14.955577  46800.293747  1.157247e+07     111.048
1502  000001.SZ   20210430  2650.71576  2690.69304  2529.67344  2586.30792  2586.30792    0.00000   0.000000  28037.472084  7.243296e+06     111.048
1503  000001.SZ   20210507  2565.20880  2698.46640  2565.20880  2670.70440  2586.30792   84.39648   3.263203  11729.247082  3.103294e+06     111.048
1504  000001.SZ   20210514  2665.15200  2669.59392  2529.41460  2609.99772  2670.70440  -60.70668  -2.273059  25223.037449  6.546422e+06     111.921
1505  000001.SZ   20210521  2590.97115  2700.65373  2568.58695  2629.02429  2609.99772   19.02657   0.728988  20528.087401  5.424483e+06     111.921
1506  000001.SZ   20210528  2627.90508  2815.93236  2595.44799  2742.06450  2629.02429  113.04021   4.299702  28878.102590  7.903014e+06     111.921
1507  000001.SZ   20210602  2723.03793  2745.42213  2616.71298  2673.79269  2742.06450  -68.27181  -2.489796  14634.392027  3.911294e+06     111.921

在我使用groupbyapplyjoblib后,40个进程总共需要30秒。

from joblib import Parallel, delayed
from tqdm import tqdm

data_grouped = df.groupby('ts_code')
# n_jobs: number of processes
results = Parallel(n_jobs=40)(
delayed(daily2weekly)(group) for name, group in tqdm(data_grouped, total=len(data_grouped)))
data = pd.concat(results)
100%|██████████| 4366/4366 [00:29<00:00, 148.69it/s]

您可以按ts_coderesample()groupby()数据,并应用感兴趣的聚合函数。请注意,对数据重新采样后,可以选择要应用聚合函数的列。

我最近在时间序列中处理了类似的情况,如下面的代码所示:

import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from @akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas

start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)

data.groupby('Collaborator').resample('W')['Price'].sum()

这是输出:

Collaborator            
1             2021-01-03     59.562737
2021-01-10    151.625485
2021-01-17    330.762041
2021-01-24    128.766567
2021-01-31    161.683576

3             2021-06-06    256.446862
2021-06-13    227.216391
2021-06-20    130.374960
2021-06-27    164.481192
2021-07-04     43.876798

相关内容

  • 没有找到相关文章

最新更新