我的Dataframe
有11,516,015
rows
,这是daily
数据。ts_code
的数据是重复的,ts_code
+trade_date
是唯一的。我需要将所有数据转换为weekly
数据,其中一些数据如下:
>>>print(df.tail(10))
ts_code trade_date open high low close pre_close change pct_chg vol amount adj_factor
11516005 000011.SZ 20210602 49.83250 49.98750 48.12750 48.51500 50.25875 -1.74375 -3.469545 23642.818065 115647.098 3.875
11516006 000010.SZ 20210602 43.20775 43.63875 43.10000 43.31550 43.42325 -0.10775 -0.248139 4262.180974 18472.845 10.775
11516007 000009.SZ 20210602 88.81000 89.60929 87.83309 88.09952 89.25405 -1.15453 -1.293532 28523.113388 252740.741 8.881
11516008 000008.SZ 20210602 50.86616 51.31432 50.86616 51.09024 51.09024 0.00000 0.000000 5659.005712 28933.202 22.408
11516009 000007.SZ 20210602 30.65080 30.73364 30.15376 30.31944 30.65080 -0.33136 -1.081081 3568.358281 10841.980 8.284
11516010 000006.SZ 20210602 193.12203 193.12203 190.56654 191.29668 192.75696 -1.46028 -0.757576 1591.295642 30539.090 36.507
11516011 000005.SZ 20210602 16.68240 16.96044 16.40436 16.68240 16.58972 0.09268 0.558659 8889.586750 14812.102 9.268
11516012 000004.SZ 20210602 65.87744 69.81952 64.69888 67.34048 64.94272 2.39776 3.692115 11595.858760 77192.135 4.064
11516013 000002.SZ 20210602 4078.37650 4183.02918 4049.13531 4118.39076 4093.76660 24.62416 0.601504 5546.065718 2287264.276 153.901
11516014 000001.SZ 20210602 2673.79269 2677.15032 2616.71298 2673.79269 2677.15032 -3.35763 -0.125418 4445.341089 1176608.126 111.921
>>>print(df[df.ts_code=='000001.SZ'].tail(10))
ts_code trade_date open high low close pre_close change pct_chg vol amount adj_factor
11477270 000001.SZ 20210520 2620.07061 2668.19664 2604.40167 2665.95822 2641.33560 24.62262 0.932203 3625.667837 957478.853 111.921
11481565 000001.SZ 20210521 2672.67348 2696.17689 2577.54063 2629.02429 2665.95822 -36.93393 -1.385390 4818.896722 1263058.114 111.921
11485858 000001.SZ 20210524 2627.90508 2641.33560 2595.44799 2627.90508 2629.02429 -1.11921 -0.042571 3073.021417 806092.207 111.921
11490157 000001.SZ 20210525 2634.62034 2767.80633 2624.54745 2753.25660 2627.90508 125.35152 4.770017 8688.044424 2363145.902 111.921
11494461 000001.SZ 20210526 2757.73344 2811.45552 2742.06450 2799.14421 2753.25660 45.88761 1.666667 8213.505776 2286540.248 111.921
11498767 000001.SZ 20210527 2787.95211 2815.93236 2744.30292 2774.52159 2799.14421 -24.62262 -0.879648 4503.992638 1246712.048 111.921
11503076 000001.SZ 20210528 2762.21028 2765.56791 2704.01136 2742.06450 2774.52159 -32.45709 -1.169827 4399.538335 1200523.315 111.921
11507387 000001.SZ 20210531 2723.03793 2745.42213 2676.03111 2708.48820 2742.06450 -33.57630 -1.224490 4604.594401 1244209.045 111.921
11511699 000001.SZ 20210601 2708.48820 2714.08425 2630.14350 2677.15032 2708.48820 -31.33788 -1.157025 5584.456536 1490476.624 111.921
11516014 000001.SZ 20210602 2673.79269 2677.15032 2616.71298 2673.79269 2677.15032 -3.35763 -0.125418 4445.341089 1176608.126 111.921
>>>print(df[df.trade_date=='20210601'].tail(10))
ts_code trade_date open high low close pre_close change pct_chg vol amount adj_factor
11511690 000011.SZ 20210601 50.02625 50.29750 49.05750 50.25875 49.79375 0.46500 0.933852 18663.819355 92935.863 3.875
11511691 000010.SZ 20210601 43.63875 43.63875 43.10000 43.42325 43.63875 -0.21550 -0.493827 5253.744780 22713.494 10.775
11511692 000009.SZ 20210601 90.76382 90.76382 88.36595 89.25405 90.85263 -1.59858 -1.759531 42079.017003 376049.211 8.881
11511693 000008.SZ 20210601 50.64208 51.09024 50.41800 51.09024 50.64208 0.44816 0.884956 6865.234738 34876.868 22.408
11511694 000007.SZ 20210601 30.65080 30.89932 29.90524 30.65080 30.65080 0.00000 0.000000 4805.589087 14552.560 8.284
11511695 000006.SZ 20210601 192.75696 193.12203 190.93161 192.75696 192.02682 0.73014 0.380228 1634.333689 31401.554 36.507
11511696 000005.SZ 20210601 16.21900 16.86776 16.03364 16.58972 16.21900 0.37072 2.285714 10139.708675 16756.998 9.268
11511697 000004.SZ 20210601 65.30848 65.79616 64.29248 64.94272 65.18656 -0.24384 -0.374065 5646.429626 36524.364 4.064
11511698 000002.SZ 20210601 4136.85888 4150.70997 4073.75947 4093.76660 4109.15670 -15.39010 -0.374532 3962.999656 1622419.892 153.901
11511699 000001.SZ 20210601 2708.48820 2714.08425 2630.14350 2677.15032 2708.48820 -31.33788 -1.157025 5584.456536 1490476.624 111.921
将我的daily
数据转换为weekly
数据的功能如下,但只能转换一个ts_code
:
def daily2weekly(df):
period_type = 'W'
df = pd.DataFrame.from_records(df, index=pd.to_datetime(df['trade_date']))
df_weekly = df.resample(period_type).last()
df_weekly['open'] = df['open'].resample(period_type).first()
df_weekly['high'] = df['high'].resample(period_type).max()
df_weekly['low'] = df['low'].resample(period_type).min()
df_weekly['vol'] = df['vol'].resample(period_type).sum()
df_weekly['amount'] = df['amount'].resample(period_type).sum()
df_weekly = df_weekly[df_weekly['trade_date'].notna()]
df_weekly['pre_close'] = df_weekly['close'].shift(axis=0, periods=1)
df_weekly['change'] = df_weekly.close.sub(df_weekly.pre_close)
df_weekly['pct_chg'] = df_weekly.change.div(df_weekly.pre_close).mul(100)
df_weekly = pd.DataFrame.from_records(df_weekly, index=range(len(df_weekly)))
return df_weekly
用法:
>>>t = daily2weekly(df[df.ts_code == '000001.SZ'])
>>>print(t.tail(10))
ts_code trade_date open high low close pre_close change pct_chg vol amount adj_factor
1498 000001.SZ 20210402 2363.10144 2455.27128 2354.21760 2387.53200 2347.55472 39.97728 1.702933 30859.466717 7.418832e+06 111.048
1499 000001.SZ 20210409 2393.08440 2453.05032 2340.89184 2365.32240 2387.53200 -22.20960 -0.930233 15314.973615 3.665203e+06 111.048
1500 000001.SZ 20210416 2388.64248 2388.64248 2199.86088 2249.83248 2365.32240 -115.48992 -4.882629 27897.387796 6.350587e+06 111.048
1501 000001.SZ 20210423 2224.29144 2626.28520 2210.96568 2586.30792 2249.83248 336.47544 14.955577 46800.293747 1.157247e+07 111.048
1502 000001.SZ 20210430 2650.71576 2690.69304 2529.67344 2586.30792 2586.30792 0.00000 0.000000 28037.472084 7.243296e+06 111.048
1503 000001.SZ 20210507 2565.20880 2698.46640 2565.20880 2670.70440 2586.30792 84.39648 3.263203 11729.247082 3.103294e+06 111.048
1504 000001.SZ 20210514 2665.15200 2669.59392 2529.41460 2609.99772 2670.70440 -60.70668 -2.273059 25223.037449 6.546422e+06 111.921
1505 000001.SZ 20210521 2590.97115 2700.65373 2568.58695 2629.02429 2609.99772 19.02657 0.728988 20528.087401 5.424483e+06 111.921
1506 000001.SZ 20210528 2627.90508 2815.93236 2595.44799 2742.06450 2629.02429 113.04021 4.299702 28878.102590 7.903014e+06 111.921
1507 000001.SZ 20210602 2723.03793 2745.42213 2616.71298 2673.79269 2742.06450 -68.27181 -2.489796 14634.392027 3.911294e+06 111.921
在我使用groupby
、apply
和joblib
后,40个进程总共需要30秒。
from joblib import Parallel, delayed
from tqdm import tqdm
data_grouped = df.groupby('ts_code')
# n_jobs: number of processes
results = Parallel(n_jobs=40)(
delayed(daily2weekly)(group) for name, group in tqdm(data_grouped, total=len(data_grouped)))
data = pd.concat(results)
100%|██████████| 4366/4366 [00:29<00:00, 148.69it/s]
您可以按ts_code
、resample()
周groupby()
数据,并应用感兴趣的聚合函数。请注意,对数据重新采样后,可以选择要应用聚合函数的列。
我最近在时间序列中处理了类似的情况,如下面的代码所示:
import pandas as pd
import numpy as np
def random_dates(start, end, n=10):
# Function copied from @akilat90
# Available on https://stackoverflow.com/questions/50559078/generating-random-dates-within-a-given-range-in-pandas
start_u = pd.to_datetime(start).value//10**9
end_u = pd.to_datetime(end).value//10**9
return pd.to_datetime(np.random.randint(start_u, end_u, n), unit='s')
size = 1000
index = random_dates(start='2021-01-01', end='2021-06-30', n=size).sort_values()
collaborators = np.random.randint(low=1, high=4, size=size)
prices = np.random.uniform(low=5., high=25., size=size)
data = pd.DataFrame({'Collaborator': collaborators,
'Price': prices}, index=index)
data.groupby('Collaborator').resample('W')['Price'].sum()
这是输出:
Collaborator
1 2021-01-03 59.562737
2021-01-10 151.625485
2021-01-17 330.762041
2021-01-24 128.766567
2021-01-31 161.683576
3 2021-06-06 256.446862
2021-06-13 227.216391
2021-06-20 130.374960
2021-06-27 164.481192
2021-07-04 43.876798