pandas groupby object index turns into FUBAR



我有一只熊猫,我的分组操作将索引变成糊状。我需要日期作为我的索引,在每个股票组中排序

举例来说。像这样设置熊猫:

import pandas as pd
from StringIO import StringIO
text = """Date   Ticker        Open        High         Low   Adj_Close   Volume
    2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800
    2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500
    2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500
    2015-04-14  vws.co  320.000000  322.400000  318.700000  314.870000  1370600
    2015-04-15  vws.co  320.000000  321.500000  319.200000  316.150000   945000
    2015-04-16  vws.co  319.000000  320.200000  310.400000  307.870000  2236100
    2015-04-17  vws.co  309.900000  310.000000  302.500000  299.100000  2711900
    2015-04-20  vws.co  303.000000  312.000000  303.000000  306.490000  1629700
    2016-03-31     mmm  166.750000  167.500000  166.500000  166.630005  1762800
    2016-04-01     mmm  165.630005  167.740005  164.789993  167.529999  1993700
    2016-04-04     mmm  167.110001  167.490005  165.919998  166.399994  2022800
    2016-04-05     mmm  165.179993  166.550003  164.649994  165.809998  1610300
    2016-04-06     mmm  165.339996  167.080002  164.839996  166.809998  2092200
    2016-04-07     mmm  165.880005  167.229996  165.250000  167.160004  2721900"""
df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)

和代码

import pandas as pd
from pandas.io.data import DataReader
import numpy as np
import time
import os
stocklist = ['vws.co','nflx','mmm']

print ('df.tail (Input df)n',df.tail(6),'n')

def Screener(group):
    def diff_calc(group):
        df['Difference'] = df['Adj_Close'].diff()
        return df['Difference']
    df['Difference'] = diff_calc(group)
    return df
if __name__ == '__main__':
    df = GetStock(stocklist, start, end)
    df['Adj_Close'] = df['Adj Close']
    for ticker in stocklist:
        ### groupby screeener (filtering to only rel ticker group)
        df = df.groupby('Ticker', as_index=False).Adj_Close.apply(Screener)
    df.reset_index().sort(['Ticker', 'Date'], ascending=[1,1]).set_index('Ticker')
    print ('(Output df)n',df,'n')
# Test the first 7 rows of each group for rolling_mean transgress groups...
df_test = df.groupby('Ticker').head(7).reset_index().set_index('Date')
print ('df_test (summary from df) (Output)n',df_test,'n')

显然我的索引现在搞砸了,我不知道这是怎么发生的。

(Output df)
                   Ticker    Open    High     Low  Adj Close  Adj_Close        Date                                                               
0 0 0 2016-05-20  vws.co  443.00  446.30  441.40     442.90     442.90   
      2016-05-23  vws.co  442.00  446.70  439.90     439.90     439.90   
      2016-05-24  vws.co  439.10  450.00  438.10     450.00     450.00   
      2016-05-25  vws.co  455.50  466.10  454.30     464.90     464.90   
      2016-05-26  vws.co  465.00  470.80  464.60     464.60     464.60   
      2016-05-27  vws.co  464.00  480.70  461.20     476.00     476.00   
      2016-05-30  vws.co  477.00  481.80  473.10     475.00     475.00   
      2016-05-31  vws.co  474.00  479.30  472.20     479.00     479.00   
      2016-06-01  vws.co  477.40  480.20  472.90     474.40     474.40   
      2016-05-20    nflx   90.08   93.28   89.98      92.49      92.49   
      2016-05-23    nflx   92.98   95.29   92.85      94.89      94.89   
      2016-05-24    nflx   95.98   99.14   95.75      97.89      97.89   
      2016-05-25    nflx   99.00  100.31   98.30     100.20     100.20   

我需要日期作为我的索引,在每个股票组中排序

谁能帮忙?

好的,

我终于找到了解决方案。这句台词是我的秘诀

df = df.reset_index(level=0, drop=True)

这个问题帮助我把索引恢复到我想要的样子。如何获取熊猫数据框中的行,一列中有最大值并保留原始索引?

下面的代码将让我摆脱索引中添加的不需要的 pr. 迭代 col。谢谢大家!

import pandas as pd
from pandas.io.data import DataReader
import numpy as np
import time
import os
from io import StringIO
text = """Date   Ticker        Open        High         Low   Adj_Close   Volume
    2015-04-09  vws.co  315.000000  316.100000  312.500000  311.520000  1686800
    2015-04-10  vws.co  317.000000  319.700000  316.400000  312.700000  1396500
    2015-04-13  vws.co  317.900000  321.500000  315.200000  315.850000  1564500
    2015-04-14  vws.co  320.000000  322.400000  318.700000  314.870000  1370600
    2015-04-15  vws.co  320.000000  321.500000  319.200000  316.150000   945000
    2015-04-16  vws.co  319.000000  320.200000  310.400000  307.870000  2236100
    2015-04-17  vws.co  309.900000  310.000000  302.500000  299.100000  2711900
    2015-04-20  vws.co  303.000000  312.000000  303.000000  306.490000  1629700
    2016-03-31     mmm  166.750000  167.500000  166.500000  166.630005  1762800
    2016-04-01     mmm  165.630005  167.740005  164.789993  167.529999  1993700
    2016-04-04     mmm  167.110001  167.490005  165.919998  166.399994  2022800
    2016-04-05     mmm  165.179993  166.550003  164.649994  165.809998  1610300
    2016-04-06     mmm  165.339996  167.080002  164.839996  166.809998  2092200
    2016-04-07     mmm  165.880005  167.229996  165.250000  167.160004  2721900"""
df = pd.read_csv(StringIO(text), delim_whitespace=1, parse_dates=[0], index_col=0)
runstart = time.time()     # Start script timer
stocklist = ['vws.co','nflx','mmm']#,'msft','tsla']
tickers =   []
def Screener(group):
    def diff_calc(group):
        df['Difference'] = df['Adj_Close'].diff()
        return df['Difference']
    df['Difference'] = diff_calc(group)
    return df
if __name__ == '__main__':
    for ticker in stocklist:
        ### groupby screeener (filtering to only rel ticker group)
        df = df.groupby('Ticker', as_index=False).Adj_Close.apply(Screener) #.reset_index()
        df = df.reset_index(level=0, drop=True)
    print ('(Output df)n',df,'n')
    # Test the first 7 rows of each group for rolling_mean transgress groups...
    df_test = df.groupby('Ticker').head(7).reset_index().set_index('Date')
    print ('df_test (summary from df) (Output)n',df_test,'n')

groupby 函数会自动将您分组的列设置为索引。要避免这种情况,请传递以下选项:在您的分组中as_index = False

df_test = df.groupby('Ticker', as_index = False).head(7).reset_index().set_index('Date')

GroupBy对象的diff方法将日期保留为索引:

# sort if needed (use 'mergesort' algorithm to preserve date order)
df = df.sort().sort_values('Ticker', kind='mergesort')
df['Difference'] = df.groupby('Ticker')['Adj_Close'].diff()

注意:如果输入未排序,则需要先对其进行排序。上面的代码首先按索引(即日期)排序,然后使用 mergesort 按股票代码排序以保留日期的顺序。

输出

In [21]: df[['Ticker', 'Adj_Close', 'Difference']]
Out[21]: 
            Ticker   Adj_Close  Difference
Date                                      
2015-04-09  vws.co  311.520000         NaN
2015-04-10  vws.co  312.700000    1.180000
2015-04-13  vws.co  315.850000    3.150000
2015-04-14  vws.co  314.870000   -0.980000
2015-04-15  vws.co  316.150000    1.280000
2015-04-16  vws.co  307.870000   -8.280000
2015-04-17  vws.co  299.100000   -8.770000
2015-04-20  vws.co  306.490000    7.390000
2016-03-31     mmm  166.630005         NaN
2016-04-01     mmm  167.529999    0.899994
2016-04-04     mmm  166.399994   -1.130005
2016-04-05     mmm  165.809998   -0.589996
2016-04-06     mmm  166.809998    1.000000
2016-04-07     mmm  167.160004    0.350006

最新更新