如何根据 nan 索引值合并行



我今天得到了一个奇怪的。我正在使用 Tabula-py 抓取数千个 PDF,无论出于何种原因,带有换行文本的同一个表格(不同的 PDF(可以根据表格的实际拆分自动合并,但在其他情况下,熊猫数据帧将有许多 NaN 行来考虑换行的文本。一般比例为50:1合并。因此,它可以自动化合并过程。下面是示例:

所需数据帧:

    Column1      | Column2     | Column3
A  Many Many ...  Lots and ...  This keeps..
B  lots of text.. Many Texts..  Johns and jo..
C   ...
D

抓取的返回数据帧

        Column1      | Column2     | Column3
    A  Many Many       Lots         This keeps Just
   Nan Many Many       and lots     Keeps Going!
   Nan Texts           Nan          Nan
    B  lots of        Many Texts    John and
   Nan text           here          Johnson inc.
    C  ...

在这种情况下,文本应该合并起来,以便"许多许多许多文本"都在单元格 A 列 1 中,依此类推。

我已经用下面的解决方案解决了这个问题,但感觉很脏。有大量的索引设置可以避免管理列并避免删除所需的值。有没有人知道更好的解决方案?

df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')

干杯

与Ben的想法类似:

# fill the missing index
df.index = df.index.to_series().ffill()

(df.stack()               # stack to kill the other NaN values
    .groupby(level=(0,1)) # grouby (index, column)
    .apply(' '.join)      # join those strings
    .unstack(level=1)     # unstack to get columns back
)

输出:

                     Column1          Column2                       Column3
A  Many Many Many Many Texts    Lots and lots  This keeps Just Keeps Going!
B               lots of text  Many Texts here         John and Johnson inc.

试试这个:

df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)

Out[1390]:
                              Column1          Column2  
Unnamed: 0
A           Many Many Many Many Texts   Lots and lots
B                        lots of text  Many Texts here
                                  Column3
Unnamed: 0
A           This keeps Just Keeps Going!
B                   John and Johnson inc.
我认为

您可以直接在索引上使用ffill groupby。然后使用 agg 而不是 transform

# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')}, 
                   index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
     a  b
A    a  1
NaN  b  2
NaN  c  3
B    d  4
C    e  5
NaN  f  6
#then groupby on the filled index and agg
new_df = (df.fillna('')
            .groupby(pd.Series(df.index).ffill().values)[df.columns]
            .agg(lambda x: ' '.join(x)))
print (new_df)
       a      b
A  a b c  1 2 3
B      d      4
C    e f    5 6

最新更新