我今天得到了一个奇怪的。我正在使用 Tabula-py 抓取数千个 PDF,无论出于何种原因,带有换行文本的同一个表格(不同的 PDF(可以根据表格的实际拆分自动合并,但在其他情况下,熊猫数据帧将有许多 NaN 行来考虑换行的文本。一般比例为50:1合并。因此,它可以自动化合并过程。下面是示例:
所需数据帧:
Column1 | Column2 | Column3
A Many Many ... Lots and ... This keeps..
B lots of text.. Many Texts.. Johns and jo..
C ...
D
抓取的返回数据帧
Column1 | Column2 | Column3
A Many Many Lots This keeps Just
Nan Many Many and lots Keeps Going!
Nan Texts Nan Nan
B lots of Many Texts John and
Nan text here Johnson inc.
C ...
在这种情况下,文本应该合并起来,以便"许多许多许多文本"都在单元格 A 列 1 中,依此类推。
我已经用下面的解决方案解决了这个问题,但感觉很脏。有大量的索引设置可以避免管理列并避免删除所需的值。有没有人知道更好的解决方案?
df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')
干杯
与Ben的想法类似:
# fill the missing index
df.index = df.index.to_series().ffill()
(df.stack() # stack to kill the other NaN values
.groupby(level=(0,1)) # grouby (index, column)
.apply(' '.join) # join those strings
.unstack(level=1) # unstack to get columns back
)
输出:
Column1 Column2 Column3
A Many Many Many Many Texts Lots and lots This keeps Just Keeps Going!
B lots of text Many Texts here John and Johnson inc.
试试这个:
df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)
Out[1390]:
Column1 Column2
Unnamed: 0
A Many Many Many Many Texts Lots and lots
B lots of text Many Texts here
Column3
Unnamed: 0
A This keeps Just Keeps Going!
B John and Johnson inc.
您可以直接在索引上使用ffill
groupby
。然后使用 agg
而不是 transform
。
# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')},
index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
a b
A a 1
NaN b 2
NaN c 3
B d 4
C e 5
NaN f 6
#then groupby on the filled index and agg
new_df = (df.fillna('')
.groupby(pd.Series(df.index).ffill().values)[df.columns]
.agg(lambda x: ' '.join(x)))
print (new_df)
a b
A a b c 1 2 3
B d 4
C e f 5 6