我使用python 3中的tabula包从pdfs中的表中获取数据。
我正在尝试从多个pdf联机导入表(例如。http://trreb.ca/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf),但我甚至很难正确导入一张表。
这是我运行的代码:
! pip install -q tabula-py
! pip install pandas
import pandas as pd
import tabula
from tabula import read_pdf
pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"
data = read_pdf(pdf, output_format='dataframe', pages="all")
data
它给出以下输出:
[ Community Sales Dollar Volume ... Active Listings Avg. SP/LP Avg. DOM
0 Ajax 391 $265,999,351 ... 73 100% 21
1 Central East 32 $21,177,488 ... 3 99% 26
2 Northeast Ajax 70 $50,713,199 ... 18 100% 21
3 South East 105 $68,203,487 ... 15 100% 20
[4 rows x 9 columns]]
这似乎工作,除了它错过了"中东"之后的每一行。以下是有问题的实际表格,来自上面代码中url处的pdf:Ajax 2019年第4季度
我还尝试过篡改read_pdf
函数中的一些选项,但收效甚微。
最终目标将是一个脚本,它将遍历所有这些"社区报告"(有很多(,从pdf中提取所有这样的表,并将它们合并到python中的一个数据帧中进行分析。
如果问题不清楚,或者需要更多信息,请告诉我!我对python和堆栈交换都是新手,所以如果我没有正确地构建框架,我很抱歉。
当然,任何帮助都将不胜感激!
Bryn
以下代码几乎有效:
pdf = "http://trebhome.com/files/market-stats/community-reports/2019/Q4/Durham/AjaxQ42019.pdf"
from tabula import convert_into
convert_into(pdf, "test.csv", pages="all", lattice="true")
with open("test.csv",'r') as f:
with open("updated_test.csv",'w') as f1:
next(f) # skip header line
for line in f:
f1.write(line)
data = pd.read_csv("updated_test.csv")
# rename first column, drop unwanted rows
data.rename(columns = {'Unnamed: 0':'Community'}, inplace=True)
data.dropna(inplace=True)
data
并给出输出:
Community Year Quarter Sales Dollar Volume Average Price Median Price New Listings Active Listings Avg. SP/LP
1 Central 2019 Q4 44.0 $27,618,950 $627,703 $630,500 67.0 8.0 99%
2 Central East 2019 Q4 32.0 $21,177,488 $661,797 $627,450 34.0 3.0 99%
3 Central West 2019 Q4 57.0 $40,742,450 $714,780 $675,000 65.0 7.0 99%
4 Northeast Ajax 2019 Q4 70.0 $50,713,199 $724,474 $716,500 82.0 18.0 100%
5 Northwest Ajax 2019 Q4 49.0 $37,192,790 $759,037 $765,000 63.0 14.0 99%
6 South East 2019 Q4 105.0 $68,203,487 $649,557 $640,000 117.0 15.0 100%
7 South West 2019 Q4 34.0 $20,350,987 $598,558 $590,000 36.0 8.0 99%
这里唯一的问题是最后一列"Avg.DOM"没有被convert_into
命令选中。
根据我的分析,这并不重要,但对于其他试图以类似方式拉表的人来说,这肯定是一个问题。