我正在尝试编写一个python脚本,将从https://data.cms.gov/provider-data/dataset/g6vv-u9sr下载数据并对数据集执行不同的操作。我在自动提取这些数据时遇到了麻烦,我不知道如何正确地编写一个查询,它将返回我整个数据集(最好是熊猫的csv形式)。指针吗?
可以通过requests
模块下载CSV数据,例如:
import pandas as pd
from io import StringIO
r = requests.get(
"https://data.cms.gov/provider-data/sites/default/files/resources/72ed1971c684c81da254c00145da1b47_1647887934/NH_Penalties_Mar2022.csv"
)
df = pd.read_csv(StringIO(r.text))
print(df.dtypes)
print(len(df))
打印:
Federal Provider Number object
Provider Name object
Provider Address object
Provider City object
Provider State object
Provider Zip Code int64
Penalty Date object
Penalty Type object
Fine Amount float64
Payment Denial Start Date object
Payment Denial Length in Days float64
Location object
Processing Date object
dtype: object
27881
编辑:如@Parfait所述,您可以直接使用pd.read_csv
中的url。但是,在这种情况下有必要显式设置enoding=
参数("latin1"/"iso_8859-1"作品):
df = pd.read_csv(
"https://data.cms.gov/provider-data/sites/default/files/resources/72ed1971c684c81da254c00145da1b47_1647887934/NH_Penalties_Mar2022.csv",
encoding="iso_8859-1",
)
print(len(df))
打印:
27881