使用Pandas实现Python中的数据处理自动化



TL;DR有大量数据需要以一致的方式进行处理。寻找自动化解决方案

大家好,我正在处理2013年至2022年关于葡萄牙森林火灾的数据集。主要目标是向公众提供有关火灾数量、过火面积和其他指标的官方信息,以便根据事实而不是猜测来讨论这个问题。

问题是,我在创建工作流时遇到了问题,无法为所有数据帧编写相同的代码。

我想知道是否有方法,我可以转动这个:

total_records_2022 = df_in_2022['id'].nunique()
total_records_2021 = df_in_2021['id'].nunique()
total_records_2020 = df_in_2020['id'].nunique()
total_records_2019 = df_in_2019['id'].nunique()
total_records_2018 = df_in_2018['id'].nunique()
total_records_2017 = df_in_2017['id'].nunique()
total_records_2016 = df_in_2016['id'].nunique()
total_records_2015 = df_in_2015['id'].nunique()
total_records_2014 = df_in_2014['id'].nunique()
total_records_2013 = df_in_2013['id'].nunique()

或者这个

# GET TOTAL BURNT AREA FOR EACH YEAR
total_burn_area_measure = " ha"
df_in_2022_reset_burntarea = df_in_2022['icnf.burnArea.total'].fillna(0)
total_burnt_area_2022_number_full = df_in_2022['icnf.burnArea.total'].sum()
total_burnt_area_2022_number = "{:.2f}".format(total_burnt_area_2022_number_full)
total_burnt_area_2022 = total_burnt_area_2022_number + total_burn_area_measure

进入一个运行所有数据帧并应用我需要的任何数据处理的循环。完整的代码可以在这里找到,正如你所看到的,有很多数据需要处理,但要以一致的方式处理。

如果您能提供任何帮助或指导,我们将不胜感激。

完全披露:此代码将用于非商业目的

所以我在浏览了很多之后提出了这个解决方案。也许不是最好的方法,我不知道,但它有效。

# import libraries 
import json
import requests
import pandas as pd 
import datetime as dt 
from datetime import datetime, timedelta, date 
# Define Arrays
csv_git = [2013,2014,2015,2016,2017,2018]
csv_fogos = [2019,2020,2021]
years = [2013,2014,2015,2016,2017,2018,2019,2020,2021,2022]

# -------------------------------------
#     GET INITIAL DATA - ALL FOR 2022
# -------------------------------------
url_bar_2022 = "https://api.fogos.pt/v2/incidents/search?after=2022-01-01&limit=1000000"
# Get response from URL 
response_2022 = requests.get(url_bar_2022)
json_2022 = response_2022.json()
# Create dataframe for 2022 and treat the data 
df_in_2022 = pd.json_normalize(json_2022,'data')
df_in_2022.loc[:,'date'] = pd.to_datetime(df_in_2022['date'],format='%d-%m-%Y')
df_in_2022['month'] = pd.DatetimeIndex(df_in_2022['date']).month
df_in_2022 = df_in_2022.sort_values(by='district', ascending=True)
#df_in_ = {}
for i in csv_git:
globals()[f"df_in_{i}"] = pd.read_csv(f'https://raw.githubusercontent.com/vostpt/ICNF_DATA/main/icnf_{i}_raw.csv')

for i in csv_fogos:
globals()[f"df_in_{i}"] = pd.read_csv(f'assets/fogos_{i}.csv')

for i in csv_git:
globals()[f"df_in_{i}"] = globals()[f"df_in_{i}"].rename(columns={"Unnamed: 0": "id","MES":"month","DISTRITO":"district","ANO":"year","CONCELHO":"concelho","AREATOTAL":"icnf.burnArea.total"})

for i in years:
globals()[f"total_records_{i}"] = globals()[f"df_in_{i}"]['id'].nunique()

print(total_records_2022)

最新更新