我想了一个小时,我尝试了很多不同的方法,但由于这个csv文件对我来说太复杂了(我是初学者(,我做不到。我需要每个月每个国家的总病例的多维数组。解决这个问题后,我将使用这些数组来代替值来创建热图。
更清楚地说,例如法国我需要这样的总病例数_法国=(12月总病例数,1月总病例,……11月总病例(我需要为每个国家做这件事,并获得多维数组。从这里访问CSV文件
我不知道你是不是指月份的总和
- 所有病例的总和仅在12月
- 仅1月份所有病例的总和
- 2月份所有病例的总和,等等
或累计总和
- 12月份所有病例的总和
- 12月+1月所有病例的总和
- 12月+1月+2月所有病例的总和等等
对于这两种情况,我将使用列new cases
和列pandas
以及列groupby(['location', 'year-month'])
首先,我需要从year-month-day
创建year-month
df['year-month'] = df['date'].str[:7]
接下来我可以对进行分组
groups = df.groupby(['location', 'year-month'])
和CCD_ 6求和得到仅新情况的和
df_sum = groups.sum().reset_index()[['location', 'year-month', 'new_cases']]
结果
location year-month new_cases
0 Afghanistan 2019-12 0.0
1 Afghanistan 2020-01 0.0
2 Afghanistan 2020-02 1.0
3 Afghanistan 2020-03 140.0
4 Afghanistan 2020-04 1808.0
... ... ... ...
2112 Zimbabwe 2020-07 2518.0
2113 Zimbabwe 2020-08 3320.0
2114 Zimbabwe 2020-09 1425.0
2115 Zimbabwe 2020-10 525.0
2116 Zimbabwe 2020-11 858.0
我可以将其与cumsum()
一起使用,以获得总/累积和
df_sum['total_cases'] = df_sum.groupby('location')['new_cases'].cumsum()
结果
location year-month new_cases total_cases
0 Afghanistan 2019-12 0.0 0.0
1 Afghanistan 2020-01 0.0 0.0
2 Afghanistan 2020-02 1.0 1.0
3 Afghanistan 2020-03 140.0 141.0
4 Afghanistan 2020-04 1808.0 1949.0
... ... ... ... ...
2112 Zimbabwe 2020-07 2518.0 3092.0
2113 Zimbabwe 2020-08 3320.0 6412.0
2114 Zimbabwe 2020-09 1425.0 7837.0
2115 Zimbabwe 2020-10 525.0 8362.0
2116 Zimbabwe 2020-11 858.0 9220.0
我只能得到一个国家的
df_sum[ df_sum['location'] == 'France' ]
df_sum[ df_sum['location'] == 'Germany' ]
结果
location year-month new_cases total_cases
671 France 2019-12 0.0 0.0
672 France 2020-01 6.0 6.0
673 France 2020-02 51.0 57.0
674 France 2020-03 44493.0 44550.0
675 France 2020-04 83892.0 128442.0
676 France 2020-05 23054.0 151496.0
677 France 2020-06 12764.0 164260.0
678 France 2020-07 22313.0 186573.0
679 France 2020-08 91370.0 277943.0
680 France 2020-09 272747.0 550690.0
681 France 2020-10 781294.0 1331984.0
682 France 2020-11 808224.0 2140208.0
location year-month new_cases total_cases
722 Germany 2019-12 0.0 0.0
723 Germany 2020-01 5.0 5.0
724 Germany 2020-02 52.0 57.0
725 Germany 2020-03 61856.0 61913.0
726 Germany 2020-04 97206.0 159119.0
727 Germany 2020-05 22363.0 181482.0
728 Germany 2020-06 12777.0 194259.0
729 Germany 2020-07 14439.0 208698.0
730 Germany 2020-08 33683.0 242381.0
731 Germany 2020-09 46838.0 289219.0
732 Germany 2020-10 229534.0 518753.0
733 Germany 2020-11 410380.0 929133.0
或者,我可以使用groupby(['locations'])
来拆分到分离的列表数据帧,并创建字典{"France": df_frances, "Germany": df_germany, ...}
data = {}
for country, values in groups:
data[country] = values
全代码
import pandas as pd
df = pd.read_csv('ex1.csv')
print(df.columns)
df['year-month'] = df['date'].str[:7]
#print(df['year-month'].head())
groups = df.groupby(['location', 'year-month'])
df_sum = groups.sum().reset_index()[['location', 'year-month', 'new_cases']]
#print(df_sum)
df_sum['total_cases'] = df_sum.groupby('location')['new_cases'].cumsum()
print(df_sum)
print(df_sum[ df_sum['location'] == 'France' ])
print(df_sum[ df_sum['location'] == 'Germany' ])
data = {}
for country, values in groups:
data[country] = values
print(data)