我有一个数据集(Excel文件)包括地区(字符串),土地使用(字符串)和温度(数字)三个字段。顺便说一下,区域总数和土地用途有限,而温度值各不相同。
有成千上万的记录,就像一个大数据…
部分内容如下表:
| District| Land Use | Temperature |
|---------|-------------|-------------|
| B | Desert | 43.3 |
| A | Residential | 23.1 |
| C | Forest | 14.6 |
| B | Forest | 18.3 |
| A | Wetland | 15.8 |
| B | Residential | 25.9 |
| C | Agricultural| 37.0 |
| A | Residential | 29.1 |
| B | Desert | 44.5 |
| C | Residential | 31.6 |
| A | Forest | 17.4 |
| B | Residential | 23.2 |
| A | Forest | 18.8 |
| C | Agricultural| 36.7 |
| A | Residential | 29.2 |
| C | Forest | 17.6 |
| A | Agricultural| 36.9 |
| B | Desert | 15.5 |
....
| H | Residential | 26.9 |
| I | Agricultural| 27.0 |
| N | Residential | 22.1 |
| B | Desert | 47.5 |
是否有一种自动方法来聚类整个数据集,以统计方式描述每个地区基于其自身的土地使用(平均值,中位数,标准等)?
我想得到这样的东西
Temperature District A
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
Wetland mean = xxx , Std. = xxx
Temperature District B
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
Desert mean = xxx , Std. = xxx
Temperature District C
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
....
Temperature District N
Residential mean = xxx , Std. = xxx
Agricultural mean = xxx , Std. = xxx
Forest mean = xxx , Std. = xxx
虽然它不完全是您指定的格式,但您可以获得每个地区的平均值和std,并将其保存到groupby()
和agg()
的数据帧中。agg()
同时支持多个汇总功能
data = {'District': ['B', 'A', 'C', 'B', 'A', 'B', 'C'],
'Land Use': ['Desert', 'Residential', 'Forest', 'Forest', 'Wetland', 'Residential', 'Agricultural'],
'Temperature': [43.3, 23.1, 14.6, 18.3, 15.8, 25.9, 37.0]
}
df = pd.DataFrame(data)
df_stats = df.groupby(['District', 'Land Use'])['Temperature'].agg(['mean', 'std'])
输出:
mean std
District Land Use
A Residential 23.1 ...
Wetland 15.8 ...
B Desert 43.3 ...
Forest 18.3 ...
Residential 25.9 ...
C Agricultural 37.0 ...
Forest 14.6 ...