3个字段之间的统计和层次分析



我有一个数据集(Excel文件)包括地区(字符串),土地使用(字符串)和温度(数字)三个字段。顺便说一下,区域总数和土地用途有限,而温度值各不相同。

有成千上万的记录,就像一个大数据…

部分内容如下表:

| District| Land Use    | Temperature |
|---------|-------------|-------------|
| B       | Desert      | 43.3        |
| A       | Residential | 23.1        |
| C       | Forest      | 14.6        |
| B       | Forest      | 18.3        |
| A       | Wetland     | 15.8        |
| B       | Residential | 25.9        |
| C       | Agricultural| 37.0        |
| A       | Residential | 29.1        |
| B       | Desert      | 44.5        |
| C       | Residential | 31.6        |
| A       | Forest      | 17.4        |
| B       | Residential | 23.2        |
| A       | Forest      | 18.8        |
| C       | Agricultural| 36.7        |
| A       | Residential | 29.2        |
| C       | Forest      | 17.6        |
| A       | Agricultural| 36.9        |
| B       | Desert      | 15.5        |
....
| H       | Residential | 26.9        |
| I       | Agricultural| 27.0        |
| N       | Residential | 22.1        |
| B       | Desert      | 47.5        |

是否有一种自动方法来聚类整个数据集,以统计方式描述每个地区基于其自身的土地使用(平均值,中位数,标准等)?

我想得到这样的东西

Temperature District A
Residential   mean = xxx , Std. = xxx
Agricultural  mean = xxx , Std. = xxx
Forest        mean = xxx , Std. = xxx
Wetland       mean = xxx , Std. = xxx
Temperature District B
Residential   mean = xxx , Std. = xxx
Agricultural  mean = xxx , Std. = xxx
Forest        mean = xxx , Std. = xxx
Desert        mean = xxx , Std. = xxx
Temperature District C
Residential   mean = xxx , Std. = xxx
Agricultural  mean = xxx , Std. = xxx
Forest        mean = xxx , Std. = xxx
....
Temperature District N
Residential   mean = xxx , Std. = xxx
Agricultural  mean = xxx , Std. = xxx
Forest        mean = xxx , Std. = xxx

虽然它不完全是您指定的格式,但您可以获得每个地区的平均值和std,并将其保存到groupby()agg()的数据帧中。agg()同时支持多个汇总功能

data = {'District': ['B', 'A', 'C', 'B', 'A', 'B', 'C'],
'Land Use': ['Desert', 'Residential', 'Forest', 'Forest', 'Wetland', 'Residential', 'Agricultural'],
'Temperature': [43.3, 23.1, 14.6, 18.3, 15.8, 25.9, 37.0]
}
df = pd.DataFrame(data)
df_stats = df.groupby(['District', 'Land Use'])['Temperature'].agg(['mean', 'std'])

输出:

mean   std
District Land Use                
A        Residential   23.1   ...
Wetland       15.8   ...
B        Desert        43.3   ...
Forest        18.3   ...
Residential   25.9   ...
C        Agricultural  37.0   ...
Forest        14.6   ...

最新更新