如何处理多响应数据以在 Python 3 中构建频率?



我正在使用多响应数据集来构建一些使用python panda的频率表。 这是我的数据集:

Student Id  |1st_Lang   |2nd_Lang   |Core_Sub_1 |Core_Sub_2 |Core_Sub_3 |Additional
1       |Bengali    |English    |Math       |Life Sc    |Physical Sc    |Work Education
2       |Bengali    |English    |Geography  |Life Sc    |Physical Sc    |Physical Education
3       |Bengali    |English    |History    |Geography  |Economics  |Life Sc
4       |English    |Hindi      |History    |Geography  |Economics  |Life Sc
5       |Hindi      |English    |Math       |Life Sc    |Physical Sc    |Work Education

具有学生ID和他们选择作为语言,核心和附加的不同科目的示例学生数据。

我想生成学生正在学习科目的频率

例:

English - 5
Bengali - 3
Hindi - 2
Geography - 3
... etc.

我还想了解学生学习的语言是英语或印地语(来自1st_lang,2nd_Lang列(的学生正在学习的科目的频率。

请问你能帮忙使用Python来完成它吗?

因为我们不需要它,所以我们把"学生ID"作为索引放在一边(或删除它(:

df= df.set_index("Student Id")
#df= df.drop(columns=""Student Id")
1st_Lang 2nd_Lang Core_Sub_1 Core_Sub_2   Core_Sub_3          Additional
Student Id
1           Bengali  English       Math    Life Sc  Physical Sc      Work Education
2           Bengali  English  Geography    Life Sc  Physical Sc  Physical Education
3           Bengali  English    History  Geography    Economics             Life Sc
4           English    Hindi    History  Geography    Economics             Life Sc
5             Hindi  English       Math    Life Sc  Physical Sc      Work Education

堆叠 df,我们得到了一个系列(带有 MultiIndex(:

ser= df.stack()
Student Id
1           1st_Lang                 Bengali
2nd_Lang                 English
Core_Sub_1                  Math
Core_Sub_2               Life Sc
Core_Sub_3           Physical Sc
Additional        Work Education
2           1st_Lang                 Bengali
2nd_Lang                 English
Core_Sub_1             Geography
Core_Sub_2               Life Sc
Core_Sub_3           Physical Sc
Additional    Physical Education
3           1st_Lang                 Bengali
2nd_Lang                 English
Core_Sub_1               History
Core_Sub_2             Geography
Core_Sub_3             Economics
Additional               Life Sc
4           1st_Lang                 English
2nd_Lang                   Hindi
Core_Sub_1               History
Core_Sub_2             Geography
Core_Sub_3             Economics
Additional               Life Sc
5           1st_Lang                   Hindi
2nd_Lang                 English
Core_Sub_1                  Math
Core_Sub_2               Life Sc
Core_Sub_3           Physical Sc
Additional        Work Education
dtype: object

我们现在可以计算频率:

ser.value_counts()
Life Sc               5
English               5
Physical Sc           3
Bengali               3
Geography             3
Work Education        2
Hindi                 2
Math                  2
History               2
Economics             2
Physical Education    1
dtype: int64

现在看看印地语学习的学生,设定标准:

critH= df[["1st_Lang","2nd_Lang"]].eq("Hindi")
1st_Lang  2nd_Lang
Student Id
1              False     False
2              False     False
3              False     False
4              False      True
5               True     False

我们将印地语视为第一语言和第二语言:

critH=critH.any(axis=1)
Student Id
1    False
2    False
3    False
4     True
5     True
dtype: bool

选择匹配的行(学生(并一步计算频率:

df.loc[critH].stack().value_counts()
Life Sc           2
Hindi             2
English           2
History           1
Work Education    1
Math              1
Economics         1
Physical Sc       1
Geography         1
dtype: int64

最新更新