将卡方应用于包含分类变量的数据集

我的数据集有以下列：

Voted? Political Category
Yes            Right
No             Left
Not Answered   Center
Yes            Right
Yes            Right
No             Right

我需要计算卡方，看看哪个类别与投票的人最相关。两列都包含字符串。为了应用卡方，我如何给每个值一个数字表示？

您可以使用pd.factorize对分类变量进行编码：

df['nVoted?'] = pd.factorize(df['Voted?'])[0]
df['nCategory'] = pd.factorize(df['Political Category'])[0]
print(df)
# Output
Voted? Political Category  nVoted?  nCategory
0           Yes              Right        0          0
1            No               Left        1          1
2  Not Answered             Center        2          2
3           Yes              Right        0          0
4           Yes              Right        0          0
5            No              Right        1          0

之后，您可以使用scipy.stats.chisquare

相关内容

最新更新

热门标签：