将卡方应用于包含分类变量的数据集



我的数据集有以下列:

Voted? Political Category
Yes            Right
No             Left
Not Answered   Center
Yes            Right
Yes            Right
No             Right

我需要计算卡方,看看哪个类别与投票的人最相关。两列都包含字符串。为了应用卡方,我如何给每个值一个数字表示?

您可以使用pd.factorize对分类变量进行编码:

df['nVoted?'] = pd.factorize(df['Voted?'])[0]
df['nCategory'] = pd.factorize(df['Political Category'])[0]
print(df)
# Output
Voted? Political Category  nVoted?  nCategory
0           Yes              Right        0          0
1            No               Left        1          1
2  Not Answered             Center        2          2
3           Yes              Right        0          0
4           Yes              Right        0          0
5            No              Right        1          0

之后,您可以使用scipy.stats.chisquare

最新更新