我的数据集有以下列:
Voted? Political Category
Yes Right
No Left
Not Answered Center
Yes Right
Yes Right
No Right
我需要计算卡方,看看哪个类别与投票的人最相关。两列都包含字符串。为了应用卡方,我如何给每个值一个数字表示?
您可以使用pd.factorize
对分类变量进行编码:
df['nVoted?'] = pd.factorize(df['Voted?'])[0]
df['nCategory'] = pd.factorize(df['Political Category'])[0]
print(df)
# Output
Voted? Political Category nVoted? nCategory
0 Yes Right 0 0
1 No Left 1 1
2 Not Answered Center 2 2
3 Yes Right 0 0
4 Yes Right 0 0
5 No Right 1 0
之后,您可以使用scipy.stats.chisquare