机器学习分类的Python Pandas和Numpy问题

这是我要做的。原始数据集有两列，一列是个人的全名（即：Justine Davidson），另一列是种族（即：英语）。我想使用Naive Bayes机器学习方法进行训练，根据姓名特征预测人们的种族。为了从名称中提取名称特征，我将全名分解为3个字符的子字符串（即：Justine Davidson=>jus、ust、sti…等）。以下是我的代码。

import pandas as pd
from pandas import DataFrame
import re
import numpy as np
import nltk
from nltk.classify import NaiveBayesClassifier as nbc
# Get csv file into data frame
data = pd.read_csv("C:UsersKubiKDesktopOddNames_sampleData.csv")
frame = DataFrame(data)
frame.columns = ["name", "ethnicity"]
name = frame.name
ethnicity = frame.ethnicity
# Remove missing ethnicity data cases
index_missEthnic = frame.ethnicity.isnull()
index_missName = frame.name.isnull()
frame2 = frame.loc[~index_missEthnic, :]
frame3 = frame2.loc[~index_missName, :]
# Make all letters into lowercase
frame3.loc[:, "name"] = frame3["name"].str.lower()
frame3.loc[:, "ethnicity"] = frame3["ethnicity"].str.lower()
# Remove all non-alphabetical characters in Name
frame3.loc[:, "name"] = frame3["name"].str.replace(r'[^a-zA-Zs-]', '') # Retain space and hyphen
# Replace empty space as "#"
frame3.loc[:, "name"] = frame3["name"].str.replace('[s]', '#')
# Find the longest name in the dataset
##frame3["name_length"] = frame3["name"].str.len()
##nameLength = frame3.name_length
##print nameLength.max() # Longest name has !!!40 characters!!! including spaces and hyphens
# Add "?" to fill spaces up to 43 characters
frame3["name_filled"] = frame3["name"].str.pad(side="right", width=43, fillchar="?")
# Split into three-character strings
for i in range(1, 41):
    substr = "substr" + str(i)
    frame3[substr] = frame3["name_filled"].str[i-1:i+2]
# Count number of letter characters
frame3["name_len"] = frame3["name"].map(lambda x : len(re.findall('[a-zA-Z]', x)))
# Count number of vowel letter
frame3["vowel_len"] = frame3["name"].map(lambda x : len(re.findall('[aeiouAEIOU]', x)))
# Count number of consonant letter
frame3["consonant_len"] = frame3["name"].map(lambda x : len(re.findall('[b-df-hj-np-tv-z]', x)))
# Count number of in-between-string (not any) spaces
frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))
# Space-name ratio
frame3["SN_ratio"] = frame3["space_len"]/frame3["name_len"]
# Vowel-name ratio
frame3["VN_ratio"] = frame3["vowel_len"]/frame3["name_len"]
# Recategorize ethnicity
frame3["ethnicity2"] = ""
frame3["ethnicity2"][frame3["ethnicity"] == "chinese"] = "chinese"
frame3["ethnicity2"][frame3["ethnicity"] != "chinese"] = "non-chinese"
# Test outputs
##print frame3
# Run naive bayes
featuresets = [((substr1, substr2), ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3.iterrows()]
train_set, test_set = featuresets[:400], featuresets[400:]
classifier = nbc.train(train_set)
# Predict
print classifier.classify(ethnic_features('Anderson Silva'))
Name    Ethnicity
J-b'te Letourneau   Scotish
Jane Mc-earthar French
Li Chen Chinese
Amabil?? Bonneau    English

当我运行程序时，它有两个问题：

这是一个非致命的问题，在整个代码中发生多次，但它仍然在不终止的情况下运行：

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  frame3["space_len"] = frame3["name"].map(lambda x : len(re.findall('[#]', x)))
C:UsersKubiKDesktopFamSeach_NameHandling4.py:57: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

这是一个致命的问题（终止了程序）：

Traceback (most recent call last): Traceback (most recent call last):
  File "C:UsersKubiKDesktopFamSeach_NameHandling4.py", line 71, in <module>
    featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3.iterrows()]
ValueError: too many values to unpack

由于frame3有3列以上，因此会出现错误。

iterrows（）是通过元组（index，row）的迭代器。这里的行是一个pd.Series，其索引是列名，值是该行中的所有值。

您的frame3数据帧有许多列：name、etnicity、name_filled、name_len等。您正试图将所有这些值写入三个变量：substra1、substra2和种族2，因此出现了"太多的值无法解压缩"错误。要解决此问题，只选择您需要的列：

featuresets = [(substr1, ethnicity2) for index, (substr1, substr2, ethnicity2) in frame3[['substr1', 'substr2', 'ethnicity2']].iterrows()]

相关内容

最新更新

热门标签：