我想在数据集中运行k-最近邻方法,其中一些与Id相关的列是字符串格式的。我应该将所有字符串转换为数字,以便在knn算法中使用它们。如何将数据集中的字符串Id转换为唯一的int作为Id?(由于这些字符串是Id,所以我们在每列中对同一字符串使用相同的int是很重要的。我应该使用hash而不是将其强制转换为int吗?
我试图使用字符串转换为int,但它出现了以下错误:
ValueError:基数为10的int((的文字无效:"VkSa32MyS738HMkfk4tEfk">
这是数据集:http://gitlab.rahnemacollege.com/rahnemacollege/tuning-registration-JusticeInWork/raw/master/dataset.csv
这里有一段与此相关的代码:
for i in range(1, 24857):
df.iloc[i,0]=int(df.iloc[i,0])
df.iloc[i,1]=int(df.iloc[i,1])
df.iloc[i,3]=int(df.iloc[i,3])
df.iloc[i,8]=int(df.iloc[i,8])
df.iloc[i,9]=int(df.iloc[i,9])
df.iloc[i,10]=int(df.iloc[i,10])
df.iloc[i,11]=int(df.iloc[i,11])
df.iloc[i,12]=int(df.iloc[i,12])
这是我的总代码:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
!pip install sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
#-----------------read file-------------------
uploaded = files.upload()
with open('dataset.csv', 'r') as data:
df3 = pd.read_csv(data , encoding = ('ansi'))
lst = ['id', 'Prold', 'ProCreationId', 'CustCreatonRate', 'TaskCreationTimestamp', 'Price', 'ServiceId', 'CategoryId', 'ZoneId', 'TaskState', 'TargetProId', 'isFraud']
df = pd.DataFrame(df3)
print (df)
#----------------------preprocessing----------------
for i in range(1, 24857):
df.iloc[i,0]=int(df.iloc[i,0])
df.iloc[i,1]=int(df.iloc[i,1])
df.iloc[i,3]=int(df.iloc[i,3])
df.iloc[i,8]=int(df.iloc[i,8])
df.iloc[i,9]=int(df.iloc[i,9])
df.iloc[i,10]=int(df.iloc[i,10])
df.iloc[i,11]=int(df.iloc[i,11])
df.iloc[i,12]=int(df.iloc[i,12])
#----------------------set data-----------------------
x = df.iloc[:,0:12]
y = df.iloc[:,13]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
#-------------------------normalize-----------------
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#-----------------------------knn----------------
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#-------------------------result-----------------
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))
我该怎么修?
感谢您的考虑。
我们可以尝试分类数据:
In [553]: x = pd.Series(['a', 'a', 'a', 'b', 'b', 'c']).astype('category')
In [554]: x
Out[554]:
0 a
1 a
2 a
3 b
4 b
5 c
dtype: category
Categories (3, object): [
a
, b
, c]
In [555]: x.cat.codes
Out[555]:
0 0
1 0
2 0
3 1
4 1
5 2
dtype: int8
参考:https://pandas.pydata.org/pandas-docs/version/0.16.2/categorical.html