如何从数据帧中的唯一字符串生成唯一的int



我想在数据集中运行k-最近邻方法,其中一些与Id相关的列是字符串格式的。我应该将所有字符串转换为数字,以便在knn算法中使用它们。如何将数据集中的字符串Id转换为唯一的int作为Id?(由于这些字符串是Id,所以我们在每列中对同一字符串使用相同的int是很重要的。我应该使用hash而不是将其强制转换为int吗?

我试图使用字符串转换为int,但它出现了以下错误:

ValueError:基数为10的int((的文字无效:"VkSa32MyS738HMkfk4tEfk">

这是数据集:http://gitlab.rahnemacollege.com/rahnemacollege/tuning-registration-JusticeInWork/raw/master/dataset.csv

这里有一段与此相关的代码:

for i in range(1, 24857):
df.iloc[i,0]=int(df.iloc[i,0]) 
df.iloc[i,1]=int(df.iloc[i,1]) 
df.iloc[i,3]=int(df.iloc[i,3]) 
df.iloc[i,8]=int(df.iloc[i,8]) 
df.iloc[i,9]=int(df.iloc[i,9]) 
df.iloc[i,10]=int(df.iloc[i,10]) 
df.iloc[i,11]=int(df.iloc[i,11]) 
df.iloc[i,12]=int(df.iloc[i,12]) 

这是我的总代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from google.colab import files
!pip install sklearn
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
#-----------------read file-------------------
uploaded = files.upload()
with open('dataset.csv', 'r') as data:
df3 = pd.read_csv(data , encoding = ('ansi'))
lst = ['id', 'Prold', 'ProCreationId', 'CustCreatonRate', 'TaskCreationTimestamp', 'Price', 'ServiceId', 'CategoryId', 'ZoneId', 'TaskState', 'TargetProId', 'isFraud']
df = pd.DataFrame(df3)
print (df)
#----------------------preprocessing----------------
for i in range(1, 24857):
df.iloc[i,0]=int(df.iloc[i,0]) 
df.iloc[i,1]=int(df.iloc[i,1]) 
df.iloc[i,3]=int(df.iloc[i,3]) 
df.iloc[i,8]=int(df.iloc[i,8]) 
df.iloc[i,9]=int(df.iloc[i,9]) 
df.iloc[i,10]=int(df.iloc[i,10]) 
df.iloc[i,11]=int(df.iloc[i,11]) 
df.iloc[i,12]=int(df.iloc[i,12]) 
#----------------------set data-----------------------
x = df.iloc[:,0:12]
y = df.iloc[:,13]
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
print(X_train.shape, y_train.shape)
print(X_test.shape, y_test.shape)
#-------------------------normalize-----------------
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
#-----------------------------knn----------------
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
#-------------------------result-----------------
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

我该怎么修?

感谢您的考虑。

我们可以尝试分类数据:

In [553]: x = pd.Series(['a', 'a', 'a', 'b', 'b', 'c']).astype('category')
In [554]: x
Out[554]: 
0    a
1    a
2    a
3    b
4    b
5    c
dtype: category
Categories (3, object): [
a
, b
, c]
In [555]: x.cat.codes
Out[555]: 
0    0
1    0
2    0
3    1
4    1
5    2
dtype: int8

参考:https://pandas.pydata.org/pandas-docs/version/0.16.2/categorical.html

最新更新