大型数据集回归:为什么准确率下降



我正在尝试预测olx广告的视图。我写了一个刮刀来刮取所有数据(50000(广告。当我对1400个样本进行线性回归时,我得到了66%的准确率。但在那之后,我对52000个样本进行了测试,结果下降到了8%。以下是Imgcount与视图和Price与视图的统计数据。

我的数据有问题吗?或者我如何对此执行回归。我知道这些数据是两极分化的。

我想知道当我使用大型数据集时,我的准确性下降的问题是什么。

谢谢你的帮助。">

代码:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
url =  '/home/msz/olx/olx/with_images.csv'
df = pd.read_csv(url, index_col='url')

df['price'] = df['price'].str.replace('.', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].str.replace('Rs', '')
df['price'] = df['price'].astype(int)

df['text'] = df['text'].str.replace(',', ' ')
df['text'] = df['text'].str.replace('t', '')
df['text'] = df['text'].str.replace('n', '')
X = df[['price', 'img']]
y = df['views'] 
print ("X is like ",  X.shape)
print ("Y is like ",  y.shape)
df.plot(y='views', x='img', style='x')  
plt.title('ImgCount vs Views')  
plt.xlabel('ImgCount')  
plt.ylabel('Views')  
plt.show()
df.plot(y='views', x='price', style='x')  
plt.title('Price vs Views')  
plt.xlabel('Price')  
plt.ylabel('Views')  
plt.show()

from sklearn.model_selection import train_test_split  
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.451, random_state=0)
from sklearn.linear_model import LinearRegression  
regressor = LinearRegression() 
regressor.fit(X_train, y_train) 
score = regressor.score(X_test, y_test)
print('Accuracy is : ',score*100)

回归是主要适用于线性数据集的基本算法,但如果您有一个大型非线性数据集,则必须使用另一种算法,如k-最近邻算法,或者可能是决策树算法。但我更喜欢使用Naives Bayes分类器和其他分类器。

最新更新