Regression on large dataset: Why does accuracy drop? - matplotlib

I am trying to predict the views on olx's ads. I write a scraper to scrape all the data(50000) ads. When I perform linear regression (on 1400 samples) I got 66% accuracy.But after that I perform on 52000 samples it dropped to 8%. Here is the Imgcount vs Views and Price vs Views stats.
Is there any problem with my data? or How can I perform regression on this. I know that this data is very polarized.
I wanted to know what's the problem why my accuracy dropped when I used large dataset.
Thank you for the help.`
CODE:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn import preprocessing
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import PolynomialFeatures
import seaborn as sns
url = '/home/msz/olx/olx/with_images.csv'
df = pd.read_csv(url, index_col='url')
df['price'] = df['price'].str.replace('.', '')
df['price'] = df['price'].str.replace(',', '')
df['price'] = df['price'].str.replace('Rs', '')
df['price'] = df['price'].astype(int)
df['text'] = df['text'].str.replace(',', ' ')
df['text'] = df['text'].str.replace('\t', '')
df['text'] = df['text'].str.replace('\n', '')
X = df[['price', 'img']]
y = df['views']
print ("X is like ", X.shape)
print ("Y is like ", y.shape)
df.plot(y='views', x='img', style='x')
plt.title('ImgCount vs Views')
plt.xlabel('ImgCount')
plt.ylabel('Views')
plt.show()
df.plot(y='views', x='price', style='x')
plt.title('Price vs Views')
plt.xlabel('Price')
plt.ylabel('Views')
plt.show()
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.451, random_state=0)
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
score = regressor.score(X_test, y_test)
print('Accuracy is : ',score*100)

Regression is the basic algorithm which works on linear datasets mostly but if you have a large and non liner dataset you have to use another algorithm like k-nearest neighbour or may be decision tree. But I prefer to use Naives Bayes classifier and others.

Related

Running multiple machine learning models using scikit learn

I am trying to run machine learning on some code. However, I run out of ram or the kernel dies. I tried using dask and dropping lots of data, but the result is the same. I want to run the data on multiple models. Does anyone know a fix?
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import dask.dataframe as dd
%matplotlib inline
data_path = "/Users/natowei/Documents/Youtube Data/YouTubeDataset_withChannelElapsed.csv"
data = pd.read_csv(data_path)
data = data.iloc[500000:]
data.head()
#Predicting the total channel View Count, eliminating datasets that are not valuable in prediction
X = data.drop(['videoViewCount','index','channelId','videoId','videoPublished','dislikes/views','likes/views','comments/views','views/subscribers','views/elapsedtime'\], axis = 1)
Y = data['videoViewCount']
from dask_ml.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2)
train_data = X_train.join(Y_train)
from sklearn.naive_bayes import GaussianNB
bayes = GaussianNB()
import joblib
from dask.distributed import Client
client = Client(processes=False)
with joblib.parallel_backend('dask'):
bayes.fit(X_train_s, Y_train)
bayes.score(X_test_s, Y_test)
from sklearn.tree import DecisionTreeClassifier
decision = DecisionTreeClassifier()
with joblib.parallel_backend('dask'):
decision.fit(X_train_s, Y_train)
decision.score(X_test_s, Y_test)
I have also tried to chunk the data but it does not seem to help much. Basically I all need is a result score for different machine learning models.

pd.scatter_matrix not working on pandas version 1.4.2

Here is my code:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.model_selection import train_test_split
fruits = pd.read_table('readonly/fruit_data_with_colors.txt')
from matplotlib import cm
X = fruits[['height', 'width', 'mass', 'color_score']]
y = fruits['fruit_label']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
cmap = cm.get_cmap('gnuplot')
scatter = pd.scatter_matrix(X_train, c= y_train, marker = 'o', s=40, hist_kwds={'bins':15}, figsize=(9,9), cmap=cmap)
My education had pandas version '0.19.2' and pd.scatter_matrix works fine. But I got the error message below when I run it on my Jupyter Notebook with pandas '1.4.2.'.
AttributeError: module 'pandas' has no attribute 'scatter_matrix'
How can I make it run on my Jupyter Notebook?
I guess it has now changed to pandas.plotting.scatter_matrix
Have a look at the document below.
https://pandas.pydata.org/docs/reference/api/pandas.plotting.scatter_matrix.html

Why I can't draw a chart? TypeError: unhashable type: 'numpy.ndarray'

I want to see the results of the regression with the graph. But it turns out a blank chart.
I use also not dataframe just values. but the result was same. And the dataset includes 537577 rows
TypeError: unhashable type: 'numpy.ndarray'
#1. kütüphaneker
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as sm
# 2. veri ön işleme
veriler = pd.read_csv("BlackFriday.csv")
print(veriler)
#eksikveriler
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values="NaN", strategy="mean", axis=0)
pro2 = veriler.iloc[:,9:11].values
pro2 = imputer.fit_transform(pro2)
print(veriler)
#test-eğitim bölme
from sklearn.cross_validation import train_test_split
x_train, x_test, y_train, y_test = train_test_split(s,y,test_size=0.33,
random_state=0)
from sklearn.linear_model import LinearRegression
lin_reg=LinearRegression()
lin_reg.fit(s.values,y.values)
plt.scatter(s.values,y.values)
plt.plot(s,lin_reg.predict(s.values))
try this:
plt.scatter([s.values],[y.values])
shuld work with lists

not able to convert string to float in python and how to train the model with this dataset

I have a dataset with columns: age (float type), gender (str type), regions (str type) and charges(float type).
I want to predict charges using age gender and region as features, how can I do that in scikit learn?
I have tried something but it shows "ValueError: could not convert string to float: 'northwest' "
import pandas as pd
import numpy as np
df = pd.read_csv('Desktop/insurance.csv')
X = df.loc[:,['age','sex','region']].values
y = df.loc[:,['charges']].values
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
from sklearn import svm
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)
The column region contains strings, which can't be used as such in the SVM classifier as it is not a vector.
Threfore you have to turn this column into something that is usable by the SVM. Here is an example by changing region into a categorical series:
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'age':[20,30,40,50],
'sex':['male','female','female','male'],
'region':['northwest','southwest','northeast','southeast'],
'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df.region = pd.Categorical(df.region)
df.region = df.region.cat.codes
X = df.loc[:,['age','sex','region']]
y = df.loc[:,['charges']]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)
Another way to approach this problem is to use one-hot vector encoding:
import pandas as pd
from sklearn import svm
from sklearn.model_selection import train_test_split
df = pd.DataFrame({'age':[20,30,40,50],
'sex':['male','female','female','male'],
'region':['northwest','southwest','northeast','southeast'],
'charges':[1000,1000,2000,2000]})
df.sex = (df.sex == 'female')
df = pd.concat([df,pd.get_dummies(df.region)],axis = 1).drop('region',1)
X = df.drop('charges',1)
y = df.charges
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
clf = svm.SVC(C=1.0, cache_size=200,decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf')
clf.fit(X_train, y_train)
Yet another approach is to perform label encoding:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df.region = le.fit_transform(df.region)
This list of methods is of course non-exhaustive, and they perform differently according to your problem.
The use of non-numeric data is a non-trivial one, and requires a bit of knowledge on the existing techniques (I encourage you to go and search in kaggle's forums where you can find valuable informations).

How can I improve numpy's broadcast

I'm trying implementing k-NN with Mahalanobis's distance in python with numpy. However, the code below works very slowly when I use broadcasting.
Please teach me how can I improve numpy speed or implement this better.
from __future__ import division
from sklearn.utils import shuffle
from sklearn.metrics import f1_score
from sklearn.datasets import fetch_mldata
from sklearn.cross_validation import train_test_split
import numpy as np
import matplotlib.pyplot as plt
mnist = fetch_mldata('MNIST original')
mnist_X, mnist_y = shuffle(mnist.data, mnist.target.astype('int32'))
mnist_X = mnist_X/255.0
train_X, test_X, train_y, test_y = train_test_split(mnist_X, mnist_y, test_size=0.2)
k = 2
def data_gen(n):
return train_X[train_y == n]
train_X_num = [data_gen(i) for i in range(10)]
inv_cov = [np.linalg.inv(np.cov(train_X_num[i], rowvar=0)+np.eye(784)*0.00001) for i in range(10)] # Making Inverse covariance matrices
for i in range(10):
ivec = train_X_num[i] # ivec size is (number of 'i' data, 784)
ivec = ivec - test_X[:, np.newaxis, :] # This code is too much slowly, and using huge memory
iinv_cov = inv_cov[i]
d[i] = np.add.reduce(np.dot(ivec, iinv_cov)*ivec, axis=2).sort(1)[:, :k+1] # Calculate x.T inverse(sigma) x, and extract k-minimal distance