How to make user interest prediction for article reading - pandas

I am trying to get user interest prediction on the daily articles read for an website by using below sample data :
from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1) # start date
edate = date(2019,1,7) # end date -6days
required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists
data = [['2019-01-01', 1000,101], ['2019-01-03', 1000,201] ,['2019-01-02', 1500,301],
['2019-01-02', 1400,101],['2019-01-04', 1500,201],['2019-01-01', 2000,201],
['2019-01-04', 2000,101],['2019-01-04', 1400,301],['2019-01-05', 1400,301],['2019-01-05', 1400,301]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'ArticleID','UserID'])
df1=df1[['OnlyDate','UserID','ArticleID']]
df1.sort_values(by=['UserID','ArticleID'],inplace=True)
df1.reset_index(inplace=True,drop=True)
# raw data
raw_data= df1
# Final Data
final_data= (df1.groupby(['OnlyDate','UserID','ArticleID'])
.size()
.unstack('OnlyDate', fill_value=0)
.unstack('UserID', fill_value=0)
.unstack()
.reset_index(name='InterestValue'))
My Data looks like :
Now I am using XGB model :
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
# converting data for model
final_data['OnlyDate']=pd.to_datetime(final_data['OnlyDate'],format="%Y-%m-%d")
final_data['OnlyDate']= final_data['OnlyDate'].dt.strftime('%Y%m%d')
final_data['OnlyDate']=final_data['OnlyDate'].astype(np.int64)
final_data.info()
# splitting data
X, y = final_data.drop('InterestValue',axis=1), final_data.InterestValue
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
print(X.shape,y.shape,X_train.shape, X_test.shape, y_train.shape, y_test.shape)
xgb_model = xgb.XGBClassifier().fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
.format(xgb_model.score(X_test[X_train.columns], y_test)))
#making prediction here
y_pred = xgb_model.predict(X_test)
#Checking how data looks after prediction
X_test_afterPrediction = X_test.copy()
X_test_afterPrediction['InterestValue']= y_test
X_test_afterPrediction['PredictedValues'] = y_pred
X_test_afterPrediction
Prediction output looks like :
Currently with my original dataset I am getting only 20% prediction were correct.
Let me know which are the other ways or models I should use to improve my prediction rate?
Edit: with LSTM multivariante I am able to predict for single user data at a time with 28% prediction rate.

Well, this is a really really broad question, did you do EDA? Because my first question would be what's the distribution for InterestValue. And further than that, what is the distribution on ANY of those data fields.
If interestvalues are mostly zeros then the model is going to fit more zeros, in that case, stratification in your sampling will help a lot, is it a binary prediction or multiclass?
Also, what sort of tuning did you do on this model, it looks like it's just the default hyperparameters.

Related

tensorflow model trained on keras.preprocessing.timeseries_dataset_from_array yields unexpected output shape of (sequence_length, 1)

I'm trying to train a tensorflow model where my inputs are a lagged timeseries of multiple features and I want to predict a single value.
Somehow the output shape ends up as an array of (lag/sequence_length, 1) when my lagged dataset has more than one feature, but I haven't been able to figure out why exactly that is. Here is a minimal example of what I'm trying to do
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
# generate some dummy data
x0 = np.array(range(300))
x1 = np.array(range(300)) * 2
df = pd.DataFrame({"x0": x0, "x1": x1})
y = np.array(range(100))
# also tried reshaping my y, but no help
# y = np.array(range(100)).reshape(100,1)
# make a dataset with lagged values
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=df,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
# show an example of what we are working with
list(ds.take(1))
# define simple model and train it
model = tf.keras.Sequential(
[
layers.Dense(32),
layers.Dense(1),
]
)
model.compile(loss="mse", optimizer=tf.optimizers.Adam())
model.fit(ds, epochs=4)
# make predictions on dataset
predictions = model.predict(ds)
# show predictions
predictions
print(predictions.shape)
"""
(100, 3, 1)
"""
If I create the dataset with only a single feature as:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=x1,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
My outputs are of expected shape.
Would appreciate any pointers. I'm guessing something is probably getting broadcast which then results in the output I'm seeing but I haven't been able to figure out what exactly is going on.

running LinearRegression model using scikit-learn with different pandas dataframe (loop question)

I have a dataframe with cost, wind, solar and hour of day and like to use the linear regression model from scikit-learn to find the how wind and solar impact the cost. I have labelled each hour with P1-P24 (24 hour a day) i.e. each row depending on the hour of the day will be assigned with a P(1-24)
Therefore i have defined each corresponding row of wind/solar/cost to different dataframe according to the hour of the day
The code runs okay with everything i wanted to do. However I struggle to build a for loop code run repeatedly for every hour to find the linreg.intercept, linreg.coef and np.sqrt(metrics.mean_squared_error(y_test, y_pred) function from scikit-learn on various panda dataframe (P1 to P24).
So at the moment i have to manually change the P number 24 times to find the corresponding intercept/coefficient/mean squared error for each hour
I have some code below for the work but i always struggle to build for loop
I tried to build the for loop using for i in [P1,P2...] but the dataframe became a list and i also struggle to incorporate it to the scikit-learn part
b is the original dataframe with columns: cost, Period (half hourly, therefore i have period 1 to 48), wind, solar
import dataframe
b = pd.read_csv('/Users/Downloads/cost_latest.csv')
To put it into hourly therefore:
P1 = b[b['Period'].isin(['01','02'])]
P2 = b[b['Period'].isin(['03','04'])]...
the scikit-learn part:
feature_cols = ['wind','Solar']
X = P1[feature_cols]
y = P1['Price']
and here is my issue, i need to change the P1 to P2...P24 before running the following codes to get my parameters
the following are the scikit-learn part:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
I think there is a smarter way to avoid me manually editing the following (P value) and running everything in one go, i welcome your advice, suggestions
thanks
X = P1[feature_cols]
y = P1['Price']
Just use this:
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
All together:
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
all_intercepts = []
all_coefs = []
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
all_intercepts.append(linreg.intercept_)
all_coefs.append(linreg.coef_)
print(all_intercepts)
print(all_coefs)
P will be your dataframes P1,P2,... according to each iteration
Put all Pn dataframes in a list and run your code.
all_P = [P1, P2, P3]
for P in all_P:
X = P[feature_cols]
y = P['Price']

ValueError: expected dense_1_input to have shape (None, 4) but got (78,2)

I don't fundamentally understand the shapes of arrays or how to determine the epochs and batch sizes of training data. My data has 6 columns, column 0 is the independent variable - a string, columns 1-4 are the Deep Neural Network inputs and column 5 is the binary outcome due to the inputs. I have 99 rows of data.
I want to understand how to get rid of this error.
#Importing Datasets
dataset=pd.read_csv('TestDNN.csv')
x = dataset.iloc[:,[1,5]].values # lower bound independent variable to upper bound in a matrix (in this case up to not including column5)
y = dataset.iloc[:,5].values # dependent variable vector
#Splitting data into Training and Test Data
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.2, random_state=0)
#Feature Scaling
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test=sc.transform(x_test)
# PART2 - Making ANN, deep neural network
#Importing the Keras libraries and packages
import keras
from keras.models import Sequential
from keras.layers import Dense
#Initialising ANN
classifier = Sequential()
#Adding the input layer and first hidden layer
classifier.add(Dense(activation= 'relu', input_dim =4, units=2,
kernel_initializer="uniform"))#rectifier activation function
#Adding second hidden layer
classifier.add(Dense(activation= 'relu', units=2,
kernel_initializer="uniform")) #rectifier activation function
#Adding the Output Layer
classifier.add(Dense(activation= 'sigmoid', units=1,
kernel_initializer="uniform"))
#Compiling ANN - stochastic gradient descent
classifier.compile(optimizer='adam', loss='binary_crossentropy',metrics=
['accuracy'])
#Fit ANN to training set
#PART 3 - Making predictions and evaluating the model
#Fitting classifier to the training set
classifier.fit(x_train, y_train, batch_size=32, epochs=5)#original batch is
10 and epoch is 100
The problem is with x definition. This line:
x = dataset.iloc[:,[1,5]].values
... tells pandas to take the columns 1 and 5 only, so it has shape [78, 2]. You probably meant taking all columns before the 5-th:
x = dataset.iloc[:,:5].values

Linear Regression overfitting

I'm pursuing course 2 on this coursera course on linear regression (https://www.coursera.org/specializations/machine-learning)
I've solved the training using graphlab but wanted to try out sklearn for the experience and learning. I'm using sklearn and pandas for this.
The model overfits on the data. How can I fix this? This is the code.
These are the coefficients i'm getting.
[ -3.33628603e-13 1.00000000e+00]
poly1_data = polynomial_dataframe(sales["sqft_living"], 1)
poly1_data["price"] = sales["price"]
model1 = LinearRegression()
model1.fit(poly1_data, sales["price"])
print(model1.coef_)
plt.plot(poly1_data['power_1'], poly1_data['price'], '.',poly1_data['power_1'], model1.predict(poly1_data),'-')
plt.show()
The plotted line is like this. As you see it connects every data point.
and this is the plot of the input data
I wouldn't even call this overfit. I'd say you aren't doing what you think you should be doing. In particular, you forgot to add a column of 1's to your design matrix, X. For example:
# generate some univariate data
x = np.arange(100)
y = 2*x + x*np.random.normal(0,1,100)
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
You're doing the following:
model1 = LinearRegression()
X = df["x"].values.reshape(1,-1)[0] # reshaping data
y = df["y"].values.reshape(1,-1)[0]
model1.fit(X,y)
Which leads to:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(X[0], model1.predict(X)[0],'-')
plt.show()
Instead, you want to add a column of 1's to your design matrix (X):
X = np.column_stack([np.ones(len(df['x'])),df["x"].values.reshape(1,-1)[0]])
y = df["y"].values.reshape(1,-1)
model1.fit(X,y)
And (after some reshaping) you get:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(df['x'].values, model1.predict(X),'-')
plt.show()

My TensorFlow Gradient Descent diverges

import tensorflow as tf
import pandas as pd
import numpy as np
def normalize(data):
return data - np.min(data) / np.max(data) - np.min(data)
df = pd.read_csv('sat.csv', skipinitialspace=True)
x_reading = df['reading_score']
x_math = df['math_score']
x_reading, x_math = np.array(x_reading[df.reading_score != 's']), np.array(x_math[df.math_score != 's'])
x_data = normalize(np.float32(np.array([x_reading, x_math])))
y_writing = df[['writing_score']]
y_data = normalize(np.float32(np.array(y_writing[df.writing_score != 's'])))
W = tf.Variable(tf.random_uniform([1, 2], -.5, .5)) #float32
b = tf.Variable(tf.ones([1]))
y = tf.matmul(W, x_data) + b
loss = tf.reduce_mean(tf.square(y - y_data.T))
optimizer = tf.train.GradientDescentOptimizer(0.005)
train = optimizer.minimize(loss)
init = tf.initialize_all_variables()
with tf.Session() as sess:
sess.run(init)
for step in range(1000):
sess.run(train)
print step, sess.run(W), sess.run(b), sess.run(loss)
Here's my code. My sat.csv contains a data of reading, writing and math scores at SAT. As you can guess, the difference between the features is not that big.
This is a part of sat.csv.
DBN,SCHOOL NAME,Num of Test Takers,reading_score,math_score,writing_score
01M292,HENRY STREET SCHOOL FOR INTERNATIONAL STUDIES,29,355,404,363
01M448,UNIVERSITY NEIGHBORHOOD HIGH SCHOOL,91,383,423,366
01M450,EAST SIDE COMMUNITY SCHOOL,70,377,402,370
01M458,FORSYTH SATELLITE ACADEMY,7,414,401,359
01M509,MARTA VALLE HIGH SCHOOL,44,390,433,384
01M515,LOWER EAST SIDE PREPARATORY HIGH SCHOOL,112,332,557,316
01M539,"NEW EXPLORATIONS INTO SCIENCE, TECHNOLOGY AND MATH HIGH SCHOOL",159,522,574,525
01M650,CASCADES HIGH SCHOOL,18,417,418,411
01M696,BARD HIGH SCHOOL EARLY COLLEGE,130,624,604,628
02M047,47 THE AMERICAN SIGN LANGUAGE AND ENGLISH SECONDARY SCHOOL,16,395,400,387
I've only used math, writing and reading scores. My goal for the code above is to predict the writing score if I give math and reading scores.
I've never seen Tensorflow's gradient descent model diverges with this such simple data. What'd be wrong?
Here are a few options you could try:
Normalise you input and output data
Set smaller initial values for your weights
Use a lower learning rate
Divide your loss by the amount of samples you have (not putting your data in a placeholder is already uncommon).
Let me know what (if any) of these options helped and good luck!