running LinearRegression model using scikit-learn with different pandas dataframe (loop question) - pandas

I have a dataframe with cost, wind, solar and hour of day and like to use the linear regression model from scikit-learn to find the how wind and solar impact the cost. I have labelled each hour with P1-P24 (24 hour a day) i.e. each row depending on the hour of the day will be assigned with a P(1-24)
Therefore i have defined each corresponding row of wind/solar/cost to different dataframe according to the hour of the day
The code runs okay with everything i wanted to do. However I struggle to build a for loop code run repeatedly for every hour to find the linreg.intercept, linreg.coef and np.sqrt(metrics.mean_squared_error(y_test, y_pred) function from scikit-learn on various panda dataframe (P1 to P24).
So at the moment i have to manually change the P number 24 times to find the corresponding intercept/coefficient/mean squared error for each hour
I have some code below for the work but i always struggle to build for loop
I tried to build the for loop using for i in [P1,P2...] but the dataframe became a list and i also struggle to incorporate it to the scikit-learn part
b is the original dataframe with columns: cost, Period (half hourly, therefore i have period 1 to 48), wind, solar
import dataframe
b = pd.read_csv('/Users/Downloads/cost_latest.csv')
To put it into hourly therefore:
P1 = b[b['Period'].isin(['01','02'])]
P2 = b[b['Period'].isin(['03','04'])]...
the scikit-learn part:
feature_cols = ['wind','Solar']
X = P1[feature_cols]
y = P1['Price']
and here is my issue, i need to change the P1 to P2...P24 before running the following codes to get my parameters
the following are the scikit-learn part:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
I think there is a smarter way to avoid me manually editing the following (P value) and running everything in one go, i welcome your advice, suggestions
thanks
X = P1[feature_cols]
y = P1['Price']

Just use this:
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
All together:
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
all_intercepts = []
all_coefs = []
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
all_intercepts.append(linreg.intercept_)
all_coefs.append(linreg.coef_)
print(all_intercepts)
print(all_coefs)
P will be your dataframes P1,P2,... according to each iteration

Put all Pn dataframes in a list and run your code.
all_P = [P1, P2, P3]
for P in all_P:
X = P[feature_cols]
y = P['Price']

Related

tensorflow model trained on keras.preprocessing.timeseries_dataset_from_array yields unexpected output shape of (sequence_length, 1)

I'm trying to train a tensorflow model where my inputs are a lagged timeseries of multiple features and I want to predict a single value.
Somehow the output shape ends up as an array of (lag/sequence_length, 1) when my lagged dataset has more than one feature, but I haven't been able to figure out why exactly that is. Here is a minimal example of what I'm trying to do
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
# generate some dummy data
x0 = np.array(range(300))
x1 = np.array(range(300)) * 2
df = pd.DataFrame({"x0": x0, "x1": x1})
y = np.array(range(100))
# also tried reshaping my y, but no help
# y = np.array(range(100)).reshape(100,1)
# make a dataset with lagged values
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=df,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
# show an example of what we are working with
list(ds.take(1))
# define simple model and train it
model = tf.keras.Sequential(
[
layers.Dense(32),
layers.Dense(1),
]
)
model.compile(loss="mse", optimizer=tf.optimizers.Adam())
model.fit(ds, epochs=4)
# make predictions on dataset
predictions = model.predict(ds)
# show predictions
predictions
print(predictions.shape)
"""
(100, 3, 1)
"""
If I create the dataset with only a single feature as:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=x1,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
My outputs are of expected shape.
Would appreciate any pointers. I'm guessing something is probably getting broadcast which then results in the output I'm seeing but I haven't been able to figure out what exactly is going on.

How to make user interest prediction for article reading

I am trying to get user interest prediction on the daily articles read for an website by using below sample data :
from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1) # start date
edate = date(2019,1,7) # end date -6days
required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists
data = [['2019-01-01', 1000,101], ['2019-01-03', 1000,201] ,['2019-01-02', 1500,301],
['2019-01-02', 1400,101],['2019-01-04', 1500,201],['2019-01-01', 2000,201],
['2019-01-04', 2000,101],['2019-01-04', 1400,301],['2019-01-05', 1400,301],['2019-01-05', 1400,301]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'ArticleID','UserID'])
df1=df1[['OnlyDate','UserID','ArticleID']]
df1.sort_values(by=['UserID','ArticleID'],inplace=True)
df1.reset_index(inplace=True,drop=True)
# raw data
raw_data= df1
# Final Data
final_data= (df1.groupby(['OnlyDate','UserID','ArticleID'])
.size()
.unstack('OnlyDate', fill_value=0)
.unstack('UserID', fill_value=0)
.unstack()
.reset_index(name='InterestValue'))
My Data looks like :
Now I am using XGB model :
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
# converting data for model
final_data['OnlyDate']=pd.to_datetime(final_data['OnlyDate'],format="%Y-%m-%d")
final_data['OnlyDate']= final_data['OnlyDate'].dt.strftime('%Y%m%d')
final_data['OnlyDate']=final_data['OnlyDate'].astype(np.int64)
final_data.info()
# splitting data
X, y = final_data.drop('InterestValue',axis=1), final_data.InterestValue
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
print(X.shape,y.shape,X_train.shape, X_test.shape, y_train.shape, y_test.shape)
xgb_model = xgb.XGBClassifier().fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
.format(xgb_model.score(X_test[X_train.columns], y_test)))
#making prediction here
y_pred = xgb_model.predict(X_test)
#Checking how data looks after prediction
X_test_afterPrediction = X_test.copy()
X_test_afterPrediction['InterestValue']= y_test
X_test_afterPrediction['PredictedValues'] = y_pred
X_test_afterPrediction
Prediction output looks like :
Currently with my original dataset I am getting only 20% prediction were correct.
Let me know which are the other ways or models I should use to improve my prediction rate?
Edit: with LSTM multivariante I am able to predict for single user data at a time with 28% prediction rate.
Well, this is a really really broad question, did you do EDA? Because my first question would be what's the distribution for InterestValue. And further than that, what is the distribution on ANY of those data fields.
If interestvalues are mostly zeros then the model is going to fit more zeros, in that case, stratification in your sampling will help a lot, is it a binary prediction or multiclass?
Also, what sort of tuning did you do on this model, it looks like it's just the default hyperparameters.

Knn give more weight to specific feature in distance

I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I've used game_date to extract year and month features:
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
and I wish to use season, year, month features to give them more weight in the distance function so events with closer date to the current event will be closer neighbors but still maintain reasonable distances to potential other datapoints, so for example I don't wish an event withing the same day would be the closest neighbor just because of the date features but it'll take into account the other features such as shot_range etc..
To give it more weight I've tried to use metric argument with custom distance function but the arguments of the function are just numpy array without column information of pandas so I'm not sure what I can do and how to implement what I'm trying to do.
EDIT:
Using larger weights for date features to find the optimal k with cv of 10 running on k from [1, 100]:
from IPython.display import display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
not_classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].isnull()]
classified_df = scaled_features_df[scaled_features_df['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
cv = StratifiedKFold(n_splits=10, shuffle=True)
neighbors = [x for x in range(1, 100)]
cv_scores = []
weight = np.ones((X.shape[1],))
weight[[X.columns.get_loc("season"),
X.columns.get_loc("year"),
X.columns.get_loc("month")
]] = 5
weight = weight/weight.sum() #Normalize weights
def my_distance(x, y):
dist = ((x-y)**2)
return np.dot(dist, weight)
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric=my_distance)
cv_scores.append(np.mean(cross_val_score(knn, X, y, cv=cv, scoring='roc_auc')))
#optimal K
optimal_k_index = cv_scores.index(min(cv_scores))
optimal_k = neighbors[optimal_k_index]
print('best k: ', optimal_k)
plt.plot(neighbors, cv_scores)
plt.xlabel('Number of Neighbors K')
plt.ylabel('ROC AUC')
plt.show()
Runs really slow, any idea on how to make it faster?
The idea of the weighted features is to find neighbors more close to the data point date to avoid data leakage and cv for finding optimal k.
First, you have to prepare a numpy 1D weight array, specifying weight for each feature. You could do something like:
weight = np.ones((M,)) # M is no of features
weight[[1,7,10]] = 2 # Increase weight of 1st,7th and 10th features
weight = weight/weight.sum() #Normalize weights
You can use kobe_data_encoded.columns to find indexes of season, year, month features in your dataframe to replace 2nd line above.
Now define a distance function, which by guideline have to take two 1D numpy array.
def my_dist(x,y):
global weight #1D array, same shape as x or y
dist = ((x-y)**2) #1D array, same shape as x or y
return np.dot(dist,weight) # a scalar float
And initialize KNeighborsRegressor as:
knn = KNeighborsRegressor(metric=my_dist)
EDIT:
To make things efficient, you can precompute distance matrix, and reuse it in KNN. This should bring in significant speedup by reducing calls to my_dist, since this non-vectorized custom python distance function is quite slow. So now -
dist = np.zeros((len(X),len(X))) #Computing NXN distance matrix
for i in range(len(X)): # You can halve this by using the fact that dist[i,j] = dist[j,i]
for j in range(len(X)):
dist[i,j] = my_dist(X[i],X[j])
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, metric='precomputed') #Note: metric='precomputed'
cv_scores.append(np.mean(cross_val_score(knn, dist, y, cv=cv, scoring='roc_auc'))) #Note: passing dist instead of X
I couldn't test it, so let me know if something isn't alright.
Just add on Shihab's answer regarding distance computation. Can use scipy pdist as suggested in this post, which is faster and more efficient.
from scipy.spatial.distance import pdist, minkowski, squareform
# create the custom weight array
weight = ...
# calculate pairwise distances, using Minkowski norm with custom weights
distances = pdist(X, minkowski, 2, weight)
# reformat the result as a square matrix
distances_as_2d_matrix = squareform(distances)

Linear Regression overfitting

I'm pursuing course 2 on this coursera course on linear regression (https://www.coursera.org/specializations/machine-learning)
I've solved the training using graphlab but wanted to try out sklearn for the experience and learning. I'm using sklearn and pandas for this.
The model overfits on the data. How can I fix this? This is the code.
These are the coefficients i'm getting.
[ -3.33628603e-13 1.00000000e+00]
poly1_data = polynomial_dataframe(sales["sqft_living"], 1)
poly1_data["price"] = sales["price"]
model1 = LinearRegression()
model1.fit(poly1_data, sales["price"])
print(model1.coef_)
plt.plot(poly1_data['power_1'], poly1_data['price'], '.',poly1_data['power_1'], model1.predict(poly1_data),'-')
plt.show()
The plotted line is like this. As you see it connects every data point.
and this is the plot of the input data
I wouldn't even call this overfit. I'd say you aren't doing what you think you should be doing. In particular, you forgot to add a column of 1's to your design matrix, X. For example:
# generate some univariate data
x = np.arange(100)
y = 2*x + x*np.random.normal(0,1,100)
df = pd.DataFrame([x,y]).T
df.columns = ['x','y']
You're doing the following:
model1 = LinearRegression()
X = df["x"].values.reshape(1,-1)[0] # reshaping data
y = df["y"].values.reshape(1,-1)[0]
model1.fit(X,y)
Which leads to:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(X[0], model1.predict(X)[0],'-')
plt.show()
Instead, you want to add a column of 1's to your design matrix (X):
X = np.column_stack([np.ones(len(df['x'])),df["x"].values.reshape(1,-1)[0]])
y = df["y"].values.reshape(1,-1)
model1.fit(X,y)
And (after some reshaping) you get:
plt.plot(df['x'].values, df['y'].values,'.')
plt.plot(df['x'].values, model1.predict(X),'-')
plt.show()

Scikit-learn prediction intervals for future values?

The numbers predicted by my code below are very specific and I do not get any exact matches, but some are pretty close. For example, on a certain date there were actually 388 events and this might predict 397.
Can I output a range of like 370 - 410? Or see the percentage chance that the value will be between a range? Or should I bin the values and check for accuracy that way?
Code:
def make_prediction(label, prediction):
X = df[[col1, col2, col3]].values
y = df[label].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
X_train.shape, X_test.shape
clf = linear_model.LinearRegression()
clf.fit(X_train, y_train)
output = clf.predict(X)
result = np.c_[X, output]
df_result = pd.DataFrame(result, columns=[col1, col2, col3, prediction])
return df_result
So the code above places a value for each row (which is a date in this case but I number them from 1 onward based on the first value in the data set. How do I predict future values? When I run the code above I only get the predicted values for existing data, how can I use that model on other data sets or input future dates?
Assuming that you require binning on top of predicted values, you can use pandas cut() as follows:
In [1]: import pandas as pd
In [2]: df = pd.DataFrame([270,201,375,370,410,510], columns=['prediction'])
In [3]: bins = [0,370,420,600]
In [4]: group_labels = ['(0-370]', '(371-420]', '(421-600]']
In [5]: df['prediction_range'] = pd.cut(df.prediction, bins, labels=group_labels)
In [6]: df
Out[6]:
prediction prediction_range
0 270 (0-370]
1 201 (0-370]
2 375 (371-420]
3 370 (0-370]
4 410 (371-420]
5 510 (421-600]
Reference: Binning Data In Pandas