tensorflow model trained on keras.preprocessing.timeseries_dataset_from_array yields unexpected output shape of (sequence_length, 1) - tensorflow

I'm trying to train a tensorflow model where my inputs are a lagged timeseries of multiple features and I want to predict a single value.
Somehow the output shape ends up as an array of (lag/sequence_length, 1) when my lagged dataset has more than one feature, but I haven't been able to figure out why exactly that is. Here is a minimal example of what I'm trying to do
import tensorflow as tf
from tensorflow.keras import layers
import numpy as np
import pandas as pd
# generate some dummy data
x0 = np.array(range(300))
x1 = np.array(range(300)) * 2
df = pd.DataFrame({"x0": x0, "x1": x1})
y = np.array(range(100))
# also tried reshaping my y, but no help
# y = np.array(range(100)).reshape(100,1)
# make a dataset with lagged values
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=df,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
# show an example of what we are working with
list(ds.take(1))
# define simple model and train it
model = tf.keras.Sequential(
[
layers.Dense(32),
layers.Dense(1),
]
)
model.compile(loss="mse", optimizer=tf.optimizers.Adam())
model.fit(ds, epochs=4)
# make predictions on dataset
predictions = model.predict(ds)
# show predictions
predictions
print(predictions.shape)
"""
(100, 3, 1)
"""
If I create the dataset with only a single feature as:
ds = tf.keras.preprocessing.timeseries_dataset_from_array(
data=x1,
targets=y,
sequence_length=3,
sequence_stride=1,
sampling_rate=1,
batch_size=5
)
My outputs are of expected shape.
Would appreciate any pointers. I'm guessing something is probably getting broadcast which then results in the output I'm seeing but I haven't been able to figure out what exactly is going on.

Related

How to make user interest prediction for article reading

I am trying to get user interest prediction on the daily articles read for an website by using below sample data :
from datetime import date, timedelta
import pandas as pd
import numpy as np
sdate = date(2019,1,1) # start date
edate = date(2019,1,7) # end date -6days
required_dates = pd.date_range(sdate,edate-timedelta(days=1),freq='d')
# initialize list of lists
data = [['2019-01-01', 1000,101], ['2019-01-03', 1000,201] ,['2019-01-02', 1500,301],
['2019-01-02', 1400,101],['2019-01-04', 1500,201],['2019-01-01', 2000,201],
['2019-01-04', 2000,101],['2019-01-04', 1400,301],['2019-01-05', 1400,301],['2019-01-05', 1400,301]]
# Create the pandas DataFrame
df1 = pd.DataFrame(data, columns = ['OnlyDate', 'ArticleID','UserID'])
df1=df1[['OnlyDate','UserID','ArticleID']]
df1.sort_values(by=['UserID','ArticleID'],inplace=True)
df1.reset_index(inplace=True,drop=True)
# raw data
raw_data= df1
# Final Data
final_data= (df1.groupby(['OnlyDate','UserID','ArticleID'])
.size()
.unstack('OnlyDate', fill_value=0)
.unstack('UserID', fill_value=0)
.unstack()
.reset_index(name='InterestValue'))
My Data looks like :
Now I am using XGB model :
import xgboost as xgb
from sklearn.model_selection import KFold, cross_val_score, train_test_split
# converting data for model
final_data['OnlyDate']=pd.to_datetime(final_data['OnlyDate'],format="%Y-%m-%d")
final_data['OnlyDate']= final_data['OnlyDate'].dt.strftime('%Y%m%d')
final_data['OnlyDate']=final_data['OnlyDate'].astype(np.int64)
final_data.info()
# splitting data
X, y = final_data.drop('InterestValue',axis=1), final_data.InterestValue
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=44)
print(X.shape,y.shape,X_train.shape, X_test.shape, y_train.shape, y_test.shape)
xgb_model = xgb.XGBClassifier().fit(X_train, y_train)
print('Accuracy of XGB classifier on training set: {:.2f}'
.format(xgb_model.score(X_train, y_train)))
print('Accuracy of XGB classifier on test set: {:.2f}'
.format(xgb_model.score(X_test[X_train.columns], y_test)))
#making prediction here
y_pred = xgb_model.predict(X_test)
#Checking how data looks after prediction
X_test_afterPrediction = X_test.copy()
X_test_afterPrediction['InterestValue']= y_test
X_test_afterPrediction['PredictedValues'] = y_pred
X_test_afterPrediction
Prediction output looks like :
Currently with my original dataset I am getting only 20% prediction were correct.
Let me know which are the other ways or models I should use to improve my prediction rate?
Edit: with LSTM multivariante I am able to predict for single user data at a time with 28% prediction rate.
Well, this is a really really broad question, did you do EDA? Because my first question would be what's the distribution for InterestValue. And further than that, what is the distribution on ANY of those data fields.
If interestvalues are mostly zeros then the model is going to fit more zeros, in that case, stratification in your sampling will help a lot, is it a binary prediction or multiclass?
Also, what sort of tuning did you do on this model, it looks like it's just the default hyperparameters.

tensorflow error while training model - Labels dtype should be integer Instead got <dtype: 'string'>

I am currently learning tensorflow and am new to the concept. I am trying a multi-class classification using LinearClassifier
I have a dataset where I have reduced the number of input variables using PCA to 30. I have named the PCA columns as PCA_Col_0 -- PCA_Col_29. The PCA was done using scikit learn
I then created tensorflow feature column for each of the 30 variables using the following code:
feat_cols = [PCA_Col_0, .... PCA_Col_29]
d = {}
for item in feat_cols:
d[item] = tf.feature_column.numeric_column(item)
feat_cols2 = list(d.values())
I then initialized the model
import tensorflow as tf
n_classes = 3914
model = tf.estimator.LinearClassifier(feature_columns = feat_cols2, n_classes = n_classes)
input_fn = tf.estimator.inputs.pandas_input_fn(x = DF_Final_V1[feat_cols], y = DF_Final_V1['nUnique_ID'], shuffle = False)
model.train(input_fn)
I get the error Labels dtype should be integer Instead got on tensorflow
I have verified the following:
The Input dataset has only float64 entries
There are no null or nan values in the input dataset
the tf.feature_column shows dtype as float32
Why isn't my model training and why am I getting this error?
Credit to #Bruce Swain in the comments.
The code worked after I modified the output value from 0 to n_classes -1

Applying Keras Model to give an output value for each row

I am learning keras and would like to understand how I can apply a classifier (sequential) to all rows in my data set and not just the x% left for test validation.
The confusion I am having is, when I define my data split, I will have a portion for train and test. How would I apply model to full data set to show me the predicted values for each row? The end goal I have is to produce an concatenate the predicted value for every customer in the data set.
dataset = pd.read_csv('BankCustomers.csv')
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13]
feature_train, feature_test, label_train, label_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
sc = StandardScaler()
feature_train = sc.fit_transform(feature_train)
feature_test = sc.transform(feature_test)
For completeness the classifier looks like below.
# Initialising the ANN
classifier = Sequential()
# Adding the input layer and the first hidden layer
classifier.add(Dense(activation="relu", input_dim=11, units=6, kernel_initializer="uniform"))
# Adding the second hidden layer
classifier.add(Dense(activation="relu", units=6, kernel_initializer="uniform"))
# Adding the output layer
classifier.add(Dense(activation="sigmoid", units=1, kernel_initializer="uniform"))
# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'binary_crossentropy', metrics = ['accuracy'])
# Fitting the ANN to the Training set
classifier.fit(feature_train, label_train, batch_size = 10, nb_epoch = 100)
The course I am doing will suggest ways to get accuracy and predictions for the test set like below, but not the full batch.
# Predicting the Test set results
label_pred = classifier.predict(feature_test)
label_pred = (label_pred > 0.5) # FALSE/TRUE depending on above or below 50%
cm = confusion_matrix(label_test, label_pred)
accuracy=accuracy_score(label_test,label_pred)
I tried concatenating the model applied to both training and test data, but i then was unsure how to ascertain which index matched up with the original data set (i.e. I don't know which of the 20% test data is relative to the original set).
I apologise in advance if this question is superfluous, I have been looking for answers on stack and via the course but so far no luck.
You can utilize pandas indexes to sort your results back to the original order.
Predict on each feature_train and feature_test (not sure why you'd want to predict on feature_train though.)
Add a new column to each feature_train and feature_test, which would contain the predictions
feature_train["predictions"] = pd.Series(classifier.predict(feature_train))
feature_test["predictions"] = pd.Series(classifier.predict(feature_test))
If you look at the indexes of each data frame above, you can see they're shuffled (because of the train_test_split).
You can now concatenate them, use sort_index, and retrieve the predictions column, which would have the predictions according to the order that appeared in your initial dataframe (X)
pd.concat([feature_train, feature_test], axis=0).sort_index()

running LinearRegression model using scikit-learn with different pandas dataframe (loop question)

I have a dataframe with cost, wind, solar and hour of day and like to use the linear regression model from scikit-learn to find the how wind and solar impact the cost. I have labelled each hour with P1-P24 (24 hour a day) i.e. each row depending on the hour of the day will be assigned with a P(1-24)
Therefore i have defined each corresponding row of wind/solar/cost to different dataframe according to the hour of the day
The code runs okay with everything i wanted to do. However I struggle to build a for loop code run repeatedly for every hour to find the linreg.intercept, linreg.coef and np.sqrt(metrics.mean_squared_error(y_test, y_pred) function from scikit-learn on various panda dataframe (P1 to P24).
So at the moment i have to manually change the P number 24 times to find the corresponding intercept/coefficient/mean squared error for each hour
I have some code below for the work but i always struggle to build for loop
I tried to build the for loop using for i in [P1,P2...] but the dataframe became a list and i also struggle to incorporate it to the scikit-learn part
b is the original dataframe with columns: cost, Period (half hourly, therefore i have period 1 to 48), wind, solar
import dataframe
b = pd.read_csv('/Users/Downloads/cost_latest.csv')
To put it into hourly therefore:
P1 = b[b['Period'].isin(['01','02'])]
P2 = b[b['Period'].isin(['03','04'])]...
the scikit-learn part:
feature_cols = ['wind','Solar']
X = P1[feature_cols]
y = P1['Price']
and here is my issue, i need to change the P1 to P2...P24 before running the following codes to get my parameters
the following are the scikit-learn part:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
fit the model to the training data (learn the coefficients)
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
from sklearn import metrics
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
I think there is a smarter way to avoid me manually editing the following (P value) and running everything in one go, i welcome your advice, suggestions
thanks
X = P1[feature_cols]
y = P1['Price']
Just use this:
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
All together:
from sklearn import metrics
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
all_intercepts = []
all_coefs = []
for P in [P1,P2, P3,P4,P5,P6,P7]:
X = P[feature_cols]
y = P['Price']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
linreg = LinearRegression()
linreg.fit(X_train, y_train)
print(linreg.intercept_)
print(linreg.coef_)
list(zip(feature_cols, linreg.coef_))
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
all_intercepts.append(linreg.intercept_)
all_coefs.append(linreg.coef_)
print(all_intercepts)
print(all_coefs)
P will be your dataframes P1,P2,... according to each iteration
Put all Pn dataframes in a list and run your code.
all_P = [P1, P2, P3]
for P in all_P:
X = P[feature_cols]
y = P['Price']

How can I compare if column equals in a matrix multiplication mannar?

I am using Keras (tensorflow as backend). What I want to do is to write a lambda layer that gets 2 tensor input and compare every combination of 2 column of them using Indicator function and produce a new tensor with 0-1 value. Here is an example.
Input: x = K.variable(np.array([[1,2,3],[2,3,4]])),
y = K.variable(np.array([[1,2,3],[2,3,4]]))
Output
z=K.variable(np.array[[1,0],[0,1]])
As far as I know, tensorflow provides tf.equal() to compare tensor in a elementwise way. But if I apply it here, I get
>>> z=tf.equal(x,y)
>>> K.eval(z)
array([[True, True, True],
[True, True, True]], dtype=bool)
It only compares tensor in same position.
So my questions are:
1. Is there a tensorflow API to get my desired output or if I need to write my own function to complete it?
2. If it is the latter one, then there is another problem. I noticed that in keras the input is mini-batch, so the input format looks like: (None, m, n). When writing my own method, how can I tackle with the first dimension, which is None?
Any reply would be appreciated!
You could use broadcasting.
import numpy as np
import tensorflow as tf
x = tf.constant(np.array([[1,2,3],[2,3,4]]))
y = tf.constant(np.array([[1,2,3],[2,3,4]]))
x_ = tf.expand_dims(x, 0)
y_ = tf.expand_dims(y, 1)
res = tf.reduce_all(tf.equal(x_, y_), axis=-1)
sess = tf.Session()
sess.run(res)