Transformer Model isn't fitting well on Time Series Data

Transformer Model isn't fitting well on Time Series Data - tensorflow

I have created two models LSTM, LSTM with Self-Attention. Now I am working on to create my first transformer model. I created it for multivariate time series predictions (many-to-one classification model).
I have the hourly varying data i.e., 8 different features (hour, month, temperature, humidity, windspeed, solar radiations concentration etc.) and with them I am trying to predict the time sequence (energy consumption of a building. So my input has the shape X.shape = (8783, 168, 8) i.e., 8783 time sequences, each sequence contains 168 hourly entries/vectors and each vector contains 8 features. My output has the shape Y.shape = (8783,1) i.e., 8783 sequences each containing 1 output value (i.e., building energy consumption value after every hour).
I took as a model an example from the official keras site. It is created for classification problems, I converted my output to classes n_classes = len(np.unique(Y_train)) = 156
Input shape (X_train) = (8783, 168, 8) Output shape (Y_train) = (8783,1) n_classes = 156
In the softmax activation layer, I set the output to n_classes but it's not fitting well. Below I attach the model and I would like to know:
I)- Have I done something wrong in the model? Is the model architecture is fine? Are there maybe some other parts of the code I need to change for it to work for my problem?
II)- Also, can a transformer at all work on multivariate problems of my kind (8 features input, 1 feature output) or do transformers only work on univariate problems?
def build_transformer_model(input_shape, head_size, num_heads, ff_dim, num_transformer_blocks,
mlp_units, dropout=0, mlp_dropout=0):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(x)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
x = x + res
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
x = layers.Dense(n_classes, activation="softmax")(x)
return keras.Model(inputs, x)
model_tr = build_transformer_model(input_shape=(X_train.shape[1], X_train.shape[2]), head_size=256,
num_heads=4, ff_dim=4, num_transformer_blocks=4, mlp_units=[128], mlp_dropout=0.4, dropout=0.2)
model_tr.compile(loss="sparse_categorical_crossentropy",
optimizer=keras.optimizers.Adam(learning_rate=0.0001),
metrics=["sparse_categorical_accuracy"],
)
plot_model(model_tr, to_file='model_plot.png', show_shapes=True, show_layer_names=True)
model_tr.summary()
m_tr_history = model_tr.fit(x=X_train, y=Y_train, validation_split=0.15, batch_size=64, epochs=100, verbose = 1)
model_tr.save('halka_transformer.h5')
No. of epochs
Thanks in advance for this valuable assistance.

Related

Time-Series Transformer Model Prediction Accuracy

I have created a transformer model for multivariate time series predictions for a linear regression problem.
Details about the Dataset
I have the hourly varying data i.e., single feature (lagged energy use data). The model improvement could be done by increasing the number of lagged energy use data, which provide more information to the model) to predict the time sequence (energy consumption of a building). So my input has the shape X.shape = (8783, 168, 1) i.e., 8783 time sequences, each sequence contains lagged energy use data of one week i.e., 24*7 =168 hourly entries/vectors and each vector contains lagged energy use data as input. My output has the shape Y.shape = (8783,1) i.e., 8783 sequences each containing 1 output value (i.e., building energy consumption value after every hour).
Model Details
I took as a model an example from the official keras site. It is created for classification problems, I modified it for my regression problem by changing the activation of last output layer from sigmoid to relu. Input shape (train_f) = (8783, 168, 1) Output shape (train_P) = (8783,1) When I trained the model for 100 no. of epochs it converges very well for less number of epochs as compared to my reference models (i.e., LSTMs and LSTMS with self attention). After training, when the model is asked to make prediction by feeding in the test data, the prediction performance is also good as compare to the reference models.
For the same model predicting well, in order to improve its performance now I am feeding in the lagged data of energy use of 1 month i.e., 168*4 = 672 hourly entries/vectors and each vector contains lagged energy use data as input. So my input going into the model now has the shape X.shape = (8783, 672, 1). Both the training and prediction accuracy drops in comparison to weekly input data as seen below.
**lagged energy use data for 1 week i.e., X.shape = (8783, 168, 1)**
**MSE RMSE MAE R-Score**
Training data 1.0489 1.0242 0.6395 0.9707
Testing data 0.6221 0.7887 0.5648 0.9171
**lagged energy use data for 1 week i.e., X.shape = (8783, 672, 1)**
**MSE RMSE MAE R-Score**
Training data 1.6424 1.2816 0.7326 0.9567
Testing data 1.4991 1.2244 0.9233 0.6903
I believe that providing more information to the model should result in better predictions. Any suggestions, how to improve the model prediction/test accuracy? Is there something wrong with the model?
df_energy = pd.read_excel("/content/drive/MyDrive/Architecture Topology/Building_energy_consumption_record.xlsx")
extract_for_normalization = list(df_energy)[1]
df_data_float = df_energy[extract_for_normalization].astype(float)
df_data_array = df_data_float.to_numpy()
df_data_array_1 = df_data_array.reshape(-1,1)
from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(df_data_array_1, train_size = 0.7, shuffle = False)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train_X=scaler.fit_transform(train_X)
**Converting train_X into required shape (inputs,sequences, features)**
train_f = [] #features input from training data
train_p = [] # prediction values
n_future = 1 #number of days we want to predict into the future
n_past = 672 # no. of time series input features to be considered for training
for val in range(n_past, len(scaled_train_X) - n_future+1):
train_f.append(scaled_train_X[val - n_past:val, 0:scaled_train_X.shape[1]])
train_p.append(scaled_train_X[val + n_future - 1:val + n_future, -1])
train_f, train_p = np.array(train_f), np.array(train_p)
**Transformer Model**
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(train_p.shape[1])(x)
return keras.Model(inputs, outputs)
input_shape = (train_f.shape[1], train_f.shape[2])
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(loss=tf.keras.losses.mean_absolute_error,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
metrics=["mse"])
model.summary()
history = model.fit(train_f, train_p, epochs=100, batch_size = 32, validation_split = 0.25, verbose = 1)
trainYPredict = model.predict(train_f)
**Inverse transform the prediction and keep the last value(output)**
trainYPredict1 = np.repeat(trainYPredict, scaled_train_X.shape[1], axis = -1)
trainYPredict_actual = scaler.inverse_transform(trainYPredict1)[:, -1]
train_p_actual = np.repeat(train_p, scaled_train_X.shape[1], axis = -1)
train_p_actual1 = scaler.inverse_transform(train_p_actual)[:, -1]
Prediction_mse=mean_squared_error(train_p_actual1 ,trainYPredict_actual)
print("Mean Squared Error of prediction is:", str(Prediction_mse))
Prediction_rmse =sqrt(Prediction_mse)
print("Root Mean Squared Error of prediction is:", str(Prediction_rmse))
prediction_r2=r2_score(train_p_actual1 ,trainYPredict_actual)
print("R2 score of predictions is:", str(prediction_r2))
prediction_mae=mean_absolute_error(train_p_actual1 ,trainYPredict_actual)
print("Mean absolute error of prediction is:", prediction_mae)
**Testing of model**
scaled_test_X = scaler.transform(test_X)
test_q = []
test_r = []
for val in range(n_past, len(scaled_test_X) - n_future+1):
test_q.append(scaled_test_X[val - n_past:val, 0:scaled_test_X.shape[1]])
test_r.append(scaled_test_X[val + n_future - 1:val + n_future, -1])
test_q, test_r = np.array(test_q), np.array(test_r)
testPredict = model.predict(test_q)

Time-Series Transformer Model trains well but performs worse on Test Data

I have created a transformer model for multivariate time series predictions (many-to-one classification model).
Details about the Dataset
I have the hourly varying data i.e., 8 different features (hour, month, temperature, humidity, windspeed, solar radiations concentration etc.) and with them I am trying to predict the time sequence (energy consumption of a building. So my input has the shape X.shape = (8783, 168, 8) i.e., 8783 time sequences, each sequence contains 168 hourly entries/vectors and each vector contains 8 features. My output has the shape Y.shape = (8783,1) i.e., 8783 sequences each containing 1 output value (i.e., building energy consumption value after every hour).
Model Details
I took as a model an example from the official keras site. It is created for classification problems, I modified it for my regression problem by changing the activation of last output layer from sigmoid to relu.
Input shape (train_f) = (8783, 168, 8)
Output shape (train_P) = (8783,1)
When I train the model for 100 no. of epochs it converges very well for less number of epochs as compared to my reference models (i.e., LSTMs and LSTMS with self attention). After training, when the model is asked to make prediction by feeding in the test data, the prediction performance is worse as compare to the reference models.
I would be grateful if you please have a look at the code and let me know of the potential steps to improve the prediction/test accuracy.
Here is the code;
df_weather = pd.read_excel(r"Downloads\WeatherData.xlsx")
df_energy = pd.read_excel(r"Downloads\Building_energy_consumption_record.xlsx")
visa = pd.concat([df_weather, df_energy], axis = 1)
df_data = visa.loc[:, ~visa.columns.isin(["Time1", "TD", "U", "DR", "FX"])
msna.bar(df_data)
plt.figure(figsize = (16,6))
sb.heatmap(df_data.corr(), annot = True, linewidths=1, fmt = ".2g", cmap= 'coolwarm')
plt.xticks(rotation = 'horizontal') # how the titles will look likemeans their orientation
extract_for_normalization = list(df_data)[1:9]
df_data_float = df_data[extract_for_normalization].astype(float)
from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(df_data_float, train_size = 0.7, shuffle = False)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train_X=scaler.fit_transform(train_X)
**Converting train_X into required shape (inputs,sequences, features)**
train_f = [] #features input from training data
train_p = [] # prediction values
#test_q = []
#test_r = []
n_future = 1 #number of days we want to predict into the future
n_past = 168 # no. of time series input features to be considered for training
for val in range(n_past, len(scaled_train_X) - n_future+1):
train_f.append(scaled_train_X[val - n_past:val, 0:scaled_train_X.shape[1]])
train_p.append(scaled_train_X[val + n_future - 1:val + n_future, -1])
train_f, train_p = np.array(train_f), np.array(train_p)
**Transformer Model**
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(train_p.shape[1])(x)
return keras.Model(inputs, outputs)
input_shape = (train_f.shape[1], train_f.shape[2])
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(loss=tf.keras.losses.mean_absolute_error,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
metrics=["mse"])
model.summary()
history = model.fit(train_f, train_p, epochs=100, batch_size = 32, validation_split = 0.15, verbose = 1)
trainYPredict = model.predict(train_f)
**Inverse transform the prediction and keep the last value(output)**
trainYPredict1 = np.repeat(trainYPredict, scaled_train_X.shape[1], axis = -1)
trainYPredict_actual = scaler.inverse_transform(trainYPredict1)[:, -1]
train_p_actual = np.repeat(train_p, scaled_train_X.shape[1], axis = -1)
train_p_actual1 = scaler.inverse_transform(train_p_actual)[:, -1]
Prediction_mse=mean_squared_error(train_p_actual1 ,trainYPredict_actual)
print("Mean Squared Error of prediction is:", str(Prediction_mse))
Prediction_rmse =sqrt(Prediction_mse)
print("Root Mean Squared Error of prediction is:", str(Prediction_rmse))
prediction_r2=r2_score(train_p_actual1 ,trainYPredict_actual)
print("R2 score of predictions is:", str(prediction_r2))
prediction_mae=mean_absolute_error(train_p_actual1 ,trainYPredict_actual)
print("Mean absolute error of prediction is:", prediction_mae)
**Testing of model**
scaled_test_X = scaler.transform(test_X)
test_q = []
test_r = []
for val in range(n_past, len(scaled_test_X) - n_future+1):
test_q.append(scaled_test_X[val - n_past:val, 0:scaled_test_X.shape[1]])
test_r.append(scaled_test_X[val + n_future - 1:val + n_future, -1])
test_q, test_r = np.array(test_q), np.array(test_r)
testPredict = model.predict(test_q )
Validation and training loss image is also attached Training and validation Loss

Neural Network with inputs of different sizes

I would like to use a neural network in Keras that takes 2 inputs of different sizes (a vector v and a matrix A) and outputs a vector u, which is v after acted upon by A.
I have managed to input the matrix and vector. The problem is, when I try to use the vector u as the target when fitting the model, it complains:
ValueError: Data cardinality is ambiguous:
x sizes: 70, 312
y sizes: 70
Make sure all arrays contain the same number of samples.

The best option in your case zero padding or padding up would likely be the best decision in your situation. Inputs are only zeroed out in these situations to account for the absence of data. It is frequently used for CNN pictures' borders.
An RNN, which is a simpler option and can easily accommodate your variable-length inputs, is another option.

Here is the example code. I think it can help you.
input_layer = Input(shape=(None, None, channels))
x = Conv2D(16,(4,4), activation = 'relu')(input_layer)
x = Conv2D(32,(4,4), activation = 'relu')(x)
x = Dropout(0.2)(x)
x = Conv2D(64,(4,4), activation = 'relu')(x)
x = Dropout(0.5)(x)
x = Conv2D(128, (1,1))(x)
x = GlobalAveragePooling2D()(x)
output_layer = Dense(5, activation = "softmax")(x)
model = Model(inputs = input_layer, outputs=output_layer)
model.compile(optimizer = "adam", loss = "categorical_crossentropy",
metrics=["accuracy"])

My Pytorch model is giving very bad results

I am new with Deep Learning with Pytorch. I am more experienced with Tensorflow, and thus I should say I am not new to Deep Learning itself.
Currently, I am working on a simple ANN classification. There are only 2 classes so quite naturally I am using a Softmax BCELoss combination.
The dataset is like this:
shape of X_train (891, 7)
Shape of Y_train (891,)
Shape of x_test (418, 7)
I transformed the X_train and others to torch tensors as train_data and so on. The next step is:
train_ds = TensorDataset(train_data, train_label)
# Define data loader
batch_size = 32
train_dl = DataLoader(train_ds, batch_size, shuffle=True)
I made the model class like:
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
# an affine operation: y = Wx + b
self.fc1 = nn.Linear(7, 32)
self.bc1 = nn.BatchNorm1d(32)
self.fc2 = nn.Linear(32, 64)
self.bc2 = nn.BatchNorm1d(64)
self.fc3 = nn.Linear(64, 128)
self.bc3 = nn.BatchNorm1d(128)
self.fc4 = nn.Linear(128, 32)
self.bc4 = nn.BatchNorm1d(32)
self.fc5 = nn.Linear(32, 10)
self.bc5 = nn.BatchNorm1d(10)
self.fc6 = nn.Linear(10, 1)
self.bc6 = nn.BatchNorm1d(1)
self.drop = nn.Dropout2d(p=0.5)
def forward(self, x):
torch.nn.init.xavier_uniform(self.fc1.weight)
x = self.fc1(x)
x = self.bc1(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc2(x)
x = self.bc2(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc3(x)
x = self.bc3(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc4(x)
x = self.bc4(x)
x = F.relu(x)
#x = self.drop(x)
x = self.fc5(x)
x = self.bc5(x)
x = F.relu(x)
x = self.drop(x)
x = self.fc6(x)
x = self.bc6(x)
x = torch.sigmoid(x)
return x
model = Net()
The loss function and the optimizer are defined:
loss = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.00001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
At last, the task is to run the forward in epochs:
num_epochs = 1000
# Repeat for given number of epochs
for epoch in range(num_epochs):
# Train with batches of data
for xb,yb in train_dl:
pred = model(xb)
yb = torch.unsqueeze(yb, 1)
#print(pred, yb)
print('grad', model.fc1.weight.grad)
l = loss(pred, yb)
#print('loss',l)
# 3. Compute gradients
l.backward()
# 4. Update parameters using gradients
optimizer.step()
# 5. Reset the gradients to zero
optimizer.zero_grad()
# Print the progress
if (epoch+1) % 10 == 0:
print('Epoch [{}/{}], Loss: {:.4f}'.format(epoch+1, num_epochs, l.item()))
I can see in the output that after each iteration with all the batches, the hard weights are non-zero, after this zero_grad is applied.
However, the model is pretty bad. I get an F1 score of around 50% only! And the model is bad when I call it to predict the train_dl itself!!!
I am wondering what the reason is. The grad of weights not zero but not updating properly? The optimizer not optimizing the weights? Or what else?
Can someone please have a look?
I already tried different loss functions and optimizers. I tried with smaller datasets, bigger batches, different hyperparameters.
Thanks! :)

First of all, you don't use softmax activation for BCE loss, unless you have 2 output nodes, which is not the case. In PyTorch, BCE loss doesn't apply any activation function before calculating the loss, unlike the CCE which has a built-in softmax function. So, if you want to use BCE, you have to use sigmoid (or any function f: R -> [0, 1]) at the output layer, which you don't have.
Moreover, you should ideally do optimizer.zero_grad() for each batch if you want to do SGD (which is the default). If you don't do that, you will be just doing full-batch gradient descent, which is quite slow and gets stuck in local minima easily.

How to concatenate input that doesn't go through layers

I am trying to create a Keras model with three inputs. Only one of them goes through the first few layers and the other two are concatenated at a dense layer. How would I achieve this without disconnecting the graph? The code is shown below
import keras
input_img = Input(shape=(784,))
input_1 = Input(shape=(1,))
input_2 = Input(shape=(1,))
x = (Dense(48,kernel_initializer='normal',activation="relu"))(input_img)
x = (Dropout(0.2))(x)
x = (Dense(24,activation="tanh"))(x)
x = (Dropout(0.3))(x)
x = (Dense(1))(x)
x = keras.layers.concatenate([x, input_1, input_2])
x = (Activation("sigmoid"))(x)
x = Model(input_img, x)
x.compile(loss="binary_crossentropy", optimizer='adam')
To give a more general overview of what I'm attempting, I am essentially trying to create a Convolutional Neural Network with additional features added to the dense layer for classification.

Since your model has three inputs, i.e. input_img, input_1 and input_2 you need to pass a list of these three inputs while defining your model as follows:
x = Model([input_img, input_1, input_2], x)
Hope this helps.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Transformer Model isn't fitting well on Time Series Data - tensorflow

Related

Time-Series Transformer Model Prediction Accuracy

Time-Series Transformer Model trains well but performs worse on Test Data

Neural Network with inputs of different sizes

My Pytorch model is giving very bad results

How to concatenate input that doesn't go through layers

Categories

Resources