Related
I have a machine learning model that i feel is still getting false positives. It can largely detect attacks that i produce separately from the training / test set, maybe at a 80% rate? But for me that is not enough. I also tried to drop columns with high correlation. My biggest problem is my understanding of whether to use one-hot-encoding or not. I can switch between both one hot and sparse and i don't notice a difference at all in my dataset.
The dataset is like this:
column 1 - column 2 - column 3 - etc, all containing stuff like packet properties, and then at the end, the class. So, class 1, class 2 or class 3. Any one row can only belong to one class, it can't be two attack types, it has to distinguish between all attack types and then assign this particular row the best attack type match! This is different from one-hot where it's meant that, if i understand right, a row can belong to multiple attack types. I however notice that nobody ever uses sparse_categorical_crossentropy when it comes to even the iris dataset which is very similar to mine, as it has more than 3 classes.
I can paste my code here and if somebody knows where i am going wrong! :Z
label_encoder = preprocessing.LabelEncoder()
y = ConcatenateAttackList['Label']
encoded_y = label_encoder.fit_transform(y)
y = np_utils.to_categorical(encoded_y)
x = ConcatenateAttackList.drop(['Label', ], axis = 1).astype(float)
sc = MinMaxScaler()
print('x_train, y_train, fitting and transforming.')
x = sc.fit_transform(x)
x,y = oversample.fit_resample(x,y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42,stratify=y,
shuffle=True)
len(x_train)
len(y_train)
X = pd.DataFrame(x_train)
print('x_train, y_train, fitted and transformed.')
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(4*256).batch(256)
validate = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(256)
model = Sequential()
print('Model initialized.')
model.add(Dense(64,input_dim=len(X.columns),activation='relu')) # input layer
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(6, activation='softmax'))
print('Nodes added to layers.')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics='categorical_accuracy')
print('Compiled.')
callback=tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto',
patience=50, min_delta=0, restore_best_weights=True, verbose=2)
print('EarlyStopping CallBack executed.')
print('Beginning fitting...')
model_hist = model.fit(x_train, y_train, epochs=231, batch_size=256, verbose=1,
callbacks=[callback],validation_data=validate)
print('Fitting completed.')
model.save("sets/mymodel5.h5")
dump(sc, 'sets/scaler_transformTCPDCV5.joblib')
print('Model saved.')
# loss history
plt.plot(model_hist.history['loss'], label="Training Loss")
plt.plot(model_hist.history['val_loss'], label="Validation Loss")
plt.legend()
#------------PREDICTION
tester = pd.read_csv('AttackTestFile.csv', sep=r'\s*,\s*', engine='python')
ColumnsForWindowsCIC = pd.read_csv('ColumnsForWindowsCIC.csv')
tester.columns = ColumnsForWindowsCIC.columns
tester = deleteRedudancy(tester)
x = tester.drop(['Label', ], axis = 1)
fit_new_input = sc.transform(x)
predict_y=model.predict(fit_new_input)
predict_y
classes_y=np.argmax(predict_y,axis=1)
classes_y
predict = label_encoder.inverse_transform(classes_y)
predict
I have created a transformer model for multivariate time series predictions for a linear regression problem.
Details about the Dataset
I have the hourly varying data i.e., single feature (lagged energy use data). The model improvement could be done by increasing the number of lagged energy use data, which provide more information to the model) to predict the time sequence (energy consumption of a building). So my input has the shape X.shape = (8783, 168, 1) i.e., 8783 time sequences, each sequence contains lagged energy use data of one week i.e., 24*7 =168 hourly entries/vectors and each vector contains lagged energy use data as input. My output has the shape Y.shape = (8783,1) i.e., 8783 sequences each containing 1 output value (i.e., building energy consumption value after every hour).
Model Details
I took as a model an example from the official keras site. It is created for classification problems, I modified it for my regression problem by changing the activation of last output layer from sigmoid to relu. Input shape (train_f) = (8783, 168, 1) Output shape (train_P) = (8783,1) When I trained the model for 100 no. of epochs it converges very well for less number of epochs as compared to my reference models (i.e., LSTMs and LSTMS with self attention). After training, when the model is asked to make prediction by feeding in the test data, the prediction performance is also good as compare to the reference models.
For the same model predicting well, in order to improve its performance now I am feeding in the lagged data of energy use of 1 month i.e., 168*4 = 672 hourly entries/vectors and each vector contains lagged energy use data as input. So my input going into the model now has the shape X.shape = (8783, 672, 1). Both the training and prediction accuracy drops in comparison to weekly input data as seen below.
**lagged energy use data for 1 week i.e., X.shape = (8783, 168, 1)**
**MSE RMSE MAE R-Score**
Training data 1.0489 1.0242 0.6395 0.9707
Testing data 0.6221 0.7887 0.5648 0.9171
**lagged energy use data for 1 week i.e., X.shape = (8783, 672, 1)**
**MSE RMSE MAE R-Score**
Training data 1.6424 1.2816 0.7326 0.9567
Testing data 1.4991 1.2244 0.9233 0.6903
I believe that providing more information to the model should result in better predictions. Any suggestions, how to improve the model prediction/test accuracy? Is there something wrong with the model?
df_energy = pd.read_excel("/content/drive/MyDrive/Architecture Topology/Building_energy_consumption_record.xlsx")
extract_for_normalization = list(df_energy)[1]
df_data_float = df_energy[extract_for_normalization].astype(float)
df_data_array = df_data_float.to_numpy()
df_data_array_1 = df_data_array.reshape(-1,1)
from sklearn.model_selection import train_test_split
train_X, test_X = train_test_split(df_data_array_1, train_size = 0.7, shuffle = False)
scaler = MinMaxScaler(feature_range=(0, 1))
scaled_train_X=scaler.fit_transform(train_X)
**Converting train_X into required shape (inputs,sequences, features)**
train_f = [] #features input from training data
train_p = [] # prediction values
n_future = 1 #number of days we want to predict into the future
n_past = 672 # no. of time series input features to be considered for training
for val in range(n_past, len(scaled_train_X) - n_future+1):
train_f.append(scaled_train_X[val - n_past:val, 0:scaled_train_X.shape[1]])
train_p.append(scaled_train_X[val + n_future - 1:val + n_future, -1])
train_f, train_p = np.array(train_f), np.array(train_p)
**Transformer Model**
def transformer_encoder(inputs, head_size, num_heads, ff_dim, dropout=0):
# Normalization and Attention
x = layers.LayerNormalization(epsilon=1e-6)(inputs)
x = layers.MultiHeadAttention(
key_dim=head_size, num_heads=num_heads, dropout=dropout
)(x, x)
x = layers.Dropout(dropout)(x)
res = x + inputs
# Feed Forward Part
x = layers.LayerNormalization(epsilon=1e-6)(res)
x = layers.Conv1D(filters=ff_dim, kernel_size=1, activation="relu")(x)
x = layers.Dropout(dropout)(x)
x = layers.Conv1D(filters=inputs.shape[-1], kernel_size=1)(x)
return x + res
def build_model(
input_shape,
head_size,
num_heads,
ff_dim,
num_transformer_blocks,
mlp_units,
dropout=0,
mlp_dropout=0,
):
inputs = keras.Input(shape=input_shape)
x = inputs
for _ in range(num_transformer_blocks):
x = transformer_encoder(x, head_size, num_heads, ff_dim, dropout)
x = layers.GlobalAveragePooling1D(data_format="channels_first")(x)
for dim in mlp_units:
x = layers.Dense(dim, activation="relu")(x)
x = layers.Dropout(mlp_dropout)(x)
outputs = layers.Dense(train_p.shape[1])(x)
return keras.Model(inputs, outputs)
input_shape = (train_f.shape[1], train_f.shape[2])
model = build_model(
input_shape,
head_size=256,
num_heads=4,
ff_dim=4,
num_transformer_blocks=4,
mlp_units=[128],
mlp_dropout=0.4,
dropout=0.25,
)
model.compile(loss=tf.keras.losses.mean_absolute_error,
optimizer=tf.keras.optimizers.Adam(learning_rate=0.0001),
metrics=["mse"])
model.summary()
history = model.fit(train_f, train_p, epochs=100, batch_size = 32, validation_split = 0.25, verbose = 1)
trainYPredict = model.predict(train_f)
**Inverse transform the prediction and keep the last value(output)**
trainYPredict1 = np.repeat(trainYPredict, scaled_train_X.shape[1], axis = -1)
trainYPredict_actual = scaler.inverse_transform(trainYPredict1)[:, -1]
train_p_actual = np.repeat(train_p, scaled_train_X.shape[1], axis = -1)
train_p_actual1 = scaler.inverse_transform(train_p_actual)[:, -1]
Prediction_mse=mean_squared_error(train_p_actual1 ,trainYPredict_actual)
print("Mean Squared Error of prediction is:", str(Prediction_mse))
Prediction_rmse =sqrt(Prediction_mse)
print("Root Mean Squared Error of prediction is:", str(Prediction_rmse))
prediction_r2=r2_score(train_p_actual1 ,trainYPredict_actual)
print("R2 score of predictions is:", str(prediction_r2))
prediction_mae=mean_absolute_error(train_p_actual1 ,trainYPredict_actual)
print("Mean absolute error of prediction is:", prediction_mae)
**Testing of model**
scaled_test_X = scaler.transform(test_X)
test_q = []
test_r = []
for val in range(n_past, len(scaled_test_X) - n_future+1):
test_q.append(scaled_test_X[val - n_past:val, 0:scaled_test_X.shape[1]])
test_r.append(scaled_test_X[val + n_future - 1:val + n_future, -1])
test_q, test_r = np.array(test_q), np.array(test_r)
testPredict = model.predict(test_q)
When training XGBoost iteratively for data too large to fit in memory, one may want to use "batches". The problem is, however, that each batch may not contain all 0,...,C labels. This leads to the error ValueError: The label must consist of integer labels of form 0, 1, 2, ..., [num_class-1] -
Is there a way to train XGBoost where we just have some subset of the labels, which may not contain zero?
The code has structure similar to this:
train = module.trainloader
test = module.valloader
# Train on one minibatch to get started
sample = next(iter(loader))
X = xgb.DMatrix(sample[0].numpy(), label=sample[1].numpy())
params = {
'learning_rate': 0.007,
'updater':'refresh',
'process_type': 'update',
}
# Get initial model training
model = xgb.train(params, dtrain=X)
for i, (trainsample, valsample) in enumerate(zip(train, test)):
X_train, y_train = trainsample
X_test, y_test = valsample
X_train = xgb.DMatrix(X_train, labels=y_train)
X_test = xgb.DMatrix(X_test)
model = xgb.train(params, dtrain=X_train, xgb_model=model)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)
I'm working on image classification task for diabetic retinopathy with fundus image data. There are 5 classes. The data distribution is 1805 images (class 1), 370 images (class 2), 999 images (class 3), 193 images (class 4), 295 images (class 5).
Here are the steps that I have tried to run:
Preprocessing (resized 224 * 224)
The divide of train and test data is 85% : 15%
x_train, xtest, y_train, ytest = train_test_split(
x_train, y_train,
test_size = 0.15,
random_state=SEED,
stratify = y_train
)
Data agumentation
ImageDataGenerator(
zoom_range=0.15,
fill_mode='constant',
cval=0.,
horizontal_flip=True,
vertical_flip=True,
)
Training with the ResNet-50 model and cross-validation
def getResNet():
modelres = ResNet50(weights=None, include_top=False, input_shape= (IMAGE_HEIGHT,IMAGE_HEIGHT, 3))
x = modelres.output
x = GlobalAveragePooling2D()(x)
x = Dense(5, activation= 'softmax')(x)
model = Model(inputs = modelres.input, outputs = x)
return model
num_folds = 5
skf = StratifiedKFold(n_splits = 5, shuffle=True, random_state=2021)
cvscores = []
fold = 1
for train, val in skf.split(x_train, y_train.argmax(1)):
print('Fold: ', fold)
Xtrain = x_train[train]
Xval = x_train[val]
Ytrain = y_train[train]
Yval = y_train[val]
data_generator = create_datagen().flow(Xtrain, Ytrain, batch_size=32, seed=2021)
model = getResNet()
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001),
metrics=['accuracy'])
with tf.compat.v1.device('/device:GPU:0'):
model_train = model.fit(data_generator,
validation_data=(Xval, Yval),
epochs=30, batch_size = 32, verbose=1)
model_name = 'cnn_keras_aug_Fold_'+str(fold)+'.h5'
model.save(model_name)
scores = model.evaluate(xtest, ytest, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
fold = fold +1
The maximum results I got from this method were training accuracy of 81.2%, validation accuracy of 72.2%, and test accuracy of 70.73%.
Can anyone give me an idea to improve the model so that I can get the test accuracy above 90% as possible?
Later, I will use this model as a pre-trained model to train diabetic retinopathy data as well but from other sources.
BTW, I've tried replacing my preprocessing with this method:
def preprocessing(path):
image = cv2.imread(path)
image = crop_image_from_gray(image)
green = image[:,:,1]
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
cl = clahe.apply(green)
image[:,:,0] = image[:,:,0]
image[:,:,2] = image[:,:,2]
image[:,:,1] = cl
image = cv2.resize(image, (224,224))
return image
I've also tried to replace my model with VGG16, EfficientNetB0. However, none of that had much effect on my results. I'm still stucked with about 70% accuracy.
Please help me come up with ideas to improve my modeling results. I hope.
Your training accuracy is 81.2%. It is generally impossible to have testing accuracy higher that training accuracy, i.e. with current setup you will not achieve 90%.
However, your validation (and also testing) accuracy is about 70-72%. I can suggest that on your small dataset your model is overfitting. So if you add model regularization (e.g. dropout), it is possible that the gap between your training and your validation (and test) will decrease. This way you can improve your validation score.
To further increase the score, you need to check your data manually and try to understand which classes contribute the most to the errors and figure out how those errors can be reduced (e.g. updating your preprocessing pipeline).
I am working on a triplet loss based model for this Kaggle competition.
Short Description- In this competition, we have been challenged to build an algorithm to identify individual whales in images by analyzing a database of containing more than 25,000 images, gathered from research institutions and public contributors.
https://www.kaggle.com/c/humpback-whale-identification?rvi=1
I have decided to use a Siamese network architecture and train it to give me encodings which I can then use to calculate the distance between two pictures of whales. If this distance is below a particular threshold the two pictures belong to the same whale and if this distance is greater then, they aren't the same whale.
This is the Triplet loss function(learnt it from Andrew's deeplearning specialization) I used but i also normalized the encoding's to make the loss function more interpretable(easier to determine margin and split point) across different models(if that makes sense).(First, tried it without the normalization and when it didnt work i tried normalizing.) I also have tried changing alpha(margin) and varied it from 0.2 to 0.6.
from tensorflow.nn import l2_normalize as norm_l2
def triplet_loss(y_true, y_pred, alpha = 0.3):
"""
Arguments:
y_true -- true labels, required when you define a loss in Keras, you don't need it in this function.
y_pred -- python list containing three objects:
anchor -- the encodings for the anchor images, of shape (None, 128)
positive -- the encodings for the positive images, of shape (None, 128)
negative -- the encodings for the negative images, of shape (None, 128)
Returns:
loss -- real number, value of the loss
"""
anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
anchor, positive, negative = norm_l2(anchor), norm_l2(positive), norm_l2(negative)
# Step 1: Compute the (encoding) distance between the anchor and the positive
pos_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,positive)), axis = -1)
# Step 2: Compute the (encoding) distance between the anchor and the negative
neg_dist = tf.reduce_sum(tf.square(tf.subtract(anchor,negative)), axis = -1)
# Step 3: subtract the two previous distances and add alpha.
basic_loss = tf.add(tf.subtract(pos_dist, neg_dist), alpha)
# Step 4: Take the maximum of basic_loss and 0.0. Sum over the training examples.
loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
return loss
This is an example of one of the model architectures i tried out. I have tried using pretrained Facenet, ResNet, DenseNet and Xception till now. I have tried Freezing different numbers of layers in each.
R = tf.keras.applications.ResNet50(include_top=False, weights = 'imagenet', input_shape=(224,224,3))
lr = 0.0001
optimizer = Adam(learning_rate=lr)
R.compile(optimizer=optimizer, loss = triplet_loss)
for layer in R.layers[0:30]:
layer.trainable = False
em_Rmodel = Sequential([
R,
GlobalAveragePooling2D(),
#tf.keras.layers.GlobalMaxPooling2D(),
Dense(512, activation='relu'),
bn(),
Dense(256, activation = 'sigmoid'),
Dense(128, activation = 'sigmoid')
])
def make_tripletModel(model):
#I was manually changing the input shape to fit the default shape of pretrained networks
A = Input(shape = (224, 224, 3), name='anchor')
P = Input(shape = (224, 224, 3), name = 'anchorPositive')
N = Input(shape = (224, 224, 3), name = 'anchorNegative')
enc_A = model(A)
enc_P = model(P)
enc_N = model(N)
tripletModel = Model(inputs=[A, P, N], outputs=[enc_A, enc_P, enc_N])
return tripletModel
tripletModel = make_tripletModel(em_Rmodel)
I have been training using semi-hard triplets and have also been augmenting data properly to generate more training images.
This is the batch generator that i used for training. crop_batch is a function that crops images to show only the whale's tail, using which one can identify whales. It uses a DenseNet trained on more than 1000 images with whale tails and the bounding box surrounding it. Does the work sufficiently well.
def batch_generator_RN(batch_size = batch_size, ishape = (256, 256, 3), model_input_shape = (224, 224, 3)):
triplet_generator = get_triplets()
y_val = np.zeros((batch_size, 2, 1))
anchors = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
positives = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
negatives = np.zeros((batch_size, ishape[0], ishape[1], ishape[2]))
while True:
for i in range(batch_size):
anchors[i], positives[i], negatives[i] = next(triplet_generator)
anc = crop_batch(anchors, batch_size= batch_size, img_shape=model_input_shape)
pos = crop_batch(positives, batch_size= batch_size, img_shape=model_input_shape)
neg = crop_batch(negatives, batch_size= batch_size, img_shape=model_input_shape)
x_data = {'anchor': anc,
'anchorPositive': pos,
'anchorNegative': neg
}
yield (x_data, [y_val, y_val, y_val])
And finally, this, in general, is how i have been trying to train these models. I have tried reducing and increasing learning rate, batch_size = 16.
lr = 0.0001
optimizer = Adam(learning_rate=lr)
tripletModel.compile(optimizer = optimizer, loss = triplet_loss)
es = EarlyStopping(monitor='loss', patience=20, min_delta=0.05, restore_best_weights=True)
#mc = ModelCheckpoint('Rmodel.h5', monitor='loss', save_best_only=True, save_weights_only=True)
rlr = ReduceLROnPlateau(monitor='loss',min_delta=0.05,factor = 0.1,patience = 5, verbose = 1, min_lr = 0)
gen = batch_generator(batch_size)
tripletModel.fit(gen, steps_per_epoch=64, epochs = 40, callbacks=[es, rlr])
So after training all these models, in some models the triplet loss does go down for a while but then plateaus and basically learns nothing meaningful(which basically means that just by looking at the distance between two embeddings i cant figure out if they are the same whale or not.). In other models, immediately after the first or the second epoch the weights converge and don't change at all and doesn't learning anything.
I have tried a very wide range of learning rates and i am pretty sure that it isnt the problem.
Please tell me if i should add all the code files for you to understand the problem better. The reason i havent done it yet because i havent cleaned it but will gladly do so if required. Thanks.
When you say that it doesn't learn anything, is it that the loss reaches a plateau and thus it stops decreasing or it does decrease significantly but when you predict the embeddings of both same and different whales are are similar in value?
The triples_loss() fn and batch_generator_RN() fn are correct, the problem is not related to the data generation.
However, I suspect that your learning rate is too high while you freeze a lot of layers, i.e. numerous trainable parameters are frozen, which may lead to your network being unable to converge.
My suggestion is to unfreeze all the layers and decrease the learning rate to 0.00001 and start training again, regardless of the architecture that you use (Xception/ResNet etc.)