Question about how to improve my intrusion detection model and decrease false positives? - tensorflow

I have a machine learning model that i feel is still getting false positives. It can largely detect attacks that i produce separately from the training / test set, maybe at a 80% rate? But for me that is not enough. I also tried to drop columns with high correlation. My biggest problem is my understanding of whether to use one-hot-encoding or not. I can switch between both one hot and sparse and i don't notice a difference at all in my dataset.
The dataset is like this:
column 1 - column 2 - column 3 - etc, all containing stuff like packet properties, and then at the end, the class. So, class 1, class 2 or class 3. Any one row can only belong to one class, it can't be two attack types, it has to distinguish between all attack types and then assign this particular row the best attack type match! This is different from one-hot where it's meant that, if i understand right, a row can belong to multiple attack types. I however notice that nobody ever uses sparse_categorical_crossentropy when it comes to even the iris dataset which is very similar to mine, as it has more than 3 classes.
I can paste my code here and if somebody knows where i am going wrong! :Z
label_encoder = preprocessing.LabelEncoder()
y = ConcatenateAttackList['Label']
encoded_y = label_encoder.fit_transform(y)
y = np_utils.to_categorical(encoded_y)
x = ConcatenateAttackList.drop(['Label', ], axis = 1).astype(float)
sc = MinMaxScaler()
print('x_train, y_train, fitting and transforming.')
x = sc.fit_transform(x)
x,y = oversample.fit_resample(x,y)
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.33, random_state=42,stratify=y,
shuffle=True)
len(x_train)
len(y_train)
X = pd.DataFrame(x_train)
print('x_train, y_train, fitted and transformed.')
with tf.device("CPU"):
train = tf.data.Dataset.from_tensor_slices((x_train, y_train)).shuffle(4*256).batch(256)
validate = tf.data.Dataset.from_tensor_slices((x_test, y_test)).batch(256)
model = Sequential()
print('Model initialized.')
model.add(Dense(64,input_dim=len(X.columns),activation='relu')) # input layer
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(tf.keras.layers.BatchNormalization())
model.add(Dense(6, activation='softmax'))
print('Nodes added to layers.')
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics='categorical_accuracy')
print('Compiled.')
callback=tf.keras.callbacks.EarlyStopping(monitor='val_loss', mode='auto',
patience=50, min_delta=0, restore_best_weights=True, verbose=2)
print('EarlyStopping CallBack executed.')
print('Beginning fitting...')
model_hist = model.fit(x_train, y_train, epochs=231, batch_size=256, verbose=1,
callbacks=[callback],validation_data=validate)
print('Fitting completed.')
model.save("sets/mymodel5.h5")
dump(sc, 'sets/scaler_transformTCPDCV5.joblib')
print('Model saved.')
# loss history
plt.plot(model_hist.history['loss'], label="Training Loss")
plt.plot(model_hist.history['val_loss'], label="Validation Loss")
plt.legend()
#------------PREDICTION
tester = pd.read_csv('AttackTestFile.csv', sep=r'\s*,\s*', engine='python')
ColumnsForWindowsCIC = pd.read_csv('ColumnsForWindowsCIC.csv')
tester.columns = ColumnsForWindowsCIC.columns
tester = deleteRedudancy(tester)
x = tester.drop(['Label', ], axis = 1)
fit_new_input = sc.transform(x)
predict_y=model.predict(fit_new_input)
predict_y
classes_y=np.argmax(predict_y,axis=1)
classes_y
predict = label_encoder.inverse_transform(classes_y)
predict

Related

CNN with imbalanced data stuck with 70% testing accuracy

I'm working on image classification task for diabetic retinopathy with fundus image data. There are 5 classes. The data distribution is 1805 images (class 1), 370 images (class 2), 999 images (class 3), 193 images (class 4), 295 images (class 5).
Here are the steps that I have tried to run:
Preprocessing (resized 224 * 224)
The divide of train and test data is 85% : 15%
x_train, xtest, y_train, ytest = train_test_split(
x_train, y_train,
test_size = 0.15,
random_state=SEED,
stratify = y_train
)
Data agumentation
ImageDataGenerator(
zoom_range=0.15,
fill_mode='constant',
cval=0.,
horizontal_flip=True,
vertical_flip=True,
)
Training with the ResNet-50 model and cross-validation
def getResNet():
modelres = ResNet50(weights=None, include_top=False, input_shape= (IMAGE_HEIGHT,IMAGE_HEIGHT, 3))
x = modelres.output
x = GlobalAveragePooling2D()(x)
x = Dense(5, activation= 'softmax')(x)
model = Model(inputs = modelres.input, outputs = x)
return model
num_folds = 5
skf = StratifiedKFold(n_splits = 5, shuffle=True, random_state=2021)
cvscores = []
fold = 1
for train, val in skf.split(x_train, y_train.argmax(1)):
print('Fold: ', fold)
Xtrain = x_train[train]
Xval = x_train[val]
Ytrain = y_train[train]
Yval = y_train[val]
data_generator = create_datagen().flow(Xtrain, Ytrain, batch_size=32, seed=2021)
model = getResNet()
model.compile(loss='categorical_crossentropy',
optimizer=Adam(lr=0.0001),
metrics=['accuracy'])
with tf.compat.v1.device('/device:GPU:0'):
model_train = model.fit(data_generator,
validation_data=(Xval, Yval),
epochs=30, batch_size = 32, verbose=1)
model_name = 'cnn_keras_aug_Fold_'+str(fold)+'.h5'
model.save(model_name)
scores = model.evaluate(xtest, ytest, verbose=0)
print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
cvscores.append(scores[1] * 100)
fold = fold +1
The maximum results I got from this method were training accuracy of 81.2%, validation accuracy of 72.2%, and test accuracy of 70.73%.
Can anyone give me an idea to improve the model so that I can get the test accuracy above 90% as possible?
Later, I will use this model as a pre-trained model to train diabetic retinopathy data as well but from other sources.
BTW, I've tried replacing my preprocessing with this method:
def preprocessing(path):
image = cv2.imread(path)
image = crop_image_from_gray(image)
green = image[:,:,1]
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8,8))
cl = clahe.apply(green)
image[:,:,0] = image[:,:,0]
image[:,:,2] = image[:,:,2]
image[:,:,1] = cl
image = cv2.resize(image, (224,224))
return image
I've also tried to replace my model with VGG16, EfficientNetB0. However, none of that had much effect on my results. I'm still stucked with about 70% accuracy.
Please help me come up with ideas to improve my modeling results. I hope.
Your training accuracy is 81.2%. It is generally impossible to have testing accuracy higher that training accuracy, i.e. with current setup you will not achieve 90%.
However, your validation (and also testing) accuracy is about 70-72%. I can suggest that on your small dataset your model is overfitting. So if you add model regularization (e.g. dropout), it is possible that the gap between your training and your validation (and test) will decrease. This way you can improve your validation score.
To further increase the score, you need to check your data manually and try to understand which classes contribute the most to the errors and figure out how those errors can be reduced (e.g. updating your preprocessing pipeline).

NaN loss in CNN-LSTM on Keras for Time Series forecasting

I've to predict the time dependence of soil wet from the rainfall and some several time series. For all of them I've forecasts and the only to do is prediction of soil wet.
According to guide I build a CNN model, cause Arima's can't take into account outer stohastic influence.
The model work's, but not as it should.
If You have a look on this picture enter image description here, You'll find that the forecasted series(yellow smsfu_sum) doesn't depend on rain (aprec series) as in training set. I want a sharp peak in forecast, but changing the sizes of kernel and pooling don't help.
So I tried to train CNN-LSTM model based on this guide
Here's code of architecture of model :
def build_model(train, n_input):
# prepare data
train_x, train_y = to_supervised(train, n_input)
# define parameters
verbose, epochs, batch_size = 1, 20, 32
n_timesteps, n_features, n_outputs = train_x.shape[1], train_x.shape[2], train_y.shape[1]
# reshape output into [samples, timesteps, features]
train_y = train_y.reshape((train_y.shape[0], train_y.shape[1], 1))
# define model
model = Sequential()
model.add(Conv1D(filters=64, kernel_size=3, activation='softmax', input_shape=(n_timesteps,n_features)))
model.add(Conv1D(filters=64, kernel_size=3, activation='softmax'))
model.add(MaxPooling1D(pool_size=2))
model.add(Flatten())
model.add(RepeatVector(n_outputs))
model.add(LSTM(200, activation='relu', return_sequences=True))
model.add(TimeDistributed(Dense(100, activation='softmax')))
model.add(TimeDistributed(Dense(1)))
model.compile(loss='mse', optimizer='adam')
# fit network
model.fit(train_x, train_y, epochs=epochs, batch_size=batch_size, verbose=verbose)
return model
I used batch size = 32, and split data with function:
def to_supervised(train, n_input, n_out=300):
# flatten data
data = train.reshape((train.shape[0]*train.shape[1], train.shape[2]))
X, y = list(), list()
in_start = 0
# step over the entire history one time step at a time
for _ in range(len(data)):
# define the end of the input sequence
in_end = in_start + n_input
out_end = in_end + n_out
# ensure we have enough data for this instance
if out_end <= len(data):
X.append(data[in_start:in_end, :])
y.append(data[in_end:out_end, 2])
# move along one time step
in_start += 1
return array(X), array(y)
Using n_input = 1000 and n_output = 480 (I've to predict for this time)
So the first iteration on this Network tends the loss function to Nan.
How should I fix it? There no missing values in my data, I droped every NaNs.

Defining inputs during model training, Functional API in TensorFlow

I am trying to use the Functional API in TensorFlow (https://keras.io/guides/functional_api/) to build a deep learning model. So, this is my model:
first_inputs = Input(shape=(100, ))
first_dense = Dense(1, )(first_inputs)
second_input = Input(shape=(1, ))
merge = concatenate([first_dense, second_input])
output = Dense(1, )(merge)
model = Model(inputs=[first_inputs, second_input], outputs=output)
model.compile(optimizer=ada_grad, loss='binary_crossentropy',
metrics=['accuracy'])
I use train_test_split as you see:
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.01, random_state=42)
How can I use model.fit here and say first_inputs and second_input are these columns in x_train? How can I use model.evaluate and say first_inputs and second_input are these columns in x_train?
You cannot say that. Multiple inputs should be presented to fit as lists of arrays. E.g:
X = np.random.randn(1234, 101)
X1, X2 = X[:,:100], X[:, 100]
Y = np.random.randn(1234, 1)
model.fit([X1, X2], Y)

Interpreting the Values of Confusion Matrix from machine learning

I used confusion_matrix() to evaluate the model that has been trained to detect the DDoS attack.
The result of confusion matrix is like below with my test data set.
I believe that False neg value should be not 0 if it correctly detected the attack that is not DDoS.
Below is the code that I have implemented my ML model. Could you please give me a suggestion to make the model correctly check the benign attacks?
model.add(Dense(units=64, activation='relu', input_dim=7)) # Input Layer
model.add(Dropout(0.5))
model.add(Dense(units=128, activation='relu')) # hidden Layer
model.add(Dropout(0.2))
model.add(Dense(units=64, activation='relu')) # hidden Layer
model.add(Dropout(0.2))
model.add(Dense(units=1, activation='sigmoid')) # Last Layer for output
model.compile(loss='binary_crossentropy',
optimizer=Adam(learning_rate=0.0001),
metrics=['accuracy'])
CSV_FILE = "ddos.csv"
df = pd.read_csv(CSV_FILE)
df.loc[(df.Label == "ddos"), "Label"] = 1.0
df.loc[(df.Label == "Benign"), "Label"] = 0.0
# Data set
x_train = np.array(df[["Flow Duration", "Tot Fwd Pkts", "TotLen Fwd Pkts",
"Flow IAT Mean","Flow IAT Std" ,"Flow IAT Max", "Flow IAT Min"]])
x_train = x_train.astype(float)
normalized_x = preprocessing.normalize(x_train)
y_train = np.array(df[["Label"]])
y_train = np.array(y_train, dtype = 'float')
normalized_y = preprocessing.normalize(y_train)
hist = model.fit(normalized_x, normalized_y, epochs=3, batch_size=128)
y_pred = model.predict(x_train)
y_pred = preprocessing.normalize(y_pred)
cf_matrix = confusion_matrix(y_test, np.rint(y_pred))
Notice that my dataset is not imbalanced, i.e. it has exactly 50% DDoS and 50% Normal traffic information.
Although such questions cannot actually be answered with any degree of certainty, there are indeed some serious issues with your code.
First, you should not normalize your labels y_train; this is a binary classification problem, and the labels are expected to be exactly 0/1. Remove the following lines:
normalized_y = preprocessing.normalize(y_train)
y_pred = preprocessing.normalize(y_pred)
change the labels into integers (not floats), i.e.:
df.loc[(df.Label == "ddos"), "Label"] = 1
df.loc[(df.Label == "Benign"), "Label"] = 0
and your model fit to:
hist = model.fit(normalized_x, y_train, epochs=3, batch_size=128)
Second, although you train with normalized_x, you subsequently request predictions with x_train, which is again wrong; your predictions should be:
y_pred = model.predict(normalized_x)
Third, dropout should not be used by default, but only if we have signs of overfitting - but to do so, our model has first to be able to start learning something, which is not the case here. Comment out all dropout layers, and start putting them back into the model only in case of overfitting.
Last, you should start with the default settings of Adam, which usually (and reportedly) work well out of the box, i.e.:
optimizer=Adam()
And of course you should consider running the model for more than epochs=3.

tensorflow 2 evaluate inconsistent with sklearn accuracy_score

I try to train a model to predict gender using Celeba dataset and tensorflow.
This is my model:
train_data_gen = train_image_generator.flow_from_dataframe(
dataframe=train_split,
directory=celeba.images_folder,
x_col='id',
y_col='Male',
target_size=(IMG_WIDTH, IMG_HEIGHT),
batch_size=batch_size,
classes=['1', '0']
)
base_model = tf.keras.applications.MobileNetV2(input_shape=IMG_SHAPE,
include_top=False,
weights='imagenet')
model = tf.keras.Sequential([
base_model,
tf.keras.layers.GlobalAveragePooling2D(),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(2),
tf.keras.layers.Softmax()
])
base_learning_rate = 0.001
model.compile(optimizer=tf.keras.optimizers.RMSprop(lr=base_learning_rate),
loss=tf.keras.losses.CategoricalCrossentropy(from_logits=True),
metrics=['accuracy'])
Then I use the following to evaluate the model
test_data_gen = test_image_generator.flow_from_dataframe(
dataframe=test_split,
directory=celeba.images_folder,
x_col='id',
y_col='Male',
target_size=(IMG_WIDTH, IMG_HEIGHT),
batch_size=batch_size,
classes=['1', '0']
)
model = tf.keras.models.load_model("cp-0004.ckpt")
# Re-evaluate the model
loss, acc = model.evaluate(test_data_gen, verbose=2)
which gives accuracy of 0.87
But when I use the following, I get 0.51 accuracy!
pred_test = model.predict(test_data_gen)
pred_df = pd.DataFrame(pred_test, columns=["Male", "Female"])
pred_df[pred_df > 0.5] = "1"
pred_df[pred_df < 0.5] = "0"
# test_split_raw = celeba.split('test', drop_zero=False)
confusion_matrix(test_split["Male"].astype(int).values, np.argmax(pred_df.values, 1))
Can anyone explain why the accuracy from the evaluate function is different?
You want to check test_image_generator.flow_from_dataframe. The default value of shuffle is set to True.
Your generator object therefore yields randomly from your test data.
Your model then predicts for those randomly generated images, but you compare to your ordered dataframe. If you want to compare to test_split["Male"] set shuffle to False. If you don't set shuffle to False you will always get ~0.5 accuracy (If your data is equally distributed)
Another hint: You should use the .evaluate() method if you have labeled data. Using .evaluate() also yields accuracy.
Use .predict() only for new, unlabeled data.