Question on ElasticNet algorithm implemented in Cleverhans - tensorflow

I'm trying to use the Elastic-Net algorithm implemented in Cleverhans to generate adversarial samples in a classification task. The main problem is that i'm trying to use it in a way to obtain an higher confidence at classification time on a target class (different from the original one) but i'm not able to reach good results.
The system that i'm trying to fool is a DNN with a softmax output on 10 classes.
For instance:
Given a sample of class 3 i want to generate an adversarial sample of class 0.
Using the default hyperparameters implemented in the ElasticNetMethod of cleverhans i'm able to obtain a succesful attack, so the class assigned to the adversarial sample became the class 0, but the confidence is quite low(about 30%). This also happens trying different values for the hyperparameters.
My purpose is to obtain a quite higher confidence (at least 90%).
For other algorithm like "FGSM" or "MadryEtAl" i'm able to reach this purpose creating a loop in which the algorithm is applied until the sample is classified as the target class with a confidence greater than 90%, but i can't to apply this iteration on the EAD algorithm because at each step of the iteration it yields the adversarial sample generated at the first step, and in the following iterations it remains unchanged. (I know that this may happens because the algorithm is different from the other two metioned, but i'm trying to find a solution to reach my purpose).
This is the code that i'm actually using to generate adversarial samples.
ead_params = { 'binary_search_steps':9, 'max_iterations':100 , 'learning_rate':0.001, 'clip_min':0,'clip_max':1,'y_target':target}
adv_x = image
founded_adv = False
threshold = 0.9
wrap = KerasModelWrapper(model)
ead = ElasticNetMethod(wrap, sess=sess)
while (not founded_adv):
adv_x = ead.generate_np(adv_x, **ead_params)
prediction = model.predict(adv_x).tolist()
pred_class = np.argmax(prediction[0])
confidence = prediction[0][pred_class]
if (pred_class == 0 and confidence >= threshold):
founded_adv = True
The while loop may generate a sample until the target class is reached with a confidence greater than 90%. This code actually works with FGSM and Madry, but runs infinitely using EAD.
Library version:
Tensorflow: 2.2.0
Keras: 2.4.3
Cleverhans: 2.0.0-451ccecad450067f99c333fc53592201
Anyone can help me ?
Thanks a lot.

For anyone intrested in this problem the previous code can be modified in this way to works properly:
FIRST SOLUTION:
prediction = model.predict(image)
initial_predicted_class = np.argmax(prediction[0])
ead_params = { 'binary_search_steps':9, 'max_iterations':100 , 'learning_rate':0.001,'confidence':1, 'clip_min':0,'clip_max':1,'y_target':target}
adv_x = image
founded_adv = False
threshold = 0.9
wrap = KerasModelWrapper(model)
ead = ElasticNetMethod(wrap, sess=sess)
while (not founded_adv):
adv_x = ead.generate_np(adv_x, **ead_params)
prediction = model.predict(adv_x).tolist()
pred_class = np.argmax(prediction[0])
confidence = prediction[0][pred_class]
if (pred_class == initial_pred_class and confidence >= threshold):
founded_adv = True
else:
ead_params['confidence'] += 1
Using the confidence parameter implemented in the library. Actually we increase by 1 the confidence parameter if the probability of the target class does not increase.
SECOND SOLUTION :
prediction = model.predict(image)
initial_predicted_class = np.argmax(prediction[0])
ead_params = {'beta':5e-3 , 'binary_search_steps':6, 'max_iterations':10 , 'learning_rate':3e-2, 'clip_min':0,'clip_max':1}
threshold = 0.96
adv_x = image
founded_adv = False
wrap = KerasModelWrapper(model)
ead = ElasticNetMethod(wrap, sess=sess)
while (not founded_adv):
eps_hyp = 0.5
new_adv_x = ead.generate_np(adv_x, **ead_params)
pert = new_adv_x-adv_x
new_adv_x = adv_x - eps_hyp*pert
new_adv_x = (new_adv_x - np.min(new_adv_x)) / (np.max(new_adv_x) - np.min(new_adv_x))
adv_x = new_adv_x
prediction = model.predict(new_adv_x).tolist()
pred_class = np.argmax(prediction[0])
confidence = prediction[0][pred_class]
print(pred_class)
print(confidence)
if (pred_class == initial_predicted_class and confidence >= threshold):
founded_adv = True
In the second solution there are the following modification to the original code:
-Initial_predicted_class is the class predicted by the model on the benign sample ( "0" for our example ).
-In the parameters of the algorithm (ead_params) we don't insert the target class.
-Then we can obtain the perturbation given by the algorithm calculating pert = new_adv_x - adv_x where "adv_x" is the original image (in the first step of the for loop), and new_adv_x is the perturbed sample generated by the algorithm.
-The previous operation is useful because the EAD original alghoritm calculate the perturbation to maximize the loss w.r.t the class "0", but in our case we want to minimize it.
-So, we can calculate the new perturbed image as new_adv_x = adv_x - eps_hyp*pert (where the eps_hyp is an epsilon hyperparameter that i've introduced to reduce the perturbation), and than we normalize the new perturbed image.
-I've tested the code for a large number of images, and the the confidence always increase, so i think that can be a good solution for this purpose.
I think that the second solution allow to obtain finer perturbation.

Related

How to make spacy train faster on NER for Persian language

I have a blank model from spacy, in the config file I use the widget Training Pipelines & Models with this config:
Language = Arabic
Components = ner
Hardware = CPU
Optimize for = accuracy
then in config-file I changed the:
[nlp]
lang = "ar"
to
[nlp]
lang = "fa"
because there is no pretrained GPU (transformer) for persian-language.
and as you know the accuracy type is very slow and I have 400,000 records.
this is my config-file
[paths]
train = null
dev = null
vectors = null
[system]
gpu_allocator = null
[nlp]
lang = "fa"
pipeline = ["tok2vec","ner"]
batch_size = 1000
[components]
[components.tok2vec]
factory = "tok2vec"
[components.tok2vec.model]
#architectures = "spacy.Tok2Vec.v2"
[components.tok2vec.model.embed]
#architectures = "spacy.MultiHashEmbed.v2"
width = ${components.tok2vec.model.encode.width}
attrs = ["ORTH", "SHAPE"]
rows = [5000, 2500]
include_static_vectors = true
[components.tok2vec.model.encode]
#architectures = "spacy.MaxoutWindowEncoder.v2"
width = 256
depth = 8
window_size = 1
maxout_pieces = 3
[components.ner]
factory = "ner"
[components.ner.model]
#architectures = "spacy.TransitionBasedParser.v2"
state_type = "ner"
extra_state_tokens = false
hidden_width = 64
maxout_pieces = 2
use_upper = true
nO = null
[components.ner.model.tok2vec]
#architectures = "spacy.Tok2VecListener.v1"
width = ${components.tok2vec.model.encode.width}
[corpora]
[corpora.train]
#readers = "spacy.Corpus.v1"
path = ${paths.train}
max_length = 0
[corpora.dev]
#readers = "spacy.Corpus.v1"
path = ${paths.dev}
max_length = 0
[training]
dev_corpus = "corpora.dev"
train_corpus = "corpora.train"
[training.optimizer]
#optimizers = "Adam.v1"
[training.batcher]
#batchers = "spacy.batch_by_words.v1"
discard_oversize = false
tolerance = 0.2
[training.batcher.size]
#schedules = "compounding.v1"
start = 100
stop = 1000
compound = 1.001
[initialize]
vectors = ${paths.vectors}
How can I make the training process faster?
To speed up training you have a few options.
Change the evaluation frequency. It's not in the config the widget generates, but there's an eval_frequency option - it should be filled in if you use fill-config as recommended. The default value is relatively low, and evaluation is slow. You should increase this value a lot if you have a large amount of training data.
Use the efficiency presets instead of accuracy. If speed is an issue then you should try this. For your pipeline, the relevant options are whether to include static vectors or not, and the width or depth of your tok2vec. Note this alone won't affect speed that much, but because it definitely reduces memory usage it can be usefully combined with the next option.
Increase batch size. In training the time to process a batch is relatively constant, so larger batches means fewer batches for the same data, which means faster training. How large a batch you can handle depends on the size of your documents and your hardware.
Use less training data. This is very rarely something that I'd recommend, but if you have 400,000 records you shouldn't need that many to get a good NER model. (How many classes do you have?) Try 10,000 to start with and see how your model performs, and scale up until you get the accuracy/speed tradeoff you want. This will also help you figure out if there is some kind of issue with your data more quickly.
For tips on faster inference (not training), see the spaCy speed FAQ.
You might just be using one core of your CPU, as that is kind of the Python default iirc. I would look into parallelizing the job with joblib and increasing your chunk size.
See: https://prrao87.github.io/blog/spacy/nlp/performance/2020/05/02/spacy-multiprocess.html#Option-3:-Parallelize-the-work-using-joblib

Getting nans for gradient

I am trying to create a search relevance model where I take the dot product between query vector and resulting documents. I add a positional bias term on top to take into account the fact that position 1 is more likely to be clicked on. The final (unnormalised) log likelihood calculation is as follows:
query = self.query_model(query_input_ids, query_attention_mask)
docs = self.doc_model(doc_input_ids, doc_attention_mask)
positional_bias = self.position_model()
if optimizer_idx is not None:
if optimizer_idx == 0:
docs = docs.detach()
positional_bias = positional_bias.clone().detach()
elif optimizer_idx == 1:
query = query.detach()
positional_bias = positional_bias.clone().detach()
else:
query = query.detach()
docs = docs.detach()
similarity = (docs # query.unsqueeze(-1)).squeeze()
click_log_lik = (similarity + positional_bias)\
.reshape(doc_mask.shape)\
.masked_fill_((1 - doc_mask).bool(), float("-inf"))
The query and doc model is simply a distilbert model with a projection layer on top of CLS token. The models can be seen here: https://pastebin.com/g21g9MG3
When inspecting the first gradient descent step, it has nans, but only for the query model and not the doc model. My hypothesis is that normalizing the return values for doc and query models (return F.normalize(out, dim=-1)) is somehow playing up with the gradients.
Does anyone know 1. If my hypothesis is true and more importantly 2. How can I rectify nan gradients?.
Additional Info:
None of the losses are inf or nan.
query is BS x 768
docs is BS x DOC_RESULTS x 768
positional_bias is DOC_RESULTS
DOC_RESULTS is 10 in my case.
The masked_fill in the last line is because occasionally I have less than 10 data points for a query.
Update 1
The following changes made no difference to nans:
Changing masked_fill from -inf to 1e5.
Changing the projection from F.normalize(out, dim=-1) to out / 100.
Removed positional bias altogether with again no luck.
If it helps anyone, and you come across this while using Transformers this is what I did:
So in the end the bug was due to the fact that I was masking away nan's. Since I had some documents with zero length, the output of the transformer was nan. I was hoping that masked_fill would fix this problem, but it doesn't. The solution in my case was to only put non-zero length sequences through transformers, and then append with zeros to fill the batch size.

Calculating YOLO mAP against test dataset

AFAIK YOLO calculates mAP against validation dataset during training. Now is it possible to calculate the same against unseen test dataset ?
Command:
./darknet detector map obj.data yolo-obj.cfg yolo-obj_best.weights
obj.data:
classes = 1
train = train.txt
valid = test.txt
names = classes.txt
backup = backup
I have directed valid to test dataset containing annotated images. But I always get the following result:
calculation mAP (mean average precision)...
44
detections_count = 50, unique_truth_count = 43
class_id = 0, name = traffic_light, ap = 100.00% (TP = 43, FP = 0)
for conf_thresh = 0.25, precision = 1.00, recall = 1.00, F1-score = 1.00
for conf_thresh = 0.25, TP = 43, FP = 0, FN = 0, average IoU = 85.24 %
IoU threshold = 50 %, used Area-Under-Curve for each unique Recall
mean average precision (mAP#0.50) = 1.000000, or 100.00 %
Total Detection Time: 118 Seconds
It's not that I'm not happy with 100% mAP, but it's definitely wrong isn't it?
Any advice would be greatly appreciated.
Regards,
Setnug
Now is it possible to calculate the same against unseen test dataset ?
Yes, mAP calculation needs images with corresponding labels/annotation that's all.
I have directed valid to test dataset containing annotated images.
Yes, this is the way to do what you wanted.
There is a possibility that what you're seeing here is this known bug, provided you're using old code and haven't updated after that. In that case suggest you to pull the latest darknet and try.
Note that if the model is trained really well and if your test set is simple in terms of complexity (though it's unseen) or it's visually similar to that of train set, it's possible to get such numbers as well, as we're talking about small number of test samples.

After quantisation in neural network, will the output need to be scaled with the inverse of the weight scaling

I'm currently writing a script to quantise a Keras model down to 8 bits. I'm doing a fairly basic linear scaling on the weights, by assuming a normal distribution of weights and biases, and then interpolating all the values within 2 standard deviations of the mean, to the range [-128, 127].
This all works, and I run the model through inference, but my image out is crazy bad. I know there will be a small performance hit, but I'm seeing roughly 10x performance degradation.
My question is, after this scaling of the weights, do I need to do the inverse scaling operation to my output? None of the papers I've been reading seem to mention this, but I'm unsure why else my results would be so bad.
The network is for image demosaicing. It takes in a RAW image, and is meant to output an image with very low noise, and no demosaicing artefacts. My full precision model is very good, with image PSNRs of around 40-43dB, but after quantisation, I'm getting 4-8dB, and incredibly bad looking images.
Code for anyone who's bothered to read it
for i in layer_index:
count = count+1
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
std = np.std(weights_act)
if (std > max_std):
max_std = std
mean = np.mean(weights_act)
mean_of_mean = mean_of_mean + mean
mean_of_mean = mean_of_mean / count
max_bound = mean_of_mean + 2*max_std
min_bound = mean_of_mean - 2*max_std
print(max_bound, min_bound)
for i in layer_index:
layer = model.get_layer(index = i);
weights = layer.get_weights();
weights_act = weights[0];
bias_act = weights[1];
weights_shape = weights_act.shape;
bias_shape = bias_act.shape;
new_weights = np.empty(weights_shape, dtype = np.int8)
print(new_weights.dtype)
new_biass = np.empty(bias_shape, dtype = np.int8)
for a in range(weights_shape[0]):
for b in range(weights_shape[1]):
for c in range(weights_shape[2]):
for d in range(weights_shape[3]):
new_weight = (((weights_act[a,b,c,d] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_weights[a,b,c,d] = np.int8(new_weight)
#print(new_weights[a,b,c,d], weights_act[a,b,c,d])
for e in range(bias_shape[0]):
new_bias = (((bias_act[e] - min_bound) * (127 - (-128)) / (max_bound - min_bound)) + (-128))
new_biass[e] = np.int8(new_bias)
new_weight_layer = (new_weights, new_biass)
layer.set_weights(new_weight_layer)
You dont do what you think you are doing, I'll explain.
If you wish to take pre-trained model and quantize it you have to add scales after each operation that involves weights, lets take for example the convolution operation.
As we know convolution operation is linear in my explantion i will ignore the bias for the sake of simplicity (adding him is relatively easy), Let's assume X is our input Y is our output and W is the weights, convolution can be written as:
Y=W*X
where '*' represent the convolution operation, what you are basically doing is taking the weights and multiple them by some scalar (lets call it 'a') and shift them by some other scalar (let's call it 'b') so in your model you use W' where: W'= Wa+b
So if we return to the convolution operation we get that in your quantized network you basically do the next operation: Y' = W'*X = (Wa+b)*X
Because convolution is linear we get: Y' = a(W*X) + b*X'
Don't forget that in your network you want to receive Y not Y' at the output of the convolution therefore you must do shift + re scale to get the correct answer.
So after that explanation (which i hope was clear enough) i hope you can understand what is the problem in your network, you do this scale and shift to all of weights and you never compensate for it, I think your confusion is because your read papers that trained models in quantized mode from the beginning and didn't take pretrained model quantized it.
For you problem i think tensorflow graph transform tool might help, take a look at:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/graph_transforms/README.md
If you wish to read more about quantizing pre trained model you can find more information in (for more academic info just go to scholar.google.com:
https://www.tensorflow.org/lite/performance/post_training_quantization

Why shuffling data gives significantly higher accuracy?

In Tensorflow, I've wrote a big model for 2 image classes problem. My question is concerned with the following code snippet:
X, y, X_val, y_val = prepare_data()
probs = calc_probs(model, session, X)
accuracy = float(np.equal(np.argmax(probs, 1), np.argmax(y, 1)).sum()) / probs.shape[0]
loss = log_loss(y, probs)
X is an np.array of shape: (25000,244,244,3). That code results in accuracy=0.5834 (towards random accuracy) and loss=2.7106. But
when I shuffle the data, by adding these 3 lines after the first line:
sample_idx = random.sample(range(0, X.shape[0]), 25000)
X = X[sample_idx]
y = y[sample_idx]
, the results become convenient: accuracy=0.9933 and loss=0.0208.
Why shuffling data can give significantly higher accuracy ? or what can be a reason for that ?
The function calc_probs is mainly a run call:
probs = session.run(model.probs, feed_dict={model.X: X})
Update:
After hours of debugging, I figured out that evaluating a single image gives different result. For example, if you run the following line of code multiple times, you get a different result each time:
session.run(model.props, feed_dict={model.X: [X[20]])
My data is normally sorted, X contains class 1 samples first then class 2. And in calc_probs function, I run using each batch of the data sequentially. So, without shuffling, each run has data of a single class.
I've also noted that with shuffling, if batch size is very small, I get the random accuracy.
There is some mathematical justification for this in the context of randomized Kaczmarz algorithm. Regular Kaczmarz algorithm is an old algorithm which can be seen as an non-shuffling SGD on a least squares problem, and there are guaranteed faster convergence rates that come out if you use randomization, follow references in http://www.cs.ubc.ca/~nickhar/W15/Lecture21Notes.pdf