How to properly quantize CNN into 4-bit using Tensorflow QAT? - tensorflow

I am trying to make 4-bit quantization and used this example
First of all I received the following warnings:
WARNING:tensorflow:AutoGraph could not transform <bound method Default8BitQuantizeConfig.set_quantize_activations of <tensorflow_model_optimization.python.core.quantization.keras.default_8bit.default_8bit_quantize_registry.Default8BitQuantizeConfig object at 0x7fb0208015c0>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: expected an indented block (<unknown>, line 14)
WARNING: AutoGraph could not transform <bound method Default8BitQuantizeConfig.set_quantize_activations of <tensorflow_model_optimization.python.core.quantization.keras.default_8bit.default_8bit_quantize_registry.Default8BitQuantizeConfig object at 0x7fb020806550>> and will run it as-is.
Please report this to the TensorFlow team. When filing the bug, set the verbosity to 10 (on Linux, `export AUTOGRAPH_VERBOSITY=10`) and attach the full output.
Cause: expected an indented block (<unknown>, line 14)
Than after reading this doc I found that it is possible to quantize my network into 4 bit but I couldn't understand is it possible for only Dense layer or for all (like Conv2D)?
I also don't understand how to work with weights since numpy can work only with float32.
UPD: I finally figure out how to perform quantization aware training
LastValueQuantizer = tfmot.quantization.keras.quantizers.LastValueQuantizer
MovingAverageQuantizer = tfmot.quantization.keras.quantizers.MovingAverageQuantizer
class DefaultDenseQuantizeConfig(tfmot.quantization.keras.QuantizeConfig):
# Configure how to quantize weights.
def get_weights_and_quantizers(self, layer):
return [(layer.kernel, LastValueQuantizer(num_bits=4, symmetric=True, narrow_range=False, per_axis=False))]
# Configure how to quantize activations.
def get_activations_and_quantizers(self, layer):
return [(layer.activation, MovingAverageQuantizer(num_bits=4, symmetric=False, narrow_range=False, per_axis=False))]
def set_quantize_weights(self, layer, quantize_weights):
# Add this line for each item returned in `get_weights_and_quantizers`
# , in the same order
layer.kernel = quantize_weights[0]
def set_quantize_activations(self, layer, quantize_activations):
# Add this line for each item returned in `get_activations_and_quantizers`
# , in the same order.
layer.activation = quantize_activations[0]
# Configure how to quantize outputs (may be equivalent to activations).
def get_output_quantizers(self, layer):
return []
def get_config(self):
return {}
QAT_model = tfmot.quantization.keras.quantize_annotate_model( keras.Sequential([
tfmot.quantization.keras.quantize_annotate_layer( tf.keras.layers.Dense(2, activation='relu', input_shape= x_train.shape[1:]), DefaultDenseQuantizeConfig() ),
tfmot.quantization.keras.quantize_annotate_layer( tf.keras.layers.Dense(2, activation='relu'), DefaultDenseQuantizeConfig() ),
tfmot.quantization.keras.quantize_annotate_layer( tf.keras.layers.Dense(10, activation='softmax'), DefaultDenseQuantizeConfig() )
]) )
with tfmot.quantization.keras.quantize_scope(
{'DefaultDenseQuantizeConfig': DefaultDenseQuantizeConfig}):
# Use `quantize_apply` to actually make the model quantization aware.
quantized_model = tfmot.quantization.keras.quantize_apply(QAT_model)
quantized_model.summary()
quantized_model.compile(optimizer='adam', # Good default optimizer to start with
loss='sparse_categorical_crossentropy', # how will we calculate our "error." Neural network aims to minimize loss.
metrics=['accuracy']) # what to track
quantized_model.fit(x_train, y_train, epochs=3)
val_loss, val_acc = quantized_model.evaluate(x_test, y_test)
But I still can't understand how to access the 4-bit quantized weights.
I used np.array( quantized_model.get_weights() ) but of course it gave me float32 moreover the number of elements in the quantized array is less than in original model. How this can be explained?

Related

Implementing a neural network with dynamic graph structure in TensorFlow

I am working on a project that requires a neural network with a dynamic graph structure, meaning the number of layers and the connections between them can change during runtime. I have been researching TensorFlow and its capabilities for building dynamic neural networks, but I am having trouble finding any clear examples or documentation on how to implement this.
I have tried creating a custom class for the neural network that builds the graph as it is trained, but I am getting errors when trying to run the training process. Here is a simplified version of my current implementation:
class DynamicNN(tf.keras.Model):
def __init__(self, input_shape):
super(DynamicNN, self).__init__()
self.input_shape = input_shape
self.layers = []
def add_layer(self, layer):
self.layers.append(layer)
def call(self, inputs):
x = tf.reshape(inputs, [-1, self.input_shape])
for layer in self.layers:
x = layer(x)
return x
model = DynamicNN(input_shape=784)
model.add_layer(tf.keras.layers.Dense(64, activation='relu'))
model.add_layer(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=10)
But it is giving me following error:
InvalidArgumentError: You must feed a value for placeholder tensor 'dynamicnn_input' with dtype float and shape [?,784]
How can I implement a neural network with a dynamic graph structure in TensorFlow? Are there any specific techniques or functions that I should be using? Are there any known limitations of TensorFlow in this regard?

Converting PyTorch transforms.compose method into Keras

I understand that we use transforms.compose to transform images via torch.transforms. I want to do the same in Keras and spending hours on internet I couldnt get how to write a method in keras that can do the same. Below is the Torch way:
# preprocessing
data_transform = transforms.Compose([
transforms.ToTensor(),
transforms.Normalize(mean=[.5], std=[.5])
])
Can someone please point me in the right direction.
It is a bit trivial in Tensorflow. Tensorflow recommends using the pre-processing/augmentation as part of the model itself.
I do not have your complete code but I assume you would be using tf.data.Dataset API to create your dataset. This is the recommended way of building the dataset in Tensorflow.
Having said that you can just prepend augmentation layers in your model.
# Following is the pre-processing pipeline for e.g
# Step 1: Image resizing.
# Step 2: Image rescaling.
# Step 3: Image normalization
# Having Data Augmentation as part of the input pipeline.
# Step 1: Random flip.
# Step 2: Random Rotate.
pre_processing_pipeline = tf.keras.Sequential([
layers.Resizing(IMG_SIZE, IMG_SIZE),
layers.Rescaling(1./255),
layers.Normalization(mean=[.5], variance=[.5]),
])
data_augmentation = tf.keras.Sequential([
layers.RandomFlip("horizontal_and_vertical"),
layers.RandomRotation(0.2),
])
# Then add it to your model.
# This would be different in your case as you might be using a pre-trained model.
model = tf.keras.Sequential([
# Add the preprocessing layers you created earlier.
resize_and_rescale,
data_augmentation,
layers.Conv2D(16, 3, padding='same', activation='relu'),
layers.MaxPooling2D(),
# Rest of your model.
])
For a complete list of layers check out this link. The above-given code can be found on the website here.

Shouldn't same neural network weights produce same results?

So I am working with different deep learning frameworks as part of my research and have observed something weird (at least I cannot explain the cause of it).
I trained a fairly simple MLP model (on mnist dataset) in Tensorflow, extracted trained weights, created the same model architecture in PyTorch and applied the trained weights to PyTorch model. Now my expectation is to get same test accuracy from both Tensorflow and PyTorch models but this isn't the case. I get different results.
So my question is: If a model is trained to some optimal value, shouldn't the trained weights produce same results every time testing is done on the same dataset (regardless of the framework used)?
PyTorch Model:
class Net(nn.Module):
def __init__(self) -> None:
super(Net, self).__init__()
self.fc1 = nn.Linear(784, 24)
self.fc2 = nn.Linear(24, 10)
def forward(self, x: Tensor) -> Tensor:
x = torch.flatten(x, 1)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
Tensorflow Model:
def build_model() -> tf.keras.Model:
# Build model layers
model = models.Sequential()
# Flatten Layer
model.add(layers.Flatten(input_shape=(28,28)))
# Fully connected layer
model.add(layers.Dense(24, activation='relu'))
model.add(layers.Dense(10))
# compile the model
model.compile(
optimizer='sgd',
loss=tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True),
metrics=['accuracy']
)
# return newly built model
return model
To extract weights from Tensorflow model and apply them to Pytorch model I use following functions:
Extract Weights:
def get_weights(model):
# fetch latest weights
weights = model.get_weights()
# transpose weights
t_weights = []
for w in weights:
t_weights.append(np.transpose(w))
# return
return t_weights
Apply Weights:
def set_weights(model, weights):
"""Set model weights from a list of NumPy ndarrays."""
state_dict = OrderedDict(
{k: torch.Tensor(v) for k, v in zip(model.state_dict().keys(), weights)}
)
self.load_state_dict(state_dict, strict=True)
Providing solution in answer section for the benefit of community. From comments
If you are using the same weights in the same manner then results
should be the same, though float rounding error should also be
accounted. Also it doesn't matter if model is trained at all. You can
think of your model architecture as a chain of matrix multiplications
with element-wise nonlinearities in between. How big is the
difference? Are you comparing model outputs, our metrics computed over
dataset? As a suggestion, intialize model with some random values in
Keras, do a forward pass for a single batch (paraphrased from jdehesa and Taras Sereda)

Unable to build `Dense` layer with non-floating point dtype Error

I am currently learning Deep learning and Keras. When I am executing this code I am getting weird error: "TypeError: Unable to build Dense layer with non-floating point dtype " and I can't figure out what is the problem. What am I missing? How to fix this weird error?
The error show at the model.fit(...
def create_nerual_network():
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu)) # Simple Dense Layer
model.add(tf.keras.layers.Dense(128, activation=tf.nn.relu)) # Simple Dense Layer
model.add(tf.keras.layers.Dense(2, activation=tf.nn.softmax)) # Output layer
return model
train_images, train_labels = load_dataset() #this function works fine
model = create_nerual_network()
model.compile(optimizer='adam',loss='sparse_categorical_crossentropy',metrics=['accuracy'])
model.fit(train_images, train_labels, epochs = 15, verbose=2)
train_loss, train_acc = model.evaluate(train_images, train_labels)
It is interesting that you do not specify your input shape anywhere before the model compilation but maybe newer versions of Keras can figure this out from provided input.
In which case I am quite certain that the problem is with train_images, look at what dtype is this array, it's probably int8 which is usual format for images as they use 8 bit integers for each color channel.
It is common practice to at least normalize your data before training and always convert it to float.
Try putting this before calling model.fit:
train_images = train_images / 256.
This will normalize your images into range [0, 1) and convert it to float array. It is possible that you have to convert to floats also your labels.

What is the structure of a Keras model if input_shape is omitted and why does it perform better?

I omitted the input_shape in the first layer of my Keras model by mistake. Eventually I noticed this and fixed it – and my model's performance dropped dramatically.
Looking at the structure of the model with and without input_shape, I discovered that the better-performing model has the output shape of multiple. Moreover, plotting it with plot_model shows no connections between the layers:
When it comes to performance, the model I understand (with input_shape) achieves a validation loss of 4.0513 (MSE) after 10 epochs with my test code (below), while the "weird" model manages 1.3218 – and the difference only increases with more epochs.
Model definition:
model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
# add or remove this ^^^^^^^^^^^^^^^^^^^
model.add(keras.layers.Dropout(0.05))
...
(never mind the details, this is just a model that demonstrates the difference in performance with and without input_shape)
So what is happening in the better-performing model? What is multiple? How are the layers really connected? How could I build this same model while also specifying input_shape?
Complete script:
import tensorflow as tf
from tensorflow import keras
import numpy as np
from collections import deque
import math, random
def func(x):
return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5
def get_data():
x = 0
dx = 0.1
q = deque()
r = 0
data = np.zeros((100000, 1002), np.float32)
while True:
x = x + dx
sig = func(x)
q.append(sig)
if len(q) < 1000:
continue
arr = np.array(q, np.float32)
for k in range(10):
xx = random.uniform(0.1, 9.9)
data[r, :1000] = arr[:1000]
data[r, 1000] = 5*xx #scale for easier fitting
data[r, 1001] = func(x + xx)
r = r + 1
if r >= data.shape[0]:
break
if r >= data.shape[0]:
break
q.popleft()
inputs = data[:, :1001]
outputs = data[:, 1001]
return (inputs, outputs)
np.random.seed(1)
tf.set_random_seed(1)
random.seed(1)
model = keras.Sequential()
model.add(keras.layers.Dense(64, activation=tf.nn.relu, input_shape=(1001,)))
# add or remove this ^^^^^^^^^^^^^^^^^^^
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(64, activation=tf.nn.relu))
model.add(keras.layers.Dropout(0.05))
model.add(keras.layers.Dense(1))
model.compile(
loss = 'mse',
optimizer = tf.train.RMSPropOptimizer(0.0005),
metrics = ['mae', 'mse'])
inputs, outputs = get_data()
hist = model.fit(inputs, outputs, epochs=10, validation_split=0.1)
print("Final val_loss is", hist.history['val_loss'][-1])
TL;DR
The reason that the results are different is because the two models have different initial weights. The fact that one performs (significantly) better than the other is purely by chance and as #today mentioned the results they obtain are approximately similar.
Details
As the documentation for tf.set_random_seed explains, random operations use two seeds, the graph-level seed and the operation specific seed; tf.set_random_seed sets the graph-level seed:
Operations that rely on a random seed actually derive it from two seeds: the graph-level and operation-level seeds. This sets the graph-level seed.
Taking a look at the definition for Dense we see that the default kernel initializer is 'glorot_uniform' (let's only consider the kernel initializer here but the same holds for the bias initializer). Walking farther through the source code we'll eventually find out that this fetches the GlorotUniform with default arguments. Specifically the random number generator seed for that specific operation (namely weight initialization) is set to None. Now if we check where this seed is used, we find it is passed to random_ops.truncated_normal for example. This in turn (as do all random operations) fetches now the two seeds, one being the graph-level seed and the other the operation specific seed: seed1, seed2 = random_seed.get_seed(seed). We can check the definition of the get_seed function and we find that if the operation specific seed is not given (which is our case) then it is derived from properties of the current graph: op_seed = ops.get_default_graph()._last_id. The corresponding part of the tf.set_random_seed docs read:
If the graph-level seed is set, but the operation seed is not: The system deterministically picks an operation seed in conjunction with the graph-level seed so that it gets a unique random sequence.
Now coming back to original problem, it makes a difference for the graph structure if input_shape is defined or not. Again looking at a bit of source code we find that Sequential.add builds the inputs and outputs of the network incrementally only if input_shape was specified; otherwise it just stores a list of layers (model._layers); compare model.inputs, model.outputs for the two definitions. The output is incrementally built by calling the layers directly which dispatches to Layer.__call__. This wrapper builds the layer, sets the layer's inputs and outputs and adds some metadata to the outputs; also it uses an ops.name_scope to group operations. We can see this from the visualization provided by Tensorboard (example for the simplified model architecture of Input -> Dense -> Dropout -> Dense):
Now in the case we didn't specify input_shape all the model has is a list of layers. Even after having called compile the model is actually not compiled (just attributes such as the optimizer are set). Instead it is compiled "on the fly" when for the first time data is passed in to the model. This happens in in model._standardize_weights: the model output is obtained via self.call(dummy_input_values, training=training). Checking this method we find that it builds the layers (note that the model is not yet built) and then computes the output incrementally by using Layer.call (not __call__). This leaves out all the meta data and also the grouping of operations and hence results in a different structure of the graph (though its computational operations are all the same). Again checking Tensorboard we find:
Expanding both graphs we would find that they contain the same operations, grouped differently together. However this has the effect that the keras.backend.get_session().graph._last_id is different for both definitions and hence results in a different seed for the random operations:
# With `input_shape`:
>>> keras.backend.get_session().graph._last_id
303
# Without `input_shape`:
>>> keras.backend.get_session().graph._last_id
7
Performance results
I used the OP's code with some modifications in order to have similar random operations:
Added the steps described here to ensure reproducibility in terms of randomization,
Set random seeds for Dense and Dropout variable initialization,
Removed validation_split since the splitting happens before "on the fly" compilation of the model without input_shape and hence might interfere with the seed,
Set shuffle = False since this might use a separate operation specific seed.
This is the complete code (in addition I performed export PYTHONHASHSEED=0 before running the script):
from collections import deque
from functools import partial
import math
import random
import sys
import numpy as np
import tensorflow as tf
from tensorflow import keras
seed = int(sys.argv[1])
np.random.seed(1)
tf.set_random_seed(seed)
random.seed(1)
session_conf = tf.ConfigProto(intra_op_parallelism_threads=1,
inter_op_parallelism_threads=1)
sess = tf.Session(graph=tf.get_default_graph(), config=session_conf)
keras.backend.set_session(sess)
def func(x):
return math.sin(x)*5 + math.sin(x*1.8)*4 + math.sin(x/4)*5
def get_data():
x = 0
dx = 0.1
q = deque()
r = 0
data = np.zeros((100000, 1002), np.float32)
while True:
x = x + dx
sig = func(x)
q.append(sig)
if len(q) < 1000:
continue
arr = np.array(q, np.float32)
for k in range(10):
xx = random.uniform(0.1, 9.9)
data[r, :1000] = arr[:1000]
data[r, 1000] = 5*xx #scale for easier fitting
data[r, 1001] = func(x + xx)
r = r + 1
if r >= data.shape[0]:
break
if r >= data.shape[0]:
break
q.popleft()
inputs = data[:, :1001]
outputs = data[:, 1001]
return (inputs, outputs)
Dense = partial(keras.layers.Dense, kernel_initializer=keras.initializers.glorot_uniform(seed=1))
Dropout = partial(keras.layers.Dropout, seed=1)
model = keras.Sequential()
model.add(Dense(64, activation=tf.nn.relu,
# input_shape=(1001,)
))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(64, activation=tf.nn.relu))
model.add(Dropout(0.05))
model.add(Dense(1))
model.compile(
loss = 'mse',
optimizer = tf.train.RMSPropOptimizer(0.0005)
)
inputs, outputs = get_data()
shuffled = np.arange(len(inputs))
np.random.shuffle(shuffled)
inputs = inputs[shuffled]
outputs = outputs[shuffled]
hist = model.fit(inputs, outputs[:, None], epochs=10, shuffle=False)
np.save('without.{:d}.loss.npy'.format(seed), hist.history['loss'])
With this code I'd actually expect to obtain similar results for both approaches however it turns out that they are not equal:
for i in $(seq 1 10)
do
python run.py $i
done
Plot the mean loss +/- 1 std. dev.:
Initial weights and initial prediction
I verified that the initial weights and an initial prediction (before fitting) is the same for the two versions:
inputs, outputs = get_data()
mode = 'without'
pred = model.predict(inputs)
np.save(f'{mode}.prediction.npy', pred)
for i, layer in enumerate(model.layers):
if isinstance(layer, keras.layers.Dense):
w, b = layer.get_weights()
np.save(f'{mode}.{i:d}.kernel.npy', w)
np.save(f'{mode}.{i:d}.bias.npy', b)
and
for i in 0 2 4 8
do
for data in bias kernel
do
diff -q "with.$i.$data.npy" "without.$i.$data.npy"
done
done
Influence of Dropout
[ ! ] I checked the performance after removing all Dropout layers and in that case the performance is actually equal. So the crux seems to lie with the Dropout layers. Actually the performance of the models without Dropout layers is the same as for the model with Dropout layers but without specifying input_shape. So it seems that without input_shape the Dropout layers are not effective.
Basically the difference between the two versions is that one uses __call__ and the other uses call to compute the outputs (as explained above). Since performance is similar to without Dropout layers a possible explanation could be that the Dropout layers don't drop when input_shape is not specified. This could by caused by training=False, i.e. the layers don't recognize they are in training mode. However I don't see a reason why this would happen. Also we can consider again the Tensorboard graphs.
Specifying input_shape:
Not specifying input_shape:
where the switch also depends on the learning phase (as before):
To verify the training kwarg let's subclass Dropout:
class Dropout(keras.layers.Dropout):
def __init__(self, rate, noise_shape=None, seed=None, **kwargs):
super().__init__(rate, noise_shape=noise_shape, seed=1, **kwargs)
def __call__(self, inputs, *args, **kwargs):
training = kwargs.get('training')
if training is None:
training = keras.backend.learning_phase()
print('[__call__] training: {}'.format(training))
return super().__call__(inputs, *args, **kwargs)
def call(self, inputs, training=None):
if training is None:
training = keras.backend.learning_phase()
print('[call] training: {}'.format(training))
return super().call(inputs, training)
I obtain similar outputs for both version, however the calls to __call__ are missing when input_shape is not specified:
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[__call__] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
[call] training: Tensor("keras_learning_phase:0", shape=(), dtype=bool)
So I suspect that the problem lies somewhere within __call__ but right now I can't figure out what it is.
System
I'm using Ubuntu 16.04, Python 3.6.7 and Tensorflow 1.12.0 via conda (no GPU support):
$ uname -a
Linux MyPC 4.4.0-141-generic #167-Ubuntu SMP Wed Dec 5 10:40:15 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
$ python --version
Python 3.6.7 :: Anaconda, Inc.
$ conda list | grep tensorflow
tensorflow 1.12.0 mkl_py36h69b6ba0_0
tensorflow-base 1.12.0 mkl_py36h3c3e929_0
Edit
I also had keras and keras-base installed (keras-applications and keras-preprocessing are required by tensorflow):
$ conda list | grep keras
keras 2.2.4 0
keras-applications 1.0.6 py36_0
keras-base 2.2.4 py36_0
keras-preprocessing 1.0.5 py36_0
After removing all, keras* and tensorflow*, then reinstalling tensorflow, the discrepancy vanished. Even after reinstalling keras the results remain similar. I also checked with a different virtualenv where tensorflow is installed via pip; also no discrepancy here. Right now I can't reproduce this discrepancy anymore. It must've been a broken installation of tensorflow.