Serialized neural network size in Torch after splitting layers - serialization

After training a neural network, let's say a Multi-Layer Perceptron, at prediction time I want to split the first layer from all the others.
To do so, the only way I found in order to have files of the correct size, is the following:
I loop through all the layers and I add them to one of two containers (first layer or all the others) that I'll save separetelly using torch.save function. The funny bit is that I need to retrieve the parameters of each layer before adding them to any of the two containers, otherwise when saved, both files (first layer and all the other layers) have the same file size.
A code snippet will be more helpfull than all my previous explanation:
local function split_model(network)
-- for some reason all the models when saved have the same size
-- if not splitted calling 'getParameters()' first.
first_layer = nn.Sequential()
all_the_rest = nn.Sequential()
for i = 1, network:size() do
local l = network:get(i)
local l_params, _ = l:getParameters()
if i == 1 then
first_layer:add(l)
else
all_the_rest:add(l)
end
end
return first_layer, all_the_rest
end
local first_layer, all_the_rest = split_model(network)
torch.save("checkpoints/mlp.t7", .network)
torch.save("checkpoints/first_layer.t7", first_layer)
torch.save("checkpoints/all_the_rest.t7", all_the_rest)

Same question posted in Google groups, this is the answer by Alban Desmaison:
Hi,
The reason for this behavior is due to the way getParameters work
https://github.com/torch/nn/blob/master/doc/module.md#flatparameters-flatgradparameters-getparameters
To be able to return a flat tensor containing all the parameters, it
actually creates a single Storage containing all the weights and then
each module's weight is part of this storage. When you save the
weights of any element of the network, it will have to save the weight
tensor and to do so, save the underlying storage. Hence, if you called
getParameters on the complete network, if you save any module, you
will save all the networks weights. Here, when you call
getParameters on the single module, it actually re-create this
single storage but for this single module, thus, when you save it, it
contains only the weights that you want. BUT note that the flattened
parameters returned by the getParameters that you did on the
complete network are not valid anymore!!!
You have two solutions here:
- If you don't want to use the params coming from the getParameters on the complete network, you can just call
getParameters on each subset of your network before saving them.
This will break change the underlying storage to only contain this
subset of the network and you will save only what you need (shared
storages are stored only once).
- If you want to be able to keep using the params from the original getParameters, you can do the same as above but use cloned version
of them to do the getParameters and saving.
And because a code snippet is always better:
require 'nn'
local subset1 = nn.Linear(2,2)
local subset2 = nn.Linear(2,2)
local network = nn.Sequential():add(subset1):add(subset2)
print("Before getParameters:", subset1.weight:storage():size()) -- 4 elements
network_params,_ = network:getParameters()
print("After getParameters:", subset1.weight:storage():size()) -- 12 elements
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- true
-- Keeping network_params valid
local clone_subset1 = subset1:clone()
print("Cloned subset1 before getParameters:", clone_subset1.weight:storage():size()) -- 12 elements
clone_subset1:getParameters()
print("Cloned subset1 after getParameters:", clone_subset1.weight:storage():size()) -- 6 elements (4 weights + 2 bias)
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- true
-- Not keeping network_params valid (should be faster)
local clone_subset1 = subset1:clone()
print("subset1 before getParameters:", subset1.weight:storage():size()) -- 12 elements
subset1:getParameters()
print("subset1 after getParameters:", subset1.weight:storage():size()) -- 6 elements (4 weights + 2 bias)
subset1.weight:random() -- Change weights to see if linking is still working
print("network_params is valid?", network_params[1] == subset1.weight[1][1]) -- false

Related

TF object detection: return subset of inference payload

Problem
I'm working on training and deploying an instance segmentation model using TF's object detection API. I'm able to successfully train the model, package it into a TF Serving Docker image (latest tag as of Oct 2020), and process inference requests via the REST interface. However, the amount of data returned from an inference request is very large (hundreds of Mb). This is a big problem when the inference request and processing don't happen on the same machine because all that returned data has to go over the network.
Is there a way to trim down the number of outputs (either during model export or within the TF Serving image) so allow faster round trip times during inference?
Details
I'm using TF OD API (with TF2) to train a Mask RCNN model, which is a modified version of this config. I believe the full list of outputs is described in code here. The list of items I get during inference is also pasted below. For a model with 100 object proposals, that information is ~270 Mb if I just write the returned inference as json to disk.
inference_payload['outputs'].keys()
dict_keys(['detection_masks', 'rpn_features_to_crop', 'detection_anchor_indices', 'refined_box_encodings', 'final_anchors', 'mask_predictions', 'detection_classes', 'num_detections', 'rpn_box_predictor_features', 'class_predictions_with_background', 'proposal_boxes', 'raw_detection_boxes', 'rpn_box_encodings', 'box_classifier_features', 'raw_detection_scores', 'proposal_boxes_normalized', 'detection_multiclass_scores', 'anchors', 'num_proposals', 'detection_boxes', 'image_shape', 'rpn_objectness_predictions_with_background', 'detection_scores'])
I already encode the images within my inference requests as base64, so the request payload is not too large when going over the network. It's just that the inference response is gigantic in comparison. I only need 4 or 5 of the items out of this response, so it'd be great to exclude the rest and avoid passing such a large package of bits over the network.
Things I've tried
I've tried setting the score_threshold to a higher value during the export (code example here) to reduce the number of outputs. However, this seems to just threshold the detection_scores. All the extraneous inference information is still returned.
I also tried just manually excluding some of these inference outputs by adding the names of keys to remove here. That also didn't seem to have any effect, and I'm worried this is a bad idea because some of those keys might be needed during scoring/evaluation.
I also searched here and on tensorflow/models repo, but I wasn't able to find anything.
I was able to find a hacky workaround. In the export process (here), some of the components of the prediction dict are deleted. I added additional items to the non_tensor_predictions list, which contains all keys that will get removed during the postprocess step. Augmenting this list cut down my inference outputs from ~200MB to ~12MB.
Full code for the if self._number_of_stages == 3 block:
if self._number_of_stages == 3:
non_tensor_predictions = [
k for k, v in prediction_dict.items() if not isinstance(v, tf.Tensor)]
# Add additional keys to delete during postprocessing
non_tensor_predictions = non_tensor_predictions + ['raw_detection_scores', 'detection_multiclass_scores', 'anchors', 'rpn_objectness_predictions_with_background', 'detection_anchor_indices', 'refined_box_encodings', 'class_predictions_with_background', 'raw_detection_boxes', 'final_anchors', 'rpn_box_encodings', 'box_classifier_features']
for k in non_tensor_predictions:
tf.logging.info('Removing {0} from prediction_dict'.format(k))
prediction_dict.pop(k)
return prediction_dict
I think there's a more "proper" way to deal with this using signature definitions during the creation of the TF Serving image, but this worked for a quick and dirty fix.
I've ran into the same problem. In the exporter_main_v2 code there is stated that the outputs should be:
and the following output nodes returned by the model.postprocess(..):
* `num_detections`: Outputs float32 tensors of the form [batch]
that specifies the number of valid boxes per image in the batch.
* `detection_boxes`: Outputs float32 tensors of the form
[batch, num_boxes, 4] containing detected boxes.
* `detection_scores`: Outputs float32 tensors of the form
[batch, num_boxes] containing class scores for the detections.
* `detection_classes`: Outputs float32 tensors of the form
[batch, num_boxes] containing classes for the detections.
I've submitted an issue on the tensorflow object detection github repo, I hope we will get feedback from the tensorflow dev team.
The github issue can be found here
If you are using exporter_main_v2.py file to export your model, you can try this hack way to solve this problem.
Just add following codes in the function _run_inference_on_images of exporter_lib_v2.py file:
detections[classes_field] = (
tf.cast(detections[classes_field], tf.float32) + label_id_offset)
############# START ##########
ignored_model_output_names = ["raw_detection_boxes", "raw_detection_scores"]
for key in ignored_model_output_names:
if key in detections.keys(): del detections[key]
############# END ##########
for key, val in detections.items():
detections[key] = tf.cast(val, tf.float32)
Therefore, the generated model will not output the values of ignored_model_output_names.
Please let me know if this can solve your problem.
Another approach would be to alter the signatures of the saved model:
model = tf.saved_model.load(path.join("models", "efficientdet_d7_coco17_tpu-32", "saved_model"))
infer = model.signatures["serving_default"]
outputs = infer.structured_outputs
for o in ["raw_detection_boxes", "raw_detection_scores"]:
outputs.pop(o)
tf.saved_model.save(
model,
export_dir="export",
signatures={"serving_default" : infer},
options=None
)

When forward using MXNet, how to do with varying 'batch size' in data_shapes?

Hi,I have a question that, how can I make predict with unfixed input data? I will try to describe in detail clear:
I use MTCNN for face detection(it's ok unfamiliar with that), and it employs 3 networks: PNet, RNet, ONet. PNet detects a mass of proposal face bounding boxes, then these boxes are coarse-to-fine by the rest net one after another, finally get precise face bbox(s). When taking an image as input to PNet, image's size is unfixed, and the output proposal box number from PNet is also unfixed, so as RNet, ONet. Reference to another MTCNN code I set a large data_shapes(e.g., image size, batch size) when I bind the module, and initialize all to zero,then make predict. That works though, Isn't that a redundant calculation? (Question 1)
PNet:
max_img_w=1000
max_img_h=1000
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det1’, 0)
self.PNets = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.PNets.bind(data_shapes=[(‘data’, (1, 3, max_img_w, max_img_h))],for_training=False)
self.PNets.set_params(arg_params,aux_params)
RNet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det2’, 0)
self.RNet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.RNet.bind(data_shapes=[(‘data’, (2048,3, 24, 24))],for_training=False)
self.RNet.set_params(arg_params,aux_params,allow_missing=True)
ONet
sym, arg_params, aux_params = mx.model.load_checkpoint(‘det3’, 0)
self.ONet = mx.mod.Module(symbol=sym, context=ctx,label_names=None)
self.ONet.bind(data_shapes=[(‘data’, (256, 3, 48, 48))],for_training=False)
self.ONet.set_params(arg_params,aux_params,allow_missing=True)
And I try mx.mod.Module.reshape before predict, which will adjust data'shape according to last network's output, but I get this error:(Question 2)
AssertionError: Shape of unspecified array arg:prob1_label changed. This can cause the new executor to not share parameters with the old one. Please check for error in the network. If this is intended, set partial_shaping=True to suppress this warning.
One more thing is that The MTCNN code (https://github.com/pangyupo/mxnet_mtcnn_face_detection) primary use deprecated function to load models:
self.PNet = mx.model.FeedForward.load(‘det1’,0)
One single line to work with arbitrary data_shapes, why this function be deprecated..?(Question 3)
I found a little difference that after load model, FeedFroward takes 0MB memory before make one predict, but mx.mod.Module takes up memory once loaded, and increase obviously after making one prediction.
You can use MXNet imperative API Gluon and that will let you use different batch-sizes.
If like in this case, your model was trained using the symbolic API or has been exported in the serialized MXNet format ('-0001.params', '-symbol.json' for e.g), you can load it in Gluon that way:
ctx = mx.cpu()
sym = mx.sym.load_json(open('det1-symbol.json', 'r').read())
PNet = gluon.nn.SymbolBlock(outputs=sym, inputs=mx.sym.var('data'))
PNet.load_params('det1-0001.params', ctx=ctx)
Then you can use it the following way:
# a given batch size (1)
data1 = mx.nd.ones((1, C, W, H))
output1 = PNet(data1)
# a different batch size (5)
data2 = mx.nd.ones((5, C, W, H))
output2 = PNet(data2)
And it would work.
You can get started with MXNet Gluon with the official 60 minutes crash course

How to assign value to one graph with the other graph who has the same structure in tensorflow?

I'm trying to implement DQN in tensorflow. Here I have one target network and one training network who have the same structure with each other. In the beginning of every 10000 training steps, I want to load the value from checkpoint to target network and training network, then stop_gradient target network. However, I tried those ways, and none of them worked:
1, Put the two networks in one graph. However, every time I load them, I don't know how to assign the value of training network part to target network part.(They are saved in different values, since one is stop gradient.)
2, Define two graphs using tf.graph() and run two session respectively. However, I can't load the checkpoint of one graph to another, even they have the same structure. After all, they are two different graphs.
So, any one who can give me some advice? Very appreciated!
The typical approach would be to put everything in one graph, put your two networks in two name scopes, and then create tf.assign ops for each variable in one scope to the another and use the tf.group to construct a final "copying" operation. Lets assume that function create_net() builds a single network
with tf.name_scope('main_network'):
main_net = create_net()
with tf.name_scope('target_network):
target_network = create_net()
main_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='main_network')
target_variables = tf.get_collection(tf.GraphKeys.VARIABLES, scope='target_network')
# I am assuming get_collection returns variables in the same order, please double
# check this is actually happening
assign_ops = []
for main_var, target_var in zip(main_variables, target_variables):
assign_ops.append(tf.assign(target_var, tf.identity(main_var)))
copy_operation = tf.group(*assign_ops)
Now executing copy_operation in session.run should copy your main network parameters to the target network. The above code should be considered a pseudo code, rather than something you can copy&paste.

Reading sequential data from TFRecords files within the TensorFlow graph?

I'm working with video data, but I believe this question should apply to any sequential data. I want to pass my RNN 10 sequential examples (video frames) from a TFRecords file. When I first start reading the file, I need to grab 10 examples, and use this to create a sequence-example which is then pushed onto the queue for the RNN to take when it's ready. However, now that I have the 10 frames, next time I read from the TFRecords file, I only need to take 1 example and just shift the other 9 over. But when I hit the end of the first TFRecords file, I need to restart the process on the second TFRecords file. It's my understanding that the cond op will process the ops required under each condition even if that condition is not the one that is to be used. This would be a problem when using a condition to check whether to read 10 examples or only 1. Is there anyway to resolve this problem to still have the desired result outlined above?
You can use the recently added Dataset.window() transformation in TensorFlow 1.12 to do this:
filenames = tf.data.Dataset.list_files(...)
# Define a function that will be applied to each filename, and return the sequences in that
# file.
def get_examples_from_file(filename):
# Read and parse the examples from the file using the appropriate logic.
examples = tf.data.TFRecordDataset(filename).map(...)
# Selects a sliding window of 10 examples, shifting along 1 example at a time.
sequences = examples.window(size=10, shift=1, drop_remainder=True)
# Each element of `sequences` is a nested dataset containing 10 consecutive examples.
# Use `Dataset.batch()` and get the resulting tensor to convert it to a tensor value
# (or values, if there are multiple features in an example).
return sequences.map(
lambda d: tf.data.experimental.get_single_element(d.batch(10)))
# Alternatively, you can use `filenames.interleave()` to mix together sequences from
# different files.
sequences = filenames.flat_map(get_examples_from_file)

Incorporating very large constants in Tensorflow

For example, the comments for the Tensorflow image captioning example model state:
NOTE: This script will consume around 100GB of disk space because each image
in the MSCOCO dataset is replicated ~5 times (once per caption) in the output.
This is done for two reasons:
1. In order to better shuffle the training data.
2. It makes it easier to perform asynchronous preprocessing of each image in
TensorFlow.
The primary goal of this question is to see if there is an alternative to this type of duplication. In my use case, storing the data in this way would require each image to be duplicated in the TFRecord files many more times, on the order of 20 - 50 times.
I should note first that I have already fed the images through VGGnet to extract 4096 dim features, and I have these stored as a mapping between filename and the vectors.
Before switching over to Tensorflow, I had been feeding batches containing filename strings and then looking up the corresponding vector on a per-batch basis. This allows me to store all of the image data in ~15GB without needing to duplicate the data on disk.
My first attempt to do this in in Tensorflow involved storing indices in the TFExample buffers and then doing a "preprocessing" step to slice into the corresponding matrix:
img_feat = pd.read_pickle("img_feats.pkl")
img_matrix = np.stack(img_feat)
preloaded_images = tf.Variable(img_matrix)
first_image = tf.slice(preloaded_images, [0,0], [1,4096])
However, in this case, Tensorflow disallows a variable larger than 2GB. So my next thought was to partition this across several variables:
img_tensors = []
for i in range(NUM_SPLITS):
with tf.Graph().as_default():
img_tensors.append(tf.Variable(img_matrices[i], name="preloaded_images_%i"%i))
first_image = tf.concat(1, [tf.slice(t, [0,0], [1,4096//NUM_SPLITS]) for t in img_tensors])
In this case, I'm forced to store each partition on a separate graph, because it seems any one graph cannot be this large either. However, now the concat fails because each tensor I am concatenating is on a separate graph.
Any advice on incorporating a large amount (~15GB) of preloaded into the Tensorflow graph.
Potentially related is this question; however in this case I'd like to override the decoding of the actual JPEG file with the preprocessed value in a tensor op.