How to predict a mini-batch examples in C++ of MXNet? - mxnet

In python interface,we can use a mini-batch examples to make prediction like net([[1,2],[3,4],[5,6]]).
But in C++,I can't find a way to do this.
Suppose calling the net to predict a single example needs 10ms. If there is 10000 examples needs to make prediction, that is 100s
void OneInputOneOutputPredict(PredictorHandle pred_hnd, std::vector<mx_float> vector_data, std::vector<mx_float> &output)
{
MXPredSetInput(pred_hnd, "data", vector_data.data(), vector_data.size());
// Do Predict Forward
MXPredForward(pred_hnd);
mx_uint output_index = 0;
mx_uint *shape = 0;
mx_uint shape_len;
MXPredGetOutputShape(pred_hnd, output_index, &shape, &shape_len);
size_t size = 1;
for (mx_uint i = 0; i < shape_len; ++i) size *= shape[i];
std::vector<float> data(size);
assert(0 == MXPredGetOutput(pred_hnd, output_index, &(data[0]), size));
output = data;
}
//very long time
for(int step=0;step<10000;step++)
OneInputOneOutputPredict(pred_hnd, vector_data, vector_label);
Could we use vectorize the code or something way in C++ that make it fast in prediction?

originally
input_shape_data looks like this
const mx_uint input_shape_data[4] = {1, static_cast<mx_uint>(data_len)};
now if I want to predict a mini-batch(batch-size 3)
const mx_uint input_shape_data[4] = {3, static_cast<mx_uint>(data_len)};
If using seq2seq model.If data looks like [[1,2],[3,4],[5,6]],now only flatten it to a list {1,2,3,4,5,6} , then everything is OK

Related

Indexing in Rust ndarray crate based on a boolean mask

I would like to efficiently index into an ndarray using a boolean mask. To better convey what I mean I have some working numpy code and then my attempt in rust ndarray which works but is extremely inefficient.
Numpy:
import numpy as np
shape = (100, 100, 100)
grouping_array = np.random.randint(0, 100, size=shape)
data_array = np.random.rand(*shape)
for i in range(1, 100):
ith_mean = data_array[grouping_array == i].mean()
print(ith_mean)
Rust ndarray:
fn group_means(
data: &Array<f32, IxDyn>,
grouping_var: &Array<f32, IxDyn>,
n_groups: i32,
) {
for group in 1..n_groups {
let index_array = grouping_var.mapv(|x| x == roi as f32);
let roi_data = Array::from_iter(
image_data
.iter()
.zip(index_array.iter())
.map(|(x, y)| if *y { *x } else { 0. })
);
let mean_roi = roi_data.mean().unwrap();
println!("group {}; mean {}", group, mean_roi);
}
}
Here each iteration in the n_groups loop takes about as long as the whole numpy script which is done in less than a second. Is there a better way to do this in the rust-ndarray version?
This is likely not a surprise to others, but since my grouping_var array should (in my use case) always be 3D array, I changed its type (and therefore also index_array) from &Array<f32, IxDyn> to &Array<f32, Ix3> which dramatically improved performance.

How to use the 'sphereize data' option with PCA in TensorFlow

I have used PCA with the 'Sphereize data' option on the following page successfully: https://projector.tensorflow.org/
I wonder how to run the same computation locally using the TensorFlow API. I found the PCA documentation in the API documentation, but I am not sure if sphereizing the data is available somewhere in the API too?
The "sphereize data" option normalizes the data by shifting each point by the centroid and making unit norm.
Here is the code used in Tensorboard (in typescript):
normalize() {
// Compute the centroid of all data points.
let centroid = vector.centroid(this.points, (a) => a.vector);
if (centroid == null) {
throw Error('centroid should not be null');
}
// Shift all points by the centroid and make them unit norm.
for (let id = 0; id < this.points.length; ++id) {
let dataPoint = this.points[id];
dataPoint.vector = vector.sub(dataPoint.vector, centroid);
if (vector.norm2(dataPoint.vector) > 0) {
// If we take the unit norm of a vector of all 0s, we get a vector of
// all NaNs. We prevent that with a guard.
vector.unit(dataPoint.vector);
}
}
}
You can reproduce that normalization using the following python function:
def sphereize_data(x):
"""
x is a 2D Tensor of shape :(num_vectors, dim_vectors)
"""
centroids = tf.reduce_mean(x, axis=0, keepdims=True)
return tf.math.div_no_nan((x - centroids), tf.norm(x - centroids, axis=0, keepdims=True))

TensorFlow Model is still floating point after Post-training quantization

After applying post-training quantization, my custom CNN model was shrinked to 1/4 of its original size (from 56.1MB to 14MB). I put the image(100x100x3) that is to be predicted into ByteBuffer as 100x100x3=30,000 bytes. However, I got the following error during inference:
java.lang.IllegalArgumentException: Cannot convert between a TensorFlowLite buffer with 120000 bytes and a ByteBuffer with 30000 bytes.**
at org.tensorflow.lite.Tensor.throwExceptionIfTypeIsIncompatible(Tensor.java:221)
at org.tensorflow.lite.Tensor.setTo(Tensor.java:93)
at org.tensorflow.lite.NativeInterpreterWrapper.run(NativeInterpreterWrapper.java:136)
at org.tensorflow.lite.Interpreter.runForMultipleInputsOutputs(Interpreter.java:216)
at org.tensorflow.lite.Interpreter.run(Interpreter.java:195)
at gov.nih.nlm.malaria_screener.imageProcessing.TFClassifier_Lite.recongnize(TFClassifier_Lite.java:102)
at gov.nih.nlm.malaria_screener.imageProcessing.TFClassifier_Lite.process_by_batch(TFClassifier_Lite.java:145)
at gov.nih.nlm.malaria_screener.Cells.runCells(Cells.java:269)
at gov.nih.nlm.malaria_screener.CameraActivity.ProcessThinSmearImage(CameraActivity.java:1020)
at gov.nih.nlm.malaria_screener.CameraActivity.access$600(CameraActivity.java:75)
at gov.nih.nlm.malaria_screener.CameraActivity$8.run(CameraActivity.java:810)
at java.lang.Thread.run(Thread.java:762)
The imput image size to the model is: 100x100x3. I'm currently predicting one image at a time. So, if I'm making the Bytebuffer: 100x100x3 = 30,000 bytes. However, the log info above says the TensorFlowLite buffer has 120,000 bytes. This makes me suspect that the converted tflite model is still in float format. Is this expected behavior? How can I get a quantized model that take input image in 8 pit precision like it does in the example from TensorFlow official repository ?
In the example code, the ByteBuffer used as input for tflite.run() is in 8 bit precision for the quantized model.
But I also read from the google doc saying, "At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels." This two instances seems to contradict each other.
private static final int BATCH_SIZE = 1;
private static final int DIM_IMG_SIZE = 100;
private static final int DIM_PIXEL_SIZE = 3;
private static final int BYTE_NUM = 1;
imgData = ByteBuffer.allocateDirect(BYTE_NUM * BATCH_SIZE * DIM_IMG_SIZE * DIM_IMG_SIZE * DIM_PIXEL_SIZE);
imgData.order(ByteOrder.nativeOrder());
... ...
int pixel = 0;
for (int i = 0; i < DIM_IMG_SIZE; ++i) {
for (int j = 0; j < DIM_IMG_SIZE; ++j) {
final int val = intValues[pixel++];
imgData.put((byte)((val >> 16) & 0xFF));
imgData.put((byte)((val >> 8) & 0xFF));
imgData.put((byte)(val & 0xFF));
// imgData.putFloat(((val >> 16) & 0xFF) / 255.0f);
// imgData.putFloat(((val >> 8) & 0xFF) / 255.0f);
// imgData.putFloat((val & 0xFF) / 255.0f);
}
}
... ...
tfLite.run(imgData, labelProb);
Post-training quantization code:
import tensorflow as tf
import sys
import os
saved_model_dir = '/home/yuh5/Downloads/malaria_thinsmear.h5.pb'
input_arrays = ["input_2"]
output_arrays = ["output_node0"]
converter = tf.contrib.lite.TocoConverter.from_frozen_graph(saved_model_dir, input_arrays, output_arrays)
converter.post_training_quantize = True
tflite_model = converter.convert()
open("thinSmear_100.tflite", "wb").write(tflite_model)
Post-training quantization does not change the format of the input or output layers. You can run your model with data in the same format as used for training.
You may look into quantization-aware training to generate fully-quantized models, but I have no experience with it.
As for the sentence "At inference, weights are converted from 8-bits of precision to floating-point and computed using floating point kernels." This means that the weights are "de-quantized" to floating point values in memory, and computed with FP instructions, instead of performing integer operations.

In TensorFlow's C++ api, how to generate a graph file to visualization with tensorboard?

There is a way to create a file with Python, which can be visualization by TensorBoard(see here). I have tried with this code and it works well.
import tensorflow as tf
a = tf.add(1, 2,)
b = tf.multiply(a, 3)
c = tf.add(4, 5,)
d = tf.multiply(c, 6,)
e = tf.multiply(4, 5,)
f = tf.div(c, 6,)
g = tf.add(b, d)
h = tf.multiply(g, f)
with tf.Session() as sess:
print(sess.run(h))
with tf.Session() as sess:
writer = tf.summary.FileWriter("output", sess.graph)
print(sess.run(h))
writer.close()
Now I am using TensorFlow API to create my computations. How can I visualize my computations with TensorBoard?
There have a FileWrite interface in C++ api also, but I have not seen any example. Is it the same interface ?
See my answer here, which gives you a 26-liner in c++ to do this:
#include <tensorflow/core/util/events_writer.h>
#include <string>
#include <iostream>
void write_scalar(tensorflow::EventsWriter* writer, double wall_time, tensorflow::int64 step,
const std::string& tag, float simple_value) {
tensorflow::Event event;
event.set_wall_time(wall_time);
event.set_step(step);
tensorflow::Summary::Value* summ_val = event.mutable_summary()->add_value();
summ_val->set_tag(tag);
summ_val->set_simple_value(simple_value);
writer->WriteEvent(event);
}
int main(int argc, char const *argv[]) {
std::string envent_file = "./events";
tensorflow::EventsWriter writer(envent_file);
for (int i = 0; i < 150; ++i)
write_scalar(&writer, i * 20, i, "loss", 150.f / i);
return 0;
}
Looks like you want tensorflow::EventsWriter from tensorflow/core/util/events_writer.h. You'll need to manually create an Event object to use it though.
The python code in tf.summary.FileWriter handles a lot of the details for you though, I'd suggest only using the C++ API if absolutely necessary... Is there a compelling reason to implement your training in C++?

org.apache.commons.math3.transform FastFourierTransformer returns different value when input is Complex[] and Double[]

For this question, I'm using the Maths library from Apache
My aim is to get my input back after performing an inverse fourier transform on the absolute value results of the forward fourier transformation of the input values.
When I perform an inverse fourier transform on the Complex value results of the forward fourier transformation of the input, I get the correct output.
What am I possibly doing wrong?
public void fourierTestTemp(){
double[] input = new double[]{1,0,0,0,0,0,0,66,888,0,0,0,0,0,0,0};//Length = 16
double[] result = new double[input.length];//This double array will hold the results of the fourier transform
FastFourierTransformer transformer = new FastFourierTransformer(DftNormalization.UNITARY);//The FastFourierTransformer class by Apache
Complex[] complx = transformer.transform(result, TransformType.FORWARD);//Apply fourier transform to double[]
//Go through Complex value results and obtain absolute value
for (int i = 0; i < complx.length; i++) {
result[i] = complx[i].abs();
}
//Perform inverse transform on the obtained absolute values from the forward transform.
complx = transformer.transform(result, TransformType.INVERSE);
//Go through Complex value results and obtain absolute value
for (int i = 0; i < complx.length; i++) {
result[i] = complx[i].abs();
}
//Print results
for (int i = 0; i < result.length; i++) {
System.out.print(result[i]+",");
}
}
ifft(abs(fft(x))) is only the identity if x is strictly symmetric (can be constructed out of only cosine basis vectors of the DFT). Your test vector is not.
Cosines are symmetric functions. Sines are anti-symmetric.
If x is not symmetric, fft(x) will not be real, thus the abs() function will rotate some of the phase results, thus distorting the ifft output waveform.