Below is the pseudo-code of what I want to do.
I already know how to move tensor to GPU (.cuda())...
But have no idea about using a GPU pointer to make a new tensor.
Is there any method I've missed?
I don't want to copy devPtr back to the host side but just make the GPU tensor with the pointer.
int main(void) {
float* devPtr;
cudaMalloc((void**)&devPtr, sizeof(float)*HOSTDATA_SIZE);
cudaMemcpy(devPtr, hostData, sizeof(float)*HOSTDATA_SIZE, cudaMemcpyHostToDevice);
torch::Tensor inA = /* make Tensor with devPtr which is already in GPU */;
torch::Tensor inB = torch::randn({1, 10, 512, 512}).cuda();
torch::Tensor out = torch::matmul(inA, inB);
std::cout << out << std::endl;
return 0;
}
I think this should work, can you confirm ?
auto dims = torch::IntArrayRef{1, 10, 512, 512};
auto gpu_tensor = torch::from_blob(dev_ptr, dims, torch::TensorOptions().device(torch::kCUDA))
Be careful, torch::from_blob does not take ownership of the pointer.If you need to make gpu_tensor independant of the lifetime of dev_ptr, then you need to clone it.
Related
I was developing an Image Classifier app in Android Studio with MNIST-fashion database, but I have a little problem. When I try to run the app, I have this common error:
java.lang.IllegalArgumentException: Cannot copy to a TensorFlowLite tensor (serving_default_conv2d_input:0) with 3136 bytes from a Java Buffer with 9408 bytes.
I know this might be the mismatch of input tensor from the model and the buffer that I have in my code. But It's too confusing because my code automatically fits the size of the image from the model and all the info it needs. So I was wondering where is the error...
// Reads type and shape of input and output tensors, respectively.
int imageTensorIndex = 0;
int[] imageShape = tflite.getInputTensor(imageTensorIndex).shape(); // {1, height, width, 1}
imageSizeY = imageShape[1];
imageSizeX = imageShape[2];
DataType imageDataType = tflite.getInputTensor(imageTensorIndex).dataType();
int probabilityTensorIndex = 0;
int[] probabilityShape =
tflite.getOutputTensor(probabilityTensorIndex).shape(); // {1, 10}
DataType probabilityDataType = tflite.getOutputTensor(probabilityTensorIndex).dataType();
// Creates the input tensor.
inputImageBuffer = new TensorImage(imageDataType);
Maybe this is the problem... I'm creating imageShape like this {1, height, width, 1}, and the data type is FLOAT32, so it is supposed to be {1, height, width, 4}, right?
Another reason could be the metadata. But I populate the model with metadata and I have a .json and I don't know how to use it.
Note: If u want the note book to do the .tflite, there u go.
The tensor buffer size is determined by datasize (float32: 4bytes) * flat size of the tensor shape (1 * height * width * 1).
So the above code snippet needs to prepare an float input tensor data with the shape (1, height, width, 1) instead of the shape (1, height, width, 4).
I'm trying to use the TensorFlow C API to load and execute a graph. It keeps failing and I can't figure out why.
I first use this Python script to create a very simple graph and save it to a file.
import tensorflow as tf
graph = tf.Graph()
with graph.as_default():
input = tf.placeholder(tf.float32, [10, 3], name='input')
output = tf.reduce_sum(input**2, name='output')
tf.train.write_graph(graph, '.', 'test.pbtxt')
Then I use this C++ code to load it in.
#include <fstream>
#include <iostream>
#include <string>
#include <c_api.h>
using namespace std;
int main() {
ifstream graphFile("test.pbtxt");
string graphText((istreambuf_iterator<char>(graphFile)), istreambuf_iterator<char>());
TF_Buffer* buffer = TF_NewBufferFromString(graphText.c_str(), graphText.size());
TF_Graph* graph = TF_NewGraph();
TF_ImportGraphDefOptions* importOptions = TF_NewImportGraphDefOptions();
TF_Status* status = TF_NewStatus();
TF_GraphImportGraphDef(graph, buffer, importOptions, status);
cout<<TF_GetCode(status)<<endl;
return 0;
}
The status code it prints is 3, or TF_INVALID_ARGUMENT. Which argument is invalid and why? I verified the file contents are loaded correctly into graphText, and all the other arguments are trivial.
First of all, I think you should write the Graph with as_graph_def(), in your case:
with open('test.pb', 'wb') as f:
f.write(graph.as_graph_def().SerializeToString())
Apart from it, I recommend you not to use the C API directly as it is error prone with memory leaks. Instead I have tried your code using cppflow, a C++ wrapper, and it works like a charm. I have used the following code:
# Load model
Model model("../test.pb");
# Declare tensors by name
auto input = new Tensor(model, "input");
auto output = new Tensor(model, "output");
# Feed data
std::vector<float> data(30, 1);
input->set_data(data);
# Run and show
model.run(input, output);
std::cout << output->get_data<float>()[0] << std::endl;
In python interface,we can use a mini-batch examples to make prediction like net([[1,2],[3,4],[5,6]]).
But in C++,I can't find a way to do this.
Suppose calling the net to predict a single example needs 10ms. If there is 10000 examples needs to make prediction, that is 100s
void OneInputOneOutputPredict(PredictorHandle pred_hnd, std::vector<mx_float> vector_data, std::vector<mx_float> &output)
{
MXPredSetInput(pred_hnd, "data", vector_data.data(), vector_data.size());
// Do Predict Forward
MXPredForward(pred_hnd);
mx_uint output_index = 0;
mx_uint *shape = 0;
mx_uint shape_len;
MXPredGetOutputShape(pred_hnd, output_index, &shape, &shape_len);
size_t size = 1;
for (mx_uint i = 0; i < shape_len; ++i) size *= shape[i];
std::vector<float> data(size);
assert(0 == MXPredGetOutput(pred_hnd, output_index, &(data[0]), size));
output = data;
}
//very long time
for(int step=0;step<10000;step++)
OneInputOneOutputPredict(pred_hnd, vector_data, vector_label);
Could we use vectorize the code or something way in C++ that make it fast in prediction?
originally
input_shape_data looks like this
const mx_uint input_shape_data[4] = {1, static_cast<mx_uint>(data_len)};
now if I want to predict a mini-batch(batch-size 3)
const mx_uint input_shape_data[4] = {3, static_cast<mx_uint>(data_len)};
If using seq2seq model.If data looks like [[1,2],[3,4],[5,6]],now only flatten it to a list {1,2,3,4,5,6} , then everything is OK
I'm writing a custom Tensorflow op using the tutorial and I'm having trouble understanding how to read and write to/from Tensors.
let's say I have a Tensor in my OpKernel that I get from
const Tensor& values_tensor = context->input(0); (where context = OpKernelConstruction*)
if that Tensor has shape, say, [2, 10, 20], how can I index into it (e.g. auto x = values_tensor[1, 4, 12], etc.)?
equivalently, if I have
Tensor *output_tensor = NULL;
OP_REQUIRES_OK(context, context->allocate_output(
0,
{batch_size, value_len - window_size, window_size},
&output_tensor
));
how can I assign to output_tensor, like output_tensor[1, 2, 3] = 11, etc.?
sorry for the dumb question, but the docs are really tripping me up here and the examples in the Tensorflow kernel code for built-in ops somehow obfuscate this to the point that I get very confused :)
thank you!
The easiest way to read from and write to tensorflow::Tensor objects is to convert them to an Eigen tensor, using the tensorflow::Tensor::tensor<T, NDIMS>() method. Note that you have to specify the (C++) type of elements in tensor as template parameter T.
For example, to read a particular value from a DT_FLOAT32 tensor:
const Tensor& values_tensor = context->input(0);
auto x = value_tensor.tensor<float, 3>()(1, 4, 12);
To write a particular value to a DT_FLOAT32 tensor:
Tensor* output_tensor = ...;
output_tensor->tensor<float, 3>()(1, 2, 3) = 11.0;
There are also convenience methods for accessing a scalar, vector, or matrix.
I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.
Two things are affecting the performance in your exercise.
1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM).
Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument.
Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)
It looks like EACH thread in a Work Group is copying the same 50 floats before the computation starts. This will saturate the Global Memory bandwidth.
try this
if ( lid == 0 )
{
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
}
This does the copy only once per work group.