How does you create an LSTM layer in CNTK, from C++? - cntk

CNTK generally provides a great C++ API, but I'm struggling to find out how to build LSTM or GRU layers from the C++ API. The only function I can find is OptimizedRNNStack. That function seems quite self-explanatory, except for the weights variable. So far I haven't managed to figure out how to initialize that weights variable. Looking at, it seems to initialize the weights with:
ParameterTensor {0:0, initFilterRank=0, initOutputRank=-1, init=init,
but I can't figure out how to translate that into C++. For context - I'm trying to use CTC to build an OCR pipeline. Building everything in C++ is great, because I can use all the native data synthesis tools, and the entire pipeline can be trained and tested end-to-end. However, if I must build the model in Brainscript, I guess that's fine too.

The C++ API does not have a Layers library equivalent. I have struggled to do that as the static-typing nature of C++ makes it hard to support all those options. Let me share a private piece of C++ code that creates a GRU simimilar to the Layers library (without all the options).
Sorry this is not directly copy-pastable; please try to change the return value to Function, and change the lambda signature by creating two PlaceholderVariables, dh, and x. That funny let is short for const auto.
static BinaryModel GRU(size_t outputDim, const DeviceDescriptor& device)
let activation = [](const Variable& x) { return Tanh(x); };
auto W = Parameter({ outputDim * 3, NDShape::InferredDimension }, DataType::Float, GlorotUniformInitializer(), device, L"W");
auto R = Parameter({ outputDim * 2, outputDim }, DataType::Float, GlorotUniformInitializer(), device, L"R");
auto R1 = Parameter({ outputDim , outputDim }, DataType::Float, GlorotUniformInitializer(), device, L"R1");
auto b = Parameter({ outputDim * 3 }, 0.0f, device, L"b");
let stackAxis = vector<Axis>{ Axis(0) };
let stackedDim = (int)outputDim;
let one = Constant::Scalar(1.0f, device); // for "1 -"...
// e.g.
return BinaryModel({ W, R, R1, b }, [=](const Variable& dh, const Variable& x)
let& dhs = dh;
// projected contribution from input(s), hidden, and bias
let projx3 = b + Times(W, x);
let projh2 = Times(R, dh);
let zt_proj = Slice(projx3, stackAxis, 0 * stackedDim, 1 * stackedDim) + Slice(projh2, stackAxis, 0 * stackedDim, 1 * stackedDim);
let rt_proj = Slice(projx3, stackAxis, 1 * stackedDim, 2 * stackedDim) + Slice(projh2, stackAxis, 1 * stackedDim, 2 * stackedDim);
let ct_proj = Slice(projx3, stackAxis, 2 * stackedDim, 3 * stackedDim);
let zt = Sigmoid(zt_proj)->Output(); // fun update gate z(t)
let rt = Sigmoid(rt_proj); // reset gate r(t)
let rs = dhs * rt; // "cell" c
let ct = activation(ct_proj + Times(R1, rs));
let ht = (one - zt) * ct + zt * dhs; // hidden state ht / output
return ht;


Is write_image atomic? Is it better to use atomic_max?

Full disclosure: I am cross-posting from the kronos opencl forums, since I have not received any reply there so far:
I’m writing a connected components labelling algorithm for images (2d and 3d); I found no existing implementations and decided to write one based on pointer jumping and a “recollection step” (btw: if you are aware of an easy-to-use, production ready connected component labelling let me know).
The “recollection” step kernel pseudocode for 2d images is as follows:
1) global_id = (x,y)
2) read v from img[x,y], decode it to a pair (tx,ty)
3) read v1 from img[tx,ty]
4) do some calculations to extract a boolean value C and a target value T from v1, v, and the neighbours of (x,y) and (tx,ty)
5) *** IF ( C ) THEN WRITE T INTO (tx,ty).
Q1: all the kernels where “C” is true will compete for writing. Suppose it does not matter which one wins (writes last). I’ve done some tests on an intel GPU, and (with filtering disabled, and clamping enabled) there seems to be no issue at all, write_image seems to be atomic, there is a winning value and my algorithm converges very fast. Can I safely assume that write_image on “unfiltered” images is atomic?
Q2: What I really need is to write into (tx,ty) the maximum T obtained from each kernel. That would involve using buffers instead of images, do clamping myself (or use a larger buffer padded with zeroes), and ** using atomic_max in each kernel**. I did not do this yet out of laziness since I need to change my code to use a buffer just to test it, but I believe it would be far slower. Am I right?
For completeness, here is my actual kernel (to be optimized, any suggestions welcome!)
__kernel void color_components2(/* base image */ __read_only image2d_t image,
/* uint32 */ __read_only image2d_t inputImage1,
__write_only image2d_t outImage1) {
int2 gid = (int2)(get_global_id(0), get_global_id(1));
int x = gid.x;
int y = gid.y;
int lock = 0;
int2 size = get_image_dim(inputImage1);
const sampler_t sampler =
uint4 base = read_imageui(image, sampler, gid);
uint4 ui4a = read_imageui(inputImage1, sampler, gid);
int2 t = (int2)(ui4a[0] % size.x, ui4a[0] / size.x);
unsigned int m = ui4a[0];
unsigned int n = ui4a[0];
if (base[0] > 0) {
for (int a = -1; a <= 1; a++)
for (int b = -1; b <= 1; b++) {
uint4 tmpa =
read_imageui(inputImage1, sampler, (int2)(t.x + a, t.y + b));
m = max(tmpa[0], m);
uint4 tmpb = read_imageui(inputImage1, sampler, (int2)(x + a, y + b));
n = max(tmpb[0], n);
if(n > m) write_imageui(outImage1,t,(uint4)(n,0,0,0));

Tensorflow retrained graph in C# (Tensorflowsharp)

I'am just trying to use a retrained inception model in Tensorflow sharp in Unity.
The retrained model was prepared with optimize_for_inference and is working like a charm in python.
But it is pretty inaccurate in c#.
the code works like this:
First i get the Picture
//webcamtexture transformed to picture in jpg
var pic = _texture.EncodeToJpg();
//added Picture to queue for the object detection thread
After that a thread will handle each collected picture
public void HandlePicture(byte[] picture)
var tensor = ImageUtil.CreateTensorFromImageFile(picture);
var runner = session.GetRunner();
runner.AddInput(g_input, tensor).Fetch(g_output);
var output = runner.Run();
var bestIdx = 0;
float best = 0;
var result = output[0];
var rshape = result.Shape;
var probabilities = ((float[][])result.GetValue(jagged: true))[0];
for (int r = 0; r < probabilities.Length; r++)
if (probabilities[r] > best)
bestIdx = r;
best = probabilities[r];
Debug.Log("Tensorflow thinks this is: " + labels[bestIdx] + " Prob : " + best * 100);
so my guess is: has something to do with retrained graphs (because i can't find any application/test it is used and working).
2.It has something to do with how i handle the picture transform into a tensor?! (but if that is wrong i could need help there, the code further down)
to transform the picture i'am also using a graph like it is used in the tensorsharp example
public static class ImageUtil
// Convert the image in filename to a Tensor suitable as input to the Inception model.
public static TFTensor CreateTensorFromImageFile(byte[] contents, TFDataType destinationDataType = TFDataType.Float)
// DecodeJpeg uses a scalar String-valued tensor as input.
var tensor = TFTensor.CreateString(contents);
TFGraph graph;
TFOutput input, output;
// Construct a graph to normalize the image
ConstructGraphToNormalizeImage(out graph, out input, out output, destinationDataType);
// Execute that graph to normalize this one image
using (var session = new TFSession(graph))
var normalized = session.Run(
inputs: new[] { input },
inputValues: new[] { tensor },
outputs: new[] { output });
return normalized[0];
// The inception model takes as input the image described by a Tensor in a very
// specific normalized format (a particular image size, shape of the input tensor,
// normalized pixel values etc.).
// This function constructs a graph of TensorFlow operations which takes as
// input a JPEG-encoded string and returns a tensor suitable as input to the
// inception model.
private static void ConstructGraphToNormalizeImage(out TFGraph graph, out TFOutput input, out TFOutput output, TFDataType destinationDataType = TFDataType.Float)
// Some constants specific to the pre-trained model at:
// - The model was trained after with images scaled to 224x224 pixels.
// - The colors, represented as R, G, B in 1-byte each were converted to
// float using (value - Mean)/Scale.
const int W = 299;
const int H = 299;
const float Mean = 128;
const float Scale = 1;
graph = new TFGraph();
input = graph.Placeholder(TFDataType.String);
output = graph.Cast(graph.Div(
x: graph.Sub(
x: graph.ResizeBilinear(
images: graph.ExpandDims(
input: graph.Cast(
graph.DecodeJpeg(contents: input, channels: 3), DstT: TFDataType.Float),
dim: graph.Const(0, "make_batch")),
size: graph.Const(new int[] { W, H }, "size")),
y: graph.Const(Mean, "mean")),
y: graph.Const(Scale, "scale")), destinationDataType);

calculating forward kinematics using D-H matrix

I have a 6-DOF robot arm model:
robot arm structure
I want to calculate forward kinematics, so I uses the D-H matrix. the D-H parameters are:
static const std::vector<float> theta = {
// d
static const std::vector<float> d = {
// a
static const std::vector<float> a = {
// alpha
static const std::vector<float> alpha = {
and the calculation :
glm::mat4 Robothand::armForKinematics() noexcept
glm::mat4 pose(1.0f);
float cos_theta, sin_theta, cos_alpha, sin_alpha;
for (auto i = 0; i < 6;i++)
cos_theta = cosf(glm::radians(theta[i]));
sin_theta = sinf(glm::radians(theta[i]));
cos_alpha = cosf(glm::radians(alpha[i]));
sin_alpha = sinf(glm::radians(alpha[i]));
glm::mat4 Ai = {
cos_theta, -sin_theta * cos_alpha,sin_theta * sin_alpha, a[i] * cos_theta,
sin_theta, cos_theta * cos_alpha, -cos_theta * sin_alpha,a[i] * sin_theta,
0, sin_alpha, cos_alpha, d[i],
0, 0, 0, 1 };
pose = pose * Ai;
return pose;
the problem I have is that, I can't get the correct result, for example, I want to calculate the transformation matrix from first joint to the 4th joint, I will change the for loop i < 3,then I can get the pose matrix, and I can the origin coordinate in 4th coordinate system by pose * (0,0,0,1).but the result (380.948,382.331,0) seems not correct because it should be move along x-axis not y-axis. I have read many books and materials about D-H matrix, but I can't figure out what's wrong with it.
I have figured it out by myself, the real problem behind is glm::mat, glm::mat is col-type which means columns will be initialized before rows,I changed the code and get the correct result:
for (int i = 0; i < joint_num; ++i)
pose = glm::rotate(pose, glm::radians(degrees[i]), glm::vec3(0, 0, 1));
pose = glm::translate(pose,glm::vec3(0,0,d[i]));
pose = glm::translate(pose, glm::vec3(a[i], 0, 0));
pose = glm::rotate(pose,glm::radians(alpha[i]),glm::vec3(1,0,0));
then I can get the position by:
auto pos = pose * glm::vec4(x,y,z,1);

Bidirectional path tracing

I'm making a bidirectional path tracer and I have some troubles.
To be clear :
1) One point light
2) All objects are diffuse
3) All objects are spheres, even walls (they are very large)
The light emission is a 3D vector. The BRDF of a sphere is a 3D vector. Hard coded.
In the main function below I generate EyePath and LightPath then I connect them. At least I try.
In this post I will talking about the main function then EyePath then LightPath. The talking about connecting function will appear once EyePath and Light are good.
First questions :
Does the generation of the first light point is good ?
Do I need to compute this point according to the emission of the light source? or is it just the emission ? The line is commented where i'm filling the Vertices structure.
Do I need to translate fromlight ? In order to put it on the sphere
The code below is sampled in the main function. Above it there is two for loops going through all pixels. Camera.o is the eye. CameraRayDir is the direction to the current pixel.
//The path light starting point is at the same position as the light
Ray fromLight(Vec(0, 24.3, 0), Vec());
Sphere light = spheres[7];
#define PDF 0.15915494309 // 1 / (2 * PI)
for(int i = 0; i < samps; ++i)
std::vector<Vertices> PathEye;
std::vector<Vertices> PathLight;
Vec cameraRayDir = cx * (double(x) / w - .5) + cy * (double(y) / h - .5) + camera.d;
Ray rayEye(camera.o, cameraRayDir.norm());
// Hemisphere oriented towards the top
fromLight.d = generateRayInHemisphere(fromLight.o,Vec(0,1,0)).d;
double f = clamp(;
Vertices vert;
vert.d = fromLight.d;
vert.x = fromLight.o; = 7;
vert.cos = f;
vert.n = Vec(0,1,0).norm();
// this one ?
//vert.couleur = spheres[7].e * f / PDF;
// Or this one ?
vert.couleur = spheres[7].e;
int sizeEye = generateEyePath(PathEye, rayEye, maxDepth);
int sizeLight = generateLightPath(PathLight, fromLight, maxDepth);
for (int s = 0; s < sizeLight; ++s)
for (int t = 1; t < sizeEye; ++t)
int depth = t + s - 1;
if ((s == 0 && t == 0) || depth < 0 || depth > maxDepth)
pixelValue = pixelValue + connectPaths(PathEye, PathLight, s, t);
For the EyePath I intersect the geometry then I compute the illumination according to the distance with the light. The colour is black if the point is in the shadow.
Second question : For the eye path and the direct illumination, is the computation good ? I've seen in many code, people use the pdf even in direct illumination. But I'm only using point light and spheres.
int generateEyePath(std::vector<Vertices>& v, Ray eye, int maxDepth)
double t;
int id = 0;
Vertices vert;
int RussianRoulette;
while(v.size() <= maxDepth)
if(distribRREye(generatorRREye) < 10)
// Intersect all the geometry
// id is the id of the intersected geometry in an array
intersect(eye, t, id);
const Sphere& obj = spheres[id];
// Intersection point
Vec x = eye.o + eye.d * t;
// normal
Vec n = (x - obj.p).norm();
Vec direction = light.p - x;
// Shadow ray
Ray RaytoLight = Ray(x, direction.norm());
const float distance = direction.length();
// shadow
const bool visibility = intersect(RaytoLight, t, id);
const Sphere &lumiere = spheres[id];
float degree = clamp( - x).norm()));
// If the intersected geometry is not a light, then in shadow
if(lumiere.e.x == 0)
vert.couleur = Vec();
else // else we compute the colour
// obj.c is the brdf, lumiere.e is the emission
vert.couleur = (obj.c).mult(lumiere.e / (distance * distance)) * degree;
vert.x = x; = id;
vert.n = n;
vert.d = eye.d.normn();
vert.cos = degree;
eye = generateRayInHemisphere(x,n);
return v.size();
For the LightPath, for a given point, I compute it according to the previous one and the values at this point. Like in a common path tracing.\n
Third question: Is the colour computation good ?
int generateLightPath(std::vector<Vertices>& v, Ray fromLight, int maxDepth)
double t;
int id = 0;
Vertices vert;
Vec previous;
while(v.size() <= maxDepth)
if(distribRRLight(generatorRRLight) < 10)
previous = v.back().couleur;
intersect(fromLight, t, id);
// intersected geometry
const Sphere& obj = spheres[id];
// Intersection point
Vec x = fromLight.o + fromLight.d * t;
// normal
Vec n = (x - obj.p).norm();
double f = clamp(;
// obj.c is the brdf
vert.couleur = previous.mult(((obj.c / M_PI) * f) / PDF);
vert.x = x; = id;
vert.n = n;
vert.d = fromLight.d.norm();
vert.cos = f;
fromLight = generateRayInHemisphere(x,n);
return v.size();
For the moment I get this result.
enter image description here
The connecting function will come once EyePath and LightPath are good.
Thank you all
Try the spherical reference scene mentioned in this paper. I think then you can work out most of your questions by yourself since it has an analytical solution.
It would save your time to implement and verify your understanding with path tracing and light tracing first, then try to combine them with weights.

Line/Ray-intersection not working as expected

I've been working on cobbling together a ray tracer. You know, for fun. So far most things are going as planned, but as soon as I started transforming my test spheres, it all went awry.
The fundamental concept is using one of standard shapes as origin, transforming the camera rays into object space, and then intersecting.
As long as the sphere is identical in object space and world space, it works as expected, but as soon as the spheres are scaled, normals and intersection points go wild.
I've been wracking my brains, and poring over this code over and over, but I just can't find the mistake. Fresh eyes would be much appreciated.
#implementation RTSphere
- (CGFloat)intersectsRay:(RTRay *)worldRay atPoint:(RTVector *)intersection normal:(RTVector *)normal material:(RTMaterial **)material {
RTRay *objectRay = [worldRay rayByTransformingByMatrix:self.inverseTransformation];
RTVector D = objectRay.direction;
RTVector O = objectRay.start;
CGFloat A, B, C;
A = RTVectorDotProduct(D, D);
B = 2 * RTVectorDotProduct(D,O);
C = RTVectorDotProduct(O, O) - 0.25;
CGFloat BB4AC = B * B - 4 * A * C;
if (BB4AC < 0.0) {
return -1.0;
CGFloat t0 = (-B - sqrt(BB4AC)) / 2 * A;
CGFloat t1 = (-B + sqrt(BB4AC)) / 2 * A;
if (t0 > t1) {
CGFloat tmp = t0;
t0 = t1;
t1 = tmp;
if (t1 < 0.0) {
return -1.0;
CGFloat t;
if (t0 < 0.0) {
t = t1;
} else {
t = t0;
if (material) {
*material = self.material;
if (intersection) {
RTVector isect_o = RTVectorAddition(objectRay.start, RTVectorMultiply(objectRay.direction, t));
*intersection = RTVectorMatrixMultiply(isect_o, self.transformation);
if (normal) {
RTVector normal_o = RTVectorSubtraction(isect_o, RTMakeVector(0.0, 0.0, 0.0));
RTVector normal_w = RTVectorUnit(RTVectorMatrixMultiply(normal_o, self.transformationForNormal));
*normal = normal_w;
return t;
Why are the normals and intersection points not translating into world space as expected?
Edit: I'm moderately confident that my vector and matrix functions are mathematically sound; and I'm thinking it's chiefly a method error, but I recognize that I could be wrong.
There is a lot of RT* code here "behind the scenes" that we have no way to know is correct, so I would start by making sure you have good unit tests of those math functions. The ones I would most suspect, from my experience managing transforms, is rayByTransformingByMatrix: or the value of inverseTransformation. I've found that this is very easy to get wrong when you combine transformations. Rotating and scaling is not the same as scaling and rotating.
At what point does it go wrong for you? Are you sure objectRay itself is correct? (If it isn't, then the rest of this function doesn't matter.) Again, unit test is your friend. You should hand-calculate several situations and then write unit tests to ensure that your methods return the right answers.