Combined vectorized functions in Numba - gpu

I'm using Numba (version 0.37.0) to optimize code for GPU.
I would like to use combined vectorized functions (using #vectorize decorator of Numba).
Imports & Data:
import numpy as np
from math import sqrt
from numba import vectorize, guvectorize
angles = np.random.uniform(-np.pi, np.pi, 10)
coords = np.stack([np.cos(angles), np.sin(angles)], axis=1)
This works as expected:
#guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
acc = 0.0
for value in vec:
acc += value**2
out[0] = sqrt(acc)
l2_norm(coords)
Output:
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=float32)
But I'd like to avoid using that "for" inside "l2_norm" by calling another vectorized function.
I've tried this:
#vectorize(["float32(float32)"], target="cuda")
def power(value):
return value**2
#guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
acc = 0.0
acc = power(vec)
acc = acc.sum()
out[0] = sqrt(acc)
l2_norm_power(coords)
But raises TypingError:
TypingError: Failed at nopython (nopython frontend)
Untyped global name 'power': cannot determine Numba type of <class
'numba.cuda.dispatcher.CUDAUFuncDispatcher'>
Any idea about how to perform this combination?
Any suggestion about other way to optimize l2_norm with Numba?

I think you can only call device=True functions from other cuda functions:
3.13.2. Example: Calling Device Functions
All CUDA ufunc kernels have the ability to call other CUDA device functions:
from numba import vectorize, cuda
# define a device function
#cuda.jit('float32(float32, float32, float32)', device=True, inline=True)
def cu_device_fn(x, y, z):
return x ** y / z
# define a ufunc that calls our device function
#vectorize(['float32(float32, float32, float32)'], target='cuda')
def cu_ufunc(x, y, z):
return cu_device_fn(x, y, z)
Note that you can call cuda.jit functions with device:
#cuda.jit(device=True)
def sum_of_squares(arr):
acc = 0
for item in arr:
acc += item ** 2
return acc
#nb.guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm_power(vec, out):
acc = sum_of_squares(vec)
out[0] = sqrt(acc)
l2_norm_power(coords)
But that probably defeats the purpose of splitting it.
Since numba.vectorize doesn't support device that's not possible for these functions. But that's a good thing because vectorize allocates an array to put the values in, that's an unnecessary intermediate array and allocating arrays on the GPU is also very inefficient (and forbidden in numba):
3.5.5. Numpy support
Due to the CUDA programming model, dynamic memory allocation inside a kernel is inefficient and is often not needed. Numba disallows any memory allocating features. This disables a large number of NumPy APIs. For best performance, users should write code such that each thread is dealing with a single element at a time.
Given all that I would simply use your original approach:
#guvectorize(['(float32[:], float32[:])'], '(i)->()', target='cuda')
def l2_norm(vec, out):
acc = 0.0
for value in vec:
acc += value**2
out[0] = sqrt(acc)
l2_norm(coords)

Related

Slow computation on google colab while solving partial differential equation

I 'm using google colab to solve the homogeneous heat equation. I had made a program earlier with scipy using sparse matrices which worked upto N = 10(hyperparameter) but I need to run it for like N = 4... 1000 and thus it won't work on my pc. I therefore converted the code to tensorflow and here I 'm unable to use sparse matrices like I could in sympy but even the GPU/TPU computation is also slow and slower than my pc. Problems that I'm facing in the code and require solution for
1) tf.contrib is removed and thus I 've to use an older version of tensorflow for odeint function. Where is it in 2.0?
2)If the computation can be computed with sparse matrices it could be good since matrices are tridiagonal.I know about sparse_dense_mul() function but that returns dense tensor and it wouldn't do the job. The "func" function applies time independent boundary conditions and then requires matrix multiplication of (nxn) with (nX1) which gives (nX1) with multiple matrices.
Also the program was running faster without I created the class.
Also it's giving this
WARNING: Logging before flag parsing goes to stderr.
W0829 09:12:24.415445 139855355791232 lazy_loader.py:50]
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
* https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
* https://github.com/tensorflow/addons
* https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.
W0829 09:12:24.645356 139855355791232 deprecation.py:323] From /usr/local/lib/python3.6/dist-packages/tensorflow/contrib/integrate/python/ops/odes.py:233: div (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Deprecated in favor of operator or tf.math.divide.
when I run code for loop in range(2, 10) and tqdm does not display and cell keeps running forever but it works fine for in (2, 5) and tqdm bar does appears.
#find a way to use sparse matrices
class Heat:
def __init__(self, N):
self.N = N
self.H = 1/N
self.A = ts.to_dense(ts.SparseTensor(indices=[[0, 0], [0, 1]] + \
[[i, i+j] for i in range(1, N) for j in [-1, 0, 1]] +[[N, N-1], [N, N]],
values=self.H*np.array([1/3, 1/6] + [1/6, 2/3, 1/6]*(N-1) + [1/6, 1/3], dtype=np.float32),
dense_shape=(N+1, N+1 )))
self.D = ts.to_dense(ts.SparseTensor(indices=[[0, 0], [0, 1]] + [[i, i+j] \
for i in range(1, N) for j in [-1, 0, 1]] +[[N, N-1], [N, N]],
values=N*np.array([1-(1), -1 -(-1)] + [-1, 2, -1]*(N-1) + [-1-(-1), 1-(1)], dtype=np.float32),
dense_shape=(N+1, N+1)))
self.domain = tf.linspace(0.0, 1.0, N+1)
def f(k):
if k == 0:
return (1 + math.pi**2)*(math.pi*self.H - math.sin(math.pi*self.H))/(math.pi**2*self.H)
elif k == N:
return -(1 + math.pi**2)*(-math.pi*self.H + math.sin(math.pi*self.H))/(math.pi**2*self.H)
else:
return -2*(1 + math.pi**2)*(math.cos(math.pi*self.H) - 1)*math.sin(math.pi*self.H*k)/(math.pi**2*self.H)
self.F = tf.constant([f(k) for k in range(N+1)], shape=(N+1,), dtype=tf.float32) #caution! shape changed caution caution 1, N+1(problem) is different from N+1,
self.exact = tm.scalar_mul(scalar=np.exp(1), x=tf.sin(math.pi*self.domain))
def error(self):
return np.linalg.norm(self.exact.numpy() - self.approx, 2)
def func (self, y, t):
y = tf.Variable(y)
y = y[0].assign(0.0)
y = y[self.N].assign(0.0)
if self.N**2> 100:
y_dash = tl.matvec(tf.linalg.inv(self.A), tl.matvec(a=tm.negative(self.D), b=y, a_is_sparse=True) + tm.scalar_mul(scalar=math.exp(t), x=self.F)) #caution! shape changed F is (1, N+1) others too
else:
y_dash = tl.matvec(tf.linalg.inv(self.A), tl.matvec(a=tm.negative(self.D), b=y) + tm.scalar_mul(scalar=math.exp(t), x=self.F)) #caution! shape changed F is (1, N+1) others too
y_dash = tf.Variable(y_dash) #!!y_dash performs Hadamard product like multiplication not matrix-like multiplication;returns 2-D
y_dash = y_dash[0].assign(0.0)
y_dash = y_dash[self.N].assign(0.0)
return y_dash
def algo_1(self):
self.approx = tf.contrib.integrate.odeint(
func=self.func,
y0=tf.sin(tm.scalar_mul(scalar=math.pi, x=self.domain)),
t=tf.constant([0.0, 1.0]),
rtol=1e-06,
atol=1e-12,
method='dopri5',
options={"max_num_steps":10**10},
full_output=False,
name=None
).numpy()[1]
def algo_2(self):
self.approx = tf.contrib.integrate.odeint_fixed(
func=self.func,
y0=tf.sin(tm.scalar_mul(scalar=math.pi, x=self.domain)),
t=tf.constant([0.0, 1.0]),
dt=tf.constant([self.H**2], dtype=tf.float32),
method='rk4',
name=None
).numpy()[1]
df = pd.DataFrame(columns=["NumBasis", "Errors"])
Ns = [2**r for r in range(2, 10)]
l =[]
for i in tqdm_notebook(Ns):
heateqn = Heat(i)
heateqn.algo_1()
l.append([i, heateqn.error()])
df.append({"NumBasis":i, "Errors":heateqn.error()}, ignore_index=True)
tf.keras.backend.clear_session()

simple element wise with Keras over TF

i am trying to implement the following in TensorFlow:
Input * const
matrix multiplication of 640x800x6
Here is the code
ssValues = np.zeros(shape=(6,640,800),dtype=np.float16)
inputPlaceHolder = tf.compat.v1.placeholder(shape=(6,640,800), name='InputTensor', dtype=tf.dtypes.float16)
inputLayer = tf.keras.Input(shape=(6,640,800,),
batch_size=1,
name='inputLayer',
dtype=tf.dtypes.float16,
tensor=inputPlaceHolder)
ssConstant = tf.constant(ssValues, dtype=tf.dtypes.float16, shape=(6,640,800), name='ss')
ssm = tf.keras.layers.Multiply()([inputPlaceHolder,inputPlaceHolder])
model = tf.keras.models.Model(inputs=inputLayer, outputs=ssm)
input = np.zeros(shape=(6,640,800),dtype=np.float16)
output = model.predict(input)
i get the following error:
ValueError: ('Error when checking model input: expected no data, but got:', array([[[1., 1., 1., ..., 1., 1., 1.],
how to overcome this error and run the predict function ?
why tf.keras.layers.multiply doesn't return a Layer object ?
Your issue comes from the fact that you declared your operation on a v1 placeholder, when it should simply use the inputLayer (which already acts as a placeholder for inputs following the provided specification).
Additionnally, you wrote a multiplication that returns $x \times x$, when I think you wanted $x \times constant$ ; so here would be the code:
inputLayer = tf.keras.Input(shape=(6,640,800,),
batch_size=1,
name='inputLayer',
dtype=tf.dtypes.float16)
ssConstant = tf.constant( # also fixed a shape issue here
ssValues, dtype=tf.dtypes.float16, shape=(1, 6,640,800), name='ss'
)
ssm = tf.keras.layers.Multiply(dtype=tf.dtypes.float16)([inputLayer, ssConstant])
model = tf.keras.models.Model(inputs=inputLayer, outputs=ssm)
inputs = np.zeros(shape=(1,6,640,800), dtype=np.float16)
output = model.predict(inputs)
Furthermore, since this is not an actual model, in the sense that it uses a constant and not learnable weights, you might want to use tf.keras.backend.function instead of tf.keras.Model (but that is really up to you).
Note that the shapes are probably not suited to what you actually want, with the batch-size of 1... Please consider using a batch-size of 6 to remove the useless dimensions.
When you use Input(shape) you have a placeholder already. It doesn't make sense to create a placeholder to the pass it to Input(tensor=placeholder) because this is not how Keras works.
You must:
inputs = Input(shape=(6,640,800))
ssm_tensor = Multiply()([inputs, inputs])
model = Model(inputs, ssm)
Since you always have a batch size with Keras:
input = np.zeros(shape=(1,6,640,800))

How can I update tensor (weight value) trying to use two separate network?

I've been trying to make AI for blackjack using RL. Now I'm trying to make two separate networks which is one way of DQN. I've searched the web and found some way and tried to use it but failed.
This error has occurred:
TypeError: Using a tf.Tensor as a Python bool is not allowed. Use if t is not None: instead of if t: to test if a tensor is defined, and use TensorFlow ops such as tf.cond to execute subgraphs conditioned on the value of a tensor.
Code:
import gym
import tensorflow as tf
import matplotlib.pyplot as plt
import numpy as np
def one_hot(x):
s=np.identity(600)
b = s[x[0] * 20 + x[1] * 2 + x[2]]
return b.reshape(1, 600)
def boolstr_to_floatstr(v):
if v == True:
return 1
elif v == False:
return 0
env=gym.make('Blackjack-v0')
learning_rate=0.5
state_number=600
action_number=2
#######################################3
X=tf.placeholder(tf.float32, shape=[1,state_number], name='input_data')
W1=tf.Variable(tf.random_uniform([state_number,128],0,0.01))#network for update
layer1=tf.nn.tanh(tf.matmul(X,W1))
W2=tf.Variable(tf.random_uniform([128,256],0,0.01))
layer2=tf.nn.tanh(tf.matmul(layer1,W2))
W3=tf.Variable(tf.random_uniform([256,action_number],0,0.01))
Qpred=tf.matmul(layer2,W3) # Qprediction
#####################################################################3
X1=tf.placeholder(shape=[1,state_number],dtype=tf.float32)
W4=tf.Variable(tf.random_uniform([state_number,128],0,0.01))#network for target
layer3=tf.nn.tanh(tf.matmul(X1,W4))
W5=tf.Variable(tf.random_uniform([128,256],0,0.01))
layer4=tf.nn.tanh(tf.matmul(layer3,W5))
W6=tf.Variable(tf.random_uniform([256,action_number],0,0.01))
target=tf.matmul(layer4,W6) # target
#################################################################
update1=W4.assign(W1)
update2=W5.assign(W2)
update3=W6.assign(W3)
Y=tf.placeholder(shape=[1,action_number],dtype=tf.float32)
loss=tf.reduce_sum(tf.square(Y-Qpred))#cost(W)=(Ws-y)^2
train=tf.train.GradientDescentOptimizer(learning_rate=learning_rate).minimize(loss)
num_episodes=1000
dis=0.99 #discount factor
rList=[] #record the reward
init=tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for i in range(num_episodes): #episode 만번
s = env.reset()
rALL = 0
done = False
e=1./((i/100)+1) #exploit or explore용 상수
total_loss=[]
while not done:
s = np.asarray(s)
s[2] = boolstr_to_floatstr(s[2])
#print(np.shape(one_hot(s)))
#print(one_hot(s))
Qs=sess.run(Qpred,feed_dict={X:one_hot(s).astype(np.float32)})
if np.random.rand(1)<e: #새로운 도전시도
a=env.action_space.sample()
else:
a=np.argmax(Qs) #그냥 내가아는한 최댓값의 액션 선택
s1,reward,done,_=env.step(a) #
s1=np.asarray(s1)
s1[2]=boolstr_to_floatstr(s1[2])
if done:
Qs[0,a]=reward
else:
Qs1=sess.run(target,feed_dict={X1:one_hot(s1)})
Qs[0,a]=reward+dis*np.max(Qs1) #optimal Q
sess.run(train,feed_dict={X:one_hot(s),Y:Qs})
if i%10==0: ##target 을 Qpredion으로 업데이트해줌
sess.run(update1,update2,update3)
if reward==1:
rALL += reward
else:
rALL+=0
s=s1
rList.append(rALL)
print('success rate: '+ str(sum(rList)/num_episodes))
print("Final Q-table values")
I need to print success rate finally. before DQN its 38%ish. If there is something wrong in my code considering its DQN algorithm, tell me please.
If you want to share the weights between different networks, then simply create layer with same name, using the scope with tf.variable_scope(self.name, reuse=tf.AUTO_REUSE): and then weights between networks will be shared automatically.

Interpreting Tensorflow/Tensorboard "subtraction" operation

The following is code adapted from a simple learning example, that I have bent out of shape to understand the Tensorboard graph visualizations:
import tensorflow as tf
import numpy as np
sess = tf.InteractiveSession()
# Create 100 phony x, y data points in NumPy, y = x * 0.1 + 0.3
x_data = np.random.rand(10).astype("float32")
y_data = x_data * 0.1 + 0.3
W = tf.Variable(tf.random_uniform([1], -1.0, 1.0, name = "internal_W"), name = "external_W")
b = tf.Variable(2*tf.zeros([1], name = "internal_b"), name = "doubled_b")
y = (W * x_data + b)
l1 = (y - y_data)
l2 = (y_data - y )
writer = tf.train.SummaryWriter("/tmp/test1", sess.graph_def)
init = tf.initialize_all_variables()
# Launch the graph.
sess = tf.Session()
sess.run(init)
print(sess.run(y))
print('---')
print((y_data))
print('---')
print(sess.run(l1))
print('---')
print(sess.run(l2))
A sample output of the print statements is:
[ 0.84253538 0.31011301 0.11627766 0.35491142 0.65550905 0.1798114
0.13632762 0.02010157 0.42960873 0.04218956]
---
[ 0.39195824 0.33384719 0.31269109 0.33873668 0.37154531 0.31962547
0.31487945 0.302194 0.3468895 0.30460477]
---
[ 0.45057714 -0.02373418 -0.19641343 0.01617473 0.28396374 -0.13981406
-0.17855182 -0.28209242 0.08271924 -0.2624152 ]
---
[-0.45057714 0.02373418 0.19641343 -0.01617473 -0.28396374 0.13981406
0.17855182 0.28209242 -0.08271924 0.2624152 ]
Clearly, the subtractions are working properly-- the inputs to the subtraction are in different order, and yield different outputs. However, the graph visualization is:
Notice the "Sub" operators, which appear not to reverse the order of the operands as the code does. (Highlighting either operator yields no additional insight.) Am I missing something obvious, or do the node visualizations completely obscure order of operands?
After futzing around with this, my considered answer to my own question is, "Yes, this is working as intended." The inputs to the nodes show only what the inputs are, not any particular relationships to the operation or the node or themselves; indeed, if one added a variable to itself in an operation node, the input variable would show up only once.
This is not a design choice I would have made, but that does seem to be the intent.
I still encourage others who may have more insight to comment or fully answer.

how to perform coordinates affine transformation using python? part 2

I have same problem as described here:
how to perform coordinates affine transformation using python?
I was trying to use method described but some reason I will get error messages.
Changes I made to code was to replace primary system and secondary system points. I created secondary coordinate points by using different origo. In real case for which I am studying this topic will have some errors when measuring the coordinates.
primary_system1 = (40.0, 1160.0, 0.0)
primary_system2 = (40.0, 40.0, 0.0)
primary_system3 = (260.0, 40.0, 0.0)
primary_system4 = (260.0, 1160.0, 0.0)
secondary_system1 = (610.0, 560.0, 0.0)
secondary_system2 = (610.0,-560.0, 0.0)
secondary_system3 = (390.0, -560.0, 0.0)
secondary_system4 = (390.0, 560.0, 0.0)
Error I get from when executing is following.
*Traceback (most recent call last):
File "affine_try.py", line 57, in <module>
secondary_system3, secondary_system4 )
File "affine_try.py", line 22, in solve_affine
A2 = y * x.I
File "/usr/lib/python2.7/dist-packages/numpy/matrixlib/defmatrix.py", line 850, in getI
return asmatrix(func(self))
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 445, in inv
return wrap(solve(a, identity(a.shape[0], dtype=a.dtype)))
File "/usr/lib/python2.7/dist-packages/numpy/linalg/linalg.py", line 328, in solve
raise LinAlgError, 'Singular matrix'
numpy.linalg.linalg.LinAlgError: Singular matrix*
What might be the problem ?
The problem is that your matrix is singular, meaning it's not invertible. Since you're trying to take the inverse of it, that's a problem. The thread that you linked to is a basic solution to your problem, but it's not really the best solution. Rather than just inverting the matrix, what you actually want to do is solve a least-squares minimization problem to find the optimal affine transform matrix for your possibly noisy data. Here's how you would do that:
import numpy as np
primary = np.array([[40., 1160., 0.],
[40., 40., 0.],
[260., 40., 0.],
[260., 1160., 0.]])
secondary = np.array([[610., 560., 0.],
[610., -560., 0.],
[390., -560., 0.],
[390., 560., 0.]])
# Pad the data with ones, so that our transformation can do translations too
n = primary.shape[0]
pad = lambda x: np.hstack([x, np.ones((x.shape[0], 1))])
unpad = lambda x: x[:,:-1]
X = pad(primary)
Y = pad(secondary)
# Solve the least squares problem X * A = Y
# to find our transformation matrix A
A, res, rank, s = np.linalg.lstsq(X, Y)
transform = lambda x: unpad(np.dot(pad(x), A))
print "Target:"
print secondary
print "Result:"
print transform(primary)
print "Max error:", np.abs(secondary - transform(primary)).max()
The reason that your original matrix was singular is that your third coordinate is always zero, so there's no way to tell what the transform on that coordinate should be (zero times anything gives zero, so any value would work).
Printing the value of A tells you the transformation that least-squares has found:
A[np.abs(A) < 1e-10] = 0 # set really small values to zero
print A
results in
[[ -1. 0. 0. 0.]
[ 0. 1. 0. 0.]
[ 0. 0. 0. 0.]
[ 650. -600. 0. 1.]]
which is equivalent to x2 = -x1 + 650, y2 = y1 - 600, z2 = 0 where x1, y1, z1 are the coordinates in your original system and x2, y2, z2 are the coordinates in your new system. As you can see, least-squares just set all the terms related to the third dimension to zero, since your system is really two-dimensional.