Why does pytorch matmul get different results when executed on cpu and gpu? - numpy

I am trying to figure out the rounding difference between numpy/pytorch, gpu/cpu, float16/float32 numbers and what I'm finding confuses me.
The basic version is:
a = torch.rand(3, 4, dtype=torch.float32)
b = torch.rand(4, 5, dtype=torch.float32)
print(a.numpy()#b.numpy() - a#b)
The result is all zeros as expected, however
print((a.cuda()#b.cuda()).cpu() - a#b)
gets non-zero results. Why is Pytorch float32 matmul executed differently on gpu and cpu?
An even more confusing experiment involves float16, as follows:
a = torch.rand(3, 4, dtype=torch.float16)
b = torch.rand(4, 5, dtype=torch.float16)
print(a.numpy()#b.numpy() - a#b)
print((a.cuda()#b.cuda()).cpu() - a#b)
these two results are all non-zero. Why are float16 numbers handled differently by numpy and torch? I know cpu can only do float32 operations and numpy convert float16 to float32 before computing, however the torch calculation is also executed on cpu.
And guess what, print((a.cuda()#b.cuda()).cpu() - a.numpy()#b.numpy()) gets an all zero result! This is pure fantasy for me...
The environment is as follow:
python: 3.8.5
torch: 1.7.0
numpy: 1.21.2
cuda: 11.1
gpu: GeForce RTX 3090
On the advice of some of the commenters, I add the following equal test
(a.numpy()#b.numpy() - (a#b).numpy()).any()
((a.cuda()#b.cuda()).cpu() - a#b).numpy().any()
(a.numpy()#b.numpy() - (a#b).numpy()).any()
((a.cuda()#b.cuda()).cpu() - a#b).numpy().any()
((a.cuda()#b.cuda()).cpu().numpy() - a.numpy()#b.numpy()).any()
respectively directly following the above five print functions, and the results are:
False
True
True
True
False
And for the last one, I've tried several times and I think I can rule out luck.

The differences are mostly numerical, as mentioned by #talonmies. CPU/GPU and their respectively BLAS libraries are implemented differently and use different operations/order-of-operation, hence the numerical difference.
One possible cause is sequential operation vs. reduction (https://discuss.pytorch.org/t/why-different-results-when-multiplying-in-cpu-than-in-gpu/1356/3), e.g. (((a+b)+c)+d) will have different numerical properties as compared with ((a+b)+(c+d)).
This question also talks about fused operations (multiply-add) which can cause numerical differences.
I did a little bit of testing, and find that the GPU's output in float16 mode can be matched if we promote the datatype to float32 before computation and demote it afterward. This can be caused by internal intermediate casting or the better numerical stability of fused operations (torch.backends.cudnn.enabled does not matter). This does not solve the case in float32 though.
import torch
def test(L, M, N):
# test (L*M) # (M*N)
for _ in range(5000):
a = torch.rand(L, M, dtype=torch.float16)
b = torch.rand(M, N, dtype=torch.float16)
cpu_result = a#b
gpu_result = (a.cuda()#b.cuda()).cpu()
if (cpu_result-gpu_result).any():
print(f'({L}x{M}) # ({M}x{N}) failed')
return
else:
print(f'({L}x{M}) # ({M}x{N}) passed')
test(1, 1, 1)
test(1, 2, 1)
test(4, 1, 4)
test(4, 4, 4)
def test2():
for _ in range(5000):
a = torch.rand(1, 2, dtype=torch.float16)
b = torch.rand(2, 1, dtype=torch.float16)
cpu_result = a#b
gpu_result = (a.cuda()#b.cuda()).cpu()
half_result = a[0,0]*b[0,0] + a[0,1]*b[1,0]
convert_result = (a[0,0].float()*b[0,0].float() + a[0,1].float()*b[1,0].float()).half()
if ((cpu_result-half_result).any()):
print('CPU != half')
return
if (gpu_result-convert_result).any():
print('GPU != convert')
return
else:
print('All passed')
test2()
Output:
(1x1) # (1x1) passed
(1x2) # (2x1) failed
(4x1) # (1x4) passed
(4x4) # (4x4) failed
All passed
You can tell that when the inner dimension is 1, it passes the check (no multiply-add/reduction needed).

Related

Convert an TF Agents ActorDistributionNetwork into a Tensorflow lite model

I would like to convert the ActorDistributionModel from a trained PPOClipAgent into a Tensorflow Lite model for deployment. How should I accomplish this?
I have tried following this tutorial (see section at bottom converting policy to TFLite), but the network outputs a single action (the policy) rather than the density function over actions that I desire.
I think perhaps something like this could work:
tf.compat.v2.saved_model.save(actor_net, saved_model_path, signature=?)
... if I knew how to set the signature parameter. That line of code executes without error when I omit the signature parameter, but I get the following error on load (I assume because the signature is not set up correctly):
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
File "/home/ais/salesmentor.ai/MDPSolver/src/solver/ppo_budget.py", line 336, in train_eval
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path)
File "/home/ais/.local/lib/python3.9/site-packages/tensorflow/lite/python/lite.py", line 1275, in from_saved_model
raise ValueError("Only support a single signature key.")
ValueError: Only support a single signature key.
This appears to work. I won't accept the answer until I have completed an end-to-end test, though.
def export_model(actor_net, observation_spec, saved_model_path):
predict_signature = {
'action_pred':
tf.function(func=lambda x: actor_net(x, None, None)[0].logits,
input_signature=(tf.TensorSpec(shape=observation_spec.shape),)
)
}
tf.saved_model.save(actor_net, saved_model_path, signatures=predict_signature)
# Convert to TensorFlow Lite model.
converter = tf.lite.TFLiteConverter.from_saved_model(saved_model_path,
signature_keys=["action_pred"])
converter.target_spec.supported_ops = [
tf.lite.OpsSet.TFLITE_BUILTINS, # enable TensorFlow Lite ops.
tf.lite.OpsSet.SELECT_TF_OPS # enable TensorFlow ops.
]
tflite_policy = converter.convert()
with open(os.path.join(saved_model_path, 'policy.tflite'), 'wb') as f:
f.write(tflite_policy)
The solution wraps the actor_net in a lambda because I was unable to figure out how to specify the signature with all three expected arguments. Through the lambda, I convert the function into using a single argument (a tensor). I expect to pass None to the other two arguments in my use case, so there is nothing lost in this approach.
I see you using CartPole as the model simulation, Agent DQN, and Model learning and Evaluation from links provided TF-Agent Checkpointer. For simple understanding, you need to understand about the distributions and your model limits ( less than 6 actions determining at a time ).
Discretes Distribution, answer the question to the points but the links is how they implement AgentDQN on TF- Agent.
temp = tf.random.normal([10], 1, 0.2, tf.float32), mean is one and the standard deviation is 0.2. Overall of result summation product is nearby one and its variance is 0.2, when they have 10 actions to determine the possibility of the result is the same action is 1 from 5 or 0.5. random normal
Coefficient is ladder steps or you understand as IF and ELSE conditions or SWITCH conditions such as at the gap of 0 to 5, 5 to 10, 10 to 15, and continue.
The matrixes product from the Matrix coefficients and randoms is selected 4 - 5 actions sorted by priority, significant and select the most effects in rows.
The ArgMax is 0 to 9 which is actions 0 - 9 that respond to the environment input co-variances.
Sample: To the points, random distributions and selective agents ( we call selective agent maybe the questioner has confused with NN DQN )
temp = tf.random.normal([10], 1, 0.2, tf.float32)
temp = np.asarray(temp) * np.asarray([ coefficient_0, coefficient_1, coefficient_2, coefficient_3, coefficient_4, coefficient_5, coefficient_6, coefficient_7, coefficient_8, coefficient_9 ])
temp = tf.nn.softmax(temp)
action = int(np.argmax(temp))

Is there a Rolling implementation of PCA in python?

For time-series analysis, it's useful to have rolling PCA functions to analyse how the dynamics of the time-series changes over time to avoid look-ahead bias.
We may want to answer the question: 'how many principle components are needed to keep 90% of the variance?'. The number of principle components to explain 90% variance may change over time, depending on the dynamics of the time-series.
In addition, we may want to reduce the number of components p in a given dataset to k < p on a rolling basis to more easily visualise the data.
While scikit has a PCA module, it does not support rolling calculations. Similarly with numpy SVD. We could use these packages in a manual for loop, but for large arrays (>10,000 rows) it would become very slow.
Is there a fast rolling implementation of PCA in python to address some of the questions above?
While I didn't manage to find a rolling implementation of PCA, it is a relatively straightforward matter to use the packages and tools mentioned in the question to code a manual rolling PCA function. In addition, we will use numba to gain a small speed-up, as it supports numpy.linalg.svd and numpy.linalg.eig.
The code in this answer is inspired by the excellent explanations of PCA here and here
import numpy as np
from numpy.linalg import eig
from numba import njit
import numpy.typing as npt
#njit
def rolling_pca(
arr: npt.NDArray[np.float64],
n_components: int,
window: int,
min_periods: int
) -> npt.NDArray[np.float64]:
"""Perform PCA on the covariance matrix of arr.
Return the lower dimensional array.
Data is assumed to have non-zero mean, so will be demeaned
in the process.
Args:
arr: Input data. Shape (n_samples, n_variables).
n_components: Number of components to reduce data matrix to.
Must be less than arr.shape[1].
window: Sliding window size.
min_periods: Minimum number of observations required to perform calculation.
Returns:
Reduced data matrix. Shape (n_samples, n_components)
"""
# create a copy to ensure we don't change data in place
arr_copy = arr.copy()
n = arr_copy.shape[0]
# create an empty array which will be populated with the output
reduced_out = np.full((n, n_components), np.nan)
# iterate over each row (timestamp in a timeseries)
for i in range(min_periods, n + 1):
if i < window:
lookback = i
else:
lookback = window
start_idx = i - lookback
curr_arr = arr_copy[start_idx: i, :]
# demean returns
curr_arr = curr_arr - (np.sum(curr_arr, axis=0) / lookback)
# calculate the covariance matrix
cov = (curr_arr.T # curr_arr) / (lookback - 1)
# get the eigenvectors
# sort eigvals to get top largest corresponding eigenvectors
evals, evecs = eig(cov)
idx = np.argsort(evals)[:n_components]
evecs = evecs[:, idx]
# multiply the top eigenvectors by the current array to get a reduced matrix
reduced = (evecs.T # curr_arr.T).T
reduced_out[start_idx: i, :] = reduced
return reduced_out
After profiling the code, the two slowest parts are, as expected, the calls to eig() and the matrix multiplication curr_arr.T # curr_arr. As the array curr_arr is limited by the window size, a pure numpy (no numba) implementation of matrix multiplcation is faster than using numba. This is because the arrays used in the matrix multiplication are small, and are not contiguous (see this post for more details). I didn't get around to resolving this issue, but if anyone has any suggestions, it would speed up this function quite a bit more.
I've compared the average timings between 3 implementations to see the effect of the speedup that numba offers. The 3 implementations are:
Manual for loop using numba, exactly as above
Manual for loop without numba, but otherwise same as the code above
Manual for loop using Sklearn instead of numpy eig, no numba (as numba does not support Sklearn)
Note that the following parameters are fixed so we get as fair a comparison as possible between implementations
Window size = 120
Minimum number of periods = 22
Input data number of variables = 20
Number of components to reduce to via PCA = 10
Number of iterations to time function over so as to get an average timing per implementation = 10
Only the row size (number of samples) is allowed to vary so we can visualise how execution time varies with array length.
We can see for large arrays (100k rows), we can decrease the time from about 14.19s using Sklearn, to 5.36s using numba, about a 2.6X speedup.
PCA Reconstruction
We implement some code similar to what we used above to reconstruct the original data matrix, using only the top principle components. Namely, we use SVD to decompose the matrix X into 3 matrices, U, S, and V^T. With these matrices, we can calculate how much variance is kept cumulatively by the components, and only keep the top k components that explain a desired amount of variance.
import numpy as np
from numpy.linalg import svd
from numba import njit
import numpy.typing as npt
#njit
def __get_num_top_components(
singular_values: npt.NDArray[np.float64],
threshold: float,
n: int,
) -> int:
"""Get the number of top eigen-components required by threshold.
Args:
singular_values: Singular values from SVD.
threshold: Minimum amount of explained variance to be kept.
n: Number of samples in data matrix.
Returns:
Required number of components to keep.
"""
evals = singular_values ** 2 / (n - 1)
evals = evals / np.sum(evals)
cumsum_evals = np.cumsum(evals)
top_k = np.argwhere(cumsum_evals > threshold).min()
return top_k
#njit
def rolling_pca_reconstruction(
arr: npt.NDArray[np.float64],
threshold: float,
window: int,
min_periods: int
) -> npt.NDArray[np.float64]:
"""Perform PCA on arr and return reconstructed matrix.
This method follows the logic succinctly outlined here:
https://stats.stackexchange.com/a/134283/178320
Args:
arr: Input data. Shape (n_samples, n_variables).
threshold: Minimum amount of explained variance to be kept.
Must be a number in (0., 1.).
window: Sliding window size.
min_periods: Minimum number of observations required to perform calculation.
Returns:
Reconstructed data matrix. Shape (n_samples, n_variables)
"""
arr_copy = arr.copy()
n = arr_copy.shape[0]
p = arr_copy.shape[1]
recon_out = np.full((n, p), np.nan)
for i in range(min_periods, n + 1):
if i < window:
lookback = i
else:
lookback = window
start_idx = i - lookback
curr_arr = arr_copy[start_idx: i, :]
# demean data
curr_arr = curr_arr - (np.sum(curr_arr, axis=0) / lookback)
# perform SVD on data, no need for full matrices, this is faster
u, s, vh = svd(curr_arr, full_matrices=False)
# calculate the number of components that explains threshold variance
top_k = __get_num_top_components(
singular_values=s,
threshold=threshold,
n=lookback,
)
# reconstruct the data matrix using the top_k components
tmp_recon = u[:, :top_k] # np.diag(s)[:top_k, :top_k] # vh[:, :top_k].T
recon_out[start_idx: i, :] = tmp_recon
return recon_out
The output of rolling_pca_reconstruction() is the reconstructed data, of same dimension as the input data arr. One useful modification that could be made to this code is to record top_k at each iteration, to understand how many components are needed to explain treshold variance over time.

Tensorflow graph execution ignores equality condition in earger execution mode

I stumbled on some weird tensorflow behaviour. After tf.print everywhere, it led me to the cause as shown in the following code but don't know why it happened unless either threading race condition or graph construction omitted the code segment. Don't see either of them should happen.
# Ragged tensor may have empty rows. So, for tensor arithmetic operation,
# we need to create zero-padded tensors to replace them.
# This implementation only keeps the first entry of each row.
# So, the output tensor is a normal tensor.
def pad_empty_ragged_tensor(ragtensor):
tf.print("Ragged tensor padding empty tensor...", output_stream=sys.stdout)
batch_size = ragtensor.shape[0]
n_rows = ragtensor.row_lengths()
tf.print("row_lengths(): ", n_rows, output_stream=sys.stdout)
new_tensor = []
for i in range(batch_size):
tf.print("n_rows[i]: ", n_rows[i], output_stream=sys.stdout)
if tf.equal(n_rows[i], 0): # Tried n_rows[i] == 0 too
tf.print("Create zero padded tensor...", output_stream=sys.stdout)
num_zeros = ragtensor.shape[-1]
tensor = tf.tile([[0]], [1, num_zeros])
tensor = tf.cast(tensor, dtype=ragtensor.dtype)
else:
tf.print("Take first entry from the row", output_stream=sys.stdout)
tensor = ragtensor[i,0:1]
new_tensor.append(tensor)
tensor = tf.stack(new_tensor, axis=0) # [batch, 1, [y, x, h, w]]
tensor.set_shape([batch_size, 1, ragtensor.shape[-1]])
tf.print("The padded tensor shape: ", tensor.shape, output_stream=sys.stdout)
return tensor
Here is a segment of the print trace:
row_lengths(): [1 1 0 ... 1 1 1]
n_rows[i]: 1
Take first entry from the row
n_rows[i]: 1
Take first entry from the row
n_rows[i]: 0
Take first entry from the row
n_rows[i]: 1
Take first entry from the row
As shown, if tf.equal(n_rows[i], 0): # Tried n_rows[i] == 0 too condition block was never called. It falls into 'else' condition every time even if the equality condition was met. Could anyone hint me what went wrong?
BTW, debugging tensorflow runtime is difficult too. Breakpoint in VSCode didn't hit once graph execution runs. tfdbg is not working with eager execution either. A suggestion on this is very beneficial to me too.
My dev env:
OS: Ubuntu18.04
Python: 3.6
Tensorflow-gpu: 1.14
GPU: RTX2070
Cuda: 10.1
cudnn: 7.6
IDE: VS code
Tensorflow mode: Eager execution
Thanks in advance

Why does Tensorflow output these simple results?

I'm brand new to Tensorflow, but I'm trying to figure out why these results end in ...001, ...002, etc.
I'm following the tutorial here: https://www.tensorflow.org/get_started/get_started
Code:
"""This is a Tensorflow learning script."""
import tensorflow as tf
sess = tf.Session()
W = tf.Variable([.3], dtype=tf.float32)
b = tf.Variable([-.3], dtype=tf.float32)
x = tf.placeholder(tf.float32)
linear_model = W*x + b
sess.run(tf.global_variables_initializer()) #This is the same as the above 2 lines
print(sess.run(linear_model, {x: [1, 2, 3, 4]}))
It looks like a simple math function where if I was using 2 as an input, it would be (0.3 * 2) + -0.3 = 0.3.
Output:
[ 0. 0.30000001 0.60000002 0.90000004]
I would expect:
[ 0. 0.3 0.6 0.9]
That's probably a floating point error, because you introduced your variables as a tf.float32 dtype. You could use tf.round (https://www.tensorflow.org/api_docs/python/tf/round) but it doesn't seem to have round-to-the-nearest decimal place capability yet. For that, check out the response in: tf.round() to a specified precision.
The issue is that a floating point variable (like tf.float32) simply cannot store exactly 0.3 due to being stored in binary. It's like trying to store exactly 1/3 in decimal, it'd be 0.33... but you'd have to go out to infinity to get the exact number (which isn't possible our mortal realm!).
See the python docs for more in depth review of the subject.
Tensorflow doesn't have a way to deal with decimal numbers yet (as far as I know)! But once the numbers are returned to python you could round & then convert to a Decimal.

NumPy vectorization with integration

I have a vector and wish to make another vector of the same length whose k-th component is
The question is: how can we vectorize this for speed? NumPy vectorize() is actually a for loop, so it doesn't count.
Veedrac pointed out that "There is no way to apply a pure Python function to every element of a NumPy array without calling it that many times". Since I'm using NumPy functions rather than "pure Python" ones, I suppose it's possible to vectorize, but I don't know how.
import numpy as np
from scipy.integrate import quad
ws = 2 * np.random.random(10) - 1
n = len(ws)
integrals = np.empty(n)
def f(x, w):
if w < 0: return np.abs(x * w)
else: return np.exp(x) * w
def temp(x): return np.array([f(x, w) for w in ws]).sum()
def integrand(x, w): return f(x, w) * np.log(temp(x))
## Python for loop
for k in range(n):
integrals[k] = quad(integrand, -1, 1, args = ws[k])[0]
## NumPy vectorize
integrals = np.vectorize(quad)(integrand, -1, 1, args = ws)[0]
On a side note, is a Cython for loop always faster than NumPy vectorization?
The function quad executes an adaptive algorithm, which means the computations it performs depend on the specific thing being integrated. This cannot be vectorized in principle.
In your case, a for loop of length 10 is a non-issue. If the program takes long, it's because integration takes long, not because you have a for loop.
When you absolutely need to vectorize integration (not in the example above), use a non-adaptive method, with the understanding that precision may suffer. These can be directly applied to a 2D NumPy array obtained by evaluating all of your functions on some regularly spaced 1D array (a linspace). You'll have to choose the linspace yourself since the methods aren't adaptive.
numpy.trapz is the simplest and least precise
scipy.integrate.simps is equally easy to use and more precise (Simpson's rule requires an odd number of samples, but the method works around having an even number, too).
scipy.integrate.romb is in principle of higher accuracy than Simpson (for smooth data) but it requires the number of samples to be 2**n+1 for some integer n.
#zaq's answer focusing on quad is spot on. So I'll look at some other aspects of the problem.
In recent https://stackoverflow.com/a/41205930/901925 I argue that vectorize is of most value when you need to apply the full broadcasting mechanism to a function that only takes scalar values. Your quad qualifies as taking scalar inputs. But you are only iterating on one array, ws. The x that is passed on to your functions is generated by quad itself. quad and integrand are still Python functions, even if they use numpy operations.
cython improves low level iteration, stuff that it can convert to C code. Your primary iteration is at a high level, calling an imported function, quad. Cython can't touch or rewrite that.
You might be able to speed up integrand (and on down) with cython, but first focus on getting the most speed from that with regular numpy code.
def f(x, w):
if w < 0: return np.abs(x * w)
else: return np.exp(x) * w
With if w<0 w must be scalar. Can it be written so it works with an array w? If so, then
np.array([f(x, w) for w in ws]).sum()
could be rewritten as
fn(x, ws).sum()
Alternatively, since both x and w are scalar, you might get a bit of speed improvement by using math.exp etc instead of np.exp. Same for log and abs.
I'd try to write f(x,w) so it takes arrays for both x and w, returning a 2d result. If so, then temp and integrand would also work with arrays. Since quad feeds a scalar x, that may not help here, but with other integrators it could make a big difference.
If f(x,w) can be evaluated on a regular nx10 grid of x=np.linspace(-1,1,n) and ws, then an integral (of sorts) just requires a couple of summations over that space.
You can use quadpy for fully vectorized computation. You'll have to adapt your function to allow for vector inputs first, but that is done rather easily:
import numpy as np
import quadpy
np.random.seed(0)
ws = 2 * np.random.random(10) - 1
def f(x):
out = np.empty((len(ws), *x.shape))
out0 = np.abs(np.multiply.outer(ws, x))
out1 = np.multiply.outer(ws, np.exp(x))
out[ws < 0] = out0[ws < 0]
out[ws >= 0] = out1[ws >= 0]
return out
def integrand(x):
return f(x) * np.log(np.sum(f(x), axis=0))
val, err = quadpy.quad(integrand, -1, +1, epsabs=1.0e-10)
print(val)
[0.3266534 1.44001826 0.68767868 0.30035222 0.18011948 0.97630376
0.14724906 2.62169217 3.10276876 0.27499376]