Not contract X*X*X to pow(X,3) in sympy's `printing.ccode` method - formatting

I have a sympy equation that I need to translate to CUDA.
In its default configuration, sympy.printing.ccode will transform the expression x*x into pow(x,2) which unfortunately, CUDA behaves a bit strangely with (e.g. pow(0.1,2) is 0 according to CUDA).
I would prefer sympy.printing.ccode to leave these kinds of expressions unaltered, or put another way, I would like it to expand any instance of pow into a simple product. E.g. pow(x,4) would become x*x*x*x -- does anyone know how to make this happen?

This should do it:
>>> import sympy as sp
>>> from sympy.utilities.codegen import CCodePrinter
>>> print(sp.__version__)
0.7.6.1
>>> x = sp.Symbol('x')
>>> CCodePrinter().doprint(x*x*x)
'pow(x, 3)'
>>> class MyCCodePrinter(CCodePrinter):
... def _print_Pow(self, expr):
... if expr.exp.is_integer and expr.exp.is_number:
... return '(' + '*'.join([self._print(expr.base)]*expr.exp) + ')'
... else:
... return super(MyCCodePrinter, self)._print_Pow(expr)
...
>>> MyCCodePrinter().doprint(x*x*x)
'x*x*x'
Note that this was a proposed change (with a restriction on this size of the exponent) for a while. Then the motivation was performance of regular C code, but flags such as -ffast-math made to point moot. However if this is something that is useful for CUDA code we should definitely support the behaviour through a setting, feel free to open an issue for it if you think it is needed.

Related

How to setup a batched matrix multiplication in Numba with np.dot() using contiguous arrays

I am trying to speed up a batched matrix multiplication problem with numba, but it keeps telling me that it's faster with contiguous code.
Note: I'm using numba version 0.55.1, and numpy version 1.21.5
Here's the problem:
import numpy as np
import numba as nb
def numbaFastMatMult(mat,vec):
result = np.zeros_like(vec)
for n in nb.prange(vec.shape[0]):
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
return result
D,N = 10,1000
mat = np.random.normal(0,1,(N,D,D))
vec = np.random.normal(0,1,(N,D))
result = numbaFastMatMult(mat,vec)
print(mat.data.contiguous)
print(vec.data.contiguous)
print(mat[n,:,:].data.contiguous)
print(vec[n,:].data.contiguous)
clearly all the relevant data is contiguous (run the above code snippet and see the results of print()...
But, when I run this code, I get the following warning:
NumbaPerformanceWarning: np.dot() is faster on contiguous arrays, called on (array(float64, 1d, C), array(float64, 2d, A))
result[n,:] = np.dot(vec[n,:], mat[n,:,:])
2 Extra comments:
This is just a toy problem for replication. I'm actually using something with many more data points, so hoping this will speed up.
I think the "right" way to solve this is with np.tensordot. However, I want to understand what's going on for future reference. For example, this discussion addresses a similar issue, but as far as I can tell, doesn't address why the warning shows up directly.
I've tried adding a decorator:
nb.float64[:,::1](nb.float64[:,:,::1],nb.float64[:,::1]),
I've tried reordering the arrays so the batch index is first (n in the above code)
I've tried printing whether the "mat" variable is contiguous from inside the function
I'll leave this up, but I figured it out:
Outside of a numba function:
mat[n,:,:].data.contiguous==True
but inside numba, mat[n,:,:] is no longer continous.
Changing my code above to np.dot(vec[n], mat[n]) removed the warning.
I'm making this the "correct" answer since it solved my problem. However, according to max9111's response, this behavior may be a bug!

How to make an np.array in numba with input-dependent rank?

I would like to #numba.njit this simple function that returns an array with a shape, in particular a rank, that depends on the input i:
E.g. for i = 4 the shape should be shape=(2, 2, 2, 2, 4)
import numpy as np
from numba import njit
#njit
def make_array_numba(i):
shape = np.array([2] * i + [i], dtype=np.int64)
return np.empty(shape, dtype=np.int64)
make_array_numba(4).shape
I tried many different ways, but always fail at the fact that I can't generate the shape tuple that numba wants to see in np.empty / np.reshape / np.zeros /...
In normal numpy one can pass lists / np.arrays as the shape, or I can generate a tuple on the fly such as (2,) * i + (i,).
Output:
>>> empty(array(int64, 1d, C), dtype=class(int64))
There are 4 candidate implementations:
- Of which 4 did not match due to:
Overload in function '_OverloadWrapper._build.<locals>.ol_generated': File: numba/core/overload_glue.py: Line 131.
With argument(s): '(array(int64, 1d, C), dtype=class(int64))':
Rejected as the implementation raised a specific error:
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
No implementation of function Function(<intrinsic stub>) found for signature:
>>> stub(array(int64, 1d, C), class(int64))
There are 2 candidate implementations:
- Of which 2 did not match due to:
Intrinsic of function 'stub': File: numba/core/overload_glue.py: Line 35.
With argument(s): '(array(int64, 1d, C), class(int64))':
No match.
This is not possible only with #njit. The reason is that Numba needs to set a type for the array independently of variable values so to compile the function and only then execute it. The thing is the dimension of an array is part of its type. Thus, here, Numba cannot find the type of the array since it is dependent of a value that is not a compile-time constant.
The only way to solve this problem (assuming you do not want to linearize your array) is to recompile the function for each possible i which is certainly overkill and completely defeat the benefit of using Numba (at least in your example). Note that #generated_jit can be used in such a case when you really want to recompile the function for different values or input types. I strongly advise you not to use it for your current use-case. If you try, then you will have other similar issues due to the array not being indexable using a runtime-defined variables and the resulting code will quickly be insane.
A more general and cleaner solution is simply to linearize the array. This means flattening it and perform some fancy indexing computation like (((... + z) * stride_z) + y) * stride_y + x. The size and the index can be computed at runtime independently of the typing system. Note that indexing can be quite slow but Numpy will not use a faster code in this case.

Efficient solving of generalised eigenvalue problems in python

Given an eigenvalue problem Ax = λBx what is the more efficient way to solve it out of the two shown here:
import scipy as sp
import numpy as np
def geneivprob(A,B):
# Use scipy
lamda, eigvec = sp.linalg.eig(A, B)
return lamda, eigvec
def geneivprob2(A,B):
# Reduce the problem to a standard symmetric eigenvalue problem
Linv = np.linalg.inv(np.linalg.cholesky(B))
C = Linv # A # Linv.transpose()
#C = np.asmatrix((C + C.transpose())*0.5,np.float32)
lamda,V = np.linalg.eig(C)
return lamda, Linv.transpose() # V
I saw the second version in a codebase and was wondering if it was better than simply using scipy.
Well there is no obvious advantage in using the second approach, maybe for some class of matrices it will be better, I would suggest you to test with the problems you want to solve. Since you are transforming the eigenvectors, this will also transform how the errors affect the solution, and maybe that is the reason for using this second method, not efficiency, but numerical accuracy, or convergence.
Another thing is that the second method will only work for symmetric B.

Solution to transcendental equation with both Mathematica and Python

I have the following issue with finding the roots of a non-linear equation. The equation is the following:
tanh[ 5* log [ (2/t)^(0.00990099) (1+x)^(0.990099) (1-x)^(-1) ] ]-x = 0
Solving this with NSolve, for {t, 0, 100} returns the following with Mathematica:
This what I was expecting by plotting the resulting roots versus the time parameter within this range. Now, I have tried to replicate this result with Python by using scipy.optimize.root but it seems that my code returns as a solution any value that I use as an initial condition, hence it is nothing else that the identity map. This can be also see in the pic below, where I used an initial condition 0.7:
I have provided the code below:
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import root
#Setting up the function
def delta(v,t):
epsilon = 10**(-20)
return np.tanh( 5*np.log( (2/(1.0*t+epsilon))**(0.00990099)*(1+v+epsilon)**(0.990099)*(1-v+epsilon)**(-1)))-v
#Setting up time paramerer
time = np.linspace(0, 101)
res = [root(delta, 0.7, args=(t, )).x[0] for t in time]
print res
plt.plot(time, res)
plt.savefig("plot.png")
I am not really sure if I am using the scipy.optimize.root correct, since the function looks ok as far as what I expect from its behaviour. Perhaps a mistake in the way I pass the args?
The root-finding methods that begin with a bracketing interval [a, b] (one where f(a) and f(b) have opposite signs) are generally more robust than the methods that begin with a single point x0 of departure. The reason is that the former have a definite field to work with, and can refine it iteratively. The bisection method is a classical example of these, but it's slow. SciPy implements more sophisticated methods such as brentq. It works fine here, with the bracket of [-0.1, 0.1] (which should be enough from looking from the Mathematica plot).
Also, t=0 is problematic in the equation, as it's not even defined then. Put a small positive number like 0.01 instead.
time = np.linspace(0.01, 101, 500)
res = [brentq(delta, -0.1, 0.1, args=(t, )) for t in time]

Why the difference between octave's prctile and numpy's percentile?

I've been rewriting a matlab/octave program into numpy and ran across a difference in some resultant values.
This occurs with both the percentile/prctile and the stdard-deviation functions.
In Numpy:
import matplotlib.mlab as ml
import numpy
>>> t = numpy.linspace(0,100, 100)
>>> numpy.percentile(t,95)
95.0
>>> numpy.std(t)
29.157646512850626
>>> ml.prctile(t,95)
95.000000000000014
In Octave:
octave:1> t = linspace(0,100,100)';
octave:2> prctile(t,95)
ans = 95.454545
octave:3> std(t)
ans = 29.304537
Although the array values of 't' are the same, the results are more different than I would suspect.
In the numpy help(numpy.std) they specifically mention that the algorithm is:
std = sqrt(mean(abs(x - x.mean())**2))
So I implemented that in octave and got the exact answer numpy gives. So it seems the std-deviation function differs.
But why/how? And which is correct? (if there is such a thing)
And even prctile/percentile?
Just in case since I'm in Linux aptosid...
GNU Octave, version 3.6.2
numpy.version '1.6.2rc1'
Numpy simply uses a different algorithm when the percentile lies between two data points. Octave, Matlab and R always center it exactly between two points when needed (I believe), numpy does a bit more then that... if you check http://en.wikipedia.org/wiki/Percentile you will see there are a couple of ways to calculate percentiles.
It seems like Octave assumes ddof=1, at least by default, and numpy uses 0 by default:
>>> numpy.std(t, ddof=0)
29.157646512850633
>>> numpy.std(t, ddof=1)
29.304537349375785