Why are the trigonometric functions in Julia seem to be slower than in Numpy? - numpy

I'm new to Julia, so may be doing something wrong. But I ran a simple test of trigonometric functions, and Julia seems to be significantly slower than Numpy. Need some help to see why.
--- Julia version:
x = rand(100000);
y = similar(x);
#time y.=sin.(x);
--- Numpy version:
import numpy
x = numpy.random.rand(100000)
y = numpy.zeros(x.shape)
%timeit y = numpy.sin(x)
The Julia version regularly gives 1.3 ~ 1.5 ms, but the Numpy version usually gives 0.9 ~ 1 ms. The difference is quite significant. Why is that? Thanks.

x = rand(100000);
y = similar(x);
f(x,y) = (y.=sin.(x));
#time f(x,y)
#time f(x,y)
#time f(x,y)
Gives
julia> #time y.=sin.(x);
0.123145 seconds (577.97 k allocations: 29.758 MiB, 5.70% gc time)
julia> #time y.=sin.(x);
0.000515 seconds (6 allocations: 192 bytes)
julia> #time y.=sin.(x);
0.000512 seconds (6 allocations: 192 bytes)
The first time you call a function, Julia compiles it. Broadcast expressions generate and use an anonymous function, so if you broadcast in the global scope it will compile it each time. Julia works best in function scopes.

Related

Which is faster read(), or readline() or readlines() with respect file io in julia?

Please correct me, if I were wrong:
read is efficient, as I assume:
a) read fetches whole file content to memory in one go, similar to python.
b) readline and readlines brings one line at a time to memory.
In order to expand on the comment here is some example benchmark (to additionally show you how you can perform such tests yourself).
First create some random test data:
open("testdata.txt", "w") do f
for i in 1:10^6
println(f, "a"^100)
end
end
We will want to read in the data in four ways (and calculate the aggregate length of lines):
f1() = sum(length(l) for l in readlines("testdata.txt"))
f2() = sum(length(l) for l in eachline("testdata.txt"))
function f3()
s = 0
open("testdata.txt") do f
while !eof(f)
s += length(readline(f))
end
end
s
end
function f4()
s = 0
for c in read("testdata.txt", String)
s += c != '\n' # assume Linux for simplicity
end
s
end
Now we compare the performance and memory usage of the given options:
julia> using BenchmarkTools
julia> #btime f1()
239.857 ms (2001558 allocations: 146.59 MiB)
100000000
julia> #btime f2()
179.480 ms (2001539 allocations: 137.59 MiB)
100000000
julia> #btime f3()
189.643 ms (2001533 allocations: 137.59 MiB)
100000000
julia> #btime f4()
158.055 ms (13 allocations: 96.32 MiB)
100000000
If you run it on your machine you should get similar results.

Julia code optimization, difference between structs and primitive types? (memory allocs)

I have some code to optimize with some critical parts for which I do not want the gc to trigger memory allocation.
To be more precise I have a Real number type
struct AFloat{T<:AbstractFloat} <: Real
value::T
j::Int
end
I must track to perform automatic differentiation. Thus for any arithmetic operation I have to do some registrations in a tape. Performance is really important here (it makes a real difference if you have one more alloc per arithmetic op!). I have the choice between AFloat{T} or to simply use a primitive type to track the index j:
primitive type AFloat64 <: Real sizeof(Int) end
However, I am confused with these results:
First part: ok
using BenchmarkTools
struct A n::Int64 end
vA=A[A(1)];
#time push!(vA,A(2))
v=Int64[1];
#time push!(v,2)
returns
0.000011 seconds (6 allocations: 224 bytes)
0.000006 seconds (5 allocations: 208 bytes)
which is coherent with:
#btime push!(vA,A(2))
#btime push!(v,2)
that returns
46.410 ns (1 allocation: 16 bytes)
37.890 ns (0 allocations: 0 bytes)
-> I would conclude that pushing a primitive type avoid one memory allocation compared to a struct (is it right?)
Part two: ...problematic...?!
Here I am confused and I can not interpret these results:
foo_A() = A(1);
foo_F64() = Float64(1.);
foo_I64() = Int64(1);
#time foo_A()
#time foo_F64()
#time foo_I64()
returns
0.000004 seconds (5 allocations: 176 bytes)
0.000005 seconds (5 allocations: 176 bytes)
0.000005 seconds (4 allocations: 160 bytes)
Q1 how to interpret difference foo_F64() vs foo_I64() (5 allocs vs 4 allocs)?
Moreover, results seem inconsistent with #btime outputs:
#btime foo_A()
3.179 ns (0 allocations: 0 bytes)
#btime foo_F64()
3.801 ns (0 allocations: 0 bytes)
#btime foo_I64()
3.180 ns (0 allocations: 0 bytes)
Q2: what is the right answer #time or #btime? Why?
To be synthetic, in Julia, is there a difference in terms of perf and memory allocation between foo_A and foo_Primitive, where:
struct A n::Int64 end
foo_A() = A(1)
foo_Primitive() = Int64(1)
I am aware that with such small expressions there are real risks of side effects when using #time or #btime. Ideally, it would be better to have some knowledge of Julia's internals to answer. But I don't
julia> versioninfo()
Julia Version 0.6.2
Commit d386e40c17 (2017-12-13 18:08 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Xeon(R) CPU E5-2603 v3 # 1.60GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
LAPACK: libopenblas64_
LIBM: libopenlibm
LLVM: libLLVM-3.9.1 (ORCJIT, haswell)
The allocations you're seeing are just side-effects of timing from the REPL afaict.
If you put those in a function, the allocations are the same wether you're using a struct or a primitive type :
julia> function f()
vA=[A(1)]
#time push!(vA,A(2))
v=[1]
#time push!(v,2)
end
f (generic function with 1 method)
julia> f();
0.000000 seconds (1 allocation: 32 bytes)
0.000000 seconds (1 allocation: 32 bytes)

How to show field values in Julia

I was wondering if there is a possibility to show field values in Julia.
For example, this Python program, gets the object variable wealth from the consumer class:
class Consumer:
def __init__(self, w):
"Initialize consumer with w dollars of wealth"
self.wealth = w
def earn(self, y):
"The consumer earns y dollars"
self.wealth += y
def spend(self, x):
"The consumer spends x dollars if feasible"
new_wealth = self.wealth - x
if new_wealth < 0:
print("Insufficent funds")
else:
self.wealth = new_wealth
c1.wealthc1 = Consumer(10) # Create instance with initial wealth 10
c1.spend(5)
c1.wealth
The wealth variable is 5. I want to know how can I translate this code to Julia.
The simplest approach is pretty much like Python:
mutable struct Consumer
wealth
end
function earn(c::Consumer, y)
c.wealth += y
end
function spend(c::Consumer, y)
c.wealth -= y
end
And now you can use it like:
julia> c1 = Consumer(10)
Consumer(10)
julia> spend(c1, 5)
5
julia> c1.wealth
5
You can read more about it here.
But probably in Julia you would write it like:
mutable struct ConsumerTyped{T<:Real}
wealth::T
end
function earn(c::ConsumerTyped, y)
c.wealth += y
end
function spend(c::ConsumerTyped, y)
c.wealth -= y
end
Which on surface will work almost the same. The difference is T which specifies the type of wealth. There are two benefits: you get type control in your code and the functions will run faster.
Given such a definition the only thing you need to know is that the constructor can be called in two flavors:
c2 = ConsumerTyped{Float64}(10) # explicitly specifies T
c3 = ConsumerTyped(10) # T implicitly derived from the argument
Now let us compare the performance of both types:
julia> using BenchmarkTools
julia> c1 = Consumer(10)
Consumer(10)
julia> c2 = ConsumerTyped(10)
ConsumerTyped{Int64}(10)
julia> #benchmark spend(c1, 1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 56.434 ns (0.00% GC)
median time: 57.376 ns (0.00% GC)
mean time: 60.126 ns (0.84% GC)
maximum time: 847.942 ns (87.69% GC)
--------------
samples: 10000
evals/sample: 992
julia> #benchmark spend(c2, 1)
BenchmarkTools.Trial:
memory estimate: 16 bytes
allocs estimate: 1
--------------
minimum time: 29.858 ns (0.00% GC)
median time: 30.791 ns (0.00% GC)
mean time: 32.835 ns (1.63% GC)
maximum time: 966.188 ns (90.20% GC)
--------------
samples: 10000
evals/sample: 1000
and you see that you get ~2x speedup.
Julia doesn't support classes (in terms of OOP).
However, there are composite types which can represent the variables of your class:
type Consumer
wealth::Float64
end
Now, since Julia doesn't support classes, all methods have to live outside this type which allows one of the key features of Julia, multiple dispatch, to also work with user-defined types. (https://docs.julialang.org/en/stable/manual/methods/, https://www.juliabloggers.com/julia-in-ecology-why-multiple-dispatch-is-good/)
Hence, you would have to add a method like this:
function earn!(consumer::Consumer, y::Float64)
println("The consumer earns y dollars")
consumer.wealth = consumer.wealth + y
end
(Similarly, the spend function can be implemented.)

Optimizing runtime with lambdify and function evaluation

I am currently optimizing the runtime of my code, and it is still not within the time consumption bounds I would like it to be. I have gotten to the point where 80% of the time is spent on running lambdify() on my sympy Matrix expressions and evaluating the resulting lambda functions when performing gauss quadrature. All other aspects of the code are sufficiently optimized, I was therefore hoping someone could help me optimizing the substantial "bottleneck" in my code of lambdifying and evaluating sympy expressions.
The code is written on a 64-bit Windows 7 machine with Python 3.5.2 (examples below, illustrating the code, are performed on Jupyter QtConsole) and the following module versions:
Sympy: 1.0
Numpy: 1.11.1
Numba: 0.27
Lambdify()
The reason for lambdify() using a lot of time I believe is the complexity of the sympy expression (which involves multiplications of sympy Piecewise() expressions). Simplification of these expressions are not possible, as they are wavelet functions created from Legendre scaling functions using the standard Alpert algorithm. A smaller example of such a Matrix and time comparison with lambdifying a "simpler" Matrix is given here:
from sympy import *
import numpy as np
import timeit
xi1 = symbols('xi1')
xi2 = symbols('xi2')
M = Matrix([[-0.0015625*(3.46410161513775*(0.00624999999999998*xi2 -
0.99375)*Piecewise((-1, 0.00624999999999998*xi2 - 0.99375 >= 0),
(1, 0.00624999999999998*xi2 - 0.99375 < 0)) +
1.73205080756888)*Piecewise((1, And(0.00624999999999998*xi2 -
0.99375 <= 1, 0.00624999999999998*xi2 -
0.99375 >= -1)), (0, True))],
[-0.00156249999999999*(0.0187499999999999*xi2 + 2.0*Piecewise((-1,
0.00624999999999998*xi2 - 0.99375 >= 0), (1,
0.00624999999999998*xi2 - 0.99375 < 0)) - 2.98125)*Piecewise((1,
And(0.00624999999999998*xi2 - 0.99375 <= 1,
0.00624999999999998*xi2 - 0.99375 >= -1)), (0, True))],
[-0.00270632938682636*xi1*(3.46410161513775*
(0.00624999999999998*xi2 - 0.99375)*Piecewise((-1,
0.00624999999999998*xi2 - 0.99375 >= 0), (1,
0.00624999999999998*xi2 - 0.99375 < 0)) +
1.73205080756888)*Piecewise((1, And(0.00624999999999998*xi2 -
0.99375 <= 1, 0.00624999999999998*xi2 - 0.99375 >= -1)), (0,
True))]])
M_simpl = Matrix([(xi2**2),(xi2**2)*xi1,(xi2**2)*(xi1**2)])
Time comparison yields:
import timeit
%timeit lambdify([xi1,xi2], M, 'numpy')
10 loops, best of 3: 23 ms per loop
%timeit lambdify([xi1,xi2], M_simpl, 'numpy')
100 loops, best of 3: 2.47 ms per loop
This shows that the more complex expressions are handled almost 10x slower than the simpler Matrix, which makes a significant contribution to the runtime when lambdify() is applied to several of these types of matrices.
Researching the subject I have learned of the faster ufuncify() function in sympy.utilities.autowrap, which seems to work best using a Fortran or C++ backend. However, this is not the best alternative in my case, as the function does not yet extend to sympy Matrices and I would like the code to be general enough s.t. other Windows users adapting the code does not need to install C++ compiler etc.
So, is there anyway of achieving a speed up of the lambdify() function for these types of sympy expressions without using other compilers?
Lambda function evaluation
The lambdifyed functions of the sympy Matrices above also performs different when it comes to evaluation at specific coordinates. This is illustrated with the following simple 5-point quadrature example:
# Quadrature coordinates
xi_v = np.array([[-1,-1], [-0.5,-0.5], [0,0], [0.5,0.5], [1,1]])
# Quadrature weights
w = np.array([3, 2, 1, 2, 3])
# Quadrature
def quad_func(func, xi_v, w):
G = np.zeros((3, 1))
for i in range(0, len(w), 1):
G += w[i]*func(*xi_v[i,:])
return G
# Testing time usage
f = lambdify([xi1,xi2], M, 'numpy')
%timeit quad_func(f, xi_v, w)
1000 loops, best of 3: 852 µs per loop
f_simpl = lambdify([xi1,xi2], M_simpl, 'numpy')
%timeit quad_func(f_simpl, xi_v, w)
10000 loops, best of 3: 33.9 µs per loop
My first instinct was to introduce jit from the numba module in order to speed up the evaluation. However, this resulted in a pop-up window stating that python has stopped working, and the kernel is restarted (Happens for both f and f_simpl):
import numba
quad_func_jit = numba.jit(quad_func)
quad_func_jit(f, xi_v, w)
Kernel died, restarting
So again, is there anyway to speed up these lambda function evaluations in order to reduce the total runtime? Or possibly some way of avoiding the crash for numba.jit?
I was interested in the code that lambdify produces (lambdify converts sympy syntax into numpy code), so using the inspect module I printed it:
f = lambdify([xi1,xi2], M, 'numpy')
import inspect
lines = inspect.getsource(f)
print(lines)
(The code for f can be taken from the question, I will not repeat it here for the sake of brevity) The print statement gave me this massive function, which seems to be correct:
def _lambdifygenerated(xi1, xi2):
return array(
[[(-0.0015625*(0.0216506350946109*xi2 - 3.44245098004314)
*select([greater_equal(0.00624999999999998*xi2 - 0.99375, 0),True],
[-1,1], default=nan) - 0.00270632938682638)
*select([logical_and.reduce((greater_equal(0.00624999999999998*xi2
- 0.99375, -1),
less_equal(0.00624999999999998*xi2
- 0.99375, 1))),True], [1,0],
default=nan)], [
(-2.92968749999997e-5*xi2 - 0.00312499999999998
*select([
greater_equal(0.00624999999999998*xi2 - 0.99375, 0),True], [-1,1],
default=nan) + 0.00465820312499997)
*select([logical_and.
reduce((greater_equal(0.00624999999999998*xi2 - 0.99375, -1),
less_equal(0.00624999999999998*xi2 - 0.99375, 1))),
True], [1,0], default=nan)],
[-0.00270632938682636*xi1*((0.0216506350946109*xi2
- 3.44245098004314)*
select([greater_equal(0.00624999999999998*
xi2 - 0.99375, 0),True], [-1,1], default=nan)
+ 1.73205080756888)*select([logical_and.reduce((greater_equal(
0.00624999999999998*xi2 - 0.99375, -1),
less_equal(0.00624999999999998*xi2 - 0.99375, 1))),
True], [1,0], default=nan)]])
However, this function uses a lot of non-numba supported numpy functions, such as select. Which makes it impossible to use numba. So to answer your question: No, it is (sadly) not possible to combine lambdify and numba to create JIT compiled sympy answers

Why Fortran is slow in the julia benchmark "rand_mat_mul"?

Benchmark test results on the home page of Julia (http://julialang.org/) shows that Fortran is about 4x slower than Julia/Numpy in the "rand_mat_mul" benchmark.
I can not understand that why fortran is slower while calling from the same fortran library (BLAS)??
I have also performed a simple test for matrix multiplication evolving fortran, julia and numpy and got the similar results:
Julia
n = 1000; A = rand(n,n); B = rand(n,n);
#time C = A*B;
>> elapsed time: 0.069577896 seconds (7 MB allocated)
Numpy in IPython
from numpy import *
n = 1000; A = random.rand(n,n); B = random.rand(n,n);
%time C = dot(A,B);
>> Wall time: 98 ms
Fortran
PROGRAM TEST
IMPLICIT NONE
INTEGER, PARAMETER :: N = 1000
INTEGER :: I,J
REAL*8 :: T0,T1
REAL*8 :: A(N,N), B(N,N), C(N,N)
CALL RANDOM_SEED()
DO I = 1, N, 1
DO J = 1, N, 1
CALL RANDOM_NUMBER(A(I,J))
CALL RANDOM_NUMBER(B(I,J))
END DO
END DO
call cpu_time(t0)
CALL DGEMM ( "N", "N", N, N, N, 1.D0, A, N, B, N, 0.D0, C, N )
call cpu_time(t1)
write(unit=*, fmt="(a24,f10.3,a1)") "Time for Multiplication:",t1-t0,"s"
END PROGRAM TEST
gfortran test_blas.f90 libopenblas.dll -O3 & a.exe
>> Time for Multiplication: 0.296s
I have changed the timing function to system_clock() and result turns out to be (I run it five times in one program)
Time for Multiplication: 92ms
Time for Multiplication: 92ms
Time for Multiplication: 89ms
Time for Multiplication: 85ms
Time for Multiplication: 94ms
It is approximate as Numpy, but still about 20% slower than Julia.