I am trying to reduce the memory allocation of an inner loop in my code. Below the part that is not working as expected.
using Random
using StatsBase
using BenchmarkTools
using Distributions
a_dist = Distributions.DiscreteUniform(1, 99)
v_dist = Distributions.DiscreteUniform(1, 2)
population_size = 10000
population = [rand(a_dist, population_size) rand(v_dist, population_size)]
find_all_it3(f::Function, A) = (p[2] for p in eachrow(A) if f(p[1]))
#btime begin
c_pool = find_all_it3(x -> (x < 5), population)
c_pool_dict = countmap(c_pool, alg=:dict)
end
#btime begin
c_pool_indexes = findall(x -> (x < 5) , view(population, :, 1))
c_pool_dict = countmap(population[c_pool_indexes, 2], alg=:dict)
end
I was hoping that the generator (find_all_it3) would not need to allocate much memory.
however as per the btime output it seems that there is an allocation for each loop.
98.040 μs (10006 allocations: 625.64 KiB)
18.894 μs (18 allocations: 11.95 KiB)
Now in my scenario the speed and allocation of the findall eventually become an issue, hence I was trying to find a better alternative through generator/iterators so that less allocation occur; is there a way to do that? Are there options to consider?
I don't have an explaination for it but here are the results of a few tests I made
The best time is achieved with view(population, :, 1) .< 5 (test4)
using broadcast! reduces allocations a bit (test5)
the best way to reduce allocation is to do your own loop (test6)
using BenchmarkTools
using StatsBase
population_size = 10000
population = [rand(1:99, population_size) rand(1:2, population_size)]
find_all_it(f::Function, A) = (p[2] for p in eachrow(A) if f(p[1]))
function test1(population)
c_pool = find_all_it(x -> x < 5, population)
c_pool_dict = countmap(c_pool, alg=:dict)
end
function test3(population)
c_pool_indexes = findall(x -> x < 5, view(population, :, 1))
c_pool_dict = countmap(view(population,c_pool_indexes, 2), alg=:dict)
end
function test4(population)
c_pool_indexes = view(population, :, 1) .< 5
c_pool_dict = countmap(view(population,c_pool_indexes, 2), alg=:dict)
end
function test5(c_pool_indexes, population)
broadcast!(<, c_pool_indexes, view(population, :, 1), 5)
c_pool_dict = countmap(view(population,c_pool_indexes, 2), alg=:dict)
end
function test6(population)
d = Dict{Int,Int}()
for i in eachindex(view(population, :, 1))
if population[i, 1] < 5
d[population[i,2]] = 1 + get(d,population[i,2],0)
end
end
return d
end
julia> #btime test1(population);
68.200 μs (10004 allocations: 625.59 KiB)
julia> #btime test3(population);
14.800 μs (14 allocations: 9.00 KiB)
julia> #btime test4(population);
7.250 μs (8 allocations: 9.33 KiB)
julia> temp = zeros(Bool, population_size);
julia> #btime test5(temp, population);
16.599 μs (5 allocations: 3.78 KiB)
julia> #btime test6(population);
11.299 μs (4 allocations: 608 bytes)
Related
I am trying to modify a DataFrame df to only contain rows for which the values in the column closing_price are between 99 and 101 and trying to do this with the code below.
However, I get the error
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()
and I am wondering if there is a way to do this without using loops.
df = df[(99 <= df['closing_price'] <= 101)]
Consider also series between:
df = df[df['closing_price'].between(99, 101)]
You should use () to group your boolean vector to remove ambiguity.
df = df[(df['closing_price'] >= 99) & (df['closing_price'] <= 101)]
there is a nicer alternative - use query() method:
In [58]: df = pd.DataFrame({'closing_price': np.random.randint(95, 105, 10)})
In [59]: df
Out[59]:
closing_price
0 104
1 99
2 98
3 95
4 103
5 101
6 101
7 99
8 95
9 96
In [60]: df.query('99 <= closing_price <= 101')
Out[60]:
closing_price
1 99
5 101
6 101
7 99
UPDATE: answering the comment:
I like the syntax here but fell down when trying to combine with
expresison; df.query('(mean + 2 *sd) <= closing_price <=(mean + 2
*sd)')
In [161]: qry = "(closing_price.mean() - 2*closing_price.std())" +\
...: " <= closing_price <= " + \
...: "(closing_price.mean() + 2*closing_price.std())"
...:
In [162]: df.query(qry)
Out[162]:
closing_price
0 97
1 101
2 97
3 95
4 100
5 99
6 100
7 101
8 99
9 95
newdf = df.query('closing_price.mean() <= closing_price <= closing_price.std()')
or
mean = closing_price.mean()
std = closing_price.std()
newdf = df.query('#mean <= closing_price <= #std')
If one has to call pd.Series.between(l,r) repeatedly (for different bounds l and r), a lot of work is repeated unnecessarily. In this case, it's beneficial to sort the frame/series once and then use pd.Series.searchsorted(). I measured a speedup of up to 25x, see below.
def between_indices(x, lower, upper, inclusive=True):
"""
Returns smallest and largest index i for which holds
lower <= x[i] <= upper, under the assumption that x is sorted.
"""
i = x.searchsorted(lower, side="left" if inclusive else "right")
j = x.searchsorted(upper, side="right" if inclusive else "left")
return i, j
# Sort x once before repeated calls of between()
x = x.sort_values().reset_index(drop=True)
# x = x.sort_values(ignore_index=True) # for pandas>=1.0
ret1 = between_indices(x, lower=0.1, upper=0.9)
ret2 = between_indices(x, lower=0.2, upper=0.8)
ret3 = ...
Benchmark
Measure repeated evaluations (n_reps=100) of pd.Series.between() as well as the method based on pd.Series.searchsorted(), for different arguments lower and upper. On my MacBook Pro 2015 with Python v3.8.0 and Pandas v1.0.3, the below code results in the following outpu
# pd.Series.searchsorted()
# 5.87 ms ± 321 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# pd.Series.between(lower, upper)
# 155 ms ± 6.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
# Logical expressions: (x>=lower) & (x<=upper)
# 153 ms ± 3.52 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
import numpy as np
import pandas as pd
def between_indices(x, lower, upper, inclusive=True):
# Assumption: x is sorted.
i = x.searchsorted(lower, side="left" if inclusive else "right")
j = x.searchsorted(upper, side="right" if inclusive else "left")
return i, j
def between_fast(x, lower, upper, inclusive=True):
"""
Equivalent to pd.Series.between() under the assumption that x is sorted.
"""
i, j = between_indices(x, lower, upper, inclusive)
if True:
return x.iloc[i:j]
else:
# Mask creation is slow.
mask = np.zeros_like(x, dtype=bool)
mask[i:j] = True
mask = pd.Series(mask, index=x.index)
return x[mask]
def between(x, lower, upper, inclusive=True):
mask = x.between(lower, upper, inclusive=inclusive)
return x[mask]
def between_expr(x, lower, upper, inclusive=True):
if inclusive:
mask = (x>=lower) & (x<=upper)
else:
mask = (x>lower) & (x<upper)
return x[mask]
def benchmark(func, x, lowers, uppers):
for l,u in zip(lowers, uppers):
func(x,lower=l,upper=u)
n_samples = 1000
n_reps = 100
x = pd.Series(np.random.randn(n_samples))
# Sort the Series.
# For pandas>=1.0:
# x = x.sort_values(ignore_index=True)
x = x.sort_values().reset_index(drop=True)
# Assert equivalence of different methods.
assert(between_fast(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_expr(x, 0, 1, True ).equals(between(x, 0, 1, True)))
assert(between_fast(x, 0, 1, False).equals(between(x, 0, 1, False)))
assert(between_expr(x, 0, 1, False).equals(between(x, 0, 1, False)))
# Benchmark repeated evaluations of between().
uppers = np.linspace(0, 3, n_reps)
lowers = -uppers
%timeit benchmark(between_fast, x, lowers, uppers)
%timeit benchmark(between, x, lowers, uppers)
%timeit benchmark(between_expr, x, lowers, uppers)
Instead of this
df = df[(99 <= df['closing_price'] <= 101)]
You should use this
df = df[(df['closing_price']>=99 ) & (df['closing_price']<=101)]
We have to use NumPy's bitwise Logic operators |, &, ~, ^ for compounding queries.
Also, the parentheses are important for operator precedence.
For more info, you can visit the link
:Comparisons, Masks, and Boolean Logic
If you're dealing with multiple values and multiple inputs you could also set up an apply function like this. In this case filtering a dataframe for GPS locations that fall withing certain ranges.
def filter_values(lat,lon):
if abs(lat - 33.77) < .01 and abs(lon - -118.16) < .01:
return True
elif abs(lat - 37.79) < .01 and abs(lon - -122.39) < .01:
return True
else:
return False
df = df[df.apply(lambda x: filter_values(x['lat'],x['lon']),axis=1)]
I heard that being conscious of type-stability contributes a lot to the high performance in Julia programming, so I tried to measure how much time I can save when rewriting the type-unstable function into type-stable version.
As many people say, I assumed that type-stable coding of course has higher performance than type-unstable one. However, the result was otherwise:
# type-unstable vs type-stable
# type-unstable
function positive(x)
if x < 0
return 0.0
else
return x
end
end
# type-stable
function positive_safe(x)
if x < 0
return zero(x)
else
return x
end
end
#time for n in 1:100_000_000
a = 2^( positive(-n) + 1 )
end
#time for n in 1:100_000_000
b = 2^( positive_safe(-n) + 1 )
end
result:
0.040080 seconds
0.150596 seconds
I cannot believe this. Are there some mistakes in my code? Or this is the fact?
Any information would be appreciated.
Context
Operating System and version: Windows 10
Browser and version: Google Chrome 90.0.4430.212(Official Build) (64 bit)
JupyterLab version: 3.0.14
#btime result
Just replacing #time with #btime for my code above
#btime for n in 1:100_000_000
a = 2^( positive(-n) + 1 )
end
# -> 1.500 ns
#btime for n in 1:100_000_000
b = 2^( positive_safe(-n) + 1 )
end
# -> 503.146 ms
Still weird.
the exact same code DNF showed me
using BenchmarkTools
#btime 2^(positive(-n) + 1) setup=(n=rand(1:10^8))
# -> 32.435 ns (0 allocations: 0 bytes)
#btime 2^(positive_safe(-n) + 1) setup=(n=rand(1:10^8))
#-> 3.103 ns (0 allocations: 0 bytes)
Works as expected.
I still don't understand what is happening.
I feel like I have to know better about the usage of #btime and benchmarking process.
By the way, as I said above, I'm trying this benchmarking on Jupyterlab.
The problem with your benchmark, you testing different logic code:
2 ^ (integer value)
and
2 ^ (float value)
But the most crucial part, if a and b is not defined before the loop, Julia compiler may remove the block. Your performance very much depends was the a and b defined before and were defined in the global scope or not.
And power is the time-consuming central part of your code (not the type unstable part).
positive function returns Float in your case, positive_safe returns Int)
The code similar to your case (by logic) could look like that:
# type-unstable
function positive(x)
if x < 0
return 0.0
else
return x
end
end
# type-stable
function positive_safe(x)
if x < 0
return 0.0
else
return Float64(x)
end
end
function test1()
a = 0.0
for n in 1:100_000_000
a += 2^( positive(-n) + 1 )
end
a
end
function test2()
b = 0.0
for n in 1:100_000_000
b += 2^( positive_safe(-n) + 1 )
end
b
end
#btime test1()
#btime test2()
98.045 ms (0 allocations: 0 bytes)
2.0e8
97.948 ms (0 allocations: 0 bytes)
2.0e8
The results are almost the same since your type unstable is not a bottleneck for the case.
If to test the function (which is similar to your case when a/b was not defined):
function test3()
b = 0.0
for n in 1:100_000_000
b += 2^( positive_safe(-n) + 1 )
end
nothing
end
#btime test3()
Benchmark will show results:
1.611 ns
This is not because my laptop did 100_000_000 iterations per 1.611 ns, but because Julia compiler smart enough to understand that the test3 function may be replaced with nothing.
This is benchmarking problem. The #time macro is not suitable for microbenchmarks. Use the BenchmarkTools.jl package, and read the user manual. It is easy to make mistakes when benchmarking.
Here's how to do it:
jl> using BenchmarkTools
jl> #btime 2^(positive(-n) + 1) setup=(n=rand(1:10^8))
6.507 ns (0 allocations: 0 bytes)
2.0
jl> #btime 2^(positive_safe(-n) + 1) setup=(n=rand(1:10^8))
2.100 ns (0 allocations: 0 bytes)
2
As you see, the type stable function is faster.
The problem, as Vitaliy said, is that powers in floating point done with logs can be faster than the integer ones that can be done as loop multiplies:
using BenchmarkTools
# type-unstable vs type-unstable
# type-unstable
function positive_float_unstable(x)
if x < 0
return 0.0
else
return x
end
end
# type-unstable
function positive_int_unstable(x)
if x < 0
return 0
else
return x
end
end
# type-stable
function positive_float_stable(x)
if x < 0
return 0.0
else
return Float64(x)
end
end
# type-stable
function positive_int_stable(x)
if x < 0
return 0
else
return Int(x)
end
end
println("unstable float")
#btime for n in 1:100_000_000
a = 2^( positive_float_unstable(-n) + 1 )
end
println("unstable int")
#btime for n in 1:100_000_000
b = 2^( positive_int_unstable(-n) + 1 )
end
println("stable float")
#btime for n in 1:100_000_000
a = 2^( positive_float_stable(-n) + 1 )
end
println("stable int")
#btime for n in 1:100_000_000
b = 2^( positive_int_stable(-n) + 1 )
end
Results:
unstable float
1.300 ns (0 allocations: 0 bytes)
unstable int
179.232 ms (0 allocations: 0 bytes)
stable float
1.300 ns (0 allocations: 0 bytes)
stable int
178.990 ms (0 allocations: 0 bytes)
I am looking for a julia alternative with the same behavior as more_itertools.consecutive_groups in python.
I came up with a simple implementation but speed is an issue here and I'm not sure if the code is optimized enough.
function consecutive_groups(array)
groups = eltype(array)[]
j = 0
for i=1:length(array)-1
if array[i]+1 != array[i+1]
push!(groups, array[j+1:i])
j = i
end
end
push!(groups, array[j+1:end])
return groups
end
Your implementation is already quite fast. If you know that the consecutive groups will be large you might want to just increase the index instead of pushing every element:
function consecutive_groups_2(v)
n = length(v)
groups = Vector{Vector{eltype(v)}}()
i = j = 1
while i <= n && j <= n
j = i
while j < n && v[j] + 1 == v[j + 1]
j += 1
end
push!(groups,v[i:j])
i = j + 1
end
return groups
end
which is roughly 33% faster on large groups:
julia> x = collect(1:100000);
julia> #btime consecutive_groups(x)
165.939 μs (4 allocations: 781.45 KiB)
1-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]
julia> #btime consecutive_groups_2(x)
114.830 μs (4 allocations: 781.45 KiB)
1-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]
I have a function I'm jitting with #jit(nopython=True).
Inside it has a loop that does a bunch of stuff, calculates a correlation and then assigns that to a preallocated output array. Both the target array and the correlation have the same type (np.float32), but for some reason the assignment makes the function take 100X as long.
To make things even more strange, if i instead assign a meaningless float np.float32(i*1.01) instead of my correlation value, the function runs at an appropriate speed.
Given that everything is the same type, they should both run at the same speed no?
corrs = np.zeros(a.shape[0], dtype=np.float32)
for i in range(lb, a.shape[0]):
#a bunch of calculations happens here
correl = np.float32(covar/(a_std*b_std))
testval = np.float32(i*1.01)
#doing this makes the function take FOREVER
#corrs[i] = correl
#but doing this runs very quickly, even though it is also a np.float32
#corrs[i] = testval
here is a runable example. I added an argument called "assign" that if true will assign what I want to assign, and if false will assign my useless test value.
#jit(nopython=True)
def hist_corr_loop(a, b, lb = 1000, assign=True):
flb = np.float32(lb)
a_mu, b_mu = a[0], b[0]
for i in range(1, lb):
a_mu+=a[i]
b_mu+=b[i]
a_mu = a_mu/flb
b_mu = b_mu/flb
a_var, b_var = np.float32(0.0), np.float32(0.0)
for i in range(lb):
a_var += np.square(a[i] - a_mu)
b_var += np.square(b[i] - b_mu)
a_var = a_var/flb
b_var = b_var/flb
corrs = np.zeros(a.shape[0], dtype=np.float32)
for i in range(lb, a.shape[0]):
#calculate new means and stdevs
_a_mu = a_mu
_b_mu = b_mu
a_mu = _a_mu + (a[i] - a[i-lb])/flb
b_mu = _b_mu + (b[i] - b[i-lb])/flb
a_var += (a[i] - a[i-lb])*(a[i] - a_mu + a[i-lb] - _a_mu)/flb
b_var += (b[i] - b[i-lb])*(b[i] - b_mu + b[i-lb] - _b_mu)/flb
a_std = np.sqrt(a_var)#**0.5
b_std = np.sqrt(b_var)#**0.5
covar = np.float32(0.0)
for j in range(i-lb+1,i+1):
covar += (a[j] - a_mu)*(b[j] - b_mu)
covar = covar/flb
correl = np.float32(covar/(a_std*b_std))
testval = np.float32(i*1.01)
if assign:
corrs[i] = correl
else:
corrs[i] = testval
return corrs
to run:
n = 10000000
a = np.random.random(n)
b = np.random.random(n)
%timeit hist_corr_loop(a,b,1000, True)
%timeit hist_corr_loop(a,b, 1000, False)
I get
%timeit hist_corr_loop(a,b,1000, True)
10.5 s ± 52.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit hist_corr_loop(a,b, 1000, False)
220 ms ± 1.05 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
10 seconds vs 220 ms.
I have two arrays A (4000,4000) of which only the diagonal is filled with data, and B (4000,5), filled with data. Is there a way to multiply (dot) these arrays that is faster than the numpy.dot(a,b) function?
So far I found that (A * B.T).T should be faster (where A is one dimensional (4000,), filled with the diagonal elements), but it turned out to be roughly twice as slow.
is there a faster way to calculate B.dot(A) in the case where A is a diagnal array?
You could simply extract the diagonal elements and then perform broadcasted elementwise multiplication.
Thus, a replacement for B*A would be -
np.multiply(np.diag(B)[:,None], A)
and for A.T*B -
np.multiply(A.T,np.diag(B))
Runtime test -
In [273]: # Setup
...: M,N = 4000,5
...: A = np.random.randint(0,9,(M,N)).astype(float)
...: B = np.zeros((M,M),dtype=float)
...: np.fill_diagonal(B, np.random.randint(11,99,(M)))
...: A = np.matrix(A)
...: B = np.matrix(B)
...:
In [274]: np.allclose(B*A, np.multiply(np.diag(B)[:,None], A))
Out[274]: True
In [275]: %timeit B*A
10 loops, best of 3: 32.1 ms per loop
In [276]: %timeit np.multiply(np.diag(B)[:,None], A)
10000 loops, best of 3: 33 µs per loop
In [282]: np.allclose(A.T*B, np.multiply(A.T,np.diag(B)))
Out[282]: True
In [283]: %timeit A.T*B
10 loops, best of 3: 24.1 ms per loop
In [284]: %timeit np.multiply(A.T,np.diag(B))
10000 loops, best of 3: 36.2 µs per loop
Appears that my initial claim of (A * B.T).T being slower is incorrect.
from timeit import default_timer as timer
import numpy as np
##### Case 1
a = np.zeros((4000,4000))
np.fill_diagonal(a, 10)
b = np.ones((4000,5))
dot_list = []
def time_dot(a,b):
start = timer()
c = np.dot(a,b)
end = timer()
return end - start
for i in range(100):
dot_list.append(time_dot(a,b))
print np.mean(np.asarray(dot_list))
##### Case 2
a = np.ones((4000,))
a = a * 10
b = np.ones((4000,5))
shortcut_list = []
def time_quicker(a,b):
start = timer()
c = (a*b.T).T
end = timer()
return end - start
for i in range(100):
shortcut_list.append(time_quicker(a,b))
print np.mean(np.asarray(shortcut_list))
##### Case 3
a = np.zeros((4000,4000)) #diagonal matrix
np.fill_diagonal(a, 10)
b = np.ones((4000,5))
case3_list = []
def function(a,b):
start = timer()
np.multiply(b.T,np.diag(a))
end = timer()
return end - start
for i in range(100):
case3_list.append(function(a,b))
print np.mean(np.asarray(case3_list))
results in :
0.119120892431
0.00010633951868
0.00214490709662
so the second method is fastest