RAPIDS/NUMBA: Faster way to parallelize a for-loop on small data? - data-science

If I have data that easily fits into memory, but I need to iterate over it hundreds or thousands of times, is there a faster way?
For instance, if I have 400k datapoints and I need to iterate over it with 1000 filters. It is 4-10 times slower to to do a for-loop than it is to do a single operation on data that is length 400k*1000.
#setup
import cudf
import numpy as np
import cupy as cp
from numba import cuda
cp.seed = 42
signal_ranges = []
signal_len = 1000
data_size = 400000
for signal in range(signal_len):
s_low = cp.random.rand(1, dtype='float')
def get_high():
return cp.random.rand(1, dtype='float')
s_high = 0
while s_high <= s_low:
s_high = get_high()
signal_ranges.append((s_low,s_high))
EXAMPLE 1 - length 400k*1000
#cuda.jit
def filter_signal(in_col, s1, s2, out):
i = cuda.grid(1)
if i < in_col.size: # boundary guard
out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0
%%timeit -r 1
s1 = float(signal_ranges[0][0])
s2 = float(signal_ranges[0][1])
cu_df_big = cudf.DataFrame(cp.random.rand((data_size*signal_len)), columns=['in1'])
cu_df_big['0'] = 0
size = len(cu_df_big)
filter_signal.forall(size)(cu_df_big['in1'], s1, s2, cu_df_big['0'])
*314ms*
EXAMPLE 2 - 400k iterated 1000 times
#cuda.jit
def filter_signal(in_col, s1, s2, out):
i = cuda.grid(1)
if i < in_col.size: # boundary guard
out[i] = 1 if in_col[i] <= s1 else -1 if in_col[i] >= s2 else 0
%%timeit -r 1
cu_df = cudf.DataFrame(cp.random.rand((data_size)), columns=['in1'])
size = len(cu_df)
col_id = 0
for sigs in signal_ranges:
s1 = float(sigs[0])
s2 = float(sigs[1])
col = str(col_id)
cu_df[col] = 0
filter_signal.forall(size)(cu_df['in1'], s1, s2, cu_df[col])
col_id += 1
*2.3secs*

Related

Need to plot multiple values over each number of iterations (python help)

I'm trying to plot the multiple values one gets for 'f_12' over a certain number of iterations. It should look something like points with high oscillations when there is low iterations 'N' and then it converges to a rough value of 0.204. I'm getting the correct outputs for 'f_12' but I'm having a really hard time doing the plots. New to python here.
start = time.time()
# looking for F_12 via monte carlo method
# Inputs
# N = number of rays to generate
N = 1000
# w = width of plates
w = 1
# h = vertical seperation of plates
# L = horizontal offset of plates (L=w=h)
L = 1
h = 1
p_points = 100
# counter for number of rays and number of hits
rays = 0
hits = 0
while rays < N:
rays = rays + 1
# random origin of rays along w on surface 1
Rx = random.uniform(0, 1)
Rt = random.uniform(0, 1)
Rph = random.uniform(0, 1)
x1 = Rx * w
# polar and azimuth angles - random ray directions
theta = np.arcsin(np.sqrt(Rt))
phi = 2*np.pi*Rph
# theta = np.arcsin(Rt)
xi = x1 + h*np.tan(theta)*np.cos(phi)
if xi >= L and xi <= (L+w):
hit = 1
else:
hit = 0
hits = hits + hit
gap = N/ p_points
r = rays%gap
if r == 0:
F = hits/ rays
plt.figure(figsize=(8, 4))
plt.plot(N, F, linewidth=2)
plt.xlabel("N - Rays")
plt.ylabel("F_12")
plt.show()
f_12 = hits/ N
print(f"F_12 = {f_12} at N = {N} iterations")
# Grab Currrent Time After Running the Code
end = time.time()
#Subtract Start Time from The End Time
total_time = end - start
f_time = round(total_time)
print(f"Running time = {f_time} seconds")

Can't get dimensions of arrays equal to plot with MatPlotLib

I am trying to create a plot of arrays where one is calculated based on my x-axis calculated in a for loop. I've gone through my code multiple times and tested in between what exactly the lengths are for my arrays, but I can't seem to think of a solution that makes them equal length.
This is the code I have started with:
import numpy as np
import matplotlib.pyplot as plt
a = 1 ;b = 2 ;c = 3; d = 1; e = 2
t0 = 0
t_end = 10
dt = 0.05
t = np.arange(t0, t_end, dt)
n = len(t)
fout = 1
M = 1
Ca = np.zeros(n)
Ca[0] = a; Cb[0] = b
Cc[0] = 0;
k1 = 1
def rA(Ca, Cb, Cc, t):
-k1 * Ca**a * Cb**b * dt
return -k1 * Ca**a * Cb**b * dt
while e > 1e-3:
t = np.arange(t0, t_end, dt)
n = len(t)
for i in range(1,n-1):
Ca[i+1] = Ca[i] + rA(Ca[i], Cb[i], Cc[i], t[i])
e = abs((M-Ca[n-1])/M)
M = Ca[n-1]
dt = dt/2
plt.plot(t, Ca)
plt.grid()
plt.show()
Afterwards, I try to calculate a second function for different y-values. Within the for loop I added:
Cb[i+1] = Cb[i] + rB(Ca[i], Cb[i], Cc[i], t[i])
While also defining rB in a similar manner as rA. The error code I received at this point is:
IndexError: index 200 is out of bounds for axis 0 with size 200
I feel like it has to do with the way I'm initializing the arrays for my Ca. To put it in MatLab code, something I'm more familiar with, looks like this in MatLab:
Ca = zeros(1,n)
I have recreated the code I have written here in MatLab and I do receive a plot. So I'm wondering where I am going wrong here?
So I thought my best course of action was to change n to an int by just changing it in the while loop.
but after changing n = len(t) to n = 100 I received the following error message:
ValueError: x and y must have same first dimension, but have shapes (200,) and (400,)
As my previous question was something trivial I just kept on missing out on, I feel like this is the same. But I have spent over an hour looking and trying fixes without succes.

Probabilistic Record Linkage in Pandas

I have two dataframes (X & Y). I would like to link them together and to predict the probability that each potential match is correct.
X = pd.DataFrame({'A': ["One", "Two", "Three"]})
Y = pd.DataFrame({'A': ["One", "To", "Free"]})
Method A
I have not yet fully understood the theory but there is an approach presented in:
Sayers, A., Ben-Shlomo, Y., Blom, A.W. and Steele, F., 2015. Probabilistic record linkage. International journal of epidemiology, 45(3), pp.954-964.
Here is my attempt to implementat it in Pandas:
# Probability that Matches are True Matches
m = 0.95
# Probability that non-Matches are True non-Matches
u = min(len(X), len(Y)) / (len(X) * len(Y))
# Priors
M_Pr = u
U_Pr = 1 - M_Pr
O_Pr = M_Pr / U_Pr # Prior odds of a match
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Likelihood ratio
def Likelihood_ratio(C):
return (m/u) - ((m/u) - ((1-m) / (1-u))) * (1-C)
L_R = np.vectorize(Likelihood_ratio, otypes=[float])
Z["G"] = L_R(Z['C'])
# Match weight
def Match_weight(G):
return math.log(G) * math.log(2)
M_W = np.vectorize(Match_weight, otypes=[float])
Z["R"] = M_W(Z['G'])
# Posterior odds
def Posterior_odds(R):
return math.exp( R / math.log(2)) * O_Pr
P_O = np.vectorize(Posterior_odds, otypes=[float])
Z["O"] = P_O(Z['R'])
# Probability
def Probability(O):
return O / (1 + O)
Pro = np.vectorize(Probability, otypes=[float])
Z["P"] = Pro(Z['O'])
I have verified that this gives the same results as in the paper. Here is a sensitivity check on m, showing that it doesn't make a lot of difference:
Method B
These assumptions won't apply to all applications but in some cases each row of X should match a row of Y. In that case:
The probabilities should sum to 1
If there are many credible candidates to match to then that should reduce the probability of getting the right one
then:
X["I"] = X.index
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Normalised Agreement Weight
T = Z .groupby('I') .agg({'C' : sum})
D = pd.DataFrame(T)
D.columns = ['T']
J = Z.set_index('I').join(D)
J['P1'] = J['C'] / J['T']
Comparing it against Method A:
Method C
This combines method A with method B:
# Normalised Probability
U = Z .groupby('I') .agg({'P' : sum})
E = pd.DataFrame(U)
E.columns = ['U']
K = Z.set_index('I').join(E)
K['P1'] = J['P1']
K['P2'] = K['P'] / K['U']
We can see that method B (P1) doesn't take account of uncertainty whereas method C (P2) does.

Calculate Percentile Ranks by Group using Numpy

I'm very new with Python, and I want to calculate percentile ranks by group. My group is wildlife management unit (WMU - string), and ranks are based the value of predicted moose density (PMDEN3 - FLOAT). The rank value goes into the field RankMD.
My approach was to use the for loop to calculate the 3 ranks within each WMU, but the result is that 3 ranks are created for the entire dbf file (about 23,000 records), without respect to WMU. Any help is much appreciated.
import arcpy
import numpy as np
input = r'K:\Moose\KrigStratPython\TestRank3.dbf'
arr = arcpy.da.TableToNumPyArray(input, ('PMDEN3', 'Wmu'))
c_arr = [float(x[0]) for x in np.ndarray.flatten(arr)]
for Wmu in arr:
##to create 3 rank for example
p1 = np.percentile(c_arr, 33) # rank = 0
p2 = np.percentile(c_arr, 67) # rank = 1
p3 = np.percentile(c_arr, 100) # rank = 2
#use cursor to update the new rank field
with arcpy.da.UpdateCursor(input , ['PMDEN3','RankMD']) as cursor:
for row in cursor:
if row[0] < p1:
row[1] = 0 #rank 0
elif p1 <= row[0] and row[0] < p2:
row[1] = 1
else:
row[1] = 2
cursor.updateRow(row)
Your for loop is correct, however, your UpdateCursor is iterating over all rows in the table. To get your desired result you need to select out a subset of the table, and then use the update cursor on that. You can do this by passing a query to the where_clause parameter of the UpdateCursor function.
So you would have a query like this:
current_wmu = WMU['wmu'] # This should be the value of the wmu that the for loop is currently on I think it would be WMU['wmu'] but i'm not positive
where_clause = "WMU = '{}'".format(current_wmu) # format the above variable into a query string
and then your UpdateCursor would now be:
with arcpy.da.UpdateCursor(input , ['PMDEN3','RankMD'], where_clause) as cursor:
Based on suggestion from BigGerman, I revised my code and this is now working. Script loops through each WMU value, and calculates rank percentile within each group based on PMDEN. To improve the script I should create an array of WMU values from my input file rather than manually creating the array.
import arcpy
import numpy as np
#fields to be calculated
fldPMDEN = "PMDEN"
fldRankWMU = "RankWMU"
input = r'K:\Moose\KrigStratPython\TestRank3.dbf'
arcpy.MakeFeatureLayer_management(input, "stratLayerShpNoNullsLyr")
WMUs = ["10", "11A", "11B", "11Q", "12A"]
for current_wmu in WMUs:
##to create 3 rank for example
where_clause = "Wmu = '{}'".format(current_wmu) # format the above variable into a query
with arcpy.da.UpdateCursor("stratLayerShpNoNullsLyr", [fldPMDEN,fldRankWMU], where_clause) as cursor:
arr1 = arcpy.da.TableToNumPyArray("stratLayerShpNoNullsLyr", [fldPMDEN,fldRankWMU], where_clause)
c_arrS = [float(x[0]) for x in np.ndarray.flatten(arr1)]
p1 = np.percentile(c_arrS, 33) # rank = 3
p2 = np.percentile(c_arrS, 67) # rank = 2
p3 = np.percentile(c_arrS, 100) # rank = 1 (highest density)
for row in cursor:
if row[0] < p1:
row[1] = 3 #rank 0
elif p1 <= row[0] and row[0] < p2:
row[1] = 2
else:
row[1] = 1
cursor.updateRow(row)

Variable and string concat to select variable in lua

I have a set of variables that holds quantity info and x to select the one I use. How can I concatenate the letter s and var x and have it read as s2 or s3 etc. The code I managed to find does not work.
x = 2
s1 = false
s2 = 64
s3 = 64
s4 = 64
s5 = 0
if s2 >= 0 then
x = 2
elseif s3 >= 0 then
x = 3
elseif s4 >= 0 then
x = 4
elseif s5 >= 0 then
x = 5
end
if turtle.placeDown() then
tryUp()
turtle.select(1)
_G["s"..x] = _G["s"..x] - 1
end
Why would you need to do that?
My suggestion to improve your code would be something like this:
local s = {false, 64, 64, 64, 0}
for i = 2, #s do
if s[i] >= 0 then
x = s[i]
end
end
if turtle.placeDown() then
tryUp()
turtle.select(1)
x = x-1
end
Using a loop makes the code somewhat neater, and there is no real need for you to use global variables. If you insist on using _G with the string concatenation with your original code, try this:
x = 2
s1 = false
s2 = 64
s3 = 64
s4 = 64
s5 = 0
if s2 >= 0 then
x = "2" --Notice the string here
elseif s3 >= 0 then
x = "3"
elseif s4 >= 0 then
x = "4"
elseif s5 >= 0 then
x = "5"
end
if turtle.placeDown() then
tryUp()
turtle.select(1)
_G["s"..x] = _G["s"..x] - 1
end
This replaces all the x values with strings instead of numbers, which was probably what was causing the error