Python Numpy Matrix Update Confusion - numpy

this is a part of my matrix factorization code (a very weird version of nmf). My issue is that although every time when I iterate, I save the older copies of the W and H matrices, when I compare old_W and W after W finishes updating every time, they are actually the same! So the actual error output is always 0 and the while loop stops after the first iteration. However, "#print old - new" shows that the element W[r][i] is actually updated every time. What is it that I am not seeing?
def csmf(V, l, max_iter, err, alpha=0.01, beta=0.01, lamb=0.01):
W = np.random.rand(V.shape[0], l)
H = np.random.rand(l, V.shape[1])
n = V.shape[0]
N = V.shape[1]
NwOone = 60
NwOtwo = 60
NhOone = 50
NhOtwo = 50
for t in range(max_iter):
old_W = W # save old values
old_H = H
old = criterion(V,old_W,old_H,l,alpha,beta,lamb)
print "iteration ", t
##### update W
print "updating W"
setw = range(0,n)
subset_one = random.sample(setw,NwOone)
subset_two = calcGw(V, W, H, n, l, alpha, beta, NwOtwo)
chosen = np.intersect1d(subset_one,subset_two)
for r in chosen:
for i in range(len(W[0])):
update = wPosNeg(W[r],N,i,l,V,r,beta,H)
old = W[r][i]
W[r][i] = update
new = W[r][i]
#print old - new
##### update H
print "updating H"
seth = range(0,N)
subset_oneh = random.sample(seth,NhOone)
subset_twoh = calcGh(V, W, H, N, l, NhOtwo,lamb)
chosenh = np.intersect1d(subset_oneh,subset_twoh)
for s in chosenh: # column
for i in range(len(H)):
updateh = hPosNeg(H[i],n,i,l,V,s,lamb,W)
H[i][s] = updateh
##### check err
print "Checking criterion"
print criterion(V,W,H,l,alpha,beta,lamb)
print criterion(V,old_W,old_H,l,alpha,beta,lamb)
actual = abs(criterion(V,W,H,l,alpha,beta,lamb) -criterion(V,old_W,old_H,l,alpha,beta,lamb))
if actual <= err: return W, H, actual
return W, H, actual
dmat = np.random.rand(100,80)
W, H, err = csmf(dmat, 1, 10, 0.001, alpha=0.001, beta=0.001, lamb=0.001)
print err

in these lines:
old_W = W # save old values
old_H = H
you are not saving a copy, you are keeping a reference (old_W and W are the same piece of memory).
Try this:
old_W = W.copy() # save old values
old_H = H.copy()

Related

I am having an error saying object cannot be interpreted as an integer

for f in filenames:
file_path = os.path.join(dirpath, f)
signal, sample_rate = librosa.load(file_path, sr=SAMPLE_RATE)
for d in range(num_segments):
# calculate start and finish sample for current segment
start = samples_per_segment * d
finish = start + samples_per_segment

Need to plot multiple values over each number of iterations (python help)

I'm trying to plot the multiple values one gets for 'f_12' over a certain number of iterations. It should look something like points with high oscillations when there is low iterations 'N' and then it converges to a rough value of 0.204. I'm getting the correct outputs for 'f_12' but I'm having a really hard time doing the plots. New to python here.
start = time.time()
# looking for F_12 via monte carlo method
# Inputs
# N = number of rays to generate
N = 1000
# w = width of plates
w = 1
# h = vertical seperation of plates
# L = horizontal offset of plates (L=w=h)
L = 1
h = 1
p_points = 100
# counter for number of rays and number of hits
rays = 0
hits = 0
while rays < N:
rays = rays + 1
# random origin of rays along w on surface 1
Rx = random.uniform(0, 1)
Rt = random.uniform(0, 1)
Rph = random.uniform(0, 1)
x1 = Rx * w
# polar and azimuth angles - random ray directions
theta = np.arcsin(np.sqrt(Rt))
phi = 2*np.pi*Rph
# theta = np.arcsin(Rt)
xi = x1 + h*np.tan(theta)*np.cos(phi)
if xi >= L and xi <= (L+w):
hit = 1
else:
hit = 0
hits = hits + hit
gap = N/ p_points
r = rays%gap
if r == 0:
F = hits/ rays
plt.figure(figsize=(8, 4))
plt.plot(N, F, linewidth=2)
plt.xlabel("N - Rays")
plt.ylabel("F_12")
plt.show()
f_12 = hits/ N
print(f"F_12 = {f_12} at N = {N} iterations")
# Grab Currrent Time After Running the Code
end = time.time()
#Subtract Start Time from The End Time
total_time = end - start
f_time = round(total_time)
print(f"Running time = {f_time} seconds")

How to speed up simple linear algebra optimization probelm in Julia?

I implemented the LSDD changepoint detection method decribed in [1] in Julia, to see if I could make it faster than the existing python implementation [2], which is based on a grid search that looks for the optimal parameters.
I obtain the desired results but despite my best efforts, my grid search version of it takes about the same time to compute as the python one, which is still way too long for real applications.
I also tried using the Optimize package which only makes things worse (2 or 3 times slower).
Here is the grid search that I implemented :
using Random
using LinearAlgebra
function squared_distance(X::Array{Float64,1},C::Array{Float64,1})
sqd = zeros(length(X),length(C))
for i in 1:length(X)
for j in 1:length(C)
sqd[i,j] = X[i]^2 + C[j]^2 - 2*X[i]*C[j]
end
end
return sqd
end
function lsdd(x::Array{Float64,1},y::Array{Float64,1}; folds = 5, sigma_list = nothing , lambda_list = nothing)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
#Define the training and testing data sets
cv_split1, cv_split2 = floor.(collect(1:lx)*folds/lx), floor.(collect(1:ly)*folds/ly)
cv_index1, cv_index2 = shuffle(cv_split1), shuffle(cv_split2)
tr_idx1,tr_idx2 = [findall(x->x!=i,cv_index1) for i in 1:folds], [findall(x->x!=i,cv_index2) for i in 1:folds]
te_idx1,te_idx2 = [findall(x->x==i,cv_index1) for i in 1:folds], [findall(x->x==i,cv_index2) for i in 1:folds]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
if sigma_list == nothing
sigma_list = [0.25, 0.5, 0.75, 1, 1.2, 1.5, 2, 2.5, 2.2, 3, 5]
end
if lambda_list == nothing
lambda_list = [1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02,
1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00,
1.00000000e+01]
end
#memory prealocation
score_cv = zeros(length(sigma_list),length(lambda_list))
H = zeros(b,b)
hx_tr, hy_tr = [zeros(b,1) for i in 1:folds], [zeros(b,1) for i in 1:folds]
hx_te, hy_te = [zeros(1,b) for i in 1:folds], [zeros(1,b) for i in 1:folds]
#h_tr,h_te = zeros(b,1), zeros(1,b)
theta = zeros(b)
for (sigma_idx,sigma) in enumerate(sigma_list)
#the expression of H is different for higher dimension
#H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
set_H(H,CC_dist2,sigma,b)
#check if the sum is performed along the right dimension
set_htr(hx_tr,xTr_dist,sigma,Tx), set_htr(hy_tr,yTr_dist,sigma,Ty)
set_hte(hx_te,xTe_dist,sigma,lx-Tx), set_hte(hy_te,yTe_dist,sigma,ly-Ty)
for i in 1:folds
h_tr = hx_tr[i] - hy_tr[i]
h_te = hx_te[i] - hy_te[i]
#set_h(h_tr,hx_tr[i],hy_tr[i],b)
#set_h(h_te,hx_te[i],hy_te[i],b)
for (lambda_idx,lambda) in enumerate(lambda_list)
set_theta(theta,H,lambda,h_tr,b)
score_cv[sigma_idx,lambda_idx] += dot(theta,H*theta) - 2*dot(theta,h_te)
end
end
end
#retrieve the value of the optimal parameters
sigma_chosen = sigma_list[findmin(score_cv)[2][2]]
lambda_chosen = lambda_list[findmin(score_cv)[2][2]]
#calculating the new "optimal" solution
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
H_lambda = H + lambda_chosen*Matrix{Float64}(I, b, b)
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = H_lambda\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
function set_H(H::Array{Float64,2},dist::Array{Float64,2},sigma::Float64,b::Int16)
for i in 1:b
for j in 1:b
H[i,j] = sqrt((sigma^2)*pi)*exp(-dist[i,j]/(4*sigma^2))
end
end
end
function set_theta(theta::Array{Float64,1},H::Array{Float64,2},lambda::Float64,h::Array{Float64,2},b::Int64)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, h)
theta = h
end
function set_htr(h::Array{Float64,1},dists::Array{Float64,2},sigma::Float64,T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_hte(h::Array{Float64,1},dists::Array{Float64,2},sigma::Array{Float64,1},T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_h(h,h1,h2,b)
for i in 1:b
h[i] = h1[i] - h2[i]
end
end
The set_H, set_h and set_theta functions are there because I read somewhere that modifying prealocated memory in place with a function was faster, but it did not make a great difference.
To test it, I use two random distribution as input data :
x,y = rand(500),1.5*rand(500)
lsdd(x,y) #returns a value around 0.3
Now here is the version of the code where I try to use Optimizer :
function Theta(sigma::Float64,lambda::Float64,x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
#the subsets are not be mutually exclusive !
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
shuffled_x, shuffled_y = [shuffle(1:lx) for i in 1:folds], [shuffle(1:ly) for i in 1:folds]
cv_index1, cv_index2 = floor.(collect(1:lx)*folds/lx)[shuffle(1:lx)], floor.(collect(1:ly)*folds/ly)[shuffle(1:ly)]
tr_idx1,tr_idx2 = [i[1:Tx] for i in shuffled_x], [i[1:Ty] for i in shuffled_y]
te_idx1,te_idx2 = [i[Tx:end] for i in shuffled_x], [i[Ty:end] for i in shuffled_y]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
score_cv = 0
Id = Matrix{Float64}(I, b, b)
H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
hx_tr, hy_tr = [transpose((1/Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in xTr_dist], [transpose((1/Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in yTr_dist]
hx_te, hy_te = [(lx-Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in xTe_dist], [(ly-Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in yTe_dist]
for i in 1:folds
h_tr, h_te = hx_tr[i] - hy_tr[i], hx_te[i] - hy_te[i]
#theta = (H + lambda * Id)\h_tr
theta = copy(h_tr)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, theta)
score_cv += dot(theta,H*theta) - 2*dot(theta,h_te)
end
return score_cv,(CC_dist2,xC_dist2,yC_dist2)
end
function cost(params::Array{Float64,1},x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
s,l = params[1],params[2]
return Theta(s,l,x,y,folds)[1]
end
"""
Performs the optinization
"""
function lsdd3(x::Array{Float64,1},y::Array{Float64,1}; folds = 4)
start = [1,0.1]
b = min(length(x)+length(y),300)
lx,ly = length(x),length(y)
#result = optimize(params -> cost(params,x,y,folds),fill(0.0,2),fill(50.0,2),start, Fminbox(LBFGS(linesearch=LineSearches.BackTracking())); autodiff = :forward)
result = optimize(params -> cost(params,x,y,folds),start, BFGS(),Optim.Options(f_calls_limit = 5, iterations = 5))
#bboptimize(rosenbrock2d; SearchRange = [(-5.0, 5.0), (-2.0, 2.0)])
#result = optimize(cost,[0,0],[Inf,Inf],start, Fminbox(AcceleratedGradientDescent()))
sigma_chosen,lambda_chosen = Optim.minimizer(result)
CC_dist2, xC_dist2, yC_dist2 = Theta(sigma_chosen,lambda_chosen,x,y,folds)[2]
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = (H + lambda_chosen*Matrix{Float64}(I, b, b))\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
No matter, which kind of option I use in the optimizer, I always end up with something too slow. Maybe the grid search is the best option, but I don't know how to make it faster... Does anyone have an idea how I could proceed further ?
[1] : http://www.mcduplessis.com/wp-content/uploads/2016/05/Journal-IEICE-2014-CLSDD-1.pdf
[2] : http://www.ms.k.u-tokyo.ac.jp/software.html

Probabilistic Record Linkage in Pandas

I have two dataframes (X & Y). I would like to link them together and to predict the probability that each potential match is correct.
X = pd.DataFrame({'A': ["One", "Two", "Three"]})
Y = pd.DataFrame({'A': ["One", "To", "Free"]})
Method A
I have not yet fully understood the theory but there is an approach presented in:
Sayers, A., Ben-Shlomo, Y., Blom, A.W. and Steele, F., 2015. Probabilistic record linkage. International journal of epidemiology, 45(3), pp.954-964.
Here is my attempt to implementat it in Pandas:
# Probability that Matches are True Matches
m = 0.95
# Probability that non-Matches are True non-Matches
u = min(len(X), len(Y)) / (len(X) * len(Y))
# Priors
M_Pr = u
U_Pr = 1 - M_Pr
O_Pr = M_Pr / U_Pr # Prior odds of a match
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Likelihood ratio
def Likelihood_ratio(C):
return (m/u) - ((m/u) - ((1-m) / (1-u))) * (1-C)
L_R = np.vectorize(Likelihood_ratio, otypes=[float])
Z["G"] = L_R(Z['C'])
# Match weight
def Match_weight(G):
return math.log(G) * math.log(2)
M_W = np.vectorize(Match_weight, otypes=[float])
Z["R"] = M_W(Z['G'])
# Posterior odds
def Posterior_odds(R):
return math.exp( R / math.log(2)) * O_Pr
P_O = np.vectorize(Posterior_odds, otypes=[float])
Z["O"] = P_O(Z['R'])
# Probability
def Probability(O):
return O / (1 + O)
Pro = np.vectorize(Probability, otypes=[float])
Z["P"] = Pro(Z['O'])
I have verified that this gives the same results as in the paper. Here is a sensitivity check on m, showing that it doesn't make a lot of difference:
Method B
These assumptions won't apply to all applications but in some cases each row of X should match a row of Y. In that case:
The probabilities should sum to 1
If there are many credible candidates to match to then that should reduce the probability of getting the right one
then:
X["I"] = X.index
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Normalised Agreement Weight
T = Z .groupby('I') .agg({'C' : sum})
D = pd.DataFrame(T)
D.columns = ['T']
J = Z.set_index('I').join(D)
J['P1'] = J['C'] / J['T']
Comparing it against Method A:
Method C
This combines method A with method B:
# Normalised Probability
U = Z .groupby('I') .agg({'P' : sum})
E = pd.DataFrame(U)
E.columns = ['U']
K = Z.set_index('I').join(E)
K['P1'] = J['P1']
K['P2'] = K['P'] / K['U']
We can see that method B (P1) doesn't take account of uncertainty whereas method C (P2) does.

Append array to itself by repeated function calls

im trying to append an array by repeatedly calling a function. When I put the append command in a loop this works fine, but not when the loop calls the function that's supposed to do the append.
import numpy as np
test_value = 555
i = 0
j = 0
test_array = np.empty([0, 3])
def test(test_value, i, j, test_array):
test_temp = []
test_temp.append(i)
test_temp.append(j)
test_temp.append(test_value)
test_temp_1 = test_temp
test_temp_2 = np.array(test_temp_1)
test_temp_2 = np.reshape(test_temp_2, (1,3))
test_array = np.append(test_array, test_temp_2, axis=0)
return test_array
for i in range(0,10):
i = i + 1
j = j + 2
test(test_value, i, j, test_array)
print ("test array", test_array)
Ideally what would happen is that test_array gets a new row added each time its looped, but so the final print of test_array stays empty.
Cheers