Pandas/NumPy: concisely label first N values matching a mask - pandas

I have a sorted Series like this:
[2, 4, 5, 6, 8, 9]
I want to produce another Series or ndarray of the same length, where the first two odd numbers and the first two even numbers are labeled sequentially:
[0, 1, 2, _, _, 3]
The _ values I don't really care about. They can be zero.
Now I do it this way:
src = pd.Series([2, 4, 5, 6, 8, 9])
odd = src % 2 != 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
where.sort() # maintain ordering - thanks to #hpaulj
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
Can you do it more concisely? The input will never be empty, but there might be no odds or no evens (in which case the result could have length 1, 2, or 3 instead of 4).

Great Problem! I'm still exploring and learning.
I've basically stuck with what you've done so far with modest tweaks for efficiency. I'll update if I think of anything else cool.
conclusions
So far, I've thrashed around alot and haven't improved much.
my answer
fast
odd = src.values % 2
even = 1 - odd
res = ((odd.cumsum() * odd) < 3) * ((even.cumsum() * even) < 3)
(res.cumsum() - 1) * res
alternative 1
pretty fast
a = src.values
odd = (a % 2).astype(bool)
rng = np.arange(len(a))
# same reason these are 2, we have 4 below
where = np.append(rng[~odd][:2], rng[odd][:2])
res = np.zeros(len(a), int)
# nature of the problem necessitates that this is 4
res[where] = np.arange(4)
alternative 2
not as quick, but creative
a = src.values
odd = a % 2
res = np.zeros(len(src), int)
b = np.arange(2)
c = b[:, None] == odd
res[(c.cumsum(1) * c <= 2).all(0)] = np.arange(4)
alternative 3
still slow
odd = src.values % 2
a = (odd[:, None] == [0, 1])
b = ((a.cumsum(0) * a) <= 2).all(1)
(b.cumsum() - 1) * b
timing
code
def pir3(src):
odd = src.values % 2
a = (odd[:, None] == [0, 1])
b = ((a.cumsum(0) * a) <= 2).all(1)
return (b.cumsum() - 1) * b
def pir0(src):
odd = src.values % 2
even = 1 - odd
res = ((odd.cumsum() * odd) < 3) * ((even.cumsum() * even) < 3)
return (res.cumsum() - 1) * res
def pir2(src):
a = src.values
odd = a % 2
res = np.zeros(len(src), int)
c = b[:, None] == odd
res[(c.cumsum(1) * c <= 2).all(0)] = np.arange(4)
return res
def pir1(src):
a = src.values
odd = (a % 2).astype(bool)
rng = np.arange(len(a))
where = np.append(rng[~odd][:2], rng[odd][:2])
res = np.zeros(len(a), int)
res[where] = np.arange(4)
return res
def john0(src):
odd = src % 2 == 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
return res
def john1(src):
odd = src.values % 2 == 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
return res
src = pd.Series([2, 4, 5, 6, 8, 9])
src = pd.Series([2, 4, 5, 6, 8, 9] * 10000)

Related

Using recursion to iterate multiple times through rows in a dataframe - not returning the expected result

How to loop through a dataframe series multiple times using a recursive function?
I am trying to get a simple case to work and use it in a more complicated function.
I am using a simple dataframe:
df = pd.DataFrame({'numbers': [1,2,3,4,5]
I want to iterate through the rows multiple time and sum the values. Each iteration, the index starting point increments by 1.
def recursive_sum(df, mysum=0, count=0):
df = df.iloc[count:]
if len(df.index) < 2:
return mysum
else:
for i in range(len(df.index)):
mysum += df.iloc[i, 0]
count += 1
return recursive_sum(df, mysum, count)
I think I should get:
#Iteration 1: count = 0, len(df.index) = 5 < 2, mysum = 1 + 2 + 3 + 4 + 5 = 15
#Iteration 2: count = 1, len(df.index) = 4 < 2, mysum = 15 + 2 + 3 + 4 + 5 = 29
#Iteration 3: count = 2, len(df.index) = 3 < 2, mysum = 29 + 3 + 4 + 5 = 41
#Iteration 4: count = 2, len(df.index) = 2 < 2, mysum = 41 + 4 + 5 = 50
#Iteration 5: count = 2, len(df.index) = 1 < 2, mysum = 50
But I am returning 38.
Just fixed it:
def recursive_sum(df, mysum=0, count=0):
if(len(df.index) - count) < 2:
return mysum
else:
for i in range(count, len(df.index)):
mysum += df.iloc[0]
count += 1
return recursive_sum(df, mysum, count)

Algorithm to define a 2d grid

Suppose a grid is defined by a set of grid parameters: its origin (x0, y0), an angel between one side and x axis, and increments and - please see the figure below.
There are scattered points with known coordinates on the grid but they don’t exactly fall on grid intersections. Is there an algorithm to find a set of grid parameters to define the grid so that the points are best fit to grid intersections?
Suppose the known coordinates are:
(2 , 5.464), (3.732, 6.464), (5.464, 7.464)
(3 , 3.732), (4.732, 4.732), (6.464, 5.732)
(4 , 2 ), (5.732, 3 ), (7.464, 4 ).
I expect the algorithm to find the origin (4, 2), the angle 30 degree, and the increments both 2.
You can solve the problem by finding a matrix that transforms points from positions (0, 0), (0, 1), ... (2, 2) onto the given points.
Although the grid has only 5 degrees of freedom (position of the origin + angle + scale), it is easier to define the transformation using 2x3 matrix A, because the problem can be made linear in this case.
Let a point with index (x0, y0) to be transformed into point (x0', y0') on the grid, for example (0, 0) -> (2, 5.464) and let a_ij be coefficients of matrix A. Then this pair of points results in 2 equations:
a_00 * x0 + a_01 * y0 + a_02 = x0'
a_10 * x0 + a_11 * y0 + a_12 = y0'
The unknowns are a_ij, so these equations can be written in form
a_00 * x0 + a_01 * y0 + a_02 + a_10 * 0 + a_11 * 0 + a_12 * 0 = x0'
a_00 * 0 + a_01 * 0 + a_02 * 0 + a_10 * x0 + a_11 * y0 + a_12 = y0'
or in matrix form
K0 * (a_00, a_01, a_02, a_10, a_11, a_12)^T = (x0', y0')^T
where
K0 = (
x0, y0, 1, 0, 0, 0
0, 0, 0, x0, y0, 1
)
These equations for each pair of points can be combined in a single equation
K * (a_00, a_01, a_02, a_10, a_11, a_12)^T = (x0', y0', x1', y1', ..., xn', yn')^T
or K * a = b
where
K = (
x0, y0, 1, 0, 0, 0
0, 0, 0, x0, y0, 1
x1, y1, 1, 0, 0, 0
0, 0, 0, x1, y1, 1
...
xn, yn, 1, 0, 0, 0
0, 0, 0, xn, yn, 1
)
and (xi, yi), (xi', yi') are pairs of corresponding points
This can be solved as a non-homogeneous system of linear equations. In this case the solution will minimize sum of squares of distances from each point to nearest grid intersection. This transform can be also considered to maximize overall likelihood given the assumption that points are shifted from grid intersections with normally distributed noise.
a = (K^T * K)^-1 * K^T * b
This algorithm can be easily implemented if there is a linear algebra library is available. Below is an example in Python:
import numpy as np
n_points = 9
aligned_points = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
grid_points = [(2, 5.464), (3.732, 6.464), (5.464, 7.464), (3, 3.732), (4.732, 4.732), (6.464, 5.732), (4, 2), (5.732, 3), (7.464, 4)]
K = np.zeros((n_points * 2, 6))
b = np.zeros(n_points * 2)
for i in range(n_points):
K[i * 2, 0] = aligned_points[i, 0]
K[i * 2, 1] = aligned_points[i, 1]
K[i * 2, 2] = 1
K[i * 2 + 1, 3] = aligned_points[i, 0]
K[i * 2 + 1, 4] = aligned_points[i, 1]
K[i * 2 + 1, 5] = 1
b[i * 2] = grid_points[i, 0]
b[i * 2 + 1] = grid_points[i, 1]
# operator '#' is matrix multiplication
a = np.linalg.inv(np.transpose(K) # K) # np.transpose(K) # b
A = a.reshape(2, 3)
print(A)
[[ 1. 1.732 2. ]
[-1.732 1. 5.464]]
Then the parameters can be extracted from this matrix:
theta = math.degrees(math.atan2(A[1, 0], A[0, 0]))
scale_x = math.sqrt(A[1, 0] ** 2 + A[0, 0] ** 2)
scale_y = math.sqrt(A[1, 1] ** 2 + A[0, 1] ** 2)
origin_x = A[0, 2]
origin_y = A[1, 2]
theta = -59.99927221917264
scale_x = 1.99995599951599
scale_y = 1.9999559995159895
origin_x = 1.9999999999999993
origin_y = 5.464
However there remains a minor issue: matrix A corresponds to an affine transform. This means that grid axes are not guaranteed to be perpendicular. If this is a problem, then the first two columns of the matrix can be modified in a such way that the transform preserves angles.
Update: I fixed the mistakes and resolved sign ambiguities, so now this algorithm produces the expected result. However it should be tested to see if all cases are handled correctly.
Here is another attempt to solve this problem. The idea is to decompose transformation into non-uniform scaling matrix and rotation matrix A = R * S and then solve for coefficients sx, sy, r1, r2 of these matrices given restriction that r1^2 + r2^2 = 1. The minimization problem is described here: How to find a transformation (non-uniform scaling and similarity) that maps one set of points to another?
def shift_points(points):
n_points = len(points)
shift = tuple(sum(coords) / n_points for coords in zip(*points))
shifted_points = [(point[0] - shift[0], point[1] - shift[1]) for point in points]
return shifted_points, shift
n_points = 9
aligned_points = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
grid_points = [(2, 5.464), (3.732, 6.464), (5.464, 7.464), (3, 3.732), (4.732, 4.732), (6.464, 5.732), (4, 2), (5.732, 3), (7.464, 4)]
aligned_points, aligned_shift = shift_points(aligned_points)
grid_points, grid_shift = shift_points(grid_points)
c1, c2 = 0, 0
b11, b12, b21, b22 = 0, 0, 0, 0
for i in range(n_points):
c1 += aligned_points[i][0] ** 2
c2 += aligned_points[i][0] ** 2
b11 -= 2 * aligned_points[i][0] * grid_points[i][0]
b12 -= 2 * aligned_points[i][1] * grid_points[i][0]
b21 -= 2 * aligned_points[i][0] * grid_points[i][1]
b22 -= 2 * aligned_points[i][1] * grid_points[i][1]
k = (b11 ** 2 * c2 + b22 ** 2 * c1 - b21 ** 2 * c2 - b12 ** 2 * c1) / \
(b21 * b11 * c2 - b12 * b22 * c1)
# r1_sqr and r2_sqr might need to be swapped
r1_sqr = 2 / (k ** 2 + 4 + k * math.sqrt(k ** 2 + 4))
r2_sqr = 2 / (k ** 2 + 4 - k * math.sqrt(k ** 2 + 4))
for sign1, sign2 in [(1, 1), (-1, 1), (1, -1), (-1, -1)]:
r1 = sign1 * math.sqrt(r1_sqr)
r2 = sign2 * math.sqrt(r2_sqr)
scale_x = -b11 / (2 * c1) * r1 - b21 / (2 * c1) * r2
scale_y = b12 / (2 * c2) * r2 - b22 / (2 * c2) * r1
if scale_x >= 0 and scale_y >= 0:
break
theta = math.degrees(math.atan2(r2, r1))
There might be ambiguities in choosing r1_sqr and r2_sqr. Origin point can be estimated from aligned_shift and grid_shift, but I didn't implement it yet.
theta = -59.99927221917264
scale_x = 1.9999559995159895
scale_y = 1.9999559995159895

How to speed up simple linear algebra optimization probelm in Julia?

I implemented the LSDD changepoint detection method decribed in [1] in Julia, to see if I could make it faster than the existing python implementation [2], which is based on a grid search that looks for the optimal parameters.
I obtain the desired results but despite my best efforts, my grid search version of it takes about the same time to compute as the python one, which is still way too long for real applications.
I also tried using the Optimize package which only makes things worse (2 or 3 times slower).
Here is the grid search that I implemented :
using Random
using LinearAlgebra
function squared_distance(X::Array{Float64,1},C::Array{Float64,1})
sqd = zeros(length(X),length(C))
for i in 1:length(X)
for j in 1:length(C)
sqd[i,j] = X[i]^2 + C[j]^2 - 2*X[i]*C[j]
end
end
return sqd
end
function lsdd(x::Array{Float64,1},y::Array{Float64,1}; folds = 5, sigma_list = nothing , lambda_list = nothing)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
#Define the training and testing data sets
cv_split1, cv_split2 = floor.(collect(1:lx)*folds/lx), floor.(collect(1:ly)*folds/ly)
cv_index1, cv_index2 = shuffle(cv_split1), shuffle(cv_split2)
tr_idx1,tr_idx2 = [findall(x->x!=i,cv_index1) for i in 1:folds], [findall(x->x!=i,cv_index2) for i in 1:folds]
te_idx1,te_idx2 = [findall(x->x==i,cv_index1) for i in 1:folds], [findall(x->x==i,cv_index2) for i in 1:folds]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
if sigma_list == nothing
sigma_list = [0.25, 0.5, 0.75, 1, 1.2, 1.5, 2, 2.5, 2.2, 3, 5]
end
if lambda_list == nothing
lambda_list = [1.00000000e-03, 3.16227766e-03, 1.00000000e-02, 3.16227766e-02,
1.00000000e-01, 3.16227766e-01, 1.00000000e+00, 3.16227766e+00,
1.00000000e+01]
end
#memory prealocation
score_cv = zeros(length(sigma_list),length(lambda_list))
H = zeros(b,b)
hx_tr, hy_tr = [zeros(b,1) for i in 1:folds], [zeros(b,1) for i in 1:folds]
hx_te, hy_te = [zeros(1,b) for i in 1:folds], [zeros(1,b) for i in 1:folds]
#h_tr,h_te = zeros(b,1), zeros(1,b)
theta = zeros(b)
for (sigma_idx,sigma) in enumerate(sigma_list)
#the expression of H is different for higher dimension
#H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
set_H(H,CC_dist2,sigma,b)
#check if the sum is performed along the right dimension
set_htr(hx_tr,xTr_dist,sigma,Tx), set_htr(hy_tr,yTr_dist,sigma,Ty)
set_hte(hx_te,xTe_dist,sigma,lx-Tx), set_hte(hy_te,yTe_dist,sigma,ly-Ty)
for i in 1:folds
h_tr = hx_tr[i] - hy_tr[i]
h_te = hx_te[i] - hy_te[i]
#set_h(h_tr,hx_tr[i],hy_tr[i],b)
#set_h(h_te,hx_te[i],hy_te[i],b)
for (lambda_idx,lambda) in enumerate(lambda_list)
set_theta(theta,H,lambda,h_tr,b)
score_cv[sigma_idx,lambda_idx] += dot(theta,H*theta) - 2*dot(theta,h_te)
end
end
end
#retrieve the value of the optimal parameters
sigma_chosen = sigma_list[findmin(score_cv)[2][2]]
lambda_chosen = lambda_list[findmin(score_cv)[2][2]]
#calculating the new "optimal" solution
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
H_lambda = H + lambda_chosen*Matrix{Float64}(I, b, b)
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = H_lambda\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
function set_H(H::Array{Float64,2},dist::Array{Float64,2},sigma::Float64,b::Int16)
for i in 1:b
for j in 1:b
H[i,j] = sqrt((sigma^2)*pi)*exp(-dist[i,j]/(4*sigma^2))
end
end
end
function set_theta(theta::Array{Float64,1},H::Array{Float64,2},lambda::Float64,h::Array{Float64,2},b::Int64)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, h)
theta = h
end
function set_htr(h::Array{Float64,1},dists::Array{Float64,2},sigma::Float64,T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_hte(h::Array{Float64,1},dists::Array{Float64,2},sigma::Array{Float64,1},T::Int16)
for (CVidx,dist) in enumerate(dists)
for (idx,value) in enumerate((1/T)*sum(exp.(-dist/(2*sigma^2)),dims = 1))
h[CVidx][idx] = value
end
end
end
function set_h(h,h1,h2,b)
for i in 1:b
h[i] = h1[i] - h2[i]
end
end
The set_H, set_h and set_theta functions are there because I read somewhere that modifying prealocated memory in place with a function was faster, but it did not make a great difference.
To test it, I use two random distribution as input data :
x,y = rand(500),1.5*rand(500)
lsdd(x,y) #returns a value around 0.3
Now here is the version of the code where I try to use Optimizer :
function Theta(sigma::Float64,lambda::Float64,x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
lx,ly = length(x), length(y)
b = min(lx+ly,300)
C = shuffle(vcat(x,y))[1:b]
CC_dist2 = squared_distance(C,C)
xC_dist2, yC_dist2 = squared_distance(x,C), squared_distance(y,C)
#the subsets are not be mutually exclusive !
Tx,Ty = length(x) - div(lx,folds), length(y) - div(ly,folds)
shuffled_x, shuffled_y = [shuffle(1:lx) for i in 1:folds], [shuffle(1:ly) for i in 1:folds]
cv_index1, cv_index2 = floor.(collect(1:lx)*folds/lx)[shuffle(1:lx)], floor.(collect(1:ly)*folds/ly)[shuffle(1:ly)]
tr_idx1,tr_idx2 = [i[1:Tx] for i in shuffled_x], [i[1:Ty] for i in shuffled_y]
te_idx1,te_idx2 = [i[Tx:end] for i in shuffled_x], [i[Ty:end] for i in shuffled_y]
xTr_dist, yTr_dist = [xC_dist2[i,:] for i in tr_idx1], [yC_dist2[i,:] for i in tr_idx2]
xTe_dist, yTe_dist = [xC_dist2[i,:] for i in te_idx1], [yC_dist2[i,:] for i in te_idx2]
score_cv = 0
Id = Matrix{Float64}(I, b, b)
H = sqrt((sigma^2)*pi)*exp.(-CC_dist2/(4*sigma^2))
hx_tr, hy_tr = [transpose((1/Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in xTr_dist], [transpose((1/Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1)) for dist in yTr_dist]
hx_te, hy_te = [(lx-Tx)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in xTe_dist], [(ly-Ty)*sum(exp.(-dist/(2*sigma^2)),dims = 1) for dist in yTe_dist]
for i in 1:folds
h_tr, h_te = hx_tr[i] - hy_tr[i], hx_te[i] - hy_te[i]
#theta = (H + lambda * Id)\h_tr
theta = copy(h_tr)
Hl = (H + lambda*Matrix{Float64}(I, b, b))
LAPACK.posv!('L', Hl, theta)
score_cv += dot(theta,H*theta) - 2*dot(theta,h_te)
end
return score_cv,(CC_dist2,xC_dist2,yC_dist2)
end
function cost(params::Array{Float64,1},x::Array{Float64,1},y::Array{Float64,1},folds::Int8)
s,l = params[1],params[2]
return Theta(s,l,x,y,folds)[1]
end
"""
Performs the optinization
"""
function lsdd3(x::Array{Float64,1},y::Array{Float64,1}; folds = 4)
start = [1,0.1]
b = min(length(x)+length(y),300)
lx,ly = length(x),length(y)
#result = optimize(params -> cost(params,x,y,folds),fill(0.0,2),fill(50.0,2),start, Fminbox(LBFGS(linesearch=LineSearches.BackTracking())); autodiff = :forward)
result = optimize(params -> cost(params,x,y,folds),start, BFGS(),Optim.Options(f_calls_limit = 5, iterations = 5))
#bboptimize(rosenbrock2d; SearchRange = [(-5.0, 5.0), (-2.0, 2.0)])
#result = optimize(cost,[0,0],[Inf,Inf],start, Fminbox(AcceleratedGradientDescent()))
sigma_chosen,lambda_chosen = Optim.minimizer(result)
CC_dist2, xC_dist2, yC_dist2 = Theta(sigma_chosen,lambda_chosen,x,y,folds)[2]
H = sqrt((sigma_chosen^2)*pi)*exp.(-CC_dist2/(4*sigma_chosen^2))
h = (1/lx)*sum(exp.(-xC_dist2/(2*sigma_chosen^2)),dims = 1) - (1/ly)*sum(exp.(-yC_dist2/(2*sigma_chosen^2)),dims = 1)
theta_final = (H + lambda_chosen*Matrix{Float64}(I, b, b))\transpose(h)
f = transpose(theta_final).*sum(exp.(-vcat(xC_dist2,yC_dist2)/(2*sigma_chosen^2)),dims = 1)
L2 = 2*dot(theta_final,h) - dot(theta_final,H*theta_final)
return L2
end
No matter, which kind of option I use in the optimizer, I always end up with something too slow. Maybe the grid search is the best option, but I don't know how to make it faster... Does anyone have an idea how I could proceed further ?
[1] : http://www.mcduplessis.com/wp-content/uploads/2016/05/Journal-IEICE-2014-CLSDD-1.pdf
[2] : http://www.ms.k.u-tokyo.ac.jp/software.html

Probabilistic Record Linkage in Pandas

I have two dataframes (X & Y). I would like to link them together and to predict the probability that each potential match is correct.
X = pd.DataFrame({'A': ["One", "Two", "Three"]})
Y = pd.DataFrame({'A': ["One", "To", "Free"]})
Method A
I have not yet fully understood the theory but there is an approach presented in:
Sayers, A., Ben-Shlomo, Y., Blom, A.W. and Steele, F., 2015. Probabilistic record linkage. International journal of epidemiology, 45(3), pp.954-964.
Here is my attempt to implementat it in Pandas:
# Probability that Matches are True Matches
m = 0.95
# Probability that non-Matches are True non-Matches
u = min(len(X), len(Y)) / (len(X) * len(Y))
# Priors
M_Pr = u
U_Pr = 1 - M_Pr
O_Pr = M_Pr / U_Pr # Prior odds of a match
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Likelihood ratio
def Likelihood_ratio(C):
return (m/u) - ((m/u) - ((1-m) / (1-u))) * (1-C)
L_R = np.vectorize(Likelihood_ratio, otypes=[float])
Z["G"] = L_R(Z['C'])
# Match weight
def Match_weight(G):
return math.log(G) * math.log(2)
M_W = np.vectorize(Match_weight, otypes=[float])
Z["R"] = M_W(Z['G'])
# Posterior odds
def Posterior_odds(R):
return math.exp( R / math.log(2)) * O_Pr
P_O = np.vectorize(Posterior_odds, otypes=[float])
Z["O"] = P_O(Z['R'])
# Probability
def Probability(O):
return O / (1 + O)
Pro = np.vectorize(Probability, otypes=[float])
Z["P"] = Pro(Z['O'])
I have verified that this gives the same results as in the paper. Here is a sensitivity check on m, showing that it doesn't make a lot of difference:
Method B
These assumptions won't apply to all applications but in some cases each row of X should match a row of Y. In that case:
The probabilities should sum to 1
If there are many credible candidates to match to then that should reduce the probability of getting the right one
then:
X["I"] = X.index
# Combine the dataframes
X['key'] = 1
Y['key'] = 1
Z = pd.merge(X, Y, on='key')
Z = Z.drop('key',axis=1)
X = X.drop('key',axis=1)
Y = Y.drop('key',axis=1)
# Levenshtein distance
def Levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2+1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
L_D = np.vectorize(Levenshtein_distance, otypes=[float])
Z["D"] = L_D(Z['A_x'], Z['A_y'])
# Max string length
def Max_string_length(X, Y):
return max(len(X), len(Y))
M_L = np.vectorize(Max_string_length, otypes=[float])
Z["L"] = M_L(Z['A_x'], Z['A_y'])
# Agreement weight
def Agreement_weight(D, L):
return 1 - ( D / L )
A_W = np.vectorize(Agreement_weight, otypes=[float])
Z["C"] = A_W(Z['D'], Z['L'])
# Normalised Agreement Weight
T = Z .groupby('I') .agg({'C' : sum})
D = pd.DataFrame(T)
D.columns = ['T']
J = Z.set_index('I').join(D)
J['P1'] = J['C'] / J['T']
Comparing it against Method A:
Method C
This combines method A with method B:
# Normalised Probability
U = Z .groupby('I') .agg({'P' : sum})
E = pd.DataFrame(U)
E.columns = ['U']
K = Z.set_index('I').join(E)
K['P1'] = J['P1']
K['P2'] = K['P'] / K['U']
We can see that method B (P1) doesn't take account of uncertainty whereas method C (P2) does.

Fast way combine multiple columns of type float into one column of type array(float)

I have a dataset like this:
df = pd.DataFrame({
"333-0": [123,123,123],
"5985-0.0": [1,2,3],
"5985-0.1":[1,2,3],
"5985-0.2":[1,2,3]
},
index = [0,1,2] )
Here, we have three columns ["5985-0.0", "5985-0.1", "5985-0.2"] that represent the first, second and third float readings of thing 5985-0 -- i.e. .x represents an array index.
I'd like to take multiple columns and collapse them into a single column 5985-0 containing some kind of list of float, which I can do like this:
srccols = ["5985-0.0", "5985-0.1", "5985-0.2"]
df["5985-0"] = df[srccols].apply(tuple, axis=1)
df.dropna(srccols, axis=1)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
which I can then store as an SQL table with an array column.
However, apply(tuple) is very slow. Is there a faster, more idiomatic pandas way to combine multiple columns into one.
(First person to say "normalized" gets a downvote).
My Choice
Assuming I know the columns
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Base Case
Using filter, join, and apply(tuple, 1)
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Option 2
Using filter, join, pd.Series
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
333-0 5985-0
0 123 [1, 1, 1]
1 123 [2, 2, 2]
2 123 [3, 3, 3]
Option 3
Using filter, join, pd.Series, and zip
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
print(df.drop(cols, 1).join(s))
333-0 5985-0
0 123 (1, 1, 1)
1 123 (2, 2, 2)
2 123 (3, 3, 3)
Timing
Large Data Set
df = pd.concat([df] * 10000, ignore_index=True
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = d.apply(tuple, 1).rename(thing)
cols = d.columns
df.drop(cols, 1).join(s)
1 loop, best of 3: 350 ms per loop
%%timeit
thing = '5985-0'
cols = ['5985-0.0', '5985-0.1', '5985-0.2']
k = len(cols)
v = df.values
l = [v[:, df.columns.get_loc(c)].tolist() for c in cols]
s = pd.Series(list(zip(*l)), name=thing)
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.06 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(d.values.tolist(), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 4.56 ms per loop
%%timeit
thing = '5985-0'
d = df.filter(like=thing)
s = pd.Series(list(zip(*d.values.T)), name=thing)
cols = d.columns
df.drop(cols, 1).join(s)
100 loops, best of 3: 6.89 ms per loop