sum numpy array at given indices - numpy

I want to add the values of a vector:
a = np.array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='d')
to the values of another vector:
c = np.array([10, 10, 10], dtype='d')
at position given by another array (of the same size of a, with values 0 <= b[i] < len(c))
b = np.array([2, 0, 1, 0, 2, 0, 1, 1, 0, 2], dtype='int32')
This is very simple to write in pseudo code:
for I in range(b.shape[0]):
J = b[I]
c[J] += a[I]
Something like this, but vectorized (length of c is some hundreds in real case).
c[0] += np.sum(a[b==0]) # 27 (10 + 1 + 3 + 5 + 8)
c[1] += np.sum(a[b==1]) # 25 (10 + 2 + 6 + 7)
c[2] += np.sum(a[b==2]) # 23 (10 + 0 + 4 + 9)
My initial guess was:
c[b] += a
but only last values of a are summed.

You can use np.bincount to get ID based weighted summations and then add with c, like so -
np.bincount(b,a) + c

Related

Using recursion to iterate multiple times through rows in a dataframe - not returning the expected result

How to loop through a dataframe series multiple times using a recursive function?
I am trying to get a simple case to work and use it in a more complicated function.
I am using a simple dataframe:
df = pd.DataFrame({'numbers': [1,2,3,4,5]
I want to iterate through the rows multiple time and sum the values. Each iteration, the index starting point increments by 1.
def recursive_sum(df, mysum=0, count=0):
df = df.iloc[count:]
if len(df.index) < 2:
return mysum
else:
for i in range(len(df.index)):
mysum += df.iloc[i, 0]
count += 1
return recursive_sum(df, mysum, count)
I think I should get:
#Iteration 1: count = 0, len(df.index) = 5 < 2, mysum = 1 + 2 + 3 + 4 + 5 = 15
#Iteration 2: count = 1, len(df.index) = 4 < 2, mysum = 15 + 2 + 3 + 4 + 5 = 29
#Iteration 3: count = 2, len(df.index) = 3 < 2, mysum = 29 + 3 + 4 + 5 = 41
#Iteration 4: count = 2, len(df.index) = 2 < 2, mysum = 41 + 4 + 5 = 50
#Iteration 5: count = 2, len(df.index) = 1 < 2, mysum = 50
But I am returning 38.
Just fixed it:
def recursive_sum(df, mysum=0, count=0):
if(len(df.index) - count) < 2:
return mysum
else:
for i in range(count, len(df.index)):
mysum += df.iloc[0]
count += 1
return recursive_sum(df, mysum, count)

Algorithm to define a 2d grid

Suppose a grid is defined by a set of grid parameters: its origin (x0, y0), an angel between one side and x axis, and increments and - please see the figure below.
There are scattered points with known coordinates on the grid but they don’t exactly fall on grid intersections. Is there an algorithm to find a set of grid parameters to define the grid so that the points are best fit to grid intersections?
Suppose the known coordinates are:
(2 , 5.464), (3.732, 6.464), (5.464, 7.464)
(3 , 3.732), (4.732, 4.732), (6.464, 5.732)
(4 , 2 ), (5.732, 3 ), (7.464, 4 ).
I expect the algorithm to find the origin (4, 2), the angle 30 degree, and the increments both 2.
You can solve the problem by finding a matrix that transforms points from positions (0, 0), (0, 1), ... (2, 2) onto the given points.
Although the grid has only 5 degrees of freedom (position of the origin + angle + scale), it is easier to define the transformation using 2x3 matrix A, because the problem can be made linear in this case.
Let a point with index (x0, y0) to be transformed into point (x0', y0') on the grid, for example (0, 0) -> (2, 5.464) and let a_ij be coefficients of matrix A. Then this pair of points results in 2 equations:
a_00 * x0 + a_01 * y0 + a_02 = x0'
a_10 * x0 + a_11 * y0 + a_12 = y0'
The unknowns are a_ij, so these equations can be written in form
a_00 * x0 + a_01 * y0 + a_02 + a_10 * 0 + a_11 * 0 + a_12 * 0 = x0'
a_00 * 0 + a_01 * 0 + a_02 * 0 + a_10 * x0 + a_11 * y0 + a_12 = y0'
or in matrix form
K0 * (a_00, a_01, a_02, a_10, a_11, a_12)^T = (x0', y0')^T
where
K0 = (
x0, y0, 1, 0, 0, 0
0, 0, 0, x0, y0, 1
)
These equations for each pair of points can be combined in a single equation
K * (a_00, a_01, a_02, a_10, a_11, a_12)^T = (x0', y0', x1', y1', ..., xn', yn')^T
or K * a = b
where
K = (
x0, y0, 1, 0, 0, 0
0, 0, 0, x0, y0, 1
x1, y1, 1, 0, 0, 0
0, 0, 0, x1, y1, 1
...
xn, yn, 1, 0, 0, 0
0, 0, 0, xn, yn, 1
)
and (xi, yi), (xi', yi') are pairs of corresponding points
This can be solved as a non-homogeneous system of linear equations. In this case the solution will minimize sum of squares of distances from each point to nearest grid intersection. This transform can be also considered to maximize overall likelihood given the assumption that points are shifted from grid intersections with normally distributed noise.
a = (K^T * K)^-1 * K^T * b
This algorithm can be easily implemented if there is a linear algebra library is available. Below is an example in Python:
import numpy as np
n_points = 9
aligned_points = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
grid_points = [(2, 5.464), (3.732, 6.464), (5.464, 7.464), (3, 3.732), (4.732, 4.732), (6.464, 5.732), (4, 2), (5.732, 3), (7.464, 4)]
K = np.zeros((n_points * 2, 6))
b = np.zeros(n_points * 2)
for i in range(n_points):
K[i * 2, 0] = aligned_points[i, 0]
K[i * 2, 1] = aligned_points[i, 1]
K[i * 2, 2] = 1
K[i * 2 + 1, 3] = aligned_points[i, 0]
K[i * 2 + 1, 4] = aligned_points[i, 1]
K[i * 2 + 1, 5] = 1
b[i * 2] = grid_points[i, 0]
b[i * 2 + 1] = grid_points[i, 1]
# operator '#' is matrix multiplication
a = np.linalg.inv(np.transpose(K) # K) # np.transpose(K) # b
A = a.reshape(2, 3)
print(A)
[[ 1. 1.732 2. ]
[-1.732 1. 5.464]]
Then the parameters can be extracted from this matrix:
theta = math.degrees(math.atan2(A[1, 0], A[0, 0]))
scale_x = math.sqrt(A[1, 0] ** 2 + A[0, 0] ** 2)
scale_y = math.sqrt(A[1, 1] ** 2 + A[0, 1] ** 2)
origin_x = A[0, 2]
origin_y = A[1, 2]
theta = -59.99927221917264
scale_x = 1.99995599951599
scale_y = 1.9999559995159895
origin_x = 1.9999999999999993
origin_y = 5.464
However there remains a minor issue: matrix A corresponds to an affine transform. This means that grid axes are not guaranteed to be perpendicular. If this is a problem, then the first two columns of the matrix can be modified in a such way that the transform preserves angles.
Update: I fixed the mistakes and resolved sign ambiguities, so now this algorithm produces the expected result. However it should be tested to see if all cases are handled correctly.
Here is another attempt to solve this problem. The idea is to decompose transformation into non-uniform scaling matrix and rotation matrix A = R * S and then solve for coefficients sx, sy, r1, r2 of these matrices given restriction that r1^2 + r2^2 = 1. The minimization problem is described here: How to find a transformation (non-uniform scaling and similarity) that maps one set of points to another?
def shift_points(points):
n_points = len(points)
shift = tuple(sum(coords) / n_points for coords in zip(*points))
shifted_points = [(point[0] - shift[0], point[1] - shift[1]) for point in points]
return shifted_points, shift
n_points = 9
aligned_points = [(0, 0), (0, 1), (0, 2), (1, 0), (1, 1), (1, 2), (2, 0), (2, 1), (2, 2)]
grid_points = [(2, 5.464), (3.732, 6.464), (5.464, 7.464), (3, 3.732), (4.732, 4.732), (6.464, 5.732), (4, 2), (5.732, 3), (7.464, 4)]
aligned_points, aligned_shift = shift_points(aligned_points)
grid_points, grid_shift = shift_points(grid_points)
c1, c2 = 0, 0
b11, b12, b21, b22 = 0, 0, 0, 0
for i in range(n_points):
c1 += aligned_points[i][0] ** 2
c2 += aligned_points[i][0] ** 2
b11 -= 2 * aligned_points[i][0] * grid_points[i][0]
b12 -= 2 * aligned_points[i][1] * grid_points[i][0]
b21 -= 2 * aligned_points[i][0] * grid_points[i][1]
b22 -= 2 * aligned_points[i][1] * grid_points[i][1]
k = (b11 ** 2 * c2 + b22 ** 2 * c1 - b21 ** 2 * c2 - b12 ** 2 * c1) / \
(b21 * b11 * c2 - b12 * b22 * c1)
# r1_sqr and r2_sqr might need to be swapped
r1_sqr = 2 / (k ** 2 + 4 + k * math.sqrt(k ** 2 + 4))
r2_sqr = 2 / (k ** 2 + 4 - k * math.sqrt(k ** 2 + 4))
for sign1, sign2 in [(1, 1), (-1, 1), (1, -1), (-1, -1)]:
r1 = sign1 * math.sqrt(r1_sqr)
r2 = sign2 * math.sqrt(r2_sqr)
scale_x = -b11 / (2 * c1) * r1 - b21 / (2 * c1) * r2
scale_y = b12 / (2 * c2) * r2 - b22 / (2 * c2) * r1
if scale_x >= 0 and scale_y >= 0:
break
theta = math.degrees(math.atan2(r2, r1))
There might be ambiguities in choosing r1_sqr and r2_sqr. Origin point can be estimated from aligned_shift and grid_shift, but I didn't implement it yet.
theta = -59.99927221917264
scale_x = 1.9999559995159895
scale_y = 1.9999559995159895

Julia equivalent to the python consecutive_groups() function

I am looking for a julia alternative with the same behavior as more_itertools.consecutive_groups in python.
I came up with a simple implementation but speed is an issue here and I'm not sure if the code is optimized enough.
function consecutive_groups(array)
groups = eltype(array)[]
j = 0
for i=1:length(array)-1
if array[i]+1 != array[i+1]
push!(groups, array[j+1:i])
j = i
end
end
push!(groups, array[j+1:end])
return groups
end
Your implementation is already quite fast. If you know that the consecutive groups will be large you might want to just increase the index instead of pushing every element:
function consecutive_groups_2(v)
n = length(v)
groups = Vector{Vector{eltype(v)}}()
i = j = 1
while i <= n && j <= n
j = i
while j < n && v[j] + 1 == v[j + 1]
j += 1
end
push!(groups,v[i:j])
i = j + 1
end
return groups
end
which is roughly 33% faster on large groups:
julia> x = collect(1:100000);
julia> #btime consecutive_groups(x)
165.939 μs (4 allocations: 781.45 KiB)
1-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]
julia> #btime consecutive_groups_2(x)
114.830 μs (4 allocations: 781.45 KiB)
1-element Array{Array{Int64,1},1}:
[1, 2, 3, 4, 5, 6, 7, 8, 9, 10 … 99991, 99992, 99993, 99994, 99995, 99996, 99997, 99998, 99999, 100000]

Pandas/NumPy: concisely label first N values matching a mask

I have a sorted Series like this:
[2, 4, 5, 6, 8, 9]
I want to produce another Series or ndarray of the same length, where the first two odd numbers and the first two even numbers are labeled sequentially:
[0, 1, 2, _, _, 3]
The _ values I don't really care about. They can be zero.
Now I do it this way:
src = pd.Series([2, 4, 5, 6, 8, 9])
odd = src % 2 != 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
where.sort() # maintain ordering - thanks to #hpaulj
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
Can you do it more concisely? The input will never be empty, but there might be no odds or no evens (in which case the result could have length 1, 2, or 3 instead of 4).
Great Problem! I'm still exploring and learning.
I've basically stuck with what you've done so far with modest tweaks for efficiency. I'll update if I think of anything else cool.
conclusions
So far, I've thrashed around alot and haven't improved much.
my answer
fast
odd = src.values % 2
even = 1 - odd
res = ((odd.cumsum() * odd) < 3) * ((even.cumsum() * even) < 3)
(res.cumsum() - 1) * res
alternative 1
pretty fast
a = src.values
odd = (a % 2).astype(bool)
rng = np.arange(len(a))
# same reason these are 2, we have 4 below
where = np.append(rng[~odd][:2], rng[odd][:2])
res = np.zeros(len(a), int)
# nature of the problem necessitates that this is 4
res[where] = np.arange(4)
alternative 2
not as quick, but creative
a = src.values
odd = a % 2
res = np.zeros(len(src), int)
b = np.arange(2)
c = b[:, None] == odd
res[(c.cumsum(1) * c <= 2).all(0)] = np.arange(4)
alternative 3
still slow
odd = src.values % 2
a = (odd[:, None] == [0, 1])
b = ((a.cumsum(0) * a) <= 2).all(1)
(b.cumsum() - 1) * b
timing
code
def pir3(src):
odd = src.values % 2
a = (odd[:, None] == [0, 1])
b = ((a.cumsum(0) * a) <= 2).all(1)
return (b.cumsum() - 1) * b
def pir0(src):
odd = src.values % 2
even = 1 - odd
res = ((odd.cumsum() * odd) < 3) * ((even.cumsum() * even) < 3)
return (res.cumsum() - 1) * res
def pir2(src):
a = src.values
odd = a % 2
res = np.zeros(len(src), int)
c = b[:, None] == odd
res[(c.cumsum(1) * c <= 2).all(0)] = np.arange(4)
return res
def pir1(src):
a = src.values
odd = (a % 2).astype(bool)
rng = np.arange(len(a))
where = np.append(rng[~odd][:2], rng[odd][:2])
res = np.zeros(len(a), int)
res[where] = np.arange(4)
return res
def john0(src):
odd = src % 2 == 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
return res
def john1(src):
odd = src.values % 2 == 0
where = np.hstack((np.where(odd)[0][:2], np.where(~odd)[0][:2]))
res = np.zeros(len(src), int)
res[where] = np.arange(len(where))
return res
src = pd.Series([2, 4, 5, 6, 8, 9])
src = pd.Series([2, 4, 5, 6, 8, 9] * 10000)

Easiest way to generate random array from {1,-1} with predefined mean value in Numpy?

What is the best way in numpy to generate a random array with n values of the form
arr = [1,-1,-1,1,1,1,...]
that average out as close as possible to a predefined value m, so that
print 1/n*np.sum(arr)
>>> #value that is as close as possible to m
I have been experimenting with
numpy.random.choice([-1,1], size=n)
but can't seem to find a solution.
You can optionally enter probabilities values for each element in the array you pass to random.choice. In this case your average or expected value is p - q where p is probability of a positive one and q the probability of a negative. Note that this is independent of n. If you set the probability of a positive one to p then the probability of a negative one is 1 - p. You can then solve 2p - 1 = m to get the p-value you want for a given m.
For example, for your average m to be .4 you would pass [.7, .3] as the probabilities:
numpy.random.choice([1, -1], n, p=[.7, .3])
Here's an example:
In [25]:
n = 1e6
m = .4
p = (m + 1) / 2
np.random.choice([1, -1], n, p=[p, 1-p]).sum() / n
Out[25]:
0.39873799999999998
First, recognize that for a fixed n, you can't (in general) choose the array whose mean is exactly an arbitrary value m. I'll assume that either you are choosing m for which a solution is possible, or you are OK getting something close to the given mean.
The mean of n1 1s and n2 -1s is (n1 - n2) / n where n = n1 + n2. So you want m = (n1 - n2) / n = (n1 - (n - n1)) / n = (2*n1 - n) / n = 2*n1/n - 1. This gives n1 = (m + 1)*n/2 (which works for -1 <= m <= 1). So you can create an array with n1 1s and n - n1 -1s, and then randomize that array.
For example, suppose n is 100, and the desired mean is 0.8:
In [35]: n = 100
In [36]: m = 0.8
Compute the number of positive 1s:
In [37]: n1 = int(round((m + 1) * n / 2.0)) # rounded to the nearest int
Create the array of 1s and -1s:
In [38]: x = np.ones(n, dtype=int)
In [39]: x[:n-n1] = -1
Shuffle it:
In [40]: np.random.shuffle(x)
In [41]: x
Out[41]:
array([ 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1,
1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1,
-1, 1, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1,
-1, 1, 1, 1, 1, 1, 1, 1, 1, 1, -1, 1, 1, 1, 1])
In [42]: x.mean()
Out[42]: 0.80000000000000004