How to "bin" a numpy array using custom (non-linearly spaced) buckets? - numpy

How to "bin" the bellow array in numpy so that:
import numpy as np
bins = np.array([-0.1 , -0.07, -0.02, 0. , 0.02, 0.07, 0.1 ])
array = np.array([-0.21950869, -0.02854823, 0.22329239, -0.28073936, -0.15926265,
-0.43688216, 0.03600587, -0.05101109, -0.24318651, -0.06727875])
That is replace each of the values in array with the following:
-0.1 where `value` < -0.085
-0.07 where -0.085 <= `value` < -0.045
-0.02 where -0.045 <= `value` < -0.01
0.0 where -0.01 <= `value` < 0.01
0.02 where 0.01 <= `value` < 0.045
0.07 where 0.045 <= `value` < 0.085
0.1 where `value` >= 0.085
The expected output would be:
array = np.array([-0.1, -0.02, 0.1, -0.1, -0.1, -0.1, 0.02, -0.07, -0.1, -0.07])
I recognise that numpy has a digitize function however it returns the index of the bin not the bin itself. That is:
np.digitize(array, bins)
np.array([0, 2, 7, 0, 0, 0, 5, 2, 0, 2])

Get those mid-values by averaging across consecutive bin values in pairs. Then, use np.searchsorted or np.digitize to get the indices using the mid-values. Finally, index into bins for the output.
Mid-values :
mid_bins = (bins[1:] + bins[:-1])/2.0
Indices with searchsorted or digitze :
idx = np.searchsorted(mid_bins, array)
idx = np.digitize(array, mid_bins)
Output :
out = bins[idx]

Related

Numpy array access function derivative

I try to come up with the derivative to the following function:
def f(x, item):
return x[item]
def df_dx(x, item):
pass
where item is as defined in the numpy doc can be list, slice, int, list of slice.
But there are so many edge cases but it feels like there should be a very easy multiplication between two tensors that should be kind of the answer. Can anyone help?
I tried something like this:
to_length = lambda i: i if type(i) == int else len(
range(i.start, (i.stop if i.stop is not None else -1), (i.step if i.step is not None else 1)))
input_shape = list(X.shape)
seg_shape = []
if type(self.item) == int or type(self.item) == slice:
seg_shape = [to_length(self.item)]
elif type(self.item) == list:
seg_shape = [to_length(i) for i in self.item]
else:
raise Exception(f"Unknown type {type(self.item)}")
v = np.zeros(seg_shape + input_shape)
if len(seg_shape) == 1:
# np.fill_diagonal(v[:, self.i:self.i + self.n], 1)
np.fill_diagonal(v[..., self.item], 1)
return v
elif len(seg_shape) == 2:
np.fill_diagonal(v[:, :, self.item[0], self.item[1]], 1)
return v
But it didn't really work.
Your defined function isn't continuous. There is a numpy gradient function. Which may do what you want.
import numpy as np
x = ( np.arange( 20 ) - 10 ) * .1
x = x*x
x
# array([1. , 0.81, 0.64, 0.49, 0.36, 0.25, 0.16, 0.09, 0.04, 0.01, 0. ,
# 0.01, 0.04, 0.09, 0.16, 0.25, 0.36, 0.49, 0.64, 0.81])
def f(x, item):
return x[item]
def df_dx(x, item):
return np.gradient( x )[ item ]
df_dx( x, None )
# array([[-0.19, -0.18, -0.16, -0.14, -0.12, -0.1 , -0.08, -0.06, -0.04,
# -0.02, 0. , 0.02, 0.04, 0.06, 0.08, 0.1 , 0.12, 0.14,
# 0.16, 0.17]])
df_dx( x, [ 2, 8, 10, 17 ] )
# array([-0.16, -0.04, 0. , 0.14])

How to Normalize the values between zero and one?

I have a ndarray of shape (74,):
[-1.995 1.678 -2.535 1.739 -1.728 -1.268 -0.727 -3.385 -2.348
-3.021 0.5293 -0.4573 0.5137 -3.047 -4.75 -1.847 2.922 -0.989
-1.507 -0.9224 -2.545 6.957 0.9985 -2.035 -3.234 -2.848 -1.971
-3.246 2.057 -1.991 -6.27 9.22 0.4045 -2.703 -1.577 4.066
7.215 -4.07 12.98 -3.02 1.456 9.44 6.49 0.272 2.07
1.625 -3.531 -2.846 -4.914 -0.536 -3.496 -1.095 -2.719 -0.5825
5.535 -0.1753 3.658 4.234 4.543 -0.8384 -2.705 -2.012 -6.56
10.5 -2.021 -2.48 1.725 5.69 3.672 -6.855 -3.887 1.761
6.926 -4.848 ]
I need to normlize this vector where the values become between [0,1] and then the sum of the values inside this vector = 1.
You can try this formula to make it between [0, 1]:
min_val = np.min(original_arr)
max_val = np.max(original_arr)
normalized_arr = (original_arr - min_val) / (max_val - min_val)
You can try this formula to make the sum of the array to be 1:
new_arr = original_arr / original_arr.sum()

Julia - Gurobi Callbacks on array of JuMP variables

In Gurobi and JuMP 0.21, it is well documented here on how you would access a variable with a callback:
using JuMP, Gurobi, Test
model = direct_model(Gurobi.Optimizer())
#variable(model, 0 <= x <= 2.5, Int)
#variable(model, 0 <= y <= 2.5, Int)
#objective(model, Max, y)
cb_calls = Cint[]
function my_callback_function(cb_data, cb_where::Cint)
# You can reference variables outside the function as normal
push!(cb_calls, cb_where)
# You can select where the callback is run
if cb_where != GRB_CB_MIPSOL && cb_where != GRB_CB_MIPNODE
return
end
# You can query a callback attribute using GRBcbget
if cb_where == GRB_CB_MIPNODE
resultP = Ref{Cint}()
GRBcbget(cb_data, cb_where, GRB_CB_MIPNODE_STATUS, resultP)
if resultP[] != GRB_OPTIMAL
return # Solution is something other than optimal.
end
end
# Before querying `callback_value`, you must call:
Gurobi.load_callback_variable_primal(cb_data, cb_where)
x_val = callback_value(cb_data, x)
y_val = callback_value(cb_data, y)
# You can submit solver-independent MathOptInterface attributes such as
# lazy constraints, user-cuts, and heuristic solutions.
if y_val - x_val > 1 + 1e-6
con = #build_constraint(y - x <= 1)
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
elseif y_val + x_val > 3 + 1e-6
con = #build_constraint(y + x <= 3)
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
end
if rand() < 0.1
# You can terminate the callback as follows:
GRBterminate(backend(model))
end
return
end
# You _must_ set this parameter if using lazy constraints.
MOI.set(model, MOI.RawParameter("LazyConstraints"), 1)
MOI.set(model, Gurobi.CallbackFunction(), my_callback_function)
optimize!(model)
#test termination_status(model) == MOI.OPTIMAL
#test primal_status(model) == MOI.FEASIBLE_POINT
#test value(x) == 1
#test value(y) == 2
i.e., you would use x_val = callback_value(cb_data, x). However, how should you do when you have an array of variables with specific indexes not starting at 1, i.e. my variables are not in a vector but declared thanks to:
#variable(m, x[i=1:n, j=i+1:n], Bin)
Should I access x with double for loops on its two dimensions and call multiple times callback_value? If so, the indexes for j will not be the same, won't they?
Use broadcasting:
x_val = callback_value.(Ref(cb_data), x)
Or just call callback_value(cb_data, x[i, j]) when you need the value.
For example:
using JuMP, Gurobi
model = Model(Gurobi.Optimizer)
#variable(model, 0 <= x[i=1:3, j=i+1:3] <= 2.5, Int)
function my_callback_function(cb_data)
x_val = callback_value.(Ref(cb_data), x)
display(x_val)
for i=1:3, j=i+1:3
con = #build_constraint(x[i, j] <= floor(Int, x_val[i, j]))
MOI.submit(model, MOI.LazyConstraint(cb_data), con)
end
end
MOI.set(model, MOI.LazyConstraintCallback(), my_callback_function)
optimize!(model)
yields
julia> optimize!(model)
Gurobi Optimizer version 9.1.0 build v9.1.0rc0 (mac64)
Thread count: 4 physical cores, 8 logical processors, using up to 8 threads
Optimize a model with 0 rows, 3 columns and 0 nonzeros
Model fingerprint: 0x5d543c3a
Variable types: 0 continuous, 3 integer (0 binary)
Coefficient statistics:
Matrix range [0e+00, 0e+00]
Objective range [0e+00, 0e+00]
Bounds range [2e+00, 2e+00]
RHS range [0e+00, 0e+00]
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = -0.0
[2, 3] = -0.0
[1, 3] = -0.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = 2.0
[1, 3] = 2.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = 2.0
[1, 3] = 2.0
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = 2.0
[2, 3] = -0.0
[1, 3] = -0.0
Presolve time: 0.00s
Presolved: 0 rows, 3 columns, 0 nonzeros
Variable types: 0 continuous, 3 integer (0 binary)
JuMP.Containers.SparseAxisArray{Float64,2,Tuple{Int64,Int64}} with 3 entries:
[1, 2] = -0.0
[2, 3] = -0.0
[1, 3] = -0.0
Found heuristic solution: objective 0.0000000
Explored 0 nodes (0 simplex iterations) in 0.14 seconds
Thread count was 8 (of 8 available processors)
Solution count 1: 0
Optimal solution found (tolerance 1.00e-04)
Best objective 0.000000000000e+00, best bound 0.000000000000e+00, gap 0.0000%
User-callback calls 31, time in user-callback 0.14 sec

Simple computation in numpy

I have numpy array like this a = [-- -- -- 1.90 2.91 1.91 2.92]
I need to find % of values more than 2, so here it is 50%.
How to get the same in easy way? also, why len(a) gives 7 (instead of 4)?
Try this:
import numpy as np
import numpy.ma as ma
a = ma.array([0, 1, 2, 1.90, 2.91, 1.91, 2.92])
for i in range(3):
a[i] = ma.masked
print(a)
print(np.sum(a>2)/((len(a) - ma.count_masked(a))))
The last line prints 0.5 which is your 50%. It subtracted from the total length of your array (7) the number of masked elements (3) which you see as the three "--" in the output you posted.
Generally speaking, you can simply use
a = np.array([...])
threshold = 2.0
fraction_higher = (a > threshold).sum() / len(a) # in [0, 1)
percentage_higher = fraction_higher * 100
The array contains 7 elements, being 3 of them masked. This code emulates the test case, generating a masked array as well:
# generate the test case: a masked array
a = np.ma.array([-1, -1, -1, 1.90, 2.91, 1.91, 2.92], mask=[1, 1, 1, 0, 0, 0, 0])
# check its format
print(a)
[-- -- -- 1.9 2.91 1.91 2.92]
#print the output
print(a[a>2].count() / a.count())
0.5

transform pandas dataframe column via Interpolation

i am looking to apply a 1d interpolation on a df and am not sure how to this in an efficient way. Here goes:
In [8]: param
Out[8]:
alpha beta rho nu
0.021918 0.544953 0.5 -0.641566 6.549623
0.041096 0.449702 0.5 -0.062046 5.047923
0.060274 0.428459 0.5 -0.045312 3.625387
0.079452 0.424686 0.5 -0.049508 2.790139
0.156164 0.423139 0.5 -0.071106 1.846614
0.232877 0.414887 0.5 -0.040070 1.334070
0.328767 0.415757 0.5 -0.042071 1.109897
I would like the new index (but dont mind to reset_index() if needed) to look like this:
np.array([0.02, 0.04, 0.06, 0.08, 0.1, 0.15, 0.2, 0.25])
So the corresponding values for alpha, beta, rho, nu need to be interpolated.
Came up with the following which only works for one column and only if x and y have the same dimensions:
x = np.array([0.02, 0.04, 0.06, 0.08, 0.1, 0.15, 0.2, 0.25])
y = np.array(param.alpha)
f = interp1d(x, y, kind='cubic', fill_value='extrapolate')
f(x)
Appreciate any pointer towards an efficient solution. Thanks.
You could try using reindex and interpolate then index selection with loc:
param.reindex(new_idx.tolist()+param.index.values.tolist())\
.sort_index()\
.interpolate(method='cubic')\
.bfill()\
.loc[new_idx]
Output:
alpha beta rho nu
0.02 0.544953 0.5 -0.641566 6.549623
0.04 0.452518 0.5 -0.073585 5.138333
0.06 0.428552 0.5 -0.044739 3.641854
0.08 0.424630 0.5 -0.049244 2.772958
0.10 0.423439 0.5 -0.047119 2.294109
0.15 0.423326 0.5 -0.069473 1.873499
0.20 0.419130 0.5 -0.060861 1.573724
0.25 0.412985 0.5 -0.029573 1.221732