Assign array values to contiguous intervals - numpy

I have a levels array
# 0 1 2 3 4
levels = np.array(( 0.2, 0.4, 0.6, 0.8 ))
and a values array, e.g.,
np.random.seed(20230204)
values = np.random.rand(5)
and eventually a SLOW function
def map_into_levels(values, levels):
result = []
for n in np.asarray(values):
for r, level in enumerate(levels):
if n <= level:
break
else:
r += 1
result.append(r)
return result
so that I have
In [153]: np.random.seed(20220204)
...: values = np.random.rand(6)
...: levels = np.array(( 0.2, 0.4, 0.6, 0.8 ))
...: result = map_into_levels(values, levels)
...: print(levels)
...: print(values)
...: print(result)
[0.2 0.4 0.6 0.8]
[0.00621839 0.23945242 0.87124946 0.56328486 0.5477085 0.88745812]
[0, 1, 4, 2, 2, 4]
In [154]:
Could you please point me towards a Numpy primitive that helps me to speed up the operations?

You need np.searchsorted assuming levels is sorted already. It find indices where elements should be inserted to maintain order:
np.searchsorted(levels, values)
# array([0, 1, 4, 2, 2, 4], dtype=int32)

Related

Pyspark - Looking to find indexes on the top N largest values in an array column

I'm looking to replace the functionality of the following numpy command:
top_n_idx = np.argsort(cosine_sim[idx])[::-1][1:11]
Sample Data:
array_col
[0.1,0.5,0.2,0.5,0.9]
[0.1,0.9,0.5,0.2,0.35]
Here is the code I have so far:
df.select("array_col", F.slice(F.sort_array(F.col("array_col"), asc=False), 1, 3).alias("top_scores")).show()
array_col top_scores
[0.1,0.5,0.2,0.55,0.9] [0.9, 0.55, 0.5]
[0.1,0.9,0.5,0.2,0.35] [0.9, 0.5, 0.35]
Now, what I would like to do is find the indexes in array_col that correspond to the `top_scores" columns.
array_col top_scores. top_score_idx
[0.1,0.5,0.2,0.55,0.9] [0.9, 0.55, 0.5] [5, 4, 2]
[0.1,0.9,0.5,0.2,0.35] [0.9, 0.5, 0.35] [2, 3, 5]
I will ultimately use top_score_idx to grab the corresponds indexes in another array columnn.
For Spark 2.4+, use array_position and transform functions to transform the top_scores array and get their 1-based indexes in the array_col column.
df \
.select("array_col", F.slice(F.sort_array(F.col("array_col"), asc=False), 1, 3).alias("top_scores")) \
.withColumn("top_score_idx", F.expr("transform(top_scores, v -> array_position(array_col, v))")) \
.show()
# +--------------------------+----------------+-------------+
# |array_col |top_scores |top_score_idx|
# +--------------------------+----------------+-------------+
# |[0.1, 0.5, 0.2, 0.55, 0.9]|[0.9, 0.55, 0.5]|[5, 4, 2] |
# |[0.1, 0.9, 0.5, 0.2, 0.35]|[0.9, 0.5, 0.35]|[2, 3, 5] |
# +--------------------------+----------------+-------------+

Vectorization of selective cumulative sum

I have a pandas Series where each element is a list with indices:
series_example = pd.Series([[1, 3, 2], [1, 2]])
In addition, I have an array with values associated to every index:
arr_example = np.array([3., 0.5, 0.25, 0.1])
I want to create a new Series with the cumulative sums of the elements of the array given by the indices in the row of the input Series. In the example, the output Series would have the following contents:
0 [0.5, 0.6, 0.85]
1 [0.5, 0.75]
dtype: object
The non-vectorized way to do it would be the following:
def non_vector_transform(series, array):
series_output = pd.Series(np.zeros(len(series_example)), dtype = object)
for i in range(len(series)):
element_list = series[i]
series_output[i] = []
acum = 0
for element in element_list:
acum += array[element]
series_output[i].append(acum)
return series_output
I would like to do this in a vectorized way. Any vectorization magician to help me in here?
Use Series.apply and np.cumsum:
import numpy as np
import pandas as pd
series_example = pd.Series([[1, 3, 2], [1, 2]])
arr_example = np.array([3., 0.5, 0.25, 0.1])
result = series_example.apply(lambda x: np.cumsum(arr_example[x]))
print(result)
Or if you prefer a for loop:
import numpy as np
import pandas as pd
series_example = pd.Series([[1, 3, 2], [1, 2]])
arr_example = np.array([3., 0.5, 0.25, 0.1])
# Copy only if you do not want to overwrite the original series
result = series_example.copy()
for i, x in result.iteritems():
result[i] = np.cumsum(arr_example[x])
print(result)
Output:
0 [0.5, 0.6, 0.85]
1 [0.5, 0.75]
dtype: object

Managing high dimensions in Numpy

I want to write a function of 4 variables : f(x1,x2,x3,x4), each in a different dimension.
This can be achieved by f(x1,x2[newaxis],x3[newaxis,newaxis],x4[newaxis,newaxis,newaxis]).
Do you know a smarter way ?
You're looking for np.ix_1:
f(*np.ix_(x1, x2, x3, x4))
For example:
>>> np.ix_([1, 2, 3], [4, 5])
(array([[1],
[2],
[3]]), array([[4, 5]]))
1Or equivalently, np.meshgrid(..., sparse=True, indexing='ij')
One way would be to reshape each array giving appropriate number of singleton dimensions along the leading axes. To do this across all arrays, we could use a list comprehension.
Thus, one way to handle generic number of input arrays would be -
L = [x1,x2,x3,x4]
out = [l.reshape([1]*i + [len(l)]) for i,l in enumerate(L)]
Sample run -
In [186]: # Initialize input arrays
...: x1 = np.random.randint(0,9,(4))
...: x2 = np.random.randint(0,9,(2))
...: x3 = np.random.randint(0,9,(5))
...: x4 = np.random.randint(0,9,(3))
...:
In [187]: A = x1,x2[None],x3[None,None],x4[None,None,None]
In [188]: L = [x1,x2,x3,x4]
...: out = [l.reshape([1]*i + [len(l)]) for i,l in enumerate(L)]
...:
In [189]: A
Out[189]:
(array([2, 1, 1, 1]),
array([[8, 2]]),
array([[[0, 3, 5, 8, 7]]]),
array([[[[6, 7, 0]]]]))
In [190]: out
Out[190]:
[array([2, 1, 1, 1]),
array([[8, 2]]),
array([[[0, 3, 5, 8, 7]]]),
array([[[[6, 7, 0]]]])]

Substitute entries of numpy array with numpy arrays

I have a numpy array A of size ((s1,...sm)) with integer entries and a dictionary D with integers as keys and numpy arrays of size ((t)) as values. I would like to evaluate the dictionary on every entry of the array A to get a new array B of size ((s1,...sm,t)).
For example
D={1:[0,1],2:[1,0]}
A=np.array([1,2,1])
The output shout be
array([[0,1],[1,0],[0,1]])
Motivation: I have an array with indexes of unit vectors as entries and I need to transform it into an array with the vectors as entries.
If you can rename your keys to be 0-indexed, you might use direct array querying on your unit vectors:
>>> units = np.array([D[1], D[2]])
>>> B = units[A - 1] # -1 because 0 indexed: 1 -> 0, 2 -> 1
>>> B
array([[0, 1],
[1, 0],
[0, 1]])
And similarly for any shape:
>>> A = np.random.random_integers(0, 1, (10, 11, 12))
>>> A.shape
(10, 11, 12)
>>> B = units[A]
>>> B.shape
(10, 11, 12, 2)
You can learn more about advanced indexing on the numpy doc
>>> np.asarray([D[key] for key in A])
array([[0, 1],
[1, 0],
[0, 1]])
Here's an approach using np.searchsorted to locate those row indices to index into the values of the dictionary and then simply indexing it to get the desired output, like so -
idx = np.searchsorted(D.keys(),A)
out = np.asarray(D.values())[idx]
Sample run -
In [45]: A
Out[45]: array([1, 2, 1])
In [46]: D
Out[46]: {1: [0, 1], 2: [1, 0]}
In [47]: idx = np.searchsorted(D.keys(),A)
...: out = np.asarray(D.values())[idx]
...:
In [48]: out
Out[48]:
array([[0, 1],
[1, 0],
[0, 1]])

Extract triangles form delaunay filter in mayavi

How can I extract triangles from delaunay filter in mayavi?
I want to extract the triangles just like matplotlib does
import numpy as np
import matplotlib.delaunay as triang
from enthought.mayavi import mlab
x = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
z = np.zeros(9)
#matplotlib
centers, edges, triangles_index, neig = triang.delaunay(x,y)
#mayavi
vtk_source = mlab.pipeline.scalar_scatter(x, y, z, figure=False)
delaunay = mlab.pipeline.delaunay2d(vtk_source)
I want to extract the triangles from mayavi delaunay filter to obtain the variables #triangle_index and #centers (just like matplotlib)
The only thing I've found is this
http://docs.enthought.com/mayavi/mayavi/auto/example_delaunay_graph.html
but only get the edges, and are codificated different than matplotlib
To get the triangles index:
poly = delaunay.outputs[0]
tindex = poly.polys.data.to_array().reshape(-1, 4)[:, 1:]
poly is a PolyData object, poly.polys is a CellArray object that stores the index information.
For detail about CellArray: http://www.vtk.org/doc/nightly/html/classvtkCellArray.html
To get the center of every circumcircle, you need to loop every triangle and calculate the center:
centers = []
for i in xrange(poly.number_of_cells):
cell = poly.get_cell(i)
points = cell.points.to_array()[:, :-1].tolist()
center = [0, 0]
points.append(center)
cell.circumcircle(*points)
centers.append(center)
centers = np.array(centers)
cell.circumcircle() is a static function, so you need to pass all the points of the triangle as arguments, the center data will be returned by modify the fourth argument.
Here is the full code:
import numpy as np
from enthought.mayavi import mlab
x = np.array([0, 1, 2, 0, 1, 2, 0, 1, 2])
y = np.array([0, 0, 0, 1, 1, 1, 2, 2, 2])
z = np.zeros(9)
vtk_source = mlab.pipeline.scalar_scatter(x, y, z, figure=False)
delaunay = mlab.pipeline.delaunay2d(vtk_source)
poly = delaunay.outputs[0]
tindex = poly.polys.data.to_array().reshape(-1, 4)[:, 1:]
centers = []
for i in xrange(poly.number_of_cells):
cell = poly.get_cell(i)
points = cell.points.to_array()[:, :-1].tolist()
center = [0, 0]
points.append(center)
cell.circumcircle(*points)
centers.append(center)
centers = np.array(centers)
print centers
print tindex
The output is:
[[ 1.5 0.5]
[ 1.5 0.5]
[ 0.5 1.5]
[ 0.5 0.5]
[ 0.5 0.5]
[ 0.5 1.5]
[ 1.5 1.5]
[ 1.5 1.5]]
[[5 4 2]
[4 1 2]
[7 6 4]
[4 3 1]
[3 0 1]
[6 3 4]
[8 7 4]
[8 4 5]]
The result may not be the same as matplotlib.delaunay, because there are many possible solutions.