From a multidimensional matrix I like to have the smallest absolute value above a tolerance value.
import numpy as np
np.random.seed(0)
matrix = np.random.randn(5,1,10)
tolerance = 0.1
np.amin(np.abs(matrix), axis=-1)
# array([[0.10321885],
# [0.12167502],
# [0.04575852], # <- should not appear, as below tolerance
# [0.15494743],
# [0.21274028]])
Above code returns the absolute minimum over the last dimension. But I'd like to ignore small values (near 0) from determining the minimum. So in my example with tolerance = 0.1 the third row should contain the second smallest value.
With matrix[np.abs(matrix) >= tolerance] I can select values above tolerance but this flattens the array and therefore np.amin(...) cannot determine the minimum for the last dimension any more.
You can replace the values smaller than 0.1 by for example 1, using np.where:
np.where(np.abs(matrix)< 0.1,1,np.abs(matrix))
Then apply np.amin on top :
np.amin(np.where(np.abs(matrix)< 0.1,1,np.abs(matrix)),axis=-1)
Result:
array([[0.10321885],
[0.12167502],
[0.18718385],
[0.15494743],
[0.21274028]])
Related
I have a pandas dataframe, where there are two columns: time and value. The column time has a timestep from 0 ... to 200. Where in each column value, there's a numpy array with shape (100, 3). Every element of the array is a 3-value tuple (left boundary, right boundary, count). Where left/right boundary is a range, in which histogram's bin is counted. And count is number of counts in a given histogram.
I want to produce a plot, where x axis corresponds to time, y axis corresponds to bins in value and counts corresponds to transparency.
In the plot below, every "less transparent" spot, signifies higher density of the histogram. Where every point on the x axis is a time step, for which one histogram for values on y axis is produced.
I have tried to set transparency to counts/max(all_count) and use fill_between. But still can't reproduce graph above, but I get this one below.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
# df - it is my dataframe
# here is example of data, of first timestep
# for the first 5 bins:
# array([[-2.77325630e+00, -2.75546048e+00, 3.90000000e+01],
# [-2.75546048e+00, -2.73766467e+00, 1.75000000e+02],
# [-2.73766467e+00, -2.71986885e+00, 3.41000000e+02],
# [-2.71986885e+00, -2.70207303e+00, 9.55000000e+02],
# [-2.70207303e+00, -2.68427721e+00, 2.80700000e+03]])
fig, ax = plt.subplots()
in df.iterrows()])
for i, row in df.iterrows():
left = np.array(row['value'])[:, 0]
right = np.array(row['value'])[:, 1]
count = np.array(row['value'])[:, 2]
# normalize each timestep
transparency = count / count.max()
ax.fill_between(i, left, right, alpha=transparency, color='blue')
ax.set_xlabel("Time")
ax.set_ylabel("Bins in Value")
plt.show()
import numpy as np
A = np.random.rand(3,3)
A[1,0] = -666
A[0,1] = -666
A[2,2] = -666
I have a matrix whose entries are all positive except that -666 represents a missing value or outlier. How can I compute column average and row average with positive entries only?
If you want to exclude the negative value from the count in the average denominator, you can use numpy.maximum to clip negative values before summing along the required dimension/axis, and divide by the count of non-negative values along such dimension/axis:
np.sum(np.maximum(A, 0), axis=0)/np.sum(A>0, axis=0)
On the other hand, if you want to count also the negative values in the denominator:
np.mean(A*(A>0), axis=0)
I have two 2d point clouds (oldPts and newPts) which I whish to combine. They are mx2 and nx2 numpyinteger arrays with m and n of order 2000. newPts contains many duplicates or near duplicates of oldPts and I need to remove these before combining.
So far I have used the histogram2d function to produce a 2d representation of oldPts (H). I then compare each newPt to an NxN area of H and if it is empty I accept the point. This last part I am currently doing with a python loop which i would like to remove. Can anybody show me how to do this with broadcasting or perhaps suggest a completely different method of going about the problem. the working code is below
npzfile = np.load(path+datasetNo+'\\temp.npz')
arrs = npzfile.files
oldPts = npzfile[arrs[0]]
newPts = npzfile[arrs[1]]
# remove all the negative values
oldPts = oldPts[oldPts.min(axis=1)>=0,:]
newPts = newPts[newPts.min(axis=1)>=0,:]
# round to integers
oldPts = np.around(oldPts).astype(int)
newPts = newPts.astype(int)
# put the oldPts into 2d array
H, xedg,yedg= np.histogram2d(oldPts[:,0],oldPts[:,1],
bins = [xMax,yMax],
range = [[0, xMax], [0, yMax]])
finalNewList = []
N = 5
for pt in newPts:
if not H[max(0,pt[0]-N):min(xMax,pt[0]+N),
max(0,pt[1]- N):min(yMax,pt[1]+N)].any():
finalNewList.append(pt)
finalNew = np.array(finalNewList)
The right way to do this is to use linear algebra to compute the distance between each pair of 2-long vectors, and then accept only the new points that are "different enough" from each old point: using scipy.spatial.distance.cdist:
import numpy as np
oldPts = np.random.randn(1000,2)
newPts = np.random.randn(2000,2)
from scipy.spatial.distance import cdist
dist = cdist(oldPts, newPts)
print(dist.shape) # (1000, 2000)
okIndex = np.max(dist, axis=0) > 5
print(np.sum(okIndex)) # prints 1503 for me
finalNew = newPts[okIndex,:]
print(finalNew.shape) # (1503, 2)
Above I use the Euclidean distance of 5 as the threshold for "too close": any point in newPts that's farther than 5 from all points in oldPts is accepted into finalPts. You will have to look at the range of values in dist to find a good threshold, but your histogram can guide you in picking the best one.
(One good way to visualize dist is to use matplotlib.pyplot.imshow(dist).)
This is a more refined version of what you were doing with the histogram. In fact, you ought to be able to get the exact same answer as the histogram by passing in metric='minkowski', p=1 keyword arguments to cdist, assuming your histogram bin widths are the same in both dimensions, and using 5 again as the threshold.
(PS. If you're interested in another useful function in scipy.spatial.distance, check out my answer that uses pdist to find unique rows/columns in an array.)
This python code:
import numpy,math
import scipy.optimize as optimization
import matplotlib.pyplot as plt
# Create toy data for curve_fit.
zo = numpy.array([0.0,1.0,2.0,3.0,4.0,5.0])
mu = numpy.array([0.1,0.9,2.2,2.8,3.9,5.1])
sig = numpy.array([1.0,1.0,1.0,1.0,1.0,1.0])
# Define hubble function.
def Hubble(x,a,b):
return H0 * m.sqrt( a*(1+x)**2 + 1/2 * a * (1+b)**3 )
# Define
def Distancez(x,a,b):
return c * (1+x)* np.asarray(quad(lambda tmp:
1/Hubble(a,b,tmp),0,x))
def mag(x,a,b):
return 5*np.log10(Distancez(x,a,b)) + 25
#return a+b*x
# Compute chi-square manifold.
Steps = 101 # grid size
Chi2Manifold = numpy.zeros([Steps,Steps]) # allocate grid
amin = 0.2 # minimal value of a covered by grid
amax = 0.3 # maximal value of a covered by grid
bmin = 0.3 # minimal value of b covered by grid
bmax = 0.6 # maximal value of b covered by grid
for s1 in range(Steps):
for s2 in range(Steps):
# Current values of (a,b) at grid position (s1,s2).
a = amin + (amax - amin)*float(s1)/(Steps-1)
b = bmin + (bmax - bmin)*float(s2)/(Steps-1)
# Evaluate chi-squared.
chi2 = 0.0
for n in range(len(xdata)):
residual = (mu[n] - mag(zo[n], a, b))/sig[n]
chi2 = chi2 + residual*residual
Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to grid.
Throws this error message:
ValueError Traceback (most recent call last)
<ipython-input-136-d0ef47a881a7> in <module>()
36 residual = (mu[n] - mag(zo[n], a, b))/sig[n]
37 chi2 = chi2 + residual*residual
---> 38 Chi2Manifold[Steps-1-s2,s1] = chi2 # write result to
grid.
ValueError: setting an array element with a sequence.
Note: If I define a simple mag function such as (a+b*x), I do not get any error message.
In fact all three functions Hubble, Distancez and Meg have to be functions of redshift z, which is an array.
Now do you think I need to redefine all these functions to have an output array? I mean first, create an array of redshift and then the output of the functions automatically become array?
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
Thanks for your reply.
I need the output of the Distancez() and mag() functions to be arrays. I managed to do it, simply by changing the upper limit of the integral in the Distancez function from x to x.any(). Now I have an array and this is what I want. However, now I see that the output value of the for example Distance(0.25, 0.5, 0.3) is different from when I just put x in the upper limit of the integral? Any help would be appreciated.
The ValueError is saying that it cannot assign an element of the array Chi2Manifold with a value that is a sequence. chi2 is probably a numpy array because residual is a numpy array because, your mag() function returns a numpy array, all because your Distancez function returns an numpy array -- you are telling it to do this with that np.asarray().
If Distancez() returned a scalar floating point value you'd probably be set. Do you need to use np.asarray() in Distancez()? Is that actually a 1-element array, or perhaps you intend to reduce that somehow to a scalar. I don't know what your Hubble() function is supposed to do and I'm not an astronomer but in my experience distances are often scalars ;).
If chi2 is meant to be a sequence or numpy array, you probably want to set an appropriately-sized range of values in Chi2Manifold to chi2.
I have a 2-variable discrete function represented in the form of a tuple through the following line of code:
hist_values, hist_x, hist_y = np.histogram2d()
Where you can think of a non-smooth 3d surface with hist_values being the height of the surface at grids with edge coordinates of (hist_x, hist_y).
Now, I would like to collect those grids for which hist_values is above some threshold level.
You could simply compare the hist_values with the threshold, this would give you a mask as an array of bool which can be used in slicing, e.g.:
import numpy as np
# prepare random input
arr1 = np.random.randint(0, 100, 1000)
arr2 = np.random.randint(0, 100, 1000)
# compute 2D histogram
hist_values, hist_x, hist_y = np.histogram2d(arr1, arr2)
mask = hist_values > threshold # the array of `bool`
hist_values[mask] # only the values above `threshold`
Of course, the values are then collected in a flattened array.
Alternatively, you could also use mask to instantiate a masked-array object (using numpy.ma, see docs for more info on it).
If you are after the coordinates at which this is happening, you should use numpy.where().
# i0 and i1 contain the indices in the 0 and 1 dimensions respectively
i0, i1 = np.where(hist_values > threshold)
# e.g. this will give you the first value satisfying your condition
hist_values[i0[0], i1[0]]
For the corresponding values of hist_x and hist_y you should note that these are the boundaries of the bins, and not, for example, the mid-values, therefore you could resort to the lower or upper bound of it.
# lower edges of `hist_x` and `hist_y` respectively...
hist_x[i0]
hist_y[i1]
# ... and upper edges
hist_x[i0 + 1]
hist_y[i1 + 1]