remove duplicated values form numpy array - numpy

I have three numpy arrays
x =np.array([1,2,3,4,2,1,2,3,3,3])
y =np.array([10,20,30,40,20,10,20,30,39,39])
z =np.array([100,200,300,400,200,100,200,300,300,300])
I want to check if x[i]==x[j] and y[i]==y[j] and z[i]!=z[j]. If this is true I want to remove z[j].
In pseudo code:
label: check
for i in range(0,np.size(x)):
for j in range(0,np.size(x)):
If x[i] == x[j] and y[i]==y[j] and z[i]!=z[j] and i<j:
x = delete(x,j)
y = delete(y,j)
z = delete(z,j)
print "start again from above"
goto check
Since I use goto and I don't know any other way around this I want to ask if there is any quick and elegant way to do this (maybe based on numpy predefined functions)?

This should do it:
np.unique(np.array([x, y, z]), axis=1)

Related

pyplot return unexpected contour paths

I want to use pyplot.contour to extract isolines from 2D data.
My problem is that this method returns unexpected results : when I use levels clearly outside data range, the contour result still contains paths.
Here is an example reproducing the issue :
import numpy
from matplotlib import pyplot
n = 256
x = numpy.linspace(-3., 3., n)
y = numpy.linspace(-3., 3., n)
X, Y = numpy.meshgrid(x, y)
Z = X * numpy.sinc(X ** 2 + Y ** 2)
levels = [1000]
print(f'data min : {Z.min()}')
print(f'data min : {Z.max()}')
print(f'levels : {levels}')
isolines = pyplot.contour(X, Y, Z, levels, colors='red')
for i, collection in enumerate(isolines.collections):
npaths = len(collection.get_paths())
print(f'collection[{i}] has {npaths} paths')
pyplot.show()
Which outputs
data min : -0.47993931267102286
data min : 0.47993931267102286
levels : [1000]
/path/to/issue.py:15: UserWarning: No contour levels were found within the data range.
isolines = pyplot.contour(X, Y, Z, levels, colors='red')
collection[0] has 1 paths
I expected the contour to be empty and not contain 1 path, do I miss something obvious here ?
As of 2023/01/11, it is a bug in matplotlib :
https://github.com/matplotlib/matplotlib/issues/23778
As the fix has not landed yet, my temporary workaround is to detect when levels are outside Z value range, and empty the contour collections in that case.
quadcontourset = pyplot.contour(X, Y, Z, levels)
zmin = numpy.min(Z)
zmax = numpy.max(Z)
inside = (levels > zmin) & (levels < zmax)
levels_in = levels[inside]
if not levels_in:
quadcontourset.collections.clear()
I reproduce the issue with matplotlib 3.5.3. The issue is not fixed in current 3.6.2 version but a fix seems on track at
https://github.com/matplotlib/matplotlib/pull/24912

Map elements of multiple columns in Pandas

I'm trying to label some values in a DataFrame in Pandas based on the value itself, in-place.
df = pd.read_csv('data/extrusion.csv')
# get list of columns that contain thickness
columns = [c for c in data.columns if 'SDickeIst'.lower() in c.lower()]
# create a function that returns the class based on value
def get_label(ser):
ser.map(lambda x : x if x == 0 else 1)
df[columns].apply(get_label)
I would expect that the apply function takes each column in particular and applies get_label on it. In turn, get_label gets the ser argument as a Series and uses map to map each element != 0 with 1.
get_label doesn't return anything.
You want to return ser.map(lambda x : x if x == 0 else 1).
def get_label(ser):
return ser.map(lambda x : x if x == 0 else 1)
Besides that, apply doesn't act in-place, it always returns a new object. Therefore you need
df[columns] = df[columns].apply(get_label)
But in this simple case, using DataFrame.where should be much faster if you are dealing with large DataFrames.
df[columns] = df[columns].where(lambda x: x == 0, 1)

Creating 2d array and filling first columns of each row in numpy

I have written the following code for creating a 2D array and filing the first element of each row. I am new to numpy. Is there a better way to do this?
y=np.zeros(N*T1).reshape(N,T1)
x = np.linspace(0,L,num = N)
for k in range(0,N):
y[k][0] = np.sin(PI*x[k]/L)
Simply do this:
y[:, 0] = np.sin(PI*x/L)

I am trying to take an 1D slice from 2D numpy array, but something goes wrong

I am trying to filter evident measurement mistakes from my data using the 3-sigma rule. x is a numpy array of measurement points and y is an arrray of measured values. To remove wrong points from my data, I zip x.tolist() and y.tolist(), then filter by the second element of each tuple, then I need to convert my zip back into two lists. I tried to first covert my list of tuples into a list of lists, then convert it to numpy 2D array and then take two 1D-slices of it. It looks like the first slice is correct, but then it outputs the following:
x = np.array(list(map(list, list(filter(flt, list(zap))))))[:, 0]
IndexError: too many indices for array
I don't understand what am I doing wrong. Here's the code:
x = np.array(readCol(0, l))
y = np.array(readCol(1, l))
n = len(y)
stdev = np.std(y)
mean = np.mean(y)
print("Stdev is: " + str(stdev))
print("Mean is: " + str(mean))
def flt(n):
global mean
global stdev
global x
if abs(n[1] - mean) < 3*stdev:
return True
else:
print('flt function finds an error: ' + str(n[1]))
return False
def filtration(N):
print(Fore.RED + 'Filtration function launched')
global y
global x
global stdev
global mean
zap = zip(x.tolist(), y.tolist())
for i in range(N):
print(Fore.RED + ' Filtration step number ' + str(i) + Style.RESET_ALL)
y = np.array(list(map(list, list(filter(flt, list(zap))))))[:, 1]
print(Back.GREEN + 'This is y: \n' + Style.RESET_ALL)
print(y)
x = np.array(list(map(list, list(filter(flt, list(zap))))))[:, 0]
print(Back.GREEN + 'This is x: \n' + Style.RESET_ALL)
print(x)
print('filtration fuction main step')
stdev = np.std(y)
print('second step')
mean = np.mean(y)
print('third step')
Have you tried to test the problem line step by step?
x = np.array(list(map(list, list(filter(flt, list(zap))))))[:, 0]
for example:
temp = np.array(list(map(list, list(filter(flt, list(zap))))))
print(temp.shape, temp.dtype)
x = temp[:, 0]
Further break down might be needed, but since [:,0] is the only indexing operation in this line, I'd start there.
Without further study of the code and/or some examples, I'm not going to try to speculate what the nested lists are doing.
The error sounds like temp is not 2d, contrary to your expectations. That could be because temp is object dtype, and composed of lists the vary in length. That seems to be common problem when people make arrays from downloaded databases.

How to optimize the linear coefficients for numpy arrays in a maximization function?

I have to optimize the coefficients for three numpy arrays which maximizes my evaluation function.
I have a target array called train['target'] and three predictions arrays named array1, array2 and array3.
I want to put the best linear coefficients i.e., x,y,z for these three arrays which will maximize the function
roc_aoc_curve(train['target'], xarray1 + yarray2 +z*array3)
the above function would be maximum when prediction is closer to the target.
i.e, xarray1 + yarray2 + z*array3 should be closer to train['target'].
The range of x,y,z >=0 and x,y,z <= 1
Basically I am trying to put the weights x,y,z for each of the three arrays which would make the function
xarray1 + yarray2 +z*array3 closer to the train['target']
Any help in getting this would be appreciated.
I used pulp.LpProblem('Giapetto', pulp.LpMaximize) to do the maximization. It works for normal numbers, integers etc, however failing while trying to do with arrays.
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
prob += score
coef = x+y+z
prob += (coef==1)
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Getting error at the line
score = roc_auc_score(train['target'],x*array1+ y*array2 + z*array3)
TypeError: unsupported operand type(s) for /: 'int' and 'LpVariable'
Can't progress beyond this line when using arrays. Not sure if my approach is correct. Any help in optimizing the function would be appreciated.
When you add sums of array elements to a PuLP model, you have to use built-in PuLP constructs like lpSum to do it -- you can't just add arrays together (as you discovered).
So your score definition should look something like this:
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
A few notes about this:
[+] You didn't provide the definition of roc_auc_score so I just pretended that it equals the sum of the element-wise difference between the target array and the weighted sum of the other 3 arrays.
[+] I suspect your actual calculation for roc_auc_score is nonlinear; more on this below.
[+] arr_ind is a list of the indices of the arrays, which I created like this:
# build array index
arr_ind = range(len(array1))
[+] You also didn't include the arrays, so I created them like this:
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
train = {}
train['target'] = np.ones((10, 1))
Here is my complete code, which compiles and executes, though I'm sure it doesn't give you the result you are hoping for, since I just guessed about target and roc_auc_score:
import numpy as np
import pulp
# create the LP object, set up as a maximization problem
prob = pulp.LpProblem('Giapetto', pulp.LpMaximize)
# dummy arrays since arrays weren't in OP code
array1 = np.random.rand(10, 1)
array2 = np.random.rand(10, 1)
array3 = np.random.rand(10, 1)
# build array index
arr_ind = range(len(array1))
# set up decision variables
x = pulp.LpVariable('x', lowBound=0)
y = pulp.LpVariable('y', lowBound=0)
z = pulp.LpVariable('z', lowBound=0)
# dummy roc_auc_score since roc_auc_score wasn't in OP code
train = {}
train['target'] = np.ones((10, 1))
score = pulp.lpSum([train['target'][i] - (x * array1[i] + y * array2[i] + z * array3[i]) for i in arr_ind])
prob += score
coef = x + y + z
prob += coef == 1
# solve the LP using the default solver
optimization_result = prob.solve()
# make sure we got an optimal solution
assert optimization_result == pulp.LpStatusOptimal
# display the results
for var in (x, y,z):
print('Optimal weekly number of {} to produce: {:1.0f}'.format(var.name, var.value()))
Output:
Optimal weekly number of x to produce: 0
Optimal weekly number of y to produce: 0
Optimal weekly number of z to produce: 1
Process finished with exit code 0
Now, if your roc_auc_score function is nonlinear, you will have additional troubles. I would encourage you to try to formulate the score in a way that is linear, possibly using additional variables (for example, if you want the score to be an absolute value).