How to Normalize the values between zero and one? - numpy

I have a ndarray of shape (74,):
[-1.995 1.678 -2.535 1.739 -1.728 -1.268 -0.727 -3.385 -2.348
-3.021 0.5293 -0.4573 0.5137 -3.047 -4.75 -1.847 2.922 -0.989
-1.507 -0.9224 -2.545 6.957 0.9985 -2.035 -3.234 -2.848 -1.971
-3.246 2.057 -1.991 -6.27 9.22 0.4045 -2.703 -1.577 4.066
7.215 -4.07 12.98 -3.02 1.456 9.44 6.49 0.272 2.07
1.625 -3.531 -2.846 -4.914 -0.536 -3.496 -1.095 -2.719 -0.5825
5.535 -0.1753 3.658 4.234 4.543 -0.8384 -2.705 -2.012 -6.56
10.5 -2.021 -2.48 1.725 5.69 3.672 -6.855 -3.887 1.761
6.926 -4.848 ]
I need to normlize this vector where the values become between [0,1] and then the sum of the values inside this vector = 1.

You can try this formula to make it between [0, 1]:
min_val = np.min(original_arr)
max_val = np.max(original_arr)
normalized_arr = (original_arr - min_val) / (max_val - min_val)
You can try this formula to make the sum of the array to be 1:
new_arr = original_arr / original_arr.sum()

Related

Numpy: fuzzy 'greater_than' operator, working on list of values (requesting advices on existing code)

I have implemented a numpy function that:
takes as inputs:
a n (rows) x m (columns) array of floats.
a threshold (float)
for each row:
if the max value of the row is larger than or equal to threshold,
if this max value is not preceded in the same row by a min value lower than or equal to -threshold,
then this row is flagged True (larger than),
else this row is flagged False (not larger than)
returns then this n (rows) x 1 (column) array of booleans
What I have implemented works (at least on provided example), but I am far from being an expert in numpy, and I wonder if there is no more efficient way of handling this (possibly avoid the miscellaneous transpose & tile for instance?)
I would gladly accept any advice on how making this function more efficient and/or readable.
import numpy as np
import pandas as pd
# Test data
threshold=0.02 #2%
df = pd.DataFrame({'variation_1': [0.01, 0.02, 0.005, -0.02, -0.01, -0.01],
'variation_2': [-0.01, 0.08, 0.08, 0.01, -0.02, 0.01],
'variation_3': [0.005, -0.03, -0.03, 0.002, 0.025, -0.03],
})
data = df.values
Checking expected results:
In [75]: df
Out[75]:
variation_1 variation_2 variation_3 # Expecting
0 0.010 -0.01 0.005 # False (no value larger than threshold)
1 0.020 0.08 -0.030 # True (1st value equal to threshold)
2 0.005 0.08 -0.030 # True (2nd value larger than threshold)
3 -0.020 0.01 0.002 # False (no value larger than threshold)
4 -0.010 -0.02 0.025 # False (2nd value lower than -threshold)
5 -0.010 0.01 -0.030 # False (no value larger than threshold)
Current function.
def greater_than(data: np.ndarray, threshold: float) -> np.ndarray:
# Step 1.
# Filtering out from 'low_max' mask the rows which 'max' is not greater than or equal
# to 'threshold'. 'low_max' is reshaped like input array for use in next step.
data_max = np.amax(data, axis=1)
low_max = np.transpose([data_max >= threshold] * data.shape[1])
# Step 2.
# Filtering values preceding max of each row
max_idx = np.argmax(data, axis=1) # Get idx of max.
max_idx = np.transpose([max_idx] * data.shape[1]) # Reshape like input array.
# Create an array of index.
idx_array = np.tile(np.arange(data.shape[1]), (data.shape[0],1))
# Keep indices lower than index of max for each row, and filter out rows with
# a max too low vs 'threshold' (from step 1).
mask_max = (idx_array <= max_idx) & (low_max)
# Step 3.
# On a masked array re-using mask from step 2 to filter out unqualifying values,
# filter out rows with a 'min' preceding the 'max' and that are lower than or
# equal to '-threshold'.
data = np.ma.array(data, mask=~mask_max)
data_min = np.amin(data, axis=1)
mask_min = data_min > -threshold
# Return 'mask_min', filling masked values with 'False'.
return np.ma.filled(mask_min, False)
Results.
res = greater_than(data, threshold)
In [78]:res
Out[78]: array([False, True, True, False, False, False])
Thanks in advance for any advice!
lesser = data <= -threshold
greater = data >= threshold
idx_lesser = np.argmax(lesser, axis=1)
idx_greater = np.argmax(greater, axis=1)
has_lesser = np.any(lesser, axis=1)
has_greater = np.any(greater, axis=1)
outptut = has_greater * (has_lesser * (idx_lesser > idx_greater) + np.logical_not(has_lesser))
yields your expected output on your data and should be quite fast. Also, I'm not entirely sure I understand your explanation so if this doesn't work on your actual data let me know.

Simple computation in numpy

I have numpy array like this a = [-- -- -- 1.90 2.91 1.91 2.92]
I need to find % of values more than 2, so here it is 50%.
How to get the same in easy way? also, why len(a) gives 7 (instead of 4)?
Try this:
import numpy as np
import numpy.ma as ma
a = ma.array([0, 1, 2, 1.90, 2.91, 1.91, 2.92])
for i in range(3):
a[i] = ma.masked
print(a)
print(np.sum(a>2)/((len(a) - ma.count_masked(a))))
The last line prints 0.5 which is your 50%. It subtracted from the total length of your array (7) the number of masked elements (3) which you see as the three "--" in the output you posted.
Generally speaking, you can simply use
a = np.array([...])
threshold = 2.0
fraction_higher = (a > threshold).sum() / len(a) # in [0, 1)
percentage_higher = fraction_higher * 100
The array contains 7 elements, being 3 of them masked. This code emulates the test case, generating a masked array as well:
# generate the test case: a masked array
a = np.ma.array([-1, -1, -1, 1.90, 2.91, 1.91, 2.92], mask=[1, 1, 1, 0, 0, 0, 0])
# check its format
print(a)
[-- -- -- 1.9 2.91 1.91 2.92]
#print the output
print(a[a>2].count() / a.count())
0.5

How to Index Pandas Dataframe with a List of Slices

I have two data frames, ret and bins. I would like to take index values from bins, create a range for every row in bins and then use that list of ranges to select the data from ret. Somehow this works when I pass an index of slices (manually typed in), but doesn't work when I pass in a list saved in the variable a
ret = pd.DataFrame({'px': [.1, -.15, .30, -.20, .05]})
bins = pd.DataFrame({'t1': [3,4]}, index=[1,2])
a = []
for i, b in bins.iterrows():
a.append(slice(i, b.t1))
print('a',a)
print('np.r_[a]',np.r_[a])
print('np.r[slice',np.r_[slice(1, 3, None) , slice(1, 4, None)])
print(ret.iloc[np.r_[slice(1, 3, None) , slice(1, 4, None)]]) # this WORKS
print(ret.iloc[a] #this DOES NOT WORK)
here is the output:
a [slice(1, 3, None), slice(2, 4, None)]
np.r_[a] [slice(1, 3, None) slice(2, 4, None)]
np.r[slice [1 2 1 2 3]
px
1 -0.15
2 0.30
1 -0.15
2 0.30
3 -0.20
...
TypeError: int() argument must be a string, a bytes-like object or a number, not 'slice'
Going to answer my own question here! Problem was slice() is too cumbersome to use. Easier to just flatten the lists of arrays. If anyone has any suggestions please post here!
ret = pd.DataFrame({'px': [.1, -.15, .30, -.20, .05]})
bins = pd.DataFrame({'t1': [3,4]}, index=[1,2])
a = [ret[i:b.t1].index for i, b in bins.iterrows()]
out = [item for sublist in a for item in sublist]
print(ret.loc[out])
>>> px
>>>1 -0.15
>>>2 0.30
>>>2 0.30
>>>3 -0.20

Calculating np.mean predict with percent filter

I need to find np.mean of clf.predict only for rows where one of predicted values percent more then 80%
My current code:
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X, Y)
dropIndexes = []
for i in range(len(X)):
proba = clf.predict_proba ([X.values[i]])
if (proba[0][0] < 80 and proba[0][1] < 80):
dropIndexes.append(i)
#delete all rows where predicted values less then 80
X.drop(dropIndexes, inplace=True)
Y.drop(dropIndexes, inplace=True)
#Returns the average of the array elements
print ("ERR:", np.mean(Y != clf.predict(X)))
Is it possible to make this code more quickly?
Your loop is unnecessary, as predict_proba works on matrices. You can replace it with
prd = clf.predict_proba(X)
dropIndexes = (prd[:, 0] < 0.8) & (prd[:, 1] < 0.8)

How to "bin" a numpy array using custom (non-linearly spaced) buckets?

How to "bin" the bellow array in numpy so that:
import numpy as np
bins = np.array([-0.1 , -0.07, -0.02, 0. , 0.02, 0.07, 0.1 ])
array = np.array([-0.21950869, -0.02854823, 0.22329239, -0.28073936, -0.15926265,
-0.43688216, 0.03600587, -0.05101109, -0.24318651, -0.06727875])
That is replace each of the values in array with the following:
-0.1 where `value` < -0.085
-0.07 where -0.085 <= `value` < -0.045
-0.02 where -0.045 <= `value` < -0.01
0.0 where -0.01 <= `value` < 0.01
0.02 where 0.01 <= `value` < 0.045
0.07 where 0.045 <= `value` < 0.085
0.1 where `value` >= 0.085
The expected output would be:
array = np.array([-0.1, -0.02, 0.1, -0.1, -0.1, -0.1, 0.02, -0.07, -0.1, -0.07])
I recognise that numpy has a digitize function however it returns the index of the bin not the bin itself. That is:
np.digitize(array, bins)
np.array([0, 2, 7, 0, 0, 0, 5, 2, 0, 2])
Get those mid-values by averaging across consecutive bin values in pairs. Then, use np.searchsorted or np.digitize to get the indices using the mid-values. Finally, index into bins for the output.
Mid-values :
mid_bins = (bins[1:] + bins[:-1])/2.0
Indices with searchsorted or digitze :
idx = np.searchsorted(mid_bins, array)
idx = np.digitize(array, mid_bins)
Output :
out = bins[idx]