Numpy: fuzzy 'greater_than' operator, working on list of values (requesting advices on existing code) - numpy

I have implemented a numpy function that:
takes as inputs:
a n (rows) x m (columns) array of floats.
a threshold (float)
for each row:
if the max value of the row is larger than or equal to threshold,
if this max value is not preceded in the same row by a min value lower than or equal to -threshold,
then this row is flagged True (larger than),
else this row is flagged False (not larger than)
returns then this n (rows) x 1 (column) array of booleans
What I have implemented works (at least on provided example), but I am far from being an expert in numpy, and I wonder if there is no more efficient way of handling this (possibly avoid the miscellaneous transpose & tile for instance?)
I would gladly accept any advice on how making this function more efficient and/or readable.
import numpy as np
import pandas as pd
# Test data
threshold=0.02 #2%
df = pd.DataFrame({'variation_1': [0.01, 0.02, 0.005, -0.02, -0.01, -0.01],
'variation_2': [-0.01, 0.08, 0.08, 0.01, -0.02, 0.01],
'variation_3': [0.005, -0.03, -0.03, 0.002, 0.025, -0.03],
})
data = df.values
Checking expected results:
In [75]: df
Out[75]:
variation_1 variation_2 variation_3 # Expecting
0 0.010 -0.01 0.005 # False (no value larger than threshold)
1 0.020 0.08 -0.030 # True (1st value equal to threshold)
2 0.005 0.08 -0.030 # True (2nd value larger than threshold)
3 -0.020 0.01 0.002 # False (no value larger than threshold)
4 -0.010 -0.02 0.025 # False (2nd value lower than -threshold)
5 -0.010 0.01 -0.030 # False (no value larger than threshold)
Current function.
def greater_than(data: np.ndarray, threshold: float) -> np.ndarray:
# Step 1.
# Filtering out from 'low_max' mask the rows which 'max' is not greater than or equal
# to 'threshold'. 'low_max' is reshaped like input array for use in next step.
data_max = np.amax(data, axis=1)
low_max = np.transpose([data_max >= threshold] * data.shape[1])
# Step 2.
# Filtering values preceding max of each row
max_idx = np.argmax(data, axis=1) # Get idx of max.
max_idx = np.transpose([max_idx] * data.shape[1]) # Reshape like input array.
# Create an array of index.
idx_array = np.tile(np.arange(data.shape[1]), (data.shape[0],1))
# Keep indices lower than index of max for each row, and filter out rows with
# a max too low vs 'threshold' (from step 1).
mask_max = (idx_array <= max_idx) & (low_max)
# Step 3.
# On a masked array re-using mask from step 2 to filter out unqualifying values,
# filter out rows with a 'min' preceding the 'max' and that are lower than or
# equal to '-threshold'.
data = np.ma.array(data, mask=~mask_max)
data_min = np.amin(data, axis=1)
mask_min = data_min > -threshold
# Return 'mask_min', filling masked values with 'False'.
return np.ma.filled(mask_min, False)
Results.
res = greater_than(data, threshold)
In [78]:res
Out[78]: array([False, True, True, False, False, False])
Thanks in advance for any advice!

lesser = data <= -threshold
greater = data >= threshold
idx_lesser = np.argmax(lesser, axis=1)
idx_greater = np.argmax(greater, axis=1)
has_lesser = np.any(lesser, axis=1)
has_greater = np.any(greater, axis=1)
outptut = has_greater * (has_lesser * (idx_lesser > idx_greater) + np.logical_not(has_lesser))
yields your expected output on your data and should be quite fast. Also, I'm not entirely sure I understand your explanation so if this doesn't work on your actual data let me know.

Related

cumsum() on running streak of data

i'm attempting to determine the sum of a column for a running period where there is a negative gain - ie i want to determine the total for a loosing streak.
i've set up a column that provides the numerical number of days where its been loosing (Consecutive Losses), but i wish to sum up the same for the total loss throughout the streak. what i have (Aggregate Consecutive Loss) 1) doesn't work (because it just cumsums() without resetting to zero at each streak) and 2) is incorrectly as i should in fact take the Open value at the start of the streak and Close value at the end.
how can i correctly setup this Aggregate Consecutive Loss value in pandas?
import pandas as pd
import numpy as np
import yfinance as yf
def get( symbols, group_by_ticker=False, **kwargs ):
if not isinstance(symbols, list):
symbols = [ symbols, ]
kwargs['auto_adjust'] = True
kwargs['prepost'] = True
kwargs['threads'] = True
df = None
if group_by_ticker:
kwargs['group_by'] = 'ticker'
df = yf.download( symbols, **kwargs)
for t in symbols:
df["Change Percent",t] = df["Close",t].pct_change() * 100
df["Gain",t] = np.where( df['Change Percent',t] > 0, True, False ).astype('bool')
a = df['Gain',t] != True
df['Consecutive Losses',t] = a.cumsum() - a.cumsum().where(~a).ffill().fillna(0).astype(int)
x = df['Change Percent',t].where( df['Consecutive Losses',t] > 0 )
df['Aggregate Consecutive Loss',t] = x.cumsum() - x.cumsum().where(~a).ffill().fillna(0).astype(float)
return df
data = get( ["DOW", "IDX"], period="6mo")
data[['Change Percent','Gain','Consecutive Losses','Aggregate Consecutive Loss']].head(50)

How to concatenate two tensors with intervals in tensorflow?

I want to concatenate two tensors checkerboard-ly in tensorflow2, like examples showed below:
example 1:
a = [[1,1],[1,1]]
b = [[0,0],[0,0]]
concated_a_and_b = [[1,0,1,0],[0,1,0,1]]
example 2:
a = [[1,1,1],[1,1,1],[1,1,1]]
b = [[0,0,0],[0,0,0],[0,0,0]]
concated_a_and_b = [[1,0,1,0,1,0],[0,1,0,1,0,1],[1,0,1,0,1,0]]
Is there a decent way in tensorflow2 to concatenate them like this?
A bit of background for this:
I first split a tensor c with a checkerboard mask into two halves a and b. A after some transformation I have to concat them back into oringnal shape and order.
What I mean by checkerboard-ly:
Step 1: Generate a matrix with alternated values
You can do this by first concatenating into [1, 0] pairs, and then by applying a final reshape.
Step 2: Reverse some rows
I split the matrix into two parts, reverse the second part and then rebuild the full matrix by picking alternatively from the first and second part
Code sample:
import math
import numpy as np
import tensorflow as tf
a = tf.ones(shape=(3, 4))
b = tf.zeros(shape=(3, 4))
x = tf.expand_dims(a, axis=-1)
y = tf.expand_dims(b, axis=-1)
paired_ones_zeros = tf.concat([x, y], axis=-1)
alternated_values = tf.reshape(paired_ones_zeros, [-1, a.shape[1] + b.shape[1]])
num_samples = alternated_values.shape[0]
middle = math.ceil(num_samples / 2)
is_num_samples_odd = middle * 2 != num_samples
# Gather first part of the matrix, don't do anything to it
first_elements = tf.gather_nd(alternated_values, [[index] for index in range(middle)])
# Gather second part of the matrix and reverse its elements
second_elements = tf.reverse(tf.gather_nd(alternated_values, [[index] for index in range(middle, num_samples)]), axis=[1])
# Pick alternatively between first and second part of the matrix
indices = np.concatenate([[[index], [index + middle]] for index in range(middle)], axis=0)
if is_num_samples_odd:
indices = indices[:-1]
output = tf.gather_nd(
tf.concat([first_elements, second_elements], axis=0),
indices
)
print(output)
I know this is not a decent way as it will affect time and space complexity. But it solves the above problem
def concat(tf1, tf2):
result = []
for (index, (tf_item1, tf_item2)) in enumerate(zip(tf1, tf2)):
item = []
for (subitem1, subitem2) in zip(tf_item1, tf_item2):
if index % 2 == 0:
item.append(subitem1)
item.append(subitem2)
else:
item.append(subitem2)
item.append(subitem1)
concated_a_and_b.append(item)
return concated_a_and_b

Calculating np.mean predict with percent filter

I need to find np.mean of clf.predict only for rows where one of predicted values percent more then 80%
My current code:
clf = DecisionTreeClassifier(random_state=1)
clf.fit(X, Y)
dropIndexes = []
for i in range(len(X)):
proba = clf.predict_proba ([X.values[i]])
if (proba[0][0] < 80 and proba[0][1] < 80):
dropIndexes.append(i)
#delete all rows where predicted values less then 80
X.drop(dropIndexes, inplace=True)
Y.drop(dropIndexes, inplace=True)
#Returns the average of the array elements
print ("ERR:", np.mean(Y != clf.predict(X)))
Is it possible to make this code more quickly?
Your loop is unnecessary, as predict_proba works on matrices. You can replace it with
prd = clf.predict_proba(X)
dropIndexes = (prd[:, 0] < 0.8) & (prd[:, 1] < 0.8)

Python: checking which bins two time points belong to

I have a list of lists with two values that represent a start time-point and an end time-point. I would like to count how much of the time range between the two points fall into bins.
The bins are between 0-300,300-500 and 500-1200.
I would also like to bin them between 0-50, 50-100, 100-150 and so on.
The question is similar to Python: Checking to which bin a value belongs, but different since it involves a two-points time-range which can fall into separate bins at the same time.
I have created the a for loop in the code below, which works. But I'm wondering if there is a faster more pythonic way to calculate this, perhaps using pandas or numpy.
import numpy
x = numpy.array([[100, 150],[100, 125],[290, 310],[277, 330],
[300, 400],[480, 510],[500, 600]])
d = {'0-300': [0], '300-500': [0], '500-1200':[0]}
import pandas as pd
df = pd.DataFrame(data=d)
for i in x:
start,end = i[0],i[1]
if start <= 300 and end <= 300: # checks if time ranges falls into only 1st bin
df['0-300'][0] += end - start
elif start <= 300 and end > 300: # checks if time ranges falls into 1st and 2ed bin
df['0-300'][0] += (300 - start)
df['300-500'][0] += (end - 300)
elif start >= 300 and end >= 300 and end <= 500: # checks if time ranges falls into only 2ed bin
df['300-500'][0] += end - start
elif start <= 500 and end > 500: # checks if time ranges falls into 2ed and 3ed bin
df['300-500'][0] += (500 - start)
df['500-1200'][0] += (end - 500)
elif start > 500: # checks if time ranges falls into only 3ed bin
df['500-1200'][0] += end - start
df:
0-300 300-500 500-1200
108 160 110
thanks for reading
For a generic number of bins, here's a vectorized way leveraging np.add.at to get the counts and then np.add.reduceat for getting binned summations -
bins = [0, 300, 500, 1200] # Declare bins
id_arr = np.zeros(bins[-1], dtype=int)
np.add.at(id_arr, x[:,0], 1)
np.add.at(id_arr, x[:,1], -1)
c = id_arr.cumsum()
out = np.add.reduceat(c, bins[:-1])
# Present in a dataframe format
col_names = [str(i)+'-' + str(j) for i,j in zip(bins[:-1], bins[1:])]
df_out = pd.DataFrame([out], columns=col_names)
Sample output -
In [524]: df_out
Out[524]:
0-300 300-500 500-1200
0 108 160 110
Here is one way of doing it
In [1]: counts = np.zeros(1200, dtype=int)
In [2]: for x_lower, x_upper in x: counts[x_lower:x_upper] += 1
In [3]: d['0-300'] = counts[0:300].sum()
In [4]: d['300-500'] = counts[300:500].sum()
In [5]: d['500-1200'] = counts[500:1200].sum()
In [6]: d
Out[6]: {'0-300': 108, '300-500': 160, '500-1200': 110}
However, in order to sum up the results for all bins, it will be better to wrap those 3 steps into a for loop.

pandas histogram plot error: ValueError: num must be 1 <= num <= 0, not 1

I am drawing a histogram of a column from pandas data frame:
%matplotlib notebook
import matplotlib.pyplot as plt
import matplotlib
df.hist(column='column_A', bins = 100)
but got the following errors:
62 raise ValueError(
63 "num must be 1 <= num <= {maxn}, not {num}".format(
---> 64 maxn=rows*cols, num=num))
65 self._subplotspec = GridSpec(rows, cols)[int(num) - 1]
66 # num - 1 for converting from MATLAB to python indexing
ValueError: num must be 1 <= num <= 0, not 1
Does anyone know what this error mean? Thanks!
Problem
The problem you encounter arises when column_A does not contain numeric data. As you can see in the excerpt from pandas.plotting._core below, the numeric data is essential to make the function hist_frame (which you call by DataFrame.hist()) work correctly.
def hist_frame(data, column=None, by=None, grid=True, xlabelsize=None,
xrot=None, ylabelsize=None, yrot=None, ax=None, sharex=False,
sharey=False, figsize=None, layout=None, bins=10, **kwds):
# skipping part of the code
# ...
if column is not None:
if not isinstance(column, (list, np.ndarray, Index)):
column = [column]
data = data[column]
data = data._get_numeric_data() # there is no numeric data in the column
naxes = len(data.columns) # so the number of axes becomes 0
# naxes is passed to the subplot generating function as 0 and later determines the number of columns as 0
fig, axes = _subplots(naxes=naxes, ax=ax, squeeze=False,
sharex=sharex, sharey=sharey, figsize=figsize,
layout=layout)
# skipping the rest of the code
# ...
Solution
If your problem is to represent numeric data (but not of numeric dtype yet) with a histogram, you need to cast your data to numeric, either with pd.to_numeric or df.astype(a_selected_numeric_dtype), e.g. 'float64', and then proceed with your code.
If your problem is to represent non-numeric data in one column with a histogram, you can call the function hist_series with the following line: df['column_A'].hist(bins=100).
If your problem is to represent non-numeric data in many columns with a histogram, you may resort to a handful options:
Use matplotlib and create subplots and histograms directly
Update pandas at least to version 0.25
usually is 0
mta['penn'] = [mta_bystation[mta_bystation.STATION == "34 ST-PENN STA"], 'Penn Station']
mta['grdcntrl'] = [mta_bystation[mta_bystation.STATION == "GRD CNTRL-42 ST"], 'Grand Central']
mta['heraldsq'] = [mta_bystation[mta_bystation.STATION == "34 ST-HERALD SQ"], 'Herald Sq']
mta['23rd'] = [mta_bystation[mta_bystation.STATION == "23 ST"], '23rd St']
#mta['portauth'] = [mta_bystation[mta_bystation.STATION == "42 ST-PORT AUTH"], 'Port Auth']
#mta['unionsq'] = [mta_bystation[mta_bystation.STATION == "14 ST-UNION SQ"], 'Union Sq']
mta['timessq'] = [mta_bystation[mta_bystation.STATION == "TIMES SQ-42 ST"], 'Ti