apply generic function in a vectorized fashion using numpy/pandas - pandas

I am trying to vectorize my code and, thanks in large part to some users (https://stackoverflow.com/users/3293881/divakar, https://stackoverflow.com/users/625914/behzad-nouri), I was able to make huge progress. Essentially, I am trying to apply a generic function (in this case max_dd_array_ret) to each of the bins I found (see vectorize complex slicing with pandas dataframe for details on date vectorization and Start, End and Duration of Maximum Drawdown in Python for the rationale behind max_dd_array_ret). the problem is the following: I should be able to obtain the result df_2 and, to some degree, ranged_DD(asd_1.values, starts, ends+1) is what I am looking for, except for the tragic effect that it's as if the first two bins are merged and the last one is missing as it can be gauged by looking at the results.
any explanation and fix is very welcomed
import pandas as pd
import numpy as np
from time import time
from scipy.stats import binned_statistic
def max_dd_array_ret(xs):
xs = (xs+1).cumprod()
i = np.argmax(np.maximum.accumulate(xs) - xs) # end of the period
j = np.argmax(xs[:i])
max_dd = abs(xs[j]/xs[i] -1)
return max_dd if max_dd is not None else 0
def get_ranges_arr(starts,ends):
# Taken from https://stackoverflow.com/a/37626057/3293881
counts = ends - starts
counts_csum = counts.cumsum()
id_arr = np.ones(counts_csum[-1],dtype=int)
id_arr[0] = starts[0]
id_arr[counts_csum[:-1]] = starts[1:] - ends[:-1] + 1
return id_arr.cumsum()
def ranged_DD(arr,starts,ends):
# Get all indices and the IDs corresponding to same groups
idx = get_ranges_arr(starts,ends)
id_arr = np.repeat(np.arange(starts.size),ends-starts)
slice_arr = arr[idx]
return binned_statistic(id_arr, slice_arr, statistic=max_dd_array_ret)[0]
asd_1 = pd.Series(0.01 * np.random.randn(500), index=pd.date_range('2011-1-1', periods=500)).pct_change()
index_1 = pd.to_datetime(['2011-2-2', '2011-4-3', '2011-5-1','2011-7-2', '2011-8-3', '2011-9-1','2011-10-2', '2011-11-3', '2011-12-1','2012-1-2', '2012-2-3', '2012-3-1',])
index_2 = pd.to_datetime(['2011-2-15', '2011-4-16', '2011-5-17','2011-7-17', '2011-8-17', '2011-9-17','2011-10-17', '2011-11-17', '2011-12-17','2012-1-17', '2012-2-17', '2012-3-17',])
starts = asd_1.index.searchsorted(index_1)
ends = asd_1.index.searchsorted(index_2)
df_2 = pd.DataFrame([max_dd_array_ret(asd_1.loc[i:j]) for i, j in zip(index_1, index_2)], index=index_1)
print(df_2[0].values)
print(ranged_DD(asd_1.values, starts, ends+1))
results:
df_2
[ 1.75893509 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085
1.43863472 1.85322338 1.84767224 1.32605754 1.48688414 5.44786663]
ranged_DD(asd_1.values, starts, ends+1)
[ 6.08002911 2.60131797 1.55631781 1.8770067 2.50709085 1.43863472
1.85322338 1.84767224 1.32605754 1.48688414]
which are identical except for the first two:
[ 1.75893509 6.08002911 vs [ 6.08002911
and the last two
1.48688414 5.44786663] vs 1.48688414]
p.s.:while looking in more detail at the docs (http://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binned_statistic.html) I found that this might be the problem
"All but the last (righthand-most) bin is half-open. In other words,
if bins is [1, 2, 3, 4], then the first bin is [1, 2) (including 1,
but excluding 2) and the second [2, 3). The last bin, however, is [3,
4], which includes 4. New in version 0.11.0."
problem is I don't how to reset it.

Related

passing panda dataframe data to functions and its not outputting the results

In my code, I am trying to extract data from csv file to use in the function, but it doesnt output anything, and gives no error. My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
import numpy as np
import pandas as pd
import os
# change the current directory to the directory where the running script file is
os.chdir(os.path.dirname(os.path.abspath(__file__)))
# finding best fit line for y=mx+b by iteration
def gradient_descent(x,y):
m_iter = b_iter = 1 #starting point
iteration = 10000
n = len(x)
learning_rate = 0.05
last_mse = 10000
#take baby steps to reach global minima
for i in range(iteration):
y_predicted = m_iter*x + b_iter
#mse = 1/n*sum([value**2 for value in (y-y_predicted)]) # cost function to minimize
mse = 1/n*sum((y-y_predicted)**2) # cost function to minimize
if (last_mse - mse)/mse < 0.001:
break
# recall MSE formula is 1/n*sum((yi-y_predicted)^2), where y_predicted = m*x+b
# using partial deriv of MSE formula, d/dm and d/db
dm = -(2/n)*sum(x*(y-y_predicted))
db = -(2/n)*sum((y-y_predicted))
# use current predicted value to get the next value for prediction
# by using learning rate
m_iter = m_iter - learning_rate*dm
b_iter = b_iter - learning_rate*db
print('m is {}, b is {}, cost is {}, iteration {}'.format(m_iter,b_iter,mse,i))
last_mse = mse
#x = np.array([1,2,3,4,5])
#y = np.array([5,7,8,10,13])
#gradient_descent(x,y)
df = pd.read_csv('Linear_Data.csv')
x = df['Area']
y = df['Price']
gradient_descent(x,y)
My code works because I tried it with just numpy array as inputs. not sure why it doesnt work with panda.
Well no, your code also works with pandas dataframes:
df = pd.DataFrame({'Area': [1,2,3,4,5], 'Price': [5,7,8,10,13]})
x = df['Area']
y = df['Price']
gradient_descent(x,y)
Above will give you the same output as with numpy arrays.
Try to check what's in Linear_Data.csv and/or add some print statements in the gradient_descent function just to check your assumptions. I would suggest to first of all add a print statement before the condition with the break statement:
print(last_mse, mse)
if (last_mse - mse)/mse < 0.001:
break

Idiomatic tensorflow expression of for-loop which considers value of preceding iteration

It might not even be possible, but I would like to express the following code without the for-loop.
tf.scan is prohibitively slow and therefore not a good solution.
I am perfectly happy to accept any answer which gives a solution or an argument why this is not possible.
import tensorflow as tf
import matplotlib.pyplot as plt
# Some data
random_series = tf.reshape(tf.math.cumsum(tf.random.normal([100])),[1,-1])
# a mesh_grid of "co-distance"
random_mesh_gain = 1 - tf.matmul(random_series,tf.math.reciprocal(random_series), True, False)
# lower triangular matrix
random_tri = tf.linalg.band_part(random_mesh_gain, 0, -1)
# some lambda working on a series-type data
drop = 0.5
lambda_map = lambda series: tf.math.floor(series/interval)*interval - drop
# apply map
random_img = tf.map_fn(lambda_map, random_tri)
# init preceding and output list sl_full
preceding = - drop * tf.ones(tf.transpose(random_img).shape[0],dtype=tf.float32)
sl_full = []
for a in tf.transpose(random_img):
prec_a_max = tf.reduce_max(tf.stack([preceding, a]),axis=0)
preceding = prec_a_max
sl_full.append(prec_a_max)
# create tensor from list
sl_full = tf.transpose(tf.stack(sl_full))
plt.imshow(sl_full,origin="lower")

discrete numpy array to continuous array

I have some discrete data in an array, such that:
arr = np.array([[1,1,1],[2,2,2],[3,3,3],[2,2,2],[1,1,1]])
whose plot looks like:
I also have an index array, such that each unique value in arr is associated with a unique index value, like:
ind = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]])
What is the most pythonic way of converting arr from discrete values to continuous values, so that the array would look like this when plotted?:
therefore, interpolating between the discrete points to make continuous data
I found a solution to this if anyone has a similar issue. It is maybe not the most elegant so modifications are welcome:
def ref_linear_interp(x, y):
arr = []
ux=np.unique(x) #unique x values
for u in ux:
idx = y[x==u]
try:
min = y[x==u-1][0]
max = y[x==u][0]
except:
min = y[x==u][0]
max = y[x==u][0]
try:
min = y[x==u][0]
max = y[x==u+1][0]
except:
min = y[x==u][0]
max = y[x==u][0]
if min==max:
sub = np.full((len(idx)), min)
arr.append(sub)
else:
sub = np.linspace(min, max, len(idx))
arr.append(sub)
return np.concatenate(arr, axis=None).ravel()
y = np.array([[1,1,1],[2,2,2],[3,3,3],[2,2,2],[1,1,1]])
x = np.array([[1,1,1],[2,2,2],[3,3,3],[4,4,4],[5,5,5]])
z = np.arange(1, 16, 1)
Here is an answer for the symmetric solution that I would expect when reading the question:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
# create the data as described
numbers = [1,2,3,2,1]
nblock = 3
df = pd.DataFrame({
"x": np.arange(nblock*len(numbers)),
"y": np.repeat(numbers, nblock),
"label": np.repeat(np.arange(len(numbers)), nblock)
})
Expecting a constant block size of 3, we could use a rolling window:
df['y-smooth'] = df['y'].rolling(nblock, center=True).mean()
# fill NaNs
df['y-smooth'].bfill(inplace=True)
df['y-smooth'].ffill(inplace=True)
plt.plot(df['x'], df['y-smooth'], marker='*')
If the block size is allowed to vary, we could determine the block centers and interpolate piecewise.
centers = df[['x', 'y', 'label']].groupby('label').mean()
df['y-interp'] = np.interp(df['x'], centers['x'], centers['y'])
plt.plot(df['x'], df['y-interp'], marker='*')
Note: You may also try
centers = df[['x', 'y', 'label']].groupby('label').min() to select the left corner of the labelled blocks.

Best way to find modes of an array along the column

Suppose that I have an array
a = np.array([[1,2.5,3,4],[1, 2.5, 3,3]])
I want to find the mode of each column without using stats.mode().
The only way I can think of is the following:
result = np.zeros(a.shape[1])
for i in range(len(result)):
curr_col = a[:,i]
result[i] = curr_col[np.argmax(np.unique(curr_col, return_counts = True))]
update:
There is some error in the above code and the correct one should be:
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
I have to use the loop because np.unique does not output compatible result for each column and there is no way to use np.bincount because the dtype is not int.
If you look at the numpy.unique documentation, this function returns the values and the associated counts (because you specified return_counts=True). A slight modification of your code is necessary to give the correct result. What you are trying todo is to find the value associated to the highest count:
import numpy as np
a = np.array([[1,5,3,4],[1,5,3,3],[1,5,3,3]])
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print(result)
Output:
% python3 script.py
[1. 5. 3. 4.]
Here is a code tha compares your solution with the scipy.stats.mode function:
import numpy as np
import scipy.stats as sps
import time
a = np.random.randint(1,100,(100,100))
t_start = time.time()
result = np.zeros(a.shape[1])
for i in range(len(result)):
values, counts = np.unique(a[:,i], return_counts = True)
result[i] = values[np.argmax(counts)]
print('Timer 1: ', (time.time()-t_start), 's')
t_start = time.time()
result_2 = sps.mode(a, axis=0).mode
print('Timer 2: ', (time.time()-t_start), 's')
print('Matrices are equal!' if np.allclose(result, result_2) else 'Matrices differ!')
Output:
% python3 script.py
Timer 1: 0.002721071243286133 s
Timer 2: 0.003339052200317383 s
Matrices are equal!
I tried several values for parameters and your code is actually faster than scipy.stats.mode function so it is probably close to optimal.

How to concatenate two tensors with intervals in tensorflow?

I want to concatenate two tensors checkerboard-ly in tensorflow2, like examples showed below:
example 1:
a = [[1,1],[1,1]]
b = [[0,0],[0,0]]
concated_a_and_b = [[1,0,1,0],[0,1,0,1]]
example 2:
a = [[1,1,1],[1,1,1],[1,1,1]]
b = [[0,0,0],[0,0,0],[0,0,0]]
concated_a_and_b = [[1,0,1,0,1,0],[0,1,0,1,0,1],[1,0,1,0,1,0]]
Is there a decent way in tensorflow2 to concatenate them like this?
A bit of background for this:
I first split a tensor c with a checkerboard mask into two halves a and b. A after some transformation I have to concat them back into oringnal shape and order.
What I mean by checkerboard-ly:
Step 1: Generate a matrix with alternated values
You can do this by first concatenating into [1, 0] pairs, and then by applying a final reshape.
Step 2: Reverse some rows
I split the matrix into two parts, reverse the second part and then rebuild the full matrix by picking alternatively from the first and second part
Code sample:
import math
import numpy as np
import tensorflow as tf
a = tf.ones(shape=(3, 4))
b = tf.zeros(shape=(3, 4))
x = tf.expand_dims(a, axis=-1)
y = tf.expand_dims(b, axis=-1)
paired_ones_zeros = tf.concat([x, y], axis=-1)
alternated_values = tf.reshape(paired_ones_zeros, [-1, a.shape[1] + b.shape[1]])
num_samples = alternated_values.shape[0]
middle = math.ceil(num_samples / 2)
is_num_samples_odd = middle * 2 != num_samples
# Gather first part of the matrix, don't do anything to it
first_elements = tf.gather_nd(alternated_values, [[index] for index in range(middle)])
# Gather second part of the matrix and reverse its elements
second_elements = tf.reverse(tf.gather_nd(alternated_values, [[index] for index in range(middle, num_samples)]), axis=[1])
# Pick alternatively between first and second part of the matrix
indices = np.concatenate([[[index], [index + middle]] for index in range(middle)], axis=0)
if is_num_samples_odd:
indices = indices[:-1]
output = tf.gather_nd(
tf.concat([first_elements, second_elements], axis=0),
indices
)
print(output)
I know this is not a decent way as it will affect time and space complexity. But it solves the above problem
def concat(tf1, tf2):
result = []
for (index, (tf_item1, tf_item2)) in enumerate(zip(tf1, tf2)):
item = []
for (subitem1, subitem2) in zip(tf_item1, tf_item2):
if index % 2 == 0:
item.append(subitem1)
item.append(subitem2)
else:
item.append(subitem2)
item.append(subitem1)
concated_a_and_b.append(item)
return concated_a_and_b