Optimize calculations involving a Pandas series - pandas

I'm trying to do some calculations involving a pandas series as shown below. Basically first I extracted t from a DataFrame column and then used a for loop with "if...else..." to do further calculation, because I found out that when I used max(f_min, nan), f_min was always returned. The code below worked, but it looks rather cumbersome. Is there a better way to do what I wanted to do here? Thank you so much for your help!
f_min = 0.1
t_min=0. #degree C
t_max=35.
t_opt=21.
t=pd.Series([nan, nan, nan, 37., 31., 23.],
index=['08/22/2011 07','08/22/2011 08','08/22/2011 09',
'08/22/2011 10','08/22/2011 11','08/22/2011 12'],
name='T')
# t=df.T
a = (t - t_min)/(t_opt - t_min)
bt = (t_max - t_opt)/(t_opt - t_min)
b = ((t_max - t)/(t_max - t_opt))**bt
d = a * b
i= 0
for x in d:
if (pd.isna(x)):
d.iloc[i] = np.nan
else:
f_temp = max (f_min, x)
d.iloc[i] = f_temp
i = i+1

Let's use either:
d.clip(f_min,)
or
d.loc[d<f_min] = f_min

Related

Capping multiples columns

I found an interesting snippet (vrana95) that caps multiple columns, however this function works on the main "df" as well instead to work only on "final_df". Someone knows why?
def cap_data(df):
for col in df.columns:
print("capping the ",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)
As I wanted to cap only a few columns I changed the for loop of the original snippet. It works, but I would to know why this function is working with both dataframes.
cols = ['score_3', 'score_6', 'credit_limit', 'last_amount_borrowed', 'reported_income', 'income']
def cap_data(df):
for col in cols:
print("capping the column:",col)
if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
percentiles = df[col].quantile([0.01,0.99]).values
df[col][df[col] <= percentiles[0]] = percentiles[0]
df[col][df[col] >= percentiles[1]] = percentiles[1]
else:
df[col]=df[col]
return df
final_df=cap_data(df)

Numpy : multivariate indexing?

I wander, is it possible to index several dimensions at once ? With some broadcasting. Example :
Suppose i have an array A, shaped (n,d). Suppose i have a indexing array, say I with integer values between 0 and d-1. Set B = A[:,I].
If shape(I) == (k,), for whaterver k, then B has shape (n,k) and B[x,y] = A[x,I[y]].
But if shape(I) == (k,p) for whatever (k,p), then i wanted B to be shaped (n,k,p) with B[x,y,z] = A[x,I[y,z]].
1° How can i get this behavior ?
2° Does it have a drawback i did not see ?
You can do it exactly as you described it:
import numpy as np
n = 100
d = 20
k = 10
p = 17
A = np.random.random((n, d))
I = np.random.randint(low=0, high=d, size=(k, p))
B = A[:, I]
print(B.shape) # (n, k, p)
# Testing if the new array B is constructed as expected
x = 3
y = 5
z = 7
print(B[x, y, z])
print(A[x, I[y, z]])
print(B[x, y, z] == A[x, I[y, z]])
Its hard to answer if this is a good implementation or not, without context. But in general it is a good idea to use numpy and vectorization if you have speed in mind.

add Value to a column From a .apply function

I am trying to apply a simple value to a column to my pandas frame, but always shows NaN, i cant find the reason why.
here is my code.
def get_extra_hours(value):
return f'{value[12] -40: .2f}'
raw_data = pd.read_csv('testdata.csv')
unified = raw_data.groupby('Employee').sum()
unified['Hourly Rate'] = raw_data.groupby('Employee').first()['Hourly Rate']
unified['Extra Hours'] = raw_data.apply(get_extra_hours, axis=1)
print(unified.to_string())
the data in value[12] is a float, i just need take out 40 from value[12] and return with 2 decimal. it can be float or string.
I make it work, still don't understand why didnt work before but here is how i did it
def get_extra_hours(value):
x = value['Total Hours'] - 40
if x > 0:
return x
URL = f'https://api.mytimestation.com/v0.1/reports/?api_key={API_KEY}&Report_StartDate={date}&id={CODE}&exportformat=csv'
raw_data = pd.read_csv('testdata.csv')
unified = raw_data.groupby('Employee').sum()
unified['Hourly Rate'] = raw_data.groupby('Employee').first()['Hourly Rate']
unified['Extra Hours'] = unified.apply(get_extra_hours, axis=1)
print(unified.to_string())
i change the unified['Extra Hours'] = unified.apply(get_estra_hours, axis=1)
and also change the function get_extra_hours().

Scipy Optimize minimize returns the initial value

I am building machine learning models for a certain data set. Then, based on the constraints and bounds for the outputs and inputs, I am trying to find the input parameters for the most minimized answer.
The problem which I am facing is that, when the model is a linear regression model or something like lasso, the minimization works perfectly fine.
However, when the model is "Decision Tree", it constantly returns the very initial value that is given to it. So basically, it does not enforce the constraints.
import numpy as np
import pandas as pd
from scipy.optimize import minimize
I am using the very first sample from the input data set for the optimization. As it is only one sample, I need to reshape it to (1,-1) as well.
x = df_in.iloc[0,:]
x = np.array(x)
x = x.reshape(1,-1)
This is my Objective function:
def objective(x):
x = np.array(x)
x = x.reshape(1,-1)
y = 0
for n in range(df_out.shape[1]):
y = Model[n].predict(x)
Y = y[0]
return Y
Here I am defining the bounds of inputs:
range_max = pd.DataFrame(range_max)
range_min = pd.DataFrame(range_min)
B_max=[]
B_min =[]
for i in range(range_max.shape[0]):
b_max = range_max.iloc[i]
b_min = range_min.iloc[i]
B_max.append(b_max)
B_min.append(b_min)
B_max = pd.DataFrame(B_max)
B_min = pd.DataFrame(B_min)
bnds = pd.concat([B_min, B_max], axis=1)
These are my constraints:
con_min = pd.DataFrame(c_min)
con_max = pd.DataFrame(c_max)
Here I am defining the constraint function:
def const(x):
x = np.array(x)
x = x.reshape(1,-1)
Y = []
for n in range(df_out.shape[1]):
y = Model[n].predict(x)[0]
Y.append(y)
Y = pd.DataFrame(Y)
a4 =[]
for k in range(Y.shape[0]):
a1 = Y.iloc[k,0] - con_min.iloc[k,0]
a2 = con_max.iloc[k, 0] - Y.iloc[k,0]
a3 = [a2,a1]
a4 = np.concatenate([a4, a3])
return a4
c = const(x)
con = {'type': 'ineq', 'fun': const}
This is where I try to minimize. I do not pick a method as the automatically picked model has worked so far.
sol = minimize(fun = objective, x0=x,constraints=con, bounds=bnds)
So the actual constraints are:
c_min = [0.20,1000]
c_max = [0.3,1600]
and the max and min range for the boundaries are:
range_max = [285,200,8,85,0.04,1.6,10,3.5,20,-5]
range_min = [215,170,-1,60,0,1,6,2.5,16,-18]
I think you should check the output of 'sol'. At times, the algorithm is not able to perform line search completely. To check for this, you should check message associated with 'sol'. In such a case, the optimizer returns initial parameters itself. There may be various reasons of this behavior. In a nutshell, please check the output of sol and act accordingly.
Arad,
If you have not yet resolved your issue, try using scipy.optimize.differential_evolution instead of scipy.optimize.minimize. I ran into similar issues, particularly with decision trees because of their step-like behavior resulting in infinite gradients.

Pandas add column with formula using value of other column

I have a existing df. I want to extend it with a column RSI.
RSI is calculated using a function rsi_func(close) which returns a number. I've tried the official pandas doc see coding 2) and 3) and Stackoverflow answer, see coding 7) and many other examples, I can't get it to work.
I've tried, without the numbering of course:
1) df['RSI'] = rsi_func(df['close'])
2) df.assign(RSI=lambda x: rsi_func(close))
3a) rsi = rsi_func(df['close'])
3b) print(rsi)
3c) df.assign(RSI=rsi)
4) df.assign(RSI=rsi_func(df['close']))
5) df.assign(RSI=lambda x: rsi_func(close))
6) df['RSI'] = df.apply(lambda x: rsi_func(x['close']))
7) df['RSI'] = df['close'].apply(rsi_func)
When I try 3a+b+c then a python list with RSI values is printed. But 3c doesn't append RSI to df. How can I create RSI with the return of rsi_func(close) and append it to df?
You can use map with the lambda expression
df['RSI'] = df['close'].map(lambda x: rsi_func(x))
Test using basic dataframe:
def rsi_func(close):
return close /10
df['RSI'] = df['close'].map(lambda x: rsi_func(x))
df
Out[11]:
close RSI
0 20.02 2.002
1 20.04 2.004
2 20.05 2.005