transform pandas dataframe column via Interpolation - pandas

i am looking to apply a 1d interpolation on a df and am not sure how to this in an efficient way. Here goes:
In [8]: param
Out[8]:
alpha beta rho nu
0.021918 0.544953 0.5 -0.641566 6.549623
0.041096 0.449702 0.5 -0.062046 5.047923
0.060274 0.428459 0.5 -0.045312 3.625387
0.079452 0.424686 0.5 -0.049508 2.790139
0.156164 0.423139 0.5 -0.071106 1.846614
0.232877 0.414887 0.5 -0.040070 1.334070
0.328767 0.415757 0.5 -0.042071 1.109897
I would like the new index (but dont mind to reset_index() if needed) to look like this:
np.array([0.02, 0.04, 0.06, 0.08, 0.1, 0.15, 0.2, 0.25])
So the corresponding values for alpha, beta, rho, nu need to be interpolated.
Came up with the following which only works for one column and only if x and y have the same dimensions:
x = np.array([0.02, 0.04, 0.06, 0.08, 0.1, 0.15, 0.2, 0.25])
y = np.array(param.alpha)
f = interp1d(x, y, kind='cubic', fill_value='extrapolate')
f(x)
Appreciate any pointer towards an efficient solution. Thanks.

You could try using reindex and interpolate then index selection with loc:
param.reindex(new_idx.tolist()+param.index.values.tolist())\
.sort_index()\
.interpolate(method='cubic')\
.bfill()\
.loc[new_idx]
Output:
alpha beta rho nu
0.02 0.544953 0.5 -0.641566 6.549623
0.04 0.452518 0.5 -0.073585 5.138333
0.06 0.428552 0.5 -0.044739 3.641854
0.08 0.424630 0.5 -0.049244 2.772958
0.10 0.423439 0.5 -0.047119 2.294109
0.15 0.423326 0.5 -0.069473 1.873499
0.20 0.419130 0.5 -0.060861 1.573724
0.25 0.412985 0.5 -0.029573 1.221732

Related

Sampling a random integer 'N' times according to predetermined probabilities, where the probability is different each time

Let's say I want to sample 0 with probability p0 = 0.5, 1 with probability p1 = 0.3, or a 2 with probability p2 = 0.2. This is fairly simple to do:
p0 = 0.5
p1 = 0.3
p2 = 0.2
idx = np.random.choice(3, p=[p0, p1, p2])
Now, lets say I want to repeat this process N, each times using different probabilities. Something like:
N = 4
p0 = np.array([0.5, 0.6, 0.7, 0.8])
p1 = np.array([0.3, 0.2, 0.2, 0.1])
p2 = np.array([0.2, 0.2, 0.1, 0.1])
idx = np.empty(N)
for i in range(N):
idx[i] = np.random.choice(3, p=[p0[i], p1[i], p2[i]])
However, this is obviously slow. Ideally I'd like to do this avoiding loops. Is there a simple solution to this problem?
One way is to generate a uniform random array of size N, compare that to the accumulate probabilities, then take the indexes of the first True value in each column:
cum_probs = np.cumsum([p0,p1,p2],axis=0)
idx = np.argmax(np.random.uniform(size=N) < cum_probs, axis=0)

changing the axis values in matplotlib plot

In continuation of this accepted answer plotting a beautiful timeseries plot I want to change the y axis values after the plot by dividing a floating point number (0.2) with the y axis values so that the values of the y axis will be 0.0,0.5,1.0,1.5,2,2.5,3.0. without changing the timeseries.
The code is
data = pd.DataFrame(np.loadtxt("data_3_timeseries"), columns=list('ABC'))
data['B'] = data['B'].apply(lambda x: x + 0.3)
data['C'] = data['C'].apply(lambda x: x + 0.6)
ax = data.plot()
for col, place, line in zip(list('ABC'), [5, 8, 10], ax.lines):
ax.plot(place, data[col][place], marker="*", c=line.get_color())
plt.show()
data_3_timeseries
-0.00831 -0.0213 -0.0182
0.0105 -0.00767 -0.012
0.00326 0.0148 -0.00471
-0.0263 -0.00318 0.011
0.012 0.0398 0.0117
-0.0156 -0.0133 0.02
-0.0482 -0.00783 -0.0162
0.0103 -0.00639 0.0103
0.0158 0.000788 -0.00484
-0.000704 -0.0236 0.00579
0.00151 -0.0135 -0.0195
-0.0163 -0.00185 0.00722
0.0207 0.00998 -0.0387
-0.0246 -0.0274 -0.0108
0.0123 -0.0155 0.0137
-0.00963 0.0023 0.0305
-0.0147 0.0255 -0.00806
0.000488 -0.0187 5.29e-05
-0.0167 0.0105 -0.0204
0.00653 0.0176 -0.00643
0.0154 -0.0136 0.00415
-0.0147 -0.00339 0.0175
-0.0238 -0.00284 0.0204
-0.00629 0.0205 -0.017
0.00449 -0.0135 -0.0127
0.00843 -0.0167 0.00903
-0.00331 7.2e-05 -0.00281
-0.0043 0.0047 0.00681
-0.0356 0.0214 0.0158
-0.0104 -0.0165 0.0092
0.00599 -0.0128 -0.0202
0.015 -0.0272 0.0117
0.012 0.0258 -0.0154
-0.00509 -0.0194 0.00219
-0.00154 -0.00778 -0.00483
-0.00152 -0.0451 0.0187
0.0271 0.0186 -0.0133
-0.0146 -0.0416 0.0154
-0.024 0.00295 0.006
-0.00889 -0.00501 -0.028
-0.00555 0.0124 -0.00406
-0.0185 -0.0114 0.0224
0.0143 0.0204 -0.0193
-0.0168 -0.00608 0.00178
-0.0159 0.0189 0.0109
-0.0213 -0.007 -0.0323
0.0031 0.0207 -0.00333
-0.0202 -0.0157 -0.0105
0.0159 0.00216 -0.0262
0.0105 -0.00292 0.00447
0.0126 0.0163 -0.0141
0.01 0.00679 0.025
0.0237 -0.0142 -0.0149
0.00394 -0.0379 0.00905
-0.00803 0.0186 -0.0176
-0.013 0.0162 0.0208
-0.00197 0.0313 -0.00804
0.0218 -0.0249 0.000412
-0.0164 0.00681 -0.0109
-0.0162 -0.00795 -0.00279
-0.01 -0.00977 -0.0194
-0.00723 -0.0464 0.00453
-0.000533 0.02 -0.0193
0.00706 0.0391 0.0194
I've tried to be as detailed with the comments as I can, I hope this will be clear:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
COLUMNS = ['A', 'B', 'C'] # If you have more columns you can add here
MARKS = [('A', 5), ('B', 8), ('C', 10), ('A', 20), ('C', 25)] # You can add here more marks
# Here You can add/edit colors for the lines and the markers, and add new columns if exists
COLORS_DICT = {'A': {'Line': 'Purple', 'Marker': 'Blue'},
'B': {'Line': 'Red', 'Marker': 'Green'},
'C': {'Line': 'Brown', 'Marker': 'Orange'}, }
FACTOR = 6 # the factor
SPACER = 1 # This spacer together with the factor will have the y axes with 0.5 gaps
MARKER = '*' # star Marker, can be altered.
LINE_WIDTH = 0.5 # the width of the lines
COLORS = True # True for colors False for black
data = pd.DataFrame(np.loadtxt("data_3_timeseries"), columns=COLUMNS)
for i, col in enumerate(COLUMNS): # iterating through the columns
data[col] = data[col].apply(lambda x: x * FACTOR + i * SPACER) # applying each column the factor and the spacer
ax = data.plot()
ax.get_legend().remove() # removing the columns' legend (If Colors is False there's no need for legend)
for col, line in zip(COLUMNS, ax.lines): # iterating through the column and lines
if COLORS:
line.set_color(COLORS_DICT[col]['Line'])
else:
line.set_color('Black')
line.set_linewidth(LINE_WIDTH)
for col, mark in MARKS:
ax.plot(mark, data[col][mark], marker=MARKER, c=COLORS_DICT[col]['Marker'] if COLORS else 'Black')
plt.show()

Gekko - infeasible solution to optimal scheduling, comparison w/ gurobi

I am somewhat familiar with Gurobi, but transitioning to Gekko since the latter appears to have some advantages. I am running into one issue though, which I will illustrate using my imaginary apple orchard. The 5-weeks harvest period (#horizon: T=5) is upon us, and my - very meagre - produce will be:
[3.0, 7.0, 9.0, 5.0, 4.0]
Some apples I keep for myself [2.0, 4.0, 2.0, 4.0, 2.0], the remaining produce I will sell in the farmer's market at the following prices: [0.8, 0.9, 0.5, 1.2, 1.5]. I have storage space with room for 6 apples, so I can plan ahead and sell apples at the most optimal moments, hence maximizing my revenue. I try to determine the optimal schedule with the following model:
m = GEKKO()
m.time = np.linspace(0,4,5)
orchard = m.Param([3.0, 7.0, 9.0, 5.0, 4.0])
demand = m.Param([2.0, 4.0, 2.0, 4.0, 2.0])
price = m.Param([0.8, 0.9, 0.5, 1.2, 1.5])
### manipulated variables
# selling on the market
sell = m.MV(lb=0)
sell.DCOST = 0
sell.STATUS = 1
# saving apples
storage_out = m.MV(value=0, lb=0)
storage_out.DCOST = 0
storage_out.STATUS = 1
storage_in = m.MV(lb=0)
storage_in.DCOST = 0
storage_in.STATUS = 1
### storage space
storage = m.Var(lb=0, ub=6)
### constraints
# storage change
m.Equation(storage.dt() == storage_in - storage_out)
# balance equation
m.Equation(sell + storage_in + demand == storage_out + orchard)
# Objective: argmax sum(sell[t]*price[t]) for t in [0,4]
m.Maximize(sell*price)
m.options.IMODE=6
m.options.NODES=3
m.options.SOLVER=3
m.options.MAX_ITER=1000
m.solve()
For some reason this is unfeasible (error code = 2). Interestingly, if set demand[0] to 3.0, instead of 2.0 (i.e. equal to orchard[0], the model does produce a succesful solution.
Why is this the case?
Even the "succesful" output values are bit weird: the storage space is not used a single time, and storage_out is not properly constrained in the last timestep. Clearly, I am not formulating the constraints correctly. What should I do to get realistic results, which are comparable to the gurobi output (see code below)?
output = {'sell' : list(sell.VALUE),
's_out' : list(storage_out.VALUE),
's_in' : list(storage_in.VALUE),
'storage' : list(storage.VALUE)}
df_gekko = pd.DataFrame(output)
df_gekko.head()
> sell s_out s_in storage
0 0.0 0.000000 0.000000 0.0
1 3.0 0.719311 0.719311 0.0
2 7.0 0.859239 0.859239 0.0
3 1.0 1.095572 1.095572 0.0
4 26.0 24.124924 0.124923 0.0
Gurobi model solved for with demand = [3.0, 4.0, 2.0, 4.0, 2.0]. Note that gurobi also produces a solution with demand = [2.0, 4.0, 2.0, 4.0, 2.0]. This only has a trivial impact on the outcome: n apples sold at t=0 becomes 1.
T = 5
m = gp.Model()
### horizon (five weeks)
### supply, demand and price data
orchard = [3.0, 7.0, 9.0, 5.0, 4.0]
demand = [3.0, 4.0, 2.0, 4.0, 2.0]
price = [0.8, 0.9, 0.5, 1.2, 1.5]
### manipulated variables
# selling on the market
sell = m.addVars(T)
# saving apples
storage_out = m.addVars(T)
m.addConstr(storage_out[0] == 0)
storage_in = m.addVars(T)
# storage space
storage = m.addVars(T)
m.addConstrs((storage[t]<=6) for t in range(T))
m.addConstrs((storage[t]>=0) for t in range(T))
m.addConstr(storage[0] == 0)
# storage change
#m.addConstr(storage[0] == (0 - storage_out[0]*delta_t + storage_in[0]*delta_t))
m.addConstrs(storage[t] == (storage[t-1] - storage_out[t] + storage_in[t]) for t in range(1, T))
# balance equation
m.addConstrs(sell[t] + demand[t] + storage_in[t] == (storage_out[t] + orchard[t]) for t in range(T))
# Objective: argmax sum(a_sell[t]*a_price[t] - b_buy[t]*b_price[t])
obj = gp.quicksum((price[t]*sell[t]) for t in range(T))
m.setObjective(obj, gp.GRB.MAXIMIZE)
m.optimize()
output:
sell storage_out storage_in storage
0 0.0 0.0 0.0 0.0
1 3.0 0.0 0.0 0.0
2 1.0 0.0 6.0 6.0
3 1.0 0.0 0.0 6.0
4 8.0 6.0 0.0 0.0
You can get a successful solution with:
m.options.NODES=2
The issue is that it is solving the balance equation in between the primary node points with NODES=3. Your differential equation has a linear solution so NODES=2 should be sufficiently accurate.
Here are a couple other ways to improve the solution:
Set a small penalty on moving inventory into or out of storage. Otherwise the solver can find large arbitrary values with storage_in = storage_out.
I used m.Minimize(1e-6*storage_in) and m.Minimize(1e-6*storage_out).
Because the initial condition is typically fixed, I used zero values at the beginning just to make sure that the first point is calculated.
I also switched to integer variables if they are sold and stored in integer units. You need to switch to the APOPT solver if you want an integer solution with SOLVER=1.
Successful solution
---------------------------------------------------
Solver : APOPT (v1.0)
Solution time : 0.058899999999999994 sec
Objective : -17.299986
Successful solution
---------------------------------------------------
Sell
[0.0, 0.0, 4.0, 1.0, 1.0, 8.0]
Storage Out
[0.0, 0.0, 1.0, 0.0, 0.0, 6.0]
Storage In
[0.0, 1.0, 0.0, 6.0, 0.0, 0.0]
Storage
[0.0, 1.0, 0.0, 6.0, 6.0, 0.0]
Here is the modified script.
from gekko import GEKKO
import numpy as np
m = GEKKO(remote=False)
m.time = np.linspace(0,5,6)
orchard = m.Param([0.0, 3.0, 7.0, 9.0, 5.0, 4.0])
demand = m.Param([0.0, 2.0, 4.0, 2.0, 4.0, 2.0])
price = m.Param([0.0, 0.8, 0.9, 0.5, 1.2, 1.5])
### manipulated variables
# selling on the market
sell = m.MV(lb=0, integer=True)
sell.DCOST = 0
sell.STATUS = 1
# saving apples
storage_out = m.MV(value=0, lb=0, integer=True)
storage_out.DCOST = 0
storage_out.STATUS = 1
storage_in = m.MV(lb=0, integer=True)
storage_in.DCOST = 0
storage_in.STATUS = 1
### storage space
storage = m.Var(lb=0, ub=6, integer=True)
### constraints
# storage change
m.Equation(storage.dt() == storage_in - storage_out)
# balance equation
m.Equation(sell + storage_in + demand == storage_out + orchard)
# Objective: argmax sum(sell[t]*price[t]) for t in [0,4]
m.Maximize(sell*price)
m.Minimize(1e-6 * storage_in)
m.Minimize(1e-6 * storage_out)
m.options.IMODE=6
m.options.NODES=2
m.options.SOLVER=1
m.options.MAX_ITER=1000
m.solve()
print('Sell')
print(sell.value)
print('Storage Out')
print(storage_out.value)
print('Storage In')
print(storage_in.value)
print('Storage')
print(storage.value)

Set color limits for matplotlib colormap

I made a function to get the hex code given to a set of data as follows:
from matplotlib import cm, colors
def get_color(series_data, cmap='Reds'):
color_map = cm.get_cmap(cmap, 20)
f = lambda x: colors.rgb2hex(color_map(x/series_data.max())[:3])
return series_data.apply(f)
The cm.get_cmap(cmap, 20) generates a matplotlib.colors.LinearSegmentedColormap object that is ranged from the minimum value of the input series_data to its maximum.
I cannot see how could I define the color limits for the data to be evaluated. For instance, what if I wanted to set constant color limits, defining as the minimum the value 0 and the maximum 100? How could I do that within my function?
I tried to substitute series_data.max() to 100 to control the max equivalent color (max), but I couldn't control the cmin.
The parameter of color_map needs to be scaled to the [0.,1.) range. For instance, if the minimum (maximum) color value needs to be obtained for the lo (hi) value:
from matplotlib import cm, colors
import pandas as pd
def get_color(series_data, cmap='Reds', lo=None, hi=None):
if lo is None:
lo = series_data.min()
if hi is None:
hi = series_data.max()
if lo == hi:
raise Exception('Invalid range.')
color_map = cm.get_cmap(cmap, 20)
f = lambda x: colors.rgb2hex(color_map((x-lo)/(hi-lo))[:3])
return series_data.apply(f)
s = pd.Series(np.linspace(0,3,16))
colz = get_color(s, lo=1, hi=2)
for x, c in zip(s, colz):
print('{:.2f} {}'.format(x,c))
The sample output is
0.00 #fff5f0
0.20 #fff5f0
0.40 #fff5f0
0.60 #fff5f0
0.80 #fff5f0
1.00 #fff5f0
1.20 #fdc7b0
1.40 #fc8363
1.60 #ed392b
1.80 #af1117
2.00 #67000d
2.20 #67000d
2.40 #67000d
2.60 #67000d
2.80 #67000d
3.00 #67000d

How to "bin" a numpy array using custom (non-linearly spaced) buckets?

How to "bin" the bellow array in numpy so that:
import numpy as np
bins = np.array([-0.1 , -0.07, -0.02, 0. , 0.02, 0.07, 0.1 ])
array = np.array([-0.21950869, -0.02854823, 0.22329239, -0.28073936, -0.15926265,
-0.43688216, 0.03600587, -0.05101109, -0.24318651, -0.06727875])
That is replace each of the values in array with the following:
-0.1 where `value` < -0.085
-0.07 where -0.085 <= `value` < -0.045
-0.02 where -0.045 <= `value` < -0.01
0.0 where -0.01 <= `value` < 0.01
0.02 where 0.01 <= `value` < 0.045
0.07 where 0.045 <= `value` < 0.085
0.1 where `value` >= 0.085
The expected output would be:
array = np.array([-0.1, -0.02, 0.1, -0.1, -0.1, -0.1, 0.02, -0.07, -0.1, -0.07])
I recognise that numpy has a digitize function however it returns the index of the bin not the bin itself. That is:
np.digitize(array, bins)
np.array([0, 2, 7, 0, 0, 0, 5, 2, 0, 2])
Get those mid-values by averaging across consecutive bin values in pairs. Then, use np.searchsorted or np.digitize to get the indices using the mid-values. Finally, index into bins for the output.
Mid-values :
mid_bins = (bins[1:] + bins[:-1])/2.0
Indices with searchsorted or digitze :
idx = np.searchsorted(mid_bins, array)
idx = np.digitize(array, mid_bins)
Output :
out = bins[idx]