Optimization Python - optimization

Optimization Python - optimization

I am trying to get the optimal solution
column heading: D_name , Vial_size1 ,Vial_size2 ,Vial_size3 , cost , units_needed
row 1: Act , 120 , 400 , 0 , $5 , 738
row 2: dug , 80 , 200 , 400 , $40 , 262
data in excel
column heading: Vials price size
Row 1: Vial size 1 5 120
Row 2: Vial size 2 5 400
prob=LpProblem("Dose_Vial",LpMinimize)
import pandas as pd
df = pd.read_excel (r'C:\Users\*****\Desktop\Vial.xls')
print (df)
# Create a list of the Vial_Size
Vial_Size = list(df['Vials'])
# Create a dictinary of units for all Vial_Size
size = dict(zip(Vial_Size,df['size']))
# Create a dictinary of price for all Vial_Size
Price = dict(zip(Vial_Size,df['Price']))
# print dictionaries
print(Vial_Size)
print(size)
print(Price)
vial_vars = LpVariable.dicts("Vials",size,lowBound=0,cat='Integer')
# start building the LP problem by adding the main objective function
prob += lpSum([Price[i]*vial_vars[i]*size[i] for i in size])
# adding constraints
prob += lpSum([size[f] * vial_vars[f] for f in size]) >= 738
# The status of the solution is printed to the screen
prob.solve()
print("Status:", LpStatus[prob.status])
# In case the problem is ill-formulated or there is not sufficient information,
# the solution may be infeasible or unbounded
for v in prob.variables():
if v.varValue>0:
print(v.name, "=", format(round(v.varValue)))
Vials_Vial_Size_1 = 3
Vials_Vial_Size_2 = 1
obj =round((value(prob.objective)))
print("The total cost of optimized vials: ${}".format(round(obj)))
The total cost of optimized vials: $3800
'
how to set it for 2 or more drugs and get the best optimal solution.

Here is an approach to solve the first part of the question, finding vial combinations that minimizes the waste (I'm not sure what role the price plays?):
from pulp import *
import pandas as pd
import csv
drugs_dict = {"D_name": ['Act', 'dug'],
"Vial_size1": [120, 80],
"Vial_size2": [400, 200],
"Vial_size3": [0, 400],
"cost": [5, 40],
"units_needed": [738, 262]}
df = pd.DataFrame(drugs_dict)
drugs = list(df['D_name'])
vial_1_size = dict(zip(drugs, drugs_dict["Vial_size1"]))
vial_2_size = dict(zip(drugs, drugs_dict["Vial_size2"]))
vial_3_size = dict(zip(drugs, drugs_dict["Vial_size3"]))
units_needed = dict(zip(drugs, drugs_dict["units_needed"]))
results = []
for drug in drugs:
print(f"drug = {drug}")
# setup minimum waste problem
prob = LpProblem("Minimum Waste Problem", LpMinimize)
# create decision variables
vial_1_var = LpVariable("Vial_1", lowBound=0, cat='Integer')
vial_2_var = LpVariable("Vial_2", lowBound=0, cat='Integer')
vial_3_var = LpVariable("Vial_3", lowBound=0, cat='Integer')
units = lpSum([vial_1_size[drug] * vial_1_var +
vial_2_size[drug] * vial_2_var +
vial_3_size[drug] * vial_3_var])
# objective function
prob += units
# constraints
prob += units >= units_needed[drug]
prob.solve()
print(f"units = {units.value()}")
for v in prob.variables():
if v.varValue > 0:
print(v.name, "=", v.varValue)
results.append([drug, units.value(), int(vial_1_var.value() or 0), int(vial_2_var.value() or 0), int(vial_3_var.value() or 0)])
with open('vial_results.csv', 'w', newline='') as csvfile:
csv_writer = csv.writer(csvfile)
csv_writer.writerow(['drug', 'units', 'vial_1', 'vial_2', 'vial_3'])
csv_writer.writerows(results)
Running gives:
drug = Act
units = 760.0
Vial_1 = 3.0
Vial_2 = 1.0
drug = dug
units = 280.0
Vial_1 = 1.0
Vial_2 = 1.0

Related

Need help unpacking a nested array into a pandas dataframe

I am running some code that generates an array with the following shape (18433, 17, 600 to 885 in length). I need to unpack that into a pandas dataframe with 17 columns, and rows containing data for 18433 entities that each have 600 to 885 time series entries. The code that generates the array is shown below. I am a relative python newbie and have reached the extent of my skills. I tried unpacking using a for loop, but it takes forever. Are there any libraries or methods that are more efficient?
# Generate full monthly cash flow arrays
# define constant input parameters
eloss = 0
weight = 1.0
prod_wt = 1.0
inv_wt = 1.0
stx_oil = 0.0795
stx_gas = 0.0795
stx_ngl = 0.0795
adval = 0
aban = 150000
# Create function for slicing the volume array and calculating the monthly cash flow
def econ_ncf_iter(r):
econ_ncf_iter = econ_cf(index = r, uid = prop_list.loc[r, 'PROPNUM'], wi = prop_list.loc[r, 'WI'],
nri = prop_list.loc[r, 'NRI'], roy = prop_list.loc[r, 'Royalty'], eloss = eloss,
weight = weight, prod_wt = prod_wt, inv_wt = inv_wt,
shrink = np.round(prop_list.loc[r, 'SHRINK'] / 100, 6),
btu = np.round(prop_list.loc[r, 'BTU'] / 1000, 6),
ngl_yield = np.round(prop_list.loc[r, 'NGL/GAS'], 6),
pri_oil = np.extract(oilprice[r][0] == prop_list.loc[r, 'PROPNUM'], oilprice[r][1]),
pri_gas = np.extract(gasprice[r][0] == prop_list.loc[r, 'PROPNUM'], gasprice[r][1]),
paj_oil = prop_list.loc[r, 'PAJ_OIL'],
paj_gas = np.extract(gasdiff[r][0] == prop_list.loc[r, 'PROPNUM'], gasdiff[r][1]),
paj_ngl = prop_list.loc[r, 'PAJ_NGL'], stx_oil = stx_oil, stx_gas = stx_gas, stx_ngl = stx_ngl,
adval = adval, opc_fix = np.round(prop_list.loc[r, 'OPC/T'], 2),
opc_oil = np.round(prop_list.loc[r, 'OIL_OPEX'], 2),
opc_gas = np.round(prop_list.loc[r, 'GAS_OPEX'], 2),
capex = np.round(prop_list.loc[r, 'CAPITAL'] * 1000, 2), aban = aban)
return econ_ncf_iter
# generate net cash flow array
econ_ncf = lambda r: econ_ncf_iter(r)
vecon_ncf = np.vectorize(econ_ncf_iter, otypes = [object])
ncf_arr_packed = vecon_ncf(R)

I figured it out and it was pretty easy
'''
ncf_pd_dflist = []
columns = ['UID', 'Month', 'Grs Oil', 'Grs Gas', 'Net Oil', 'Net Gas', 'Net NGL', 'Oil Revenue', 'Gas Revenue',
'NGL Revenue', 'Total Revenue', 'Total Tax', 'OPEX', 'Operating Income', 'Cumulative Op CF', 'Net Cashflow',
'Cumulative Net CF']
pbar = tqdm(len(R))
for r in R:
ncf_pd_dflist.append(pd.DataFrame(np.transpose(ncf_arr_packed[r])))
pbar.update()
ncf_pd = pd.concat(ncf_pd_dflist)
ncf_pd.columns = columns
pbar.close()
'''
Simple code to loop though the array and create a list of pandas dataframes. After the loop finishes I concatenate the dataframe lists into a single dataframe. This took about 5 seconds to complete.

Although you already figured out a solution, here's a general alternative without an explicit loop. It takes some simple steps:
If the desired horizontal axis (your middle axis) is not the last, swap them.
Reshape to a 2D array of the horizontal rows.
Make the DataFrame with a MultiIndex from the cartesian product of the other axes.
Assuming the array is arr:
x, y, z = arr.shape
df = pd.DataFrame(arr.swapaxes(1, 2).reshape(x*z, -1),
pd.MultiIndex.from_product([np.arange(x), np.arange(z)]))

dataframe results changes to zero after adding return

I am trying to pass "buy_list" in the code below to df . This is a small section of the code when the full code is executed I get the results of a back test results linked image.
initial results
replacement_stocks = portfolio_size - len(kept_positions)
buy_list = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]
new_portfolio = pd.concat(
(buy_list,
ranking_table.loc[ranking_table.index.isin(kept_positions)])
)
When I define df as in below I get df not defined error
replacement_stocks = portfolio_size - len(kept_positions)
buy_list = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]
new_portfolio = pd.concat(
(buy_list,
ranking_table.loc[ranking_table.index.isin(kept_positions)])
)
df1 = buy_list # ceate df1 with buy_list
df2 = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]# create df2 with buy_list
I tried the solution in the link below
Similar error with suggested fix
Following this I still get df not defined error and the output of my back test changes to 0% in all the month which previously had actual % change negative and positive
replacement_stocks = portfolio_size - len(kept_positions)
buy_list = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]
new_portfolio = pd.concat(
(buy_list,
ranking_table.loc[ranking_table.index.isin(kept_positions)])
)
return buy_list
df2 = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]
print(df2)
This is what I now end up with
Error message
I'd appreciate any suggestions on how I can fix this.
Thanks,
Last1
Below is the full code as requested, it's from a book am working through, Trading Evolved by Andreas Clenow.
Thanks again.
%matplotlib inline
import zipline
from zipline.api import order_target_percent, symbol, \
set_commission, set_slippage, schedule_function, \
date_rules, time_rules, attach_pipeline, pipeline_output
from pandas import Timestamp
import matplotlib.pyplot as plt
import pyfolio as pf
import pandas as pd
import numpy as np
from scipy import stats
from zipline.finance.commission import PerDollar
from zipline.finance.slippage import VolumeShareSlippage, FixedSlippage
from zipline_norgatedata.pipelines import NorgateDataIndexConstituent
from zipline.pipeline import Pipeline
"""
Model Settings
"""
intial_portfolio = 100000
momentum_window1 = 125
momentum_window2 = 125
minimum_momentum = 40
portfolio_size = 30
vola_window = 20
# Trend filter settings
enable_trend_filter = True
trend_filter_symbol = '$SPXTR'
trend_filter_window = 200
"""
Commission and Slippage Settings
"""
enable_commission = True
commission_pct = 0.001
enable_slippage = True
slippage_volume_limit = 0.025
slippage_impact = 0.05
"""
Helper functions.
"""
def momentum_score(ts):
"""
Input: Price time series.
Output: Annualized exponential regression slope,
multiplied by the R2
"""
# Make a list of consecutive numbers
x = np.arange(len(ts))
# Get logs
log_ts = np.log(ts)
# Calculate regression values
slope, intercept, r_value, p_value, std_err = stats.linregress(x, log_ts)
# Annualize percent
annualized_slope = (np.power(np.exp(slope), 252) - 1) * 100
#Adjust for fitness
score = annualized_slope * (r_value ** 2)
return score
def volatility(ts):
return ts.pct_change().rolling(vola_window).std().iloc[-1]
"""
Initialization and trading logic
"""
def make_pipeline():
indexconstituent = NorgateDataIndexConstituent('$SPX')
return Pipeline(
columns={
'NorgateDataIndexConstituent':indexconstituent},
screen = indexconstituent)
def initialize(context):
attach_pipeline(make_pipeline(), 'norgatedata_pipeline', chunks=9999,eager=True)
# Set commission and slippage.
if enable_commission:
comm_model = PerDollar(cost=commission_pct)
else:
comm_model = PerDollar(cost=0.0)
set_commission(comm_model)
if enable_slippage:
slippage_model=VolumeShareSlippage(volume_limit=slippage_volume_limit, price_impact=slippage_impact)
set_slippage(slippage_model)
else:
slippage_model=FixedSlippage(spread=0.0)
# Used only for progress output.
context.last_month = intial_portfolio
# Store index membership
#context.index_members = pd.read_csv('../data/index_members/sp500.csv', index_col=0, parse_dates=[0])
#Schedule rebalance monthly.
schedule_function(
func=rebalance,
date_rule=date_rules.month_start(),
time_rule=time_rules.market_open()
)
def output_progress(context):
"""
Output some performance numbers during backtest run
"""
# Get today's date
today = zipline.api.get_datetime().date()
# Calculate percent difference since last month
perf_pct = (context.portfolio.portfolio_value / context.last_month) - 1
# Print performance, format as percent with two decimals.
print("{} - Last Month Result: {:.2%}".format(today, perf_pct))
# Remember today's portfolio value for next month's calculation
context.last_month = context.portfolio.portfolio_value
def rebalance(context, data):
# Write some progress output during the backtest
output_progress(context)
context.pipeline_data = pipeline_output('norgatedata_pipeline')
todays_universe = context.pipeline_data.index
# Check how long history window we need.
hist_window = max(momentum_window1,
momentum_window2)
# Get historical data
hist = data.history(todays_universe, "close", hist_window, "1d")
# Slice the history to match the two chosen time frames.
momentum_hist1 = hist[(-1 * momentum_window1):]
momentum_hist2 = hist[(-1 * momentum_window2):]
# Calculate momentum values for the two time frames.
momentum_list1 = momentum_hist1.apply(momentum_score)
momentum_list2 = momentum_hist2.apply(momentum_score)
# Now let's put the two momentum values together, and calculate mean.
momentum_concat = pd.concat((momentum_list1, momentum_list2))
mom_by_row = momentum_concat.groupby(momentum_concat.index)
mom_means = mom_by_row.mean()
# Sort by momentum value.
ranking_table = mom_means.sort_values(ascending=False)
"""
Sell Logic
First we check if any existing position should be sold.
* Sell if stock is no longer part of index.
* Sell if stock has too low momentum value.
"""
kept_positions = list(context.portfolio.positions.keys())
for security in context.portfolio.positions:
if (security not in todays_universe):
order_target_percent(security, 0.0)
kept_positions.remove(security)
elif ranking_table[security] < minimum_momentum:
order_target_percent(security, 0.0)
kept_positions.remove(security)
"""
Trend Filter Section
"""
if enable_trend_filter:
ind_hist = data.history(
symbol(trend_filter_symbol),
'close',
trend_filter_window,
'1d'
)
trend_filter = ind_hist.iloc[-1] > ind_hist.mean()
if trend_filter == False:
return
"""
Stock Selection Logic
Check how many stocks we are keeping from last month.
Fill from top of ranking list, until we reach the
desired total number of portfolio holdings.
"""
replacement_stocks = portfolio_size - len(kept_positions)
buy_list = ranking_table.loc[
~ranking_table.index.isin(kept_positions)][:replacement_stocks]
new_portfolio = pd.concat(
(buy_list,
ranking_table.loc[ranking_table.index.isin(kept_positions)])
)
"""
Calculate inverse volatility for stocks,
and make target position weights.
"""
vola_table = hist[new_portfolio.index].apply(volatility)
inv_vola_table = 1 / vola_table
sum_inv_vola = np.sum(inv_vola_table)
vola_target_weights = inv_vola_table / sum_inv_vola
for security, rank in new_portfolio.iteritems():
weight = vola_target_weights[security]
if security in kept_positions:
order_target_percent(security, weight)
else:
if ranking_table[security] > minimum_momentum:
order_target_percent(security, weight)
def analyze(context, perf):
perf['max'] = perf.portfolio_value.cummax()
perf['dd'] = (perf.portfolio_value / perf['max']) - 1
maxdd = perf['dd'].min()
ann_ret = (np.power((perf.portfolio_value.iloc[-1] / perf.portfolio_value.iloc[0]),(252 / len(perf)))) - 1
print("Annualized Return: {:.2%} Max Drawdown: {:.2%}".format(ann_ret, maxdd))
return
start_date = Timestamp('2015-01-01',tz='UTC')
end_date = Timestamp('2020-03-14',tz='UTC')
perf = zipline.run_algorithm(
start=start_date, end=end_date,
initialize=initialize,
analyze=analyze,
capital_base=intial_portfolio,
data_frequency = 'daily',
bundle='norgatedata-sp500' )

avoiding data leakage with timed data and cross validation

I'm using the Kobe Bryant Dataset.
I wish to predict the shot_made_flag with KnnRegressor.
I'm trying to avoid data leakage by grouping the data by season, year, and month.
season is pre-existing column and year and month are columns I've added like so:
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
Here's the full code of my pre-processing code of the features:
import re
# drop unnecesarry columns
kobe_data_encoded = kobe_data.drop(columns=['game_event_id', 'game_id', 'lat', 'lon', 'team_id', 'team_name', 'matchup', 'shot_id'])
# use HotEncoding for action_type, combined_shot_type, shot_zone_area, shot_zone_basic, opponent
kobe_data_encoded = pd.get_dummies(kobe_data_encoded, prefix_sep="_", columns=['action_type'])
kobe_data_encoded = pd.get_dummies(kobe_data_encoded, prefix_sep="_", columns=['combined_shot_type'])
kobe_data_encoded = pd.get_dummies(kobe_data_encoded, prefix_sep="_", columns=['shot_zone_area'])
kobe_data_encoded = pd.get_dummies(kobe_data_encoded, prefix_sep="_", columns=['shot_zone_basic'])
kobe_data_encoded = pd.get_dummies(kobe_data_encoded, prefix_sep="_", columns=['opponent'])
# covert season to years
kobe_data_encoded['season'] = kobe_data_encoded['season'].apply(lambda x: int(re.compile('(\d+)-').findall(x)[0]))
# covert shot_type to numeric representation
kobe_data_encoded['shot_type'] = kobe_data_encoded['shot_type'].apply(lambda x: int(re.compile('(\d)PT').findall(x)[0]))
# add year and month using game_date
kobe_data_encoded['year'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('(\d{4})').findall(x)[0]))
kobe_data_encoded['month'] = kobe_data_encoded['game_date'].apply(lambda x: int(re.compile('-(\d+)-').findall(x)[0]))
kobe_data_encoded = kobe_data_encoded.drop(columns=['game_date'])
# covert shot_type to numeric representation
kobe_data_encoded.loc[kobe_data_encoded['shot_zone_range'] == 'Back Court Shot', 'shot_zone_range'] = 4
kobe_data_encoded.loc[kobe_data_encoded['shot_zone_range'] == '24+ ft.', 'shot_zone_range'] = 3
kobe_data_encoded.loc[kobe_data_encoded['shot_zone_range'] == '16-24 ft.', 'shot_zone_range'] = 2
kobe_data_encoded.loc[kobe_data_encoded['shot_zone_range'] == '8-16 ft.', 'shot_zone_range'] = 1
kobe_data_encoded.loc[kobe_data_encoded['shot_zone_range'] == 'Less Than 8 ft.', 'shot_zone_range'] = 0
# transform game_date to date time object
# kobe_data_encoded['game_date'] = pd.to_numeric(kobe_data_encoded['game_date'].str.replace('-',''))
kobe_data_encoded.head()
Then I've scaled the data using MinMaxScaler:
# scaling
min_max_scaler = preprocessing.MinMaxScaler()
scaled_features_df = kobe_data_encoded.copy()
column_names = ['loc_x', 'loc_y', 'minutes_remaining', 'period',
'seconds_remaining', 'shot_distance', 'shot_type', 'shot_zone_range']
scaled_features = min_max_scaler.fit_transform(scaled_features_df[column_names])
scaled_features_df[column_names] = scaled_features
And grouped by the season, year, and month like stated above:
seasons_date = scaled_features_df.groupby(['season', 'year', 'month'])
I've been tasked with using KFold to find the best K using roc_auc score.
Here's my implementation:
neighbors = [x for x in range(1,50) if x % 2 != 0]
cv_scores = []
for k in neighbors:
print('k: ', k)
knn = KNeighborsClassifier(n_neighbors=k, n_jobs=-1)
scores = []
accumelated_X = pd.DataFrame()
accumelated_y = pd.Series()
for group_name, group in seasons_date:
print(group_name)
group = group.drop(columns=['season', 'year', 'month'])
not_classified_df = group[group['shot_made_flag'].isnull()]
classified_df = group[group['shot_made_flag'].notnull()]
X = classified_df.drop(columns=['shot_made_flag'])
y = classified_df['shot_made_flag']
accumelated_X = pd.concat([accumelated_X, X])
accumelated_y = pd.concat([accumelated_y, y])
cv = StratifiedKFold(n_splits=10, shuffle=True)
scores.append(cross_val_score(knn, accumelated_X, accumelated_y, cv=cv, scoring='roc_auc'))
cv_scores.append(scores.mean())
#graphical view
#misclassification error
MSE = [1-x for x in cv_scores]
#optimal K
optimal_k_index = MSE.index(min(MSE))
optimal_k = neighbors[optimal_k_index]
print(optimal_k)
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel('Number of Neighbors K')
plt.ylabel('Misclassification Error')
plt.show()
I'm not sure if I'm dealing with the data leakage correctly in this situation
Because if I'm accumulating the previous season data and then pass it over to cross_val_score I might just as-well and up with data leakage since the cv can split the data in a way that the new season data it fitted upon and the previous season data is tested upon am I right here?
If so I would like to know how to approach this situation where I would like to use K-Fold to find the best k using this timed data without having data leakage.
Is it even sensible to use K-Fold to split data and not split by game date to avoid data leakage?

To be short, as you wanna do something with sounds like timeseries, you cannot use the standard k-fold cross validation.
You would use some data from the future to predict the past, which is forbidden.
A good approach you can find here: https://stats.stackexchange.com/questions/14099/using-k-fold-cross-validation-for-time-series-model-selection
fold 1 : training [1], test [2]
fold 2 : training [1 2], test [3]
fold 3 : training [1 2 3], test [4]
fold 4 : training [1 2 3 4], test [5]
fold 5 : training [1 2 3 4 5], test [6]
where the numbers are in chronolical order of your datatime

Pandas accumulate data for linear regression

I try to adjust my data so total_gross per day is accumulated. E.g.
`Created` `total_gross` `total_gross_accumulated`
Day 1 100 100
Day 2 100 200
Day 3 100 300
Day 4 100 400
Any idea, how I have to change my code to have total_gross_accumulated available?
Here is my data.
my code:
from sklearn import linear_model
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
X = event_data.index
y = event_data.total_gross
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

List comprehension is the most pythonic way to do this.
SHORT answer:
This should give you the new column that you want:
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
OR faster
event_data['total_gross_accumulated'] = event_data['total_gross'].cumsum()
LONG answer:
Full code using your data:
import pandas as pd
def load_event_data():
df = pd.read_csv('sample-data.csv', usecols=['created', 'total_gross'])
df['created'] = pd.to_datetime(df.created)
return df.set_index('created').resample('D').sum().fillna(0)
event_data = load_event_data()
n = event_data.shape[0]
# skip line 0 and start by accumulating from 1 until the end
total_gross_accumulated =[event_data['total_gross'][:i].sum() for i in range(1,n+1)]
# add the new variable in the initial pandas dataframe
event_data['total_gross_accumulated'] = total_gross_accumulated
Results:
event_data.head(6)
# total_gross total_gross_accumulated
#created
#2019-03-01 3481810 3481810
#2019-03-02 4690 3486500
#2019-03-03 0 3486500
#2019-03-04 0 3486500
#2019-03-05 0 3486500
#2019-03-06 0 3486500
X = event_data.index
y = event_data.total_gross_accumulated
plt.xticks(rotation=90)
plt.plot(X, y)
plt.show()

Defining a soft constraint in cvxpy

I am using cvxpy to do a simple portfolio optimization.
I implemented the following dummy code
from cvxpy import *
import numpy as np
np.random.seed(1)
n = 10
Sigma = np.random.randn(n, n)
Sigma = Sigma.T.dot(Sigma)
orig_weight = [0.15,0.25,0.15,0.05,0.20,0,0.1,0,0.1,0]
w = Variable(n)
mu = np.abs(np.random.randn(n, 1))
ret = mu.T*w
lambda_ = Parameter(sign='positive')
lambda_ = 5
risk = quad_form(w, Sigma)
constraints = [sum_entries(w) == 1, w >= 0, sum_entries(abs(w-orig_weight)) <= 0.750]
prob = Problem(Maximize(ret - lambda_ * risk), constraints)
prob.solve()
print 'Solver Status : ',prob.status
print('Weights opt :', w.value)
I am constraining on being fully invested, long only and to have a turnover of <= 75%. However I would like to use turnover as a "soft" constraint in the sense that the solver will use as little as possible but as much as necessary, currently the solver will almost fully max out turnover.
I basically want something like this which is convex and doesn't violate the DCP rules
sum_entries(abs(w-orig_weight)) >= 0.05
I would assume this should set a minimum threshold (5% here) and then use as much turnover until it finds a feasible solution.
I tried rewriting my objective function to
prob = Problem(Maximize(lambda_ * ret - risk - penalty * max(sum_entries(abs(w-orig_weight))+0.9,0)) , constraints)
where penalty is e.g. 2 and my constraint object still looks like
constraints = [sum_entries(w) == 1, w >= 0, sum_entries(abs(w-orig_weight)) <= 0.9]
I have never used soft-constraints and any explanation would be highly appreciated.
EDIT: Intermediate solution
from cvxpy import *
import numpy as np
np.random.seed(1)
n = 10
Sigma = np.random.randn(n, n)
Sigma = Sigma.T.dot(Sigma)
w = Variable(n)
mu = np.abs(np.random.randn(n, 1))
ret = mu.T*w
risk = quad_form(w, Sigma)
orig_weight = [0.15,0.2,0.2,0.2,0.2,0.05,0.0,0.0,0.0,0.0]
min_weight = [0.35,0.0,0.0,0.0,0.0,0,0.0,0,0.0,0.0]
max_weight = [0.35,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]
lambda_ret = Parameter(sign='positive')
lambda_ret = 5
lambda_risk = Parameter(sign='positive')
lambda_risk = 1
penalty = Parameter(sign='positive')
penalty = 100
penalized = True
if penalized == True:
print '-------------- RELAXED ------------------'
constraints = [sum_entries(w) == 1, w >= 0, w >= min_weight, w <= max_weight]
prob = Problem(Maximize(lambda_ * ret - lambda_ * risk - penalty * max_entries(sum_entries(abs(w-orig_weight)))-0.01), constraints)
else:
print '-------------- HARD ------------------'
constraints = [sum_entries(w) == 1, w >= 0, w >= min_weight, w <= max_weight, sum_entries(abs(w-orig_weight)) <= 0.40]
prob = Problem(Maximize(lambda_ret * ret - lambda_risk * risk ),constraints)
prob.solve()
print 'Solver Status : ',prob.status
print('Weights opt :', w.value)
all_in = []
for i in range(n):
all_in.append(np.abs(w.value[i][0] - orig_weight[i]))
print 'Turnover : ', sum(all_in)
The above code will force a specific increase in weight for item[0], here +20%, in order to maintain the sum() =1 constraint that has to be offset by a -20% decrease, therefore I know it will need a minimum of 40% turnover to do that, if one runs the code with penalized = False the <= 0.4 have to be hardcoded, anything smaller than that will fail. The penalized = True case will find the minimum required turnover of 40% and solve the optimization. What I haven't figured out yet is how I can set a minimum threshold in the relaxed case, i.e. do at least 45% (or more if required).
I found some explanation around the problem here, in chapter 4.6 page 37.
Boyed Paper

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optimization Python - optimization

Related

Need help unpacking a nested array into a pandas dataframe

dataframe results changes to zero after adding return

avoiding data leakage with timed data and cross validation

Pandas accumulate data for linear regression

Defining a soft constraint in cvxpy

Categories

Resources