Is there any way to convert a portfolio class from portfolio analytics into a data frame - dataframe

I'm trying to find the optimal weights for an especific target return using Portfolio Analytics library and ROI optimization; However, even that I know that that target return should be feasable and should be part of the efficient frontier, the ROI optimization does not find any solution.
The code that I'm using is the following:
for(i in 0:n){
target=minret+(i)*Del
p <- portfolio.spec(assets = colnames(t_EROAS)) #Specification of asset classes
p <- add.constraint(p, type = "full_investment") #An investment that has to sum 1
p <- add.constraint(portfolio=p, type="box", min=0, max=1) #No short position long-only
p <- add.constraint(p,
type="group",
groups=group_list,
group_min=VCONSMIN[,1],
group_max=VCONSMAX[,1])
p <- add.constraint(p, type = "return", name = "mean", return_target = target)
p <- add.objective(p, type="risk", name="var")
eff.opt <- optimize.portfolio(t_EROAS, p, optimize_method = "ROI",trace=TRUE)}
n=30 but is just finding 27 portfolios and the efficient frontier that I'm creating is looking empty from portfolio 27 to portfolio 30, the 28 and 29 seems to not have a solution but I'm not sure that this is correct.
What I want to have is an efficient frontier on a data frame format with a fixed number of portfolios, and it seems that the only way to achive this is by this method. Any help or any ideas that could help?

Related

Pandas rolling window on an offset between 4 and 2 weeks in the past

I have a datafile with quality scores from different suppliers over a time range of 3 years. The end goal is to use machine learning to predict the quality label (good or bad) of a shipment based on supplier information.
I want to use the mean historic quality data over a specific period of time as an input feature in this model by using pandas rolling window. the problem with this method is that pandas only allows you to create a window from t=0-x until t=0 for you rolling window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='14d',closed='left').mean()
And this is were the problem comes in. For my feature I want to use quality data from a period of 2 weeks, but these 2 weeks are not the 2 weeks before the corresponding shipment, but of 2 weeks, starting from t=-4weeks , and ending on t=-2weeks.
You would imagine that this could be solved by using the same string of code but changing the window as presented below:
df['average_score t-2w'] = df['score'].rolling(window='28d' - '14d',closed='left').mean()
This, or any other type of denotation of this specific window does not seem to work.
It seems like pandas does not offer a solution to this problem, so we made a work around it with the following solution:
def time_shift_week(df):
def _avg_score_interval_func(series):
current_time = series.index[-1]
result = series[(series.index > ( current_time- pd.Timedelta(value=4, unit='w')))
& (series.index < (current_time - pd.Timedelta(value=2, unit='w')))]
return result.mean() if len(result)>0 else 0.0
temp_df = df.groupby(by=["supplier", "timestamp"], as_index=False).aggregate({"score": np.mean}).set_index('timestamp')
temp_df["w-42"] = (
temp_df
.groupby(["supplier"])
.ag_score
.apply(lambda x:
x
.rolling(window='30D', closed='both')
.apply(_avg_score_interval_func)
))
return temp_df.reset_index()
This results in a new df in which we find the average score score per supplier per timestamp, which we can subsequently merge with the original data frame to obtain the new feature.
Doing it this way seems really cumbersome and overly complicated for the task I am trying to perform. Eventhough we have found a workaround, I am wondering if there is an easier method of doing this.
Is anyone aware of a less complicated way of performing this rolling window feature extraction?
While pandas does not have the custom date offset you need, calculating the mean is pretty simple: it's just sum divided by count. You can subtract the 14-day rolling window from the 28-day rolling window:
# Some sample data. All scores are sequential for easy verification
idx = pd.MultiIndex.from_product(
[list("ABC"), pd.date_range("2020-01-01", "2022-12-31")],
names=["supplier", "timestamp"],
)
df = pd.DataFrame({"score": np.arange(len(idx))}, index=idx).reset_index()
# Now we gonna do rolling avg on score with the custom window.
# closed=left mean the current row will be excluded from the window.
score = df.set_index("timestamp").groupby("supplier")["score"]
r28 = score.rolling("28d", closed="left")
r14 = score.rolling("14d", closed="left")
avg_score = (r28.sum() - r14.sum()) / (r28.count() - r14.count())

How to apply a value from one row to all rows with the same ID in R

I currently have a large data set that is in long format. I am hoping to apply a value only given at baseline (repeat_instance = 0) to all follow up instances (repeat_instance = 1, 2, 3+) based on the record_id.
While I cannot share the actual data I have created a simplified example below to illustrate the quesiton.
record_id <- c(1,1,1,2,3,4,4,5,6,7,8,8,9,10,10,10)
repeat_instance <- c(0,1,2,0,0,0,1,0,0,0,0,1,0,0,1,2)
reason_for_visit <- c(1,NA,NA,1,2,1,NA,1,2,3,1,NA,1,1,NA,NA)
Current Format:
Desired Outcome:
I have seen solutions in Excel, however am not sure which formula may be useful in R.
We can use fill from tidyr
library(tidyr)
fill(df1, reason_for_visit)
data
df1 <- data.frame(record_id, repeat_instance, reason_for_visit)

Pandas manipulation: matching data from other columns to one column, applied uniquely to all rows

I have a model that predicts 10 words for a particular course in order of likelihood, and I'd like the first 5 words of those words that appear in the course's description.
This is the format of the data:
course_name course_title course_description predicted_word_10 predicted_word_9 predicted_word_8 predicted_word_7 predicted_word_6 predicted_word_5 predicted_word_4 predicted_word_3 predicted_word_2 predicted_word_1
Xmath 32 Precalculus Polynomial and rational functions, exponential... directed scholars approach build african different visual cultures placed global
Xphilos 2 Morality Introduction to ethical and political philosop... make presentation weekly european ways general range questions liberal speakers
My idea is for each row to start iterating from predicted_word_1 until I get the first 5 that are in the description. I'd like to save those words in the order they appear into additional columns description_word_1 ... description_word_5. (If there are <5 predicted words in the description I plan to return NAN in the corresponding columns).
To clarify with an example: if the course_description of a course is 'Polynomial and rational functions, exponential and logarithmic functions, trigonometry and trigonometric functions. Complex numbers, fundamental theorem of algebra, mathematical induction, binomial theorem, series, and sequences. ' and its first few predicted words are irrelevantword1, induction, exponential, logarithmic, irrelevantword2, polynomial, algebra...
I would want to return induction, exponential, logarithmic, polynomial, algebra for that in that order and do the same for the rest of the courses.
My attempt was to define an apply function that will take in a row and iterate from the first predicted word until it finds the first 5 that are in the description, but the part I am unable to figure out is how to create these additional columns that have the correct words for each course. This code will currently only keep the words for one course for all the rows.
def find_top_description_words(row):
print(row['course_title'])
description_words_index=1
for i in range(num_words_per_course):
description = row.loc['course_description']
word_i = row.loc['predicted_word_' + str(i+1)]
if (word_i in description) & (description_words_index <=5) :
print(description_words_index)
row['description_word_' + str(description_words_index)] = word_i
description_words_index += 1
df.apply(find_top_description_words,axis=1)
The end goal of this data manipulation is to keep the top 10 predicted words from the model and the top 5 predicted words in the description so the dataframe would look like:
course_name course_title course_description top_description_word_1 ... top_description_word_5 predicted_word_1 ... predicted_word_10
Any pointers would be appreciated. Thank you!
If I understand correctly:
Create new DataFrame with just 100 predicted words:
pred_words_lists = df.apply(lambda x: list(x[3:].dropna())[::-1], axis = 1)
Please note that, there are lists in each row with predicted words. The order is nice, I mean the first, not empty, predicted word is on the first place, the second on the second place and so on.
Now let's create a new DataFrame:
pred_words_df = pd.DataFrame(pred_words_lists.tolist())
pred_words_df.columns = df.columns[:2:-1]
And The final DataFrame:
final_df = df[['course_name', 'course_title', 'course_description']].join(pred_words_df.iloc[:,0:11])
Hope this works.
EDIT
def common_elements(xx, yy):
temp = pd.Series(range(0, len(xx)), index= xx)
return list(df.reindex(yy).sort_values()[0:10].dropna().index)
pred_words_lists = df.apply(lambda x: common_elements(x[2].replace(',','').split(), list(x[3:].dropna())), axis = 1)
Does it satisfy your requirements?
Adapted solution (OP):
def get_sorted_descriptions_words(course_description, predicted_words, k):
description_words = course_description.replace(',','').split()
predicted_words_list = list(predicted_words)
predicted_words = pd.Series(range(0, len(predicted_words_list)), index=predicted_words_list)
predicted_words = predicted_words[~predicted_words.index.duplicated()]
ordered_description = predicted_words.reindex(description_words).dropna().sort_values()
ordered_description_list = pd.Series(ordered_description.index).unique()[:k]
return ordered_description_list
df.apply(lambda x: get_sorted_descriptions_words(x['course_description'], x.filter(regex=r'predicted_word_.*'), k), axis=1)

Data Selection - Finding relations between dataframe attributes

let's say i have a dataframe of 80 columns and 1 target column,
for example a bank account table with 80 attributes for each record (account) and 1 target column which decides if the client stays or leaves.
what steps and algorithms should i follow to select the most effective columns with the higher impact on the target column ?
There are a number of steps you can take, I'll give some examples to get you started:
A correlation coefficient, such as Pearson's Rho (for parametric data) or Spearman's R (for ordinate data).
Feature importances. I like XGBoost for this, as it includes the handy xgb.ggplot.importance / xgb.plot_importance methods.
One of the many feature selection options, such as python's sklearn.feature_selection methods.
This one way to do it using the Pearson correlation coefficient in Rstudio, I used it once when exploring the red_wine dataset my targeted variable or column was the quality and I wanted to know the effect of the rest of the columns on it.
see below figure shows the output of the code as you can see the blue color represents positive relation and red represents negative relations and the closer the value to 1 or -1 the darker the color
c <- cor(
red_wine %>%
# first we remove unwanted columns
dplyr::select(-X) %>%
dplyr::select(-rating) %>%
mutate(
# now we translate quality to a number
quality = as.numeric(quality)
)
)
corrplot(c, method = "color", type = "lower", addCoef.col = "gray", title = "Red Wine Variables Correlations", mar=c(0,0,1,0), tl.cex = 0.7, tl.col = "black", number.cex = 0.9)

Customizing new trading strategy in R using quantmod

I want to create a new custom TA-indicator to the stock symbol in R. But I have no idea about how to convert my SQL conditional strategy into R self-defined function and add it up to the ChartSeries in R.
The question are listed in the following code as the explanation.
library("quantmod")
library("FinancialInstrument")
library("PerformanceAnalytics")
library("TTR")
stock <- getSymbols("002457.SZ",auto.assign=FALSE,from="2012-11-26",to="2014-01-30")
head(stock)
chartSeries(stock, theme = "white", subset = "2013-07-01/2014-01-30",TA = "addSMA(n=5,col=\"gray\");addSMA(n=10,col=\"yellow\");
addSMA(n=20,col=\"pink\");addSMA(n=30,col=\"green\");addSMA(n=60,col=\"blue\");addVo()")
Question: How can I rewrite the code below to make it available as a function in R?
#Signal Design
#Today's volume is the lowset during the last 20 trading days
lowvolume <- VOL<=LLV(VOL,20);
#seveal moving average lines stick together
X1:=ABS(MA(C,10)/MA(C,20)-1)<0.01;
X2:=ABS(MA(C,5)/MA(C,10)-1)<0.01;
X3:=ABS(MA(C,5)/MA(C,20)-1)<0.01;
#If the follwing condition is satisfied, then the signal appears
MA(C,5)>REF(MA(C,5),1) AND X1 AND X2 AND X3 AND lowvolume;
#Convert the above SQL code into the following R custom function
VOLINE <- function(x) {
}
#Create a new TA function for the chartseries and then add it up.
addVoline <- newTA(FUN=VOLINE,
+ preFUN=Cl,
+ col=c(rep(3,6),
+ rep(”#333333”,6)),
+ legend=”VOLINE”)
I dont think you need sql in this case
Try this
require(quantmod)
# fetch the data
s <- get(getSymbols('yhoo'))
# add the indicators
s$ma5 <- SMA(Cl(s) ,5)
s$ma10 <- SMA(Cl(s) ,10)
s$ma20 <- SMA(Cl(s) ,20)
s$llv <- rollapply(Vo(s), 20, min)
# generate the signal
s$signal <- (s$ma10 / s$ma20 - 1 < 0.01 & s$ma5 / s$ma10 - 1 < 0.01 & s$ma5 / s$ma20 - 1 < 0.01 & Vo(s) == s$llv)
# draw
chart_Series(s)
add_TA(s$signal == 1, on = 1, col='red')
I'm not sure what REF means but i'm sure you can do that by your self.
This is the output (i cant seem to upload the photo but you see a chart with horizontal lines where signal eq 1)
Use the function as a wrapper for sqldf() in the sqldf package. The argument to sqldf() will be a select statement on the data frame that has the data.
A good tutorial for this can be found at Burns Statistics.