MATLAB - past variances - matplotlib

I have a table that looks like this:
firm id year profit
1 2000 10
1 2001 20
1 2002 15
2 1999 40
2 2000 55
2 2001 35
2 2002 65
3 2001 5
3 2002 20
3 2003 10
And I want to estimate the var of the past years' firm profits.

Assuming that your table is already sorted by "Firm ID" and by "Year", and that there's only one entry for each year, you can loop through each firm:
profitVar = [];
ids = unique(yourTable.firmId); % id of each firm
% Loop through the firms
for i = 1:length(ids)
id = ids(i);
subData = find(yourTable.firmId == id); % get the data from the given firm only
% Loop through the years
for j = 1:length(subData)
profitVar = [profitVar; var(yourTable.profit(subData(1:j-1)))];
end
end
yourTable = addvars(yourTable, profitVar);
Note that this returns NaN only for the first year, and not the second year due to the fact that it is calculating the variance of the previous one (which will thus always be zero). If this is a problem, you could just insert an exception in the inner loop, something like if (j == 2) profitVar = [profitVar; NaN];.

Related

Can I reference a prior row's value and populate it in the current row in a new column?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
45.593
1
1
1
2
2003
46.538
46.66
46.47
46.673
45.593
1
2
1
3
2003
46.717
46.781
46.53
46.750
45.593
1
3
1
4
2003
46.815
46.843
46.68
46.750
45.593
1
4
1
5
2003
46.935
47.000
46.56
46.593
45.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
389.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
389.019
43
7259
10
28
2022
379.869
389.519
379.67
389.019
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
385.24
44
7261
11
1
2022
390.14
390.39
383.29
384.519
385.24
44
I want to create a new column titled 'Prior_Week_Close' which will reference the prior week's 'Week Close' value (and the last week of the prior year for the first week of every year). For example, row 7260's value for Prior_Week_Close should equal 389.019
I'm trying:
SPY['prior_week_close'] = np.where(SPY['Week'].shift(1) == (SPY['Week'] - 1), SPY['Week_Close'].shift(1), np.nan)
TypeError: boolean value of NA is ambiguous
I thought about just using shift and creating a new column but some weeks only have 4 days and that would lead to inaccurate values.
Any help is greatly appreciated!
I was able to solve this by creating a new column called 'Overall_Week' (the week number in the entire data set, not just the calendar year) and using the following code:
def fn(s):
result = SPY[SPY.Overall_Week == (s.iloc[0] - 1)]['Week_Close']
if result.shape[0] > 0:
return np.broadcast_to(result.iloc[0], s.shape)
else:
return np.broadcast_to(np.NaN, s.shape)
SPY['Prior_Week_Close'] = SPY.groupby('Overall_Week')['Overall_Week'].transform(fn)```

Postgres calculate average using distinct IDs‚ values also distinct

I have a postgres query that is supposed to calculate an average value based on a set of values. This set of values should be based on DISTINCT ID's.
The query is the following:
#{context.answers_base}
SELECT
stores.name as store_name,
answers_base.question_name as question_name,
answers_base.question_id as question_id,
(sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) as score, # <--- this line is calculating wrong
sum(answers_base.answer_value) as score_sum,
count(answers_base.answer_id) as question_answer_count,
count(DISTINCT answers_base.answer_id) as answer_count
FROM answers_base
INNER JOIN stores ON stores.id = answers_base.store_id
WHERE answers_base.answer_value IS NOT NULL AND answers_base.question_type_id = :question_type_id
AND answers_base.scale = TRUE
#{context.filter_answers}
GROUP BY stores.name, answers_base.question_name, answers_base.question_id, answers_base.sort_order
ORDER BY stores.name, answers_base.sort_order
The thing is, that on the indicated line (sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) some values are counted more than once.
Part of the solution is making it DISTINCT based on ID, like so:
(sum(answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
This will result in an average that divided by the right number, but here the sum it's dividing is still wrong.
Doing the following (make sum() DISTINCT) does not work, for the reason that values are not unique. The values are either 0 / 25 / 50 / 75 / 100, so different IDs might contain 'same' values.
(sum(DISTINCT answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
How would I go about making this work?
Here are simplified versions of the table structures.
Table Answer
ID
answer_date
1
Feb 01, 2022
2
Mar 02, 2022
3
Mar 13, 2022
4
Mar 21, 2022
Table AnswerRow
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
4
2
75
5
2
100
6
2
0
7
3
25
8
4
25
9
4
100
10
4
50
Answer 1' answer_rows:
25 + 50 + 50 -> average = 125 / 3
Answer 2' answer_rows:
75 + 100 + 0 -> average = 175 / 3
Answer 3' answer_rows:
25 -> average = 25 / 1
Answer 4' answer_rows:
25 + 100 + 50 -> average = 175 / 3
For some reason, we get duplicate answer_rows in the calculation.
Example of the problem; for answer_id=1 we have the following answer_rows in the calculation, giving us a different average:
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
3
1
50
3
1
50
3
1
50
Result: 25 + 50 + 50 + 50 + 50 + 50 -> 275 / 6
Desired result: 25 + 50 + 50 -> 125 / 3
Making answer_row_id distinct (see beginning of post) makes it possible for me to get:
25 + 50 + 50 + **50 + 50 + 50** -> 275 / **3**
But not
25 + 50 + 50 -> 275 / 3
What I would like to achieve is having a calculation that selects answer_row distinctly based on its ID, and those answer_rows will be used both for calculation x and y in calculation average -> x / y.
answers_base is the following (simplified):
WITH answers_base as (
SELECT
answers.id as answer_id,
answers.store_id as store_id,
answer_rows.id as answer_row_id,
question_options.answer_value as answer_value
FROM answers
INNER JOIN answer_rows ON answers.id = answer_rows.answer_id
INNER JOIN stores ON stores.id = answers.store_id
WHERE answers.status = 0
)
I think this would be best solved with a window function. Something along the lines of
SELECT
ROW_NUMBER() OVER (PARTITION BY answer_rows.id ORDER BY answer_rows.created_at DESC) AS duplicate_answers
...
WHERE
answer_rows.duplicate_answers = 1
This would filter out multiple rows with the same id, and only keep one entry. (I chose the "first by created_at", but you could change this to whatever logic suits you best.)
A benefit to this approach is that it makes the rationale behind the logic clear, contained and re-usable.

remove outliers by group in sql

In my column in SQL Server, I must delete outliers for each group separately. Here are my columns
select
customer,
sku,
stuff,
action,
acnumber,
year
from
mytable
Sample data:
customer sku year stuff action
-----------------------------------
1 1 2 2017 10 0
2 1 2 2017 20 1
3 1 3 2017 30 0
4 1 3 2017 40 1
5 2 4 2017 50 0
6 2 4 2017 60 1
7 2 5 2017 70 0
8 2 5 2017 80 1
9 1 2 2018 10 0
10 1 2 2018 20 1
11 1 3 2018 30 0
12 1 3 2018 40 1
13 2 4 2018 50 0
14 2 4 2018 60 1
15 2 5 2018 70 0
16 2 5 2018 80 1
I must delete outlier from stuff variable, but separately by group customer+sku+year.
All that is below the 25th percentile and above 75 percentile should be considered an outlier and this principle must be respected for each group.
How to clear dataset for next working ?
Note, in this dataset, there is variable action (it tales value 0 and 1). It is not group variable, but outliers must be delete only for ZERO(0) categories of action variable.
in R language this is decided as
remove_outliers <- function(x, na.rm = TRUE, ...) {
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
new <- remove_outliers(vpg$stuff)
vpg=cbind(new,vpg)
Something like this, maybe:
DELETE mytable
WHERE PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) < .25 OR
PERCENT_RANK() OVER (PARTITION BY Department ORDER BY customer, sku, year ORDER BY stuff ) > .75

SAS Proc Optmodel Constraint Syntax

I have an optimization exercise I am trying to work through and am stuck again on the syntax. Below is my attempt, and I'd really like a thorough explanation of the syntax in addition to the solution code. I think it's the specific index piece that I am having trouble with.
The problem:
I have an item that I wish to sell out of within ten weeks. I have a historical trend and wish to alter that trend by lowering price. I want maximum margin dollars. The below works, but I wish to add two constraints and can't sort out the syntax. I have spaces for these two constraints in the code, with my brief explanation of what I think they may look like. Here is a more detailed explanation of what I need each constraint to do.
inv_cap=There is only so much inventory available at each location. I wish to sell it all. For location 1 it is 800, location 2 it is 1200. The sum of the column FRC_UNITS should equal this amount, but cannot exceed it.
price_down_or_same=The price cannot bounce around, so it needs to always be less than or more than the previous week. So, price(i)<=price(i-1) where i=week.
Here is my attempt. Thank you in advance for assistance.
*read in data;
data opt_test_mkdown_raw;
input
ITM_NBR
ITM_DES_TXT $
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
cards;
1 stuff 1 1 300 1.2 6 10 800
1 stuff 1 2 150 1.2 6 10 800
1 stuff 1 3 100 1.2 6 10 800
1 stuff 1 4 60 1.2 6 10 800
1 stuff 1 5 40 1.2 6 10 800
1 stuff 1 6 20 1.2 6 10 800
1 stuff 1 7 10 1.2 6 10 800
1 stuff 1 8 10 1.2 6 10 800
1 stuff 1 9 5 1.2 6 10 800
1 stuff 1 10 1 1.2 6 10 800
1 stuff 2 1 400 1.1 6 9 1200
1 stuff 2 2 200 1.1 6 9 1200
1 stuff 2 3 100 1.1 6 9 1200
1 stuff 2 4 100 1.1 6 9 1200
1 stuff 2 5 100 1.1 6 9 1200
1 stuff 2 6 50 1.1 6 9 1200
1 stuff 2 7 20 1.1 6 9 1200
1 stuff 2 8 20 1.1 6 9 1200
1 stuff 2 9 5 1.1 6 9 1200
1 stuff 2 10 3 1.1 6 9 1200
;
run;
data opt_test_mkdown_raw;
set opt_test_mkdown_raw;
ITM_LCT_WK=cats(ITM_NBR, LCT_NBR, WEEK);
ITM_LCT=cats(ITM_NBR, LCT_NBR);
run;
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK};
impvar FRC_UNITS{i in ITM_LCT_WK}=(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i]<=PRICE[i];
/*con inv_cap {j in ITM_LCT}: sum{i in ITM_LCT_WK}=I want this to be 800 for location 1 and 1200 for location 2;*/
con supply_last {i in ITM_LCT_WK}: FRC_UNITS[i]>=LY_UNITS[i];
/*con price_down_or_same {j in ITM_LCT} : NEW_PRICE[week]<=NEW_PRICE[week-1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
/*expand;*/
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
/*NEW_PRICE */
margin;
quit;
The main difficulty is that in your application, decisions are indexed by (Item,Location) pairs and Weeks, but in your code you have merged (Item,Location,Week) triplets. I rather like that use of the data step, but the result in this example is that your code is unable to refer to specific weeks and to specific pairs.
The fix that changes your code the least is to add these relationships by using defined sets and inputs that OPTMODEL can compute for you. Then you will know which triplets refer to each combination of (Item,Location) pair and week:
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
With this relationship you can add the other two constraints.
I left your code as is, but applied to the new code a convention I find useful, especially when there are similar names like itm_lct and ITM_LCTS:
sets as all caps;
input parameters start with lowercase;
output (vars, impvars, and constraints) start with Uppercase */
Here is the new OPTMODEL code:
proc optmodel;
*set variables and inputs;
set<string> ITM_LCT_WK;
number ITM_NBR{ITM_LCT_WK};
string ITM_DES_TXT{ITM_LCT_WK};
string ITM_LCT{ITM_LCT_WK};
number LCT_NBR{ITM_LCT_WK};
number WEEK{ITM_LCT_WK};
number LY_UNITS{ITM_LCT_WK};
number ELAST{ITM_LCT_WK};
number COST{ITM_LCT_WK};
number PRICE{ITM_LCT_WK};
number TOTAL_INV{ITM_LCT_WK};
*read data into procedure;
read data opt_test_mkdown_raw into
ITM_LCT_WK=[ITM_LCT_WK]
ITM_NBR
ITM_DES_TXT
ITM_LCT
LCT_NBR
WEEK
LY_UNITS
ELAST
COST
PRICE
TOTAL_INV;
var NEW_PRICE{i in ITM_LCT_WK} <= price[i];
impvar FRC_UNITS{i in ITM_LCT_WK} =
(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i]) * LY_UNITS[i];
* Moved to bound
con ceiling_price {i in ITM_LCT_WK}: NEW_PRICE[i] <= PRICE[i];
con supply_last{i in ITM_LCT_WK}: FRC_UNITS[i] >= LY_UNITS[i];
/* This code creates a set version of the Item x Location pairs
that you already have as strings */
set ITM_LCTS = setof{ilw in ITM_LCT_WK} itm_lct[ilw];
/* For each Item x Location pair, define a set of which
Item x Location x Week entries refer to that Item x Location */
set ILWperIL{il in ITM_LCTS} = {ilw in ITM_LCT_WK: itm_lct[ilw] = il};
/* I assume that for each item and location
the inventory is the same for all weeks for convenience,
i.e., that is not a coincidence */
num inventory{il in ITM_LCTS} = max{ilw in ILWperIL[il]} total_inv[ilw];
con inv_cap {il in ITM_LCTS}:
sum{ilw in ILWperIL[il]} Frc_Units[ilw] = inventory[il];
num lastWeek = max{ilw in ITM_LCT_WK} week[ilw];
/* Concatenating indexes is not the prettiest, but gets the job done here*/
con Price_down_or_same {il in ITM_LCTS, w in 2 .. lastWeek}:
New_Price[il || w] <= New_Price[il || w - 1];*/
*state function to optimize;
max margin=sum{i in ITM_LCT_WK}
(NEW_PRICE[i]-COST[i])*(1-(NEW_PRICE[i]-PRICE[i])*ELAST[i]/PRICE[i])*LY_UNITS[i];
expand;
solve;
*write output dataset;
create data results_MKD_maxmargin
from
[ITM_LCT_WK]={ITM_LCT_WK}
ITM_NBR
ITM_DES_TXT
LCT_NBR
WEEK
LY_UNITS
FRC_UNITS
ELAST
COST
PRICE
NEW_PRICE
TOTAL_INV;
*write results to window;
print
NEW_PRICE FRC_UNITS
margin
;
quit;

SQLite, sliding to get results based on value and date

This questions is posted on a suggestion in this thread.
I'm using SQLite/Database browser and my data looks like this:
data.csv
company year value
A 2000 15
A 2001 12
A 2002 20
B 2000 25
B 2001 20
B 2002 10
C 2000 18
C 2001 14
C 2002 22
etc..............
What I want to do is get all companies which have a value of <= 20 for all years in the data set. Using above data this would mean I want the query to answer me:
result.csv
company year value
A 2000 15
A 2001 12
A 2002 20
Thus excluding company C due to value > 20 in 2002 and company B for value > 20 in 2000.
You want all companies whose maximum value is no larger than 20:
SELECT *
FROM Data
WHERE company IN (SELECT company
FROM Data
GROUP BY company
HAVING max(value) <= 20)
Not sure if there are better solutions, but I think this will work:
select company
, sum(case when value < 20 then 1 else 0 end) s
, count(*) c
from data
where year in (2000, 2001, 2002)
group
by company
having s = c
It will check whether the count equals the number of years where the value is less than 20.