Compute weighted average by group in SQLite - sql

I have quarterly data on portfolio holdings, let's call the table holdings,
portfolio date security dollar_amount
p1 03/31/2001 security1 50
p1 03/31/2001 security2 100
p2 03/31/2001 security1 25
p2 03/31/2001 security2 50
p1 06/30/2001 security1 50
p1 06/30/2001 security2 100
p1 06/30/2001 security3 50
p2 06/30/2001 security1 25
p2 06/30/2001 security3 50
and data on monthly returns for each security, let's call it returns
security date return
security1 03/31/2001 1
security2 03/31/2001 -1
security3 03/31/2001 2
security1 04/30/2001 3
security2 04/30/2001 -1
security3 04/30/2001 2
security1 05/31/2001 1
security2 05/31/2001 2
security3 05/31/2001 -1
security1 06/30/2001 2
security2 06/30/2001 -1
security3 06/30/2001 3
security1 07/31/2001 2
security2 07/31/2001 -3
security3 07/31/2001 1
security1 08/30/2001 2
security2 08/30/2001 -3
security3 08/30/2001 2
For each portfolio, here p1 and p2, I want to compute monthly weighted average returns for each portfolio: SUM(dollar_amount * return) / SUM(dollar_amount). However, I want to take into account that there are quarterly changes in holdings, that is the weights should adjust every quarter.
Desired output:
portfolio date return
p1 03/31/2001 1/3*1 + 2/3*(-1) = -1/2
p2 03/31/2001 1/3*1 + 2/3*(-1) = -1/2
p1 04/30/2001 1/3*3 + 2/3*(-1) = 1/3
p2 04/30/2001 1/3*3 + 2/3*(-1) = 1/3
p3 05/31/2001 1/3*1 + 2/3*2 = 5/3
p4 05/31/2001 1/3*1 + 2/3*2 = 5/3
-- rebalancing, i.e. adjusting the weights according to holding data --
p1 06/30/2001 1/4*2 + 1/2*(-1) + 1/4*3 = 3/4
p2 06/30/2001 1/3*2 + 2/3*3 = 8/3
p1 07/31/2001 1/4*2 + 1/2*(-3) + 1/4*1 = -3/4
p2 07/31/2001 1/3*2 + 2/3*1 = 4/3
p3 08/30/2001 1/4*2 + 1/2*(-3) + 1/4*2 = -1/2
p4 08/30/2001 1/3*2 + 2/3*2 = 2
My final query will have to work with 53 quarters of holdings data and thus 159 months. The number of unique portfolios and securities are up to 13,000.
My question is whether there is a meaningful way to do it in a single SQLite query. If not, what do you think is the best way to do it?
The problems for me are
joining only the relevant (monthly) returns data for each quarter, e.g. returns from 03/31/2001, 04/30/2001, 05/31/2001 for the portfolio weights from 03/31/2001; otherwise the data would explode.
that the weighted average returns have to be computed per group, where a group is defined by quarter and portfolio.
The only way I can think of is to query the weighted average return per date and portfolio, such that I would have to loop through all of these combinations. I am aware that this is a computationally costly job, but I am looking for the fastest solution here.
Thanks for your help! I am using Python, sqlalchemy, sqlite3.

This is my take at it but Sqlite doesn't have much support for working with dates so it feels a little inefficient. I could not get it to work with your date format so I hade to change to 'yyyy-mm-dd' but maybe that is installation specific
SELECT portfolio, r.date, 1.0 * SUM(dollar_amount * return) / SUM(dollar_amount)
FROM returns r
JOIN holdings h ON h.security = r.security AND
strftime('%Y',h.date) = strftime('%Y',r.date) AND
CAST(strftime('%m',r.date) as int) BETWEEN cast(strftime('%m',h.date) as int) AND CAST(strftime('%m',h.date) as int) + 2
GROUP BY portfolio, r.date
ORDER BY r.date

Related

Postgres calculate average using distinct IDs‚ values also distinct

I have a postgres query that is supposed to calculate an average value based on a set of values. This set of values should be based on DISTINCT ID's.
The query is the following:
#{context.answers_base}
SELECT
stores.name as store_name,
answers_base.question_name as question_name,
answers_base.question_id as question_id,
(sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) as score, # <--- this line is calculating wrong
sum(answers_base.answer_value) as score_sum,
count(answers_base.answer_id) as question_answer_count,
count(DISTINCT answers_base.answer_id) as answer_count
FROM answers_base
INNER JOIN stores ON stores.id = answers_base.store_id
WHERE answers_base.answer_value IS NOT NULL AND answers_base.question_type_id = :question_type_id
AND answers_base.scale = TRUE
#{context.filter_answers}
GROUP BY stores.name, answers_base.question_name, answers_base.question_id, answers_base.sort_order
ORDER BY stores.name, answers_base.sort_order
The thing is, that on the indicated line (sum(answers_base.answer_value) / NULLIF(count(answers_base.answer_id),0)) some values are counted more than once.
Part of the solution is making it DISTINCT based on ID, like so:
(sum(answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
This will result in an average that divided by the right number, but here the sum it's dividing is still wrong.
Doing the following (make sum() DISTINCT) does not work, for the reason that values are not unique. The values are either 0 / 25 / 50 / 75 / 100, so different IDs might contain 'same' values.
(sum(DISTINCT answers_base.answer_value) / NULLIF(count(DISTINCT answers_base.answer_id),0))
How would I go about making this work?
Here are simplified versions of the table structures.
Table Answer
ID
answer_date
1
Feb 01, 2022
2
Mar 02, 2022
3
Mar 13, 2022
4
Mar 21, 2022
Table AnswerRow
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
4
2
75
5
2
100
6
2
0
7
3
25
8
4
25
9
4
100
10
4
50
Answer 1' answer_rows:
25 + 50 + 50 -> average = 125 / 3
Answer 2' answer_rows:
75 + 100 + 0 -> average = 175 / 3
Answer 3' answer_rows:
25 -> average = 25 / 1
Answer 4' answer_rows:
25 + 100 + 50 -> average = 175 / 3
For some reason, we get duplicate answer_rows in the calculation.
Example of the problem; for answer_id=1 we have the following answer_rows in the calculation, giving us a different average:
ID
answer_id
answer_value
1
1
25
2
1
50
3
1
50
3
1
50
3
1
50
3
1
50
Result: 25 + 50 + 50 + 50 + 50 + 50 -> 275 / 6
Desired result: 25 + 50 + 50 -> 125 / 3
Making answer_row_id distinct (see beginning of post) makes it possible for me to get:
25 + 50 + 50 + **50 + 50 + 50** -> 275 / **3**
But not
25 + 50 + 50 -> 275 / 3
What I would like to achieve is having a calculation that selects answer_row distinctly based on its ID, and those answer_rows will be used both for calculation x and y in calculation average -> x / y.
answers_base is the following (simplified):
WITH answers_base as (
SELECT
answers.id as answer_id,
answers.store_id as store_id,
answer_rows.id as answer_row_id,
question_options.answer_value as answer_value
FROM answers
INNER JOIN answer_rows ON answers.id = answer_rows.answer_id
INNER JOIN stores ON stores.id = answers.store_id
WHERE answers.status = 0
)
I think this would be best solved with a window function. Something along the lines of
SELECT
ROW_NUMBER() OVER (PARTITION BY answer_rows.id ORDER BY answer_rows.created_at DESC) AS duplicate_answers
...
WHERE
answer_rows.duplicate_answers = 1
This would filter out multiple rows with the same id, and only keep one entry. (I chose the "first by created_at", but you could change this to whatever logic suits you best.)
A benefit to this approach is that it makes the rationale behind the logic clear, contained and re-usable.

Average difference between values SQL

I'm trying to find the difference between values using SQL where the second value is always larger than the previous value.
Example Data:
Car_ID | Trip_ID | Mileage
1 1 10,000
1 2 11,000
1 3 11,500
2 1 5,000
2 2 7,000
2 3 8,000
Expect Calculation:
Car_ID: 1
(Trip 2 - Trip 1) = 1,000
(Trip 3 - Trip 2) = 500
Average Difference: 750
Car_ID: 2
(Trip 2 - Trip 1) = 2,000
(Trip 3 - Trip 2) = 1,000
Average Difference: 1,500
Expected Output:
Car_ID | Average_Difference
1 750
2 1,500
You can use aggregation:
select car_id,
(max(mileage) - min(mileage)) / nullif(count(*) - 1, 0)
from t
group by car_id;
That is, the average as you have defined it is the maximum minus the minimum divided by one less than the number of trips.

Graphically represent SQL Data

Given a table with the following structure with 11+M transactions.
ID ProductKey CloseDate Part PartAge Sales
1 XXXXP1 5/10/15 P1 13 100
2 XXXXP2 6/1/16 P1 0 15
3 XXXXP3 4/1/08 P1 0 280
4 XXXXP1 3/18/11 P1 0 10
5 XXXXP3 6/29/15 P1 45 15
6 XXXXP1 8/11/13 P1 30 360
Products XXXXP1 and XXXXP3 are entered twice since they are resales. Product Age=0 indicates its a new sale. So these products went from:
New Sale --> ReSale --> ReSale
Using a self-joining query, I can retrieve all the products which were resales. But is there a way to display these in a pretty graph or tree format?
Something which depicts the life-span of the sale transaction of the product?
Any ideas will be appreciated.
TIA,
B

Proc Optmodel SAS Maintaining Defined Separation within Group

I am relatively new to proc optmodel and have been struggling with syntax/structure. I was able to get help once before and am stuck again.
Here is my dataset:
data have;
input NAME $ TEAM $ LEAD GRADE XXX MIN MAX YYY RATE;
cards;
HAL A 1 1 50 45 55 100 1.1
SAL A 0 2 55 0 9999 200 1
KIM A 0 3 70 0 9999 50 1.4
JIM B 1 2 100 90 110 300 .95
GIO B 0 3 120 0 9999 50 1
CAL B 0 4 130 0 9999 20 .9
TOM C 1 1 2 1 5 20 .7
SUE C 0 3 5 0 9999 10 .5
VAL D 1 7 20 15 25 100 .6
WHO D 0 4 10 0 9999 10 .9
;
run;
Here are the specifics:
1. Only the "team lead" has any meaningful constraints.
2. However, the other members of the team will be adjusted accordingly. The value of XXX will be ten percent lower or higher relative to the difference in grade from the team lead. So, if HAL's NEW_XXX is 50 (stays same), then SAL will be 10% higher than HAL's (2 is 1 unit greater than 1) which is 55. KIM's NEW_XXX is 60, since this is twenty percent higher than HAL (3 is 2 units greater than 1. SImilarly, WHO's NEW_XXX will be 30% lower than VAL's.
Does that make sense?
Below is what I have so far, which is the skeleton from a similar project.
proc optmodel;
*set variables and inputs;
set<string>NAME;
string TEAM{NAME};
number LEAD{NAME};
number GRADE{NAME};
number XXX{NAME};
number MIN{NAME};
number MAX{NAME};
number YYY{NAME};
number RATE{NAME};
set TEAMS = setof{i in NAME} TEAM[i];
set NAMEperTEAM{gi in TEAMS} = {i in NAME: TEAM[i] = gi};
var NEW_XXX{i in NAME}>=MIN[i]<=MAX[i];
*read data into procedure;
read data have into
NAME=[NAME]
TEAM
LEAD
GRADE
XXX
MIN
MAX
YYY
RATE;
*state function to optimize;
max metric=sum{gi in TEAMS}
sum{i in NAMEperTEAM[gi]}
(NEW_XXX[i])*(1-(NEW_XXX[i]-XXX[i])*RATE[i]/XXX[i])*YYY[i];
expand;
solve;
*write output dataset;
create data results
from [NAME]={NAME}
TEAM
LEAD
GRADE
XXX
NEW_XXX
MIN
MAX
RATE
YYY;
*write results to window;
print NEW_XXX metric;
quit;
If I understand this correctly, you need set the non-team leads NEW_XXX variable in an equality constraint. That leaves only the team lead NEW_XXX variables free for the optimization.
Let me know if this is what you are trying to accomplish.
Here's how I did it:
proc optmodel;
*set variables and inputs;
set<string> NAME;
string TEAM{NAME};
number LEAD{NAME};
number GRADE{NAME};
number XXX{NAME};
number MIN{NAME};
number MAX{NAME};
number YYY{NAME};
number RATE{NAME};
*read data into procedure;
read data have into
NAME=[NAME]
TEAM
LEAD
GRADE
XXX
MIN
MAX
YYY
RATE;
set TEAMS = setof{i in NAME} TEAM[i];
set NAMEperTEAM{gi in TEAMS} = {i in NAME: TEAM[i] = gi};
/*Helper array that gives me the team leader for each team*/
str LEADS{TEAMS};
for {i in NAME: LEAD[i] = 1} do;
LEADS[TEAM[i]] = i;
end;
var NEW_XXX{i in NAME} init XXX[i] >=MIN[i]<=MAX[i];
*state function to optimize;
max metric=sum{gi in TEAMS}(
sum{i in NAMEperTEAM[gi]} (
(NEW_XXX[i])*(1-(NEW_XXX[i]-XXX[i])*RATE[i]/XXX[i])*YYY[i]
)
);
/*Constrain the non-lead members*/
con NonLeads{i in NAME: LEAD[i] = 0}: NEW_XXX[i] = (1 + (GRADE[i] - GRADE[LEADS[TEAM[i]]]) * 0.1) * NEW_XXX[LEADS[TEAM[i]]] ;
expand;
solve;
*write output dataset;
create data results
from [NAME]={NAME}
TEAM
LEAD
GRADE
XXX
NEW_XXX
MIN
MAX
RATE
YYY;
*write results to window;
print new_xxx metric;
quit;

SQL Query - SUM of data in transaction (for each transaction)

I have this table :
Trans_ID Name Value Total_Item
100 I1 0.33333333 3
100 I2 0.33333333 3
400 I1 0.33333333 3
400 I2 0.33333333 3
800 I1 0.25 4
800 I2 0.25 4
900 I1 0.33333333 3
900 I2 0.33333333 3
1000 I1 0.2 5
1000 I2 0.2 5
i need to make it into :
ITEM VALUE
I1,I2 0.28999998
Value is calculated from sum of each 2 item in all transaction / total transaction
EX: item I1 & I2 at trans 100
(0.33333333 + 0.33333333) = 0.666666666
trans 400
(0.33333333 + 0.33333333) = 0.666666666
trans 800
(0.25+0.25) = 0.5
trans 900
(0.33333333 + 0.33333333) = 0.666666666
trans 1000
(0.2+0.2) = 0.4
So Value will be:
(0.666666666+0.666666666+0.5+0.666666666+0.4)/10= 0.28999998
*since total transaction in this table is example table 10. there's aprox 50k transaction in my real table
please note that Total_item value is fixed for each transaction and there's no mistake (note that trans 100 only have 2 item and i put 3 in total item)
i'm working with ms access (but general sql query is fine)
If you are correct about your process, then all the grouping you are using is not necessary - the value is the same without the grouping. That is:
((T100_I1 + T100_I2) + (T400_I1 + T400_I2)) / 4 = (T100_I1 + T100_I2 + T400_I1 + T400_I2) / 4 =
In other words, to get the value you described you want, you just need to sum all the values and divide by their count.
select sum(value)/count(*)
from table