PostgreSQL - Completing a Series - sql

I have a series, and here is a simple formula, where x = yesterday, and y = three days ago:
x + (x - y) / 2
In excel, computing the above series is easy. But here is a sample data set in which I would like to complete a series based off of previous values. Please note, that the actual data simply comes from the data set. So we have data from 1/1/2018, 1/2/2018, and 1/3/2018. We would then like to predict 1/4/2018 through 1/8/2018 based on the above formula:
A (dt) B (sum) excel equivalent
row1 1/1/2018 1 (actual)
row2 1/2/2018 2 (actual)
row3 1/3/2018 5 (actual)
row4 1/4/2018 7 (predicted) =B3 + ((B3 - B1) / 2)
row5 1/5/2018 9.5 (predicted) =B4 + ((B4 - B2) / 2)
row6 1/6/2018 11.75 (predicted) =B5 + ((B5 - B3) / 2)
row7 1/7/2018 14.125 (predicted) =B6 + ((B6 - B4) / 2)
row8 1/8/2018 16.4375 (predicted) =B7 + ((B7 - B5) / 2)
I know that that you achieve a cumulative sum by using Partition By, however I am having trouble with modified cumulative sums, such as the above. Is there a way to accomplish this in PostgreSQL?
Here is a screenshot of excel:

This is a hard problem. Here is a solution using a recursive CTE:
with recursive cte as (
select 1 as n, 1::numeric as x, null::numeric as x_1, null::numeric as x_2
union all
select n + 1,
(case n + 1 when 2 then 2 when 3 then 5
else x + (x - x_2) / 2
end) as x,
x as x_1, x_1 as x_2
from cte
where n < 10
)
select *
from cte;
Along with a db<>fiddle.
The idea is to pivot the historical values that you need in separate columns. Note that the formula is x + (x - x_2) / 2 rather than x_1 + (x_1 - x_3) / 2 because this is using the values from the previous row.

Related

Determine median and quartiles using columnar data in Snowflake

I am looking for a way to calculate the median, first and third quartiles from a data set based on certain parameters. I also would like to use these values for future coding.
Here is what the data looks like:
ID
Country
Gold Level
Silver Level
Diamond Level
Value
123
A
Y
N
Y
0.1
234
B
N
N
Y
365
C
Y
Y
Y
0.003
234
D
N
N
N
0.07
245
A
Y
Y
N
0.65
374
B
Y
N
N
0.87
937
D
N
N
Y
0.55
What I am looking for is to find the median, first and third quartiles based on country and level. So, provide me the median, first and third quartiles for country A and Gold Level = 'Y', find me median, first and third quartiles for Country A and Silver Level = 'Y', etc.
Also in some cases, as you see in row 2, there is a blank value. I would like to replace that value with 0
Perhaps the better output would look something like this:
Country
Level
Median
1 Quadrant
3 Quadrant
A
Gold
0.08
0.075
0.2
A
Silver
0.2
0.01
0.5
B
Gold
0.07
0.079
0.4
You can use PERCENTILE_CONT function in snowflake
The query would be :
SELECT
Country,
"Gold Level",
"Silver Level",
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS Median,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS First_Quartile,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS Third_Quartile
FROM
your_table
GROUP BY
Country,
"Gold Level",
"Silver Level";

Computing weighted mean in pandas dataframe based on intervals

I have 2 dataframes, df1, df2.
df1 consists of 3 columns, start, end, id.
df2 consists of 4 columns, start, end, id, quantity.
Note that start < end always for both dataframes.
For df1, end - start for each row is always 15, and the [start, end] pair for each row is nonoverlapping and contiguous for each id, e.g.,
df1:
id start end
1 0 15
1 15 30
1 30 45
2 0 15
2 15 30
2 30 45
I need to create a 4th column, quantity_average, in df1, where the quantity_average for each row is the weighted average of all df2.quantity such that the corresponding id is the same in both and there is full/partial overlap between the start, end pairs in both dataframes.
The weight is defined as (min(df2.end, df1.end) - max(df2.start, df1.start)) / 15, i.e., proportional to the amount of overlap.
I will provide a full example. We will use the df1 above, and use
df2 =
id start end quantity
1 0 1.1 3.5
1 1.1 11.4 5.5
1 11.4 34 2.5
1 34 46 3
2 0 1.5 2.2
2 1.5 20 1.0
2 20 30 4.5
So we have the result for quantity_average to be:
1.1 / 15 * 3.5 + (11.4 - 1.1)/15 * 5.5 + (15 - 11.4) / 15 * 2.5 = 4.63333
(30 - 15) / 15 * 2.5 = 2.5
(34 - 30) / 15 * 2.5 = 0.66666
1.5 / 15 * 2.2 + (15 - 1.5) / 15 * 1.0 = 1.12
(20 - 15) / 15 * 1.0 + (30 - 20) / 15 * 4.5 = 3.33333333333
0
I am wondering if there's a quick way to do this in pandas?
Here's one (not so simple) way to do it. It's fast in the sense that it uses vectorized functions but its time and memory complexities are both O(len(df1) * len(df2)). Depending on the scale of your data sets, the memory requirements may overwhelm your computers' hardware.
The idea is to use numpy broadcasting to compare every row in df1 against every row in df2, searching for pairs that:
Have same id
Have overlapping duration (start - end).
... then perform calculations over them:
# Extract the columns to numpy array
# For the columns of df1, raise each by one dimension to prepare
# for numpy broadcasting
id1, start1, end1 = [col[:, None] for col in df1.to_numpy().T]
id2, start2, end2, quantity2 = df2.to_numpy().T
# Match each row in df1 to each row in df2
# `is_match` is a matrix where if cell (i, j) is True, row i of
# df1 matches row j of df2
is_match = (id1 == id2) & (start1 <= end2) & (start2 <= end1)
# `start` is a matrix where cell (i, j) is the maximum start time
# between row i of df1 and row j of df2
start = np.maximum(
np.tile(start1, len(df2)),
np.tile(start2, (len(df1), 1))
)
# Likewise, `end` is a matrix where cell (i, j) is the minium end
# time between row i of df1 and row j of df2
end = np.minimum(
np.tile(end1, len(df2)),
np.tile(end2, (len(df1), 1))
)
# This assumes that every row in df1 has a duration of 15
df1["quantity"] = (is_match * (end - start) * quantity2).sum(axis=1) / 15
# This allow each row in df1 to have a different duration
df1["quantity"] = (is_match * (end - start) * quantity2).sum(axis=1) / (end1 - start1)[:, 0]

group by dynamic interval with starting and ending point SQL Server

I have a table containing a column DED with numbers that can go from 0 to infinity. I am interested in grouping them starting always in 0 (upper bound as open and lower bound as closed interval) and get the percentage totals
Suppose I have a column with
DED AMT
0.0004 4
0.0009 1
0.001 2
0.002 1
0.009 4
0.01 5
0.04 6
0.09 3
0.095 1
0.9 3
1 2
100 1
500 1
so I would want the following intervals:
DED AMT PAMT
0-0.01 12 0.3529
0.01-0.02 5 0.1470
0.04-0.05 6 0.1764
0.09-0.1 4 0.1176
0.9-1 3 0.0882
1 2 0.0588
I have tried:
SELECT CAST(DED/.02*.02 AS VARCHAR) +' - '+CAST(DED/.02*.02 +.01 AS VARCHAR) AS DED,
SUM(AMT) AS AMT,ISNULL(SUM(AMT)*1.000/NULLIF(SUM(SUM(AMT)) OVER (),0),0) AS PAMT
FROM MYTABLE
WHERE DED/.02*.02<=1
GROUP BY DED/.02*.02
Thanks for your help
SELECT
ROUND(DED, 2, 1) AS DED_lower,
ROUND(DED, 2, 1) + 0.01 AS DED_upper,
SUM(AMT) AS SUM_AMT,
SUM(AMT) * 1.0
/
SUM(AMT) OVER () AS PAMT
FROM
mytable
WHERE
DED <= 1
GROUP BY
ROUND(DED, 2, 1)
ROUND(DED, 2, 1) will round Down to two decimal places. Giving equal sized bands of 0.01 in size.
Apologies for typos or formating, I'm on my phone

Efficient way in SQL to compute a set of rows against all other rows?

Let's say I have a table with data that looks like this:
d user val
1 1 .94
1 2 -.88
1 3 .24
1 4 .74
2 1 .35
2 2 .68
2 3 -.98
2 4 .62
3 1 -.81
3 2 .97
3 3 .29
3 4 ___ (this row doesn't exist in the database)
4 1 .76
4 2 .38
4 3 -.98
4 4 .15
5 1 .69
5 2 .27
5 3 -.49
5 4 -.59
For a given user (let's say 2), I would like the following output:
user calc
1 -.102
3 .668
4 -.1175
Generalized:
user calc
1 ((-.88 - .94) + (.68 - .35) + (.97 - -.81) + (.38 - .76) + (.27 - .69)) / 5
3 ((-.88 - .24) + (.68 - -.98) + (.97 - .29) + (.38 - -.98) + (.27 - -.49)) / 5
4 ((-.88 - .74) + (.68 - .62) + (.38 - .15) + (.27 - -.59)) / 4
Generalized Further:
user calc
1 sum of (user2's d value - user1's d value) / count
3 sum of (user2's d value - user3's d value) / count
4 sum of (user2's d value - user4's d value) / count
To explain further, I'd like to obtain an output that shows everyone's relation to a given user (in this case user 2). In my actual dataset there are hundreds of unsorted distinct users and d values, but I've tried to simplify the dataset for this question.
Also, please note that not all user's have a d value, so it should only factor in matching sets. See how in the example above user 4 doesn't have a value for d=3, so that one is skipped in the calculation.
A join and aggregation should work:
select
t2.user, avg(t1.val - t2.val) as calc
from my_table t1
join my_table t2 on t1.d = t2.d and t1.user <> t2.user
where t1.user = 2
group by t2.user

Generating Leads and lags for non-consecutive time periods in SAS

With a SAS dataset like
Ob x year pid grp
1 3.88 2001 1 a
2 2.88 2002 1 a
3 0.13 2004 1 a
4 3.70 2005 1 a
5 1.30 2007 1 a
6 0.95 2001 2 b
7 1.79 2002 2 b
8 1.59 2004 2 b
9 1.29 2005 2 b
10 0.96 2007 2 b
I would like to get
Ob x year pid grp grp X_F1 XL1
1 3.88 2001 1 a a 2.88 .
2 2.88 2002 1 a a . 3.88
3 0.13 2004 1 a a 3.7 .
4 3.7 2005 1 a a . 0.13
5 1.3 2007 1 a a . .
6 0.95 2001 2 b b 1.79 .
7 1.79 2002 2 b b . 0.95
8 1.59 2004 2 b b 1.29 .
9 1.29 2005 2 b b . 1.59
10 0.96 2007 2 b b . .
where for observations with the same pid and each year t,
x_F1 is the value of x in year t+1 and
x_L1 is the value of x in year t-1
In my data set, not all pids have observations in successive years.
My attempt using the expand proc
proc expand data=have out=want method=none;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
did not account for the fact that years are not consecutive.
You could stick with proc expand to insert the missing years into your data (utilising the extrapolate statement). I've set the from value to day as this is a sequential integer check for days which will work with your data as YEAR is stored as an integer rather than a date.
Like the other answers, it requires 2 passes of the data, but I don't think there's an alternative to this.
data have;
input x year pid grp $;
datalines;
3.88 2001 1 a
2.88 2002 1 a
0.13 2004 1 a
3.70 2005 1 a
1.30 2007 1 a
0.95 2001 2 b
1.79 2002 2 b
1.59 2004 2 b
1.29 2005 2 b
0.96 2007 2 b
;
run;
proc expand data = have out = have1
method=none extrapolate
from=day to=day;
by pid;
id year;
run;
proc expand data=have1 out=want method=none;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
or this can be done in one go, subject to whether the value of x is important in the final dataset (see comment below).
proc expand data=have1 out=want1 method=none extrapolate from=day to=day;
by pid; id year;
convert x = x_F1 / transformout=(lead 1);
convert x = x_F2 / transformout=(lead 2);
convert x = x_F3 / transformout=(lead 3);
convert x = x_L1 / transformout=(lag 1);
convert x = x_L2 / transformout=(lag 2);
convert x = x_L3 / transformout=(lag 3);
run;
Here is a simple approach using proc sql. It joins the data with itself twice; once for the forward and once for the backward lag, then takes the required values where they exist.
proc sql;
create table want as
select
a.*,
b.x as x_f1,
c.x as x_l1
from have as a
left join have as b
on a.pid = b.pid and a.year = b.year - 1
left join have as c
on a.pid = c.pid and a.year = c.year + 1
order by
a.pid,
a.year;
run;
Caveats:
It will not expand too well with larger numbers of lags.
This is probably not the quickest approach.
It requires that there be only one observation for each pid year pair, and would need modifying if this is not the case.
Sort your data per group and per year.
compute x_F1 in a data step with a lag and a condition like this: if (year and lag(year) are consecutive) then x_F1=lag(x)
Sort your date the other way around
Compute x_L1 similarly.
I'm trying to write you a working code right now.
If you provide me with a data sample (a data step with an infile e.g.), I can better try and test it.
This seems to work with my data:
/*1*/
proc sort data=WORK.QUERY_FOR_EPILABO_CLEAN_NODUP out=test1(where=(year<>1996)) nodupkey;
by grp year;
run;
quit;
/*2*/
data test2;
*retain x;
set test1;
by grp;
x_L1=lag(x);
if first.grp then
x_L1=.;
yeardif=dif(year);
if (yeardif ne 1) then
x_L1=.;
run;
/*3*/
proc sort data=test2(drop=yeardif) out=test3;
by grp descending year;
run;
quit;
/*4*/
data test4;
*retain x;
set test3;
by grp;
x_F1=lag(x);
if first.grp then
x_F1=.;
yeardif=dif(year);
if (yeardif ne -1) then
x_F1=.;
run;