Determine median and quartiles using columnar data in Snowflake - sql

I am looking for a way to calculate the median, first and third quartiles from a data set based on certain parameters. I also would like to use these values for future coding.
Here is what the data looks like:
ID
Country
Gold Level
Silver Level
Diamond Level
Value
123
A
Y
N
Y
0.1
234
B
N
N
Y
365
C
Y
Y
Y
0.003
234
D
N
N
N
0.07
245
A
Y
Y
N
0.65
374
B
Y
N
N
0.87
937
D
N
N
Y
0.55
What I am looking for is to find the median, first and third quartiles based on country and level. So, provide me the median, first and third quartiles for country A and Gold Level = 'Y', find me median, first and third quartiles for Country A and Silver Level = 'Y', etc.
Also in some cases, as you see in row 2, there is a blank value. I would like to replace that value with 0
Perhaps the better output would look something like this:
Country
Level
Median
1 Quadrant
3 Quadrant
A
Gold
0.08
0.075
0.2
A
Silver
0.2
0.01
0.5
B
Gold
0.07
0.079
0.4

You can use PERCENTILE_CONT function in snowflake
The query would be :
SELECT
Country,
"Gold Level",
"Silver Level",
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS Median,
PERCENTILE_CONT(0.25) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS First_Quartile,
PERCENTILE_CONT(0.75) WITHIN GROUP (ORDER BY COALESCE(Value, 0)) AS Third_Quartile
FROM
your_table
GROUP BY
Country,
"Gold Level",
"Silver Level";

Related

Combining aggregate and analytics functions in BigQuery to reduce table size

I am facing an issue that is more about query design than SQL specifics. Essentially, I am trying to reduce a dataset size by carrying out the following transformations described below.
I start with an irregularly sampled timeseries of Voltage measurements:
Seconds
Volts
0.1
2899
0.15
2999
0.17
2990
0.6
3001
0.98
2978
1.2
3000
1.22
3003
3.7
2888
4.1
2900
4.11
3012
4.7
3000
4.8
3000
I bin the data into buckets, where data points that are close to one another fall into the same bucket. In this example, I bin data into 1 second buckets simply by dividing the Seconds column by 1. I also add an ordering number to each group. I use the below query:
WITH qry1 AS (SELECT
Seconds
, Volts
, DIV(CAST(Seconds AS NUMERIC), 1) as BinNo
, Rank() OVER (PARTITION BY DIV(CAST(Seconds AS NUMERIC), 1) ORDER BY Seconds) as BinRank
FROM
project.rawdata
)
Seconds
Volts
BinNo
BinRank
0.1
2899
0
1
0.15
2999
0
2
0.17
2990
0
3
0.6
3001
0
4
0.98
2978
0
5
1.2
3000
1
1
1.22
3003
1
2
3.7
2888
3
1
4.1
2900
4
1
4.11
3012
4
2
4.7
3000
4
3
4.8
3000
4
4
Now comes the part I am struggling with. I am attempting to get the following output from a query acting on the above table. Keeping the time order is important as I need to plot these values on a line style chart. For each group:
Get the first row ('first' meaning earliest Second value)
Get the Max and Min of the Volts field, and associate these with the earliest (can be latest too I guess) Seconds value
Get the last row (last meaning latest Second value)
The conditions for this query are:
If there is only one row in the group, simply assign the Volts value for that row as both the max and the min and only use the single Seconds value for that group
If there are only two rows in the group, simply assign the Volts values for both the max and min to the corresponding first and last Seconds values, respectively.
(Now for the part I am struggling with) If there are three rows or more per group, extract the first and last rows as above, but then also get the max and min over all rows in the group and assign these to the max and min values for an intermediate row between the first and last row. The output would be as below. As mentioned, this step could be associated with any position between the first and last Seconds values, and here I have assigned it to the first Seconds value per group.
Seconds
Volts_min
Volts_max
OrderingCol
0.1
2899
2899
1
0.1
2899
3001
2
0.98
2978
2978
3
1.2
3000
3000
1
1.22
3003
3003
2
3.7
2888
2888
1
4.1
2900
2900
1
4.1
2900
3012
2
4.8
3000
3000
3
This will then allow me to plot these values using a custom charting library which we have without overloading the memory. I can extract the first and last rows per group by using analytics functions and then doing a join, but cannot get the intermediate values. The Ordering Column's goal is to enable me to sort the table before pulling the data to the dashboard. I am attempting to do this in BigQuery as a first preference.
Thanks :)
Below should do it
select Seconds, values.*,
row_number() over(partition by bin_num order by Seconds) as OrderingCol
from (
select *,
case
when row_num = 1 or row_num = rows_count then true
when rows_count > 2 and row_num = 2 then true
end toShow,
case
when row_num = 1 then struct(first_row.Volts as Volts_min, first_row.Volts as Volts_max)
when row_num = rows_count then struct(last_row.Volts as Volts_min, last_row.Volts as Volts_max)
else struct(min_val as Volts_min, max_val as Volts_max)
end values
from (
select *,
div(cast(Seconds AS numeric), 1) as bin_num,
row_number() over win_all as row_num,
count(1) over win_all as rows_count,
min(Volts) over win_all as min_val,
max(Volts) over win_all as max_val,
first_value(t) over win_with_order as first_row,
last_value(t) over win_with_order as last_row
from `project.dataset.table` t
window
win_all as (partition by div(cast(Seconds AS numeric), 1)),
win_with_order as (partition by div(cast(Seconds AS numeric), 1) order by Seconds)
)
)
where toShow
# order by Seconds
If applied to sample data in your question - output is

Get value of variable quantile per group

I have data that is categorized in groups, with a given quantile percentage per group. I want to create a threshold for each group that seperates all values within the group based on the quantile percentage. So if one group has q=0.8, I want the lowest 80% values given 1, and the upper 20% values given 0.
So, given the data like this:
I want object 1, 2 and 5 to get result 1 and the other 3 result 0. In total my data consists of 7.000.000 rows with 14.000 groups. I tried doing this with groupby.quantile but therefore I need a constant quantile measure, whereas my data has a different one for each group.
Setup:
num = 7_000_000
grp_num = 14_000
qua = np.around(np.random.uniform(size=grp_num), 2)
df = pd.DataFrame({
"Group": np.random.randint(low=0, high=grp_num, size=num),
"Quantile": 0.0,
"Value": np.random.randint(low=100, high=300, size=num)
}).sort_values("Group").reset_index(0, drop=True)
def func(grp):
grp["Quantile"] = qua[grp.Group]
return grp
df = df.groupby("Group").apply(func)
Answer: (This is basically a for loop, so for performance you can try to apply numba to this)
def func2(grp):
return grp.Value < grp.Value.quantile(grp.Quantile.iloc[0])
df["result"] = df.groupby("Group").apply(func2).reset_index(0, drop=True)
print(df)
Outputs:
Group Quantile Value result
0 0 0.33 156 1
1 0 0.33 259 0
2 0 0.33 166 1
3 0 0.33 183 0
4 0 0.33 111 1
... ... ... ... ...
6999995 13999 0.83 194 1
6999996 13999 0.83 227 1
6999997 13999 0.83 215 1
6999998 13999 0.83 103 1
6999999 13999 0.83 115 1
[7000000 rows x 4 columns]
CPU times: user 14.2 s, sys: 362 ms, total: 14.6 s
Wall time: 14.7 s

group by dynamic interval with starting and ending point SQL Server

I have a table containing a column DED with numbers that can go from 0 to infinity. I am interested in grouping them starting always in 0 (upper bound as open and lower bound as closed interval) and get the percentage totals
Suppose I have a column with
DED AMT
0.0004 4
0.0009 1
0.001 2
0.002 1
0.009 4
0.01 5
0.04 6
0.09 3
0.095 1
0.9 3
1 2
100 1
500 1
so I would want the following intervals:
DED AMT PAMT
0-0.01 12 0.3529
0.01-0.02 5 0.1470
0.04-0.05 6 0.1764
0.09-0.1 4 0.1176
0.9-1 3 0.0882
1 2 0.0588
I have tried:
SELECT CAST(DED/.02*.02 AS VARCHAR) +' - '+CAST(DED/.02*.02 +.01 AS VARCHAR) AS DED,
SUM(AMT) AS AMT,ISNULL(SUM(AMT)*1.000/NULLIF(SUM(SUM(AMT)) OVER (),0),0) AS PAMT
FROM MYTABLE
WHERE DED/.02*.02<=1
GROUP BY DED/.02*.02
Thanks for your help
SELECT
ROUND(DED, 2, 1) AS DED_lower,
ROUND(DED, 2, 1) + 0.01 AS DED_upper,
SUM(AMT) AS SUM_AMT,
SUM(AMT) * 1.0
/
SUM(AMT) OVER () AS PAMT
FROM
mytable
WHERE
DED <= 1
GROUP BY
ROUND(DED, 2, 1)
ROUND(DED, 2, 1) will round Down to two decimal places. Giving equal sized bands of 0.01 in size.
Apologies for typos or formating, I'm on my phone

PostgreSQL - Completing a Series

I have a series, and here is a simple formula, where x = yesterday, and y = three days ago:
x + (x - y) / 2
In excel, computing the above series is easy. But here is a sample data set in which I would like to complete a series based off of previous values. Please note, that the actual data simply comes from the data set. So we have data from 1/1/2018, 1/2/2018, and 1/3/2018. We would then like to predict 1/4/2018 through 1/8/2018 based on the above formula:
A (dt) B (sum) excel equivalent
row1 1/1/2018 1 (actual)
row2 1/2/2018 2 (actual)
row3 1/3/2018 5 (actual)
row4 1/4/2018 7 (predicted) =B3 + ((B3 - B1) / 2)
row5 1/5/2018 9.5 (predicted) =B4 + ((B4 - B2) / 2)
row6 1/6/2018 11.75 (predicted) =B5 + ((B5 - B3) / 2)
row7 1/7/2018 14.125 (predicted) =B6 + ((B6 - B4) / 2)
row8 1/8/2018 16.4375 (predicted) =B7 + ((B7 - B5) / 2)
I know that that you achieve a cumulative sum by using Partition By, however I am having trouble with modified cumulative sums, such as the above. Is there a way to accomplish this in PostgreSQL?
Here is a screenshot of excel:
This is a hard problem. Here is a solution using a recursive CTE:
with recursive cte as (
select 1 as n, 1::numeric as x, null::numeric as x_1, null::numeric as x_2
union all
select n + 1,
(case n + 1 when 2 then 2 when 3 then 5
else x + (x - x_2) / 2
end) as x,
x as x_1, x_1 as x_2
from cte
where n < 10
)
select *
from cte;
Along with a db<>fiddle.
The idea is to pivot the historical values that you need in separate columns. Note that the formula is x + (x - x_2) / 2 rather than x_1 + (x_1 - x_3) / 2 because this is using the values from the previous row.

How do I hardcode the result for one column based on another column's result

I am new to sql and have browsed stack overflow and do not understand the trigger function, set function suggested to do this. I believed this could be done with case when statement.
I have the following result using the query also provided below.
The result shows two different "Types" for the Symbol X with two different Weights. I would like to have the result show the sum of both the 1 type X and the 2 type X as 1 X. So the result will show Type 1, Symbol X, Weight 2.95.
Type Symbol Price Weight
1 g 1.17
1 h 1.24
1 x 1
2 x 1.95
2 a 2.4
2 b 1.16
2 c 2.9
2 d 0.97
2 e 1.11
2 f 1.54
SELECT 'Type','Symbol','Price','Weight'
UNION
Select type,
symbol
,IFNULL(price,'')
,weight
from lcv
Thank you in advance.