How do I get the median of prices? - sql

In the data set, every shop is selling some books and every shop has its own price for each book. In the data, I have the price information for each book. With the query in Amazon Athena, I want to calculate the median price for each shop and each product in a specific time period.
But honestly, I have no idea how to do it. Here is my query so far:
SELECT product_id,
shop_id,
XXX AS median_price
FROM data_f
WHERE site_id = 10
AND year || month || day || hour >= '2020022500'
AND year || month || day || hour < '2020022600'
GROUP BY product_id, shop_id
Thanks!

Unfortunately, AWS doesn't support a median() aggregation function or the percentile() functions. Perhaps the simplest method is to use ntile(2) in a subquery and then take the maximum of the first tile (or the minimum of the second tile:
SELECT product_id, shop_id,
MAX(CASE WHEN tile2 = 1 THEN price END) as median
FROM (SELECT d.*, NTILE(2) OVER (PARTITION BY product_id, shop_id ORDER BY price) as tile2
FROM data_f d
WHERE site_id = 10 AND
action NOT IN ('base', 'delete') AND
year || month || day || hour >= '2020022500' AND
year || month || day || hour < '2020022600'
) d
GROUP BY product_id, shop_id;
Note: This is undoubtedly good enough for any practical purpose. However, "median" is usually defined as the average of the two middle values when the total number of rows is even. If you want to be pedantic:
SELECT product_id, shop_id,
(CASE WHEN COUNT(*) % 2 = 0
THEN (MAX(CASE WHEN tile2 = 1 THEN price END) +
MIN(CASE WHEN tile2 = 2 THEN price END)
) / 2.0
ELSE MAX(CASE WHEN tile2 = 1 THEN price END)
END) as median

The median value is the one in the middle when all are listed in order, so let's create that order with a dense_rank()
with q1 as
(
SELECT product_id,
shop_id,
price,
dense_rank() over (partition by product_id, shop_id order by price) as price_rank
FROM data_f
WHERE site_id = 10
AND action <> 'base'
AND action <> 'delete'
AND year || month || day || hour >= '2020022500'
AND year || month || day || hour < '2020022600'
)
, q2 as
(
select max(price_rank) as mp
from q1
)
select q1.*
from q1
where q1.price_rank = (select floor(mp/2) from q2)
Documentation of window functions is part of the Presto Functions documentation here

You can use approx_percentile
select approx_percentile(column_name, 0.5) from table
solution from Philipp Johannis Calculate Median for each group in AWS Athena table
SELECT product_id,
shop_id,
approx_percentile(price, 0.5) AS median_price
FROM data_f
WHERE site_id = 10
AND year || month || day || hour >= '2020022500'
AND year || month || day || hour < '2020022600'
GROUP BY product_id, shop_id

Below query for calculating median:
with res1 as
(select id,ROW_NUMBER() over (order by id) "median_row_num" from test ),
res2 as
(select count(median_row_num) as i from res1)
select id as "median" from res1 where res1.median_row_num = (select case when i%2 = 0 then i/2 else i/2+1 end from res2)
Note : Remember median is middle element in sorted list of numbers.
if a = [3,4,2,6,7]
sorted list a = [2,3,4,6,7]
count of elements is 5 so median would be 4.
But in case of if a = [2,3,4,6,7,8]
Count of elements 6 which is even number so there are two mid elements 4 and 6
So median would be 5 (4+6 = 10/2 = 5)
So above query is good for odd counts and incase of even counts it will always give you first half element.

Related

SQL - Calculate percentage by group, for multiple groups

I have a table in GBQ in the following format :
UserId Orders Month
XDT 23 1
XDT 0 4
FKR 3 6
GHR 23 4
... ... ...
It shows the number of orders per user and month.
I want to calculate the percentage of users who have orders, I did it as following :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
GROUP BY
HasOrders
ORDER BY
Parts
It gives me the following result:
HasOrders Parts
0 35
1 65
I need to calculate the percentage of users who have orders, by month, in a way that every month = 100%
Currently to do this I execute the query once per month, which is not practical :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
WHERE Month = 1
GROUP BY
HasOrders
ORDER BY
Parts
Is there a way execute a query once and have this result ?
HasOrders Parts Month
0 25 1
1 75 1
0 45 2
1 55 2
... ... ...
SELECT
SIGN(Orders),
ROUND(COUNT(*) * 100.000 / SUM(COUNT(*), 2) OVER (PARTITION BY Month)) AS Parts,
Month
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
Demo on Postgres:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4cd2d1455673469c2dfc060eccea8020
You've stated that it's important for the total to be 100% so you might consider rounding down in the case of no orders and rounding up in the case of has orders for those scenarios where the percentages falls precisely on an odd multiple of 0.5%. Or perhaps rounding toward even or round smallest down would be better options:
WITH DATA AS (
SELECT SIGN(Orders) AS HasOrders, Month,
COUNT(*) * 10000.000 / SUM(COUNT(*)) OVER (PARTITION BY Month) AS PartsPercent
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
)
select HasOrders, Month, PartsPercent,
PartsPercent - TRUNCATE(PartsPercent) AS Fraction,
CASE WHEN HasOrders = 0
THEN FLOOR(PartsPercent) ELSE CEILING(PartsPercent)
END AS PartsRound0Down,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5
AND MOD(TRUNCATE(PartsPercent), 2) = 0
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsRoundTowardEven,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5 AND PartsPercent < 50
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsSmallestTowardZero
from DATA
It's usually not advisable to test floating-point values for equality and I don't know how BigQuery's float64 will work with the comparison against 0.5. One half is nevertheless representable in binary. See these in a case where the breakout is 101 vs 99. I don't have immediate access to BigQuery so be aware that Postgres's rounding behavior is different:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=c8237e272427a0d1114c3d8056a01a09
Consider below approach
select hasOrders, round(100 * parts, 2) as parts, month from (
select month,
countif(orders = 0) / count(*) `0`,
countif(orders > 0) / count(*) `1`,
from your_table
group by month
)
unpivot (parts for hasOrders in (`0`, `1`))
with output like below

Use last value when current row is null , for PostgreSQL timeseries table

I come across a problem that I could not find an optimal solution. So the idea is to get the price at each given time for a list of products from a list of shops but because the price are registered at different time I get some nulls when grouping by time and also an array o values. Therefore it requires to couple of steps in order to obtain what I need. I am wondering if someone know a better, faster way to achieve this. Bellow is my initial PostgreSQL table of course this is just a snippet of it to get the idea:
Initial Table
Desired results (intermediate table and final one)
And bellow is the PostgreSQL sql code that give the result I want but it seems very costly:
SELECT times,
first_value(price_yami_egg) OVER (PARTITION BY partition_price_yami_egg order by time) as price_yami_egg
first_value(price_yami_salt) OVER (PARTITION BY partition_price_yami_salt order by time) as price_yami_salt
first_value(price_dobl_egg) OVER (PARTITION BY partition_price_dobl_egg order by time) as price_dobl_egg
first_value(price_dobl_salt) OVER (PARTITION BY partition_price_dobl_salt order by time) as price_dobl_salt
FROM(
SELECT time,
min(price_yami_egg) as price_yami_egg,
sum(case when min(price_yami_egg) is not null then 1 end) over (order by times) as partition_price_yami_egg
min(price_yami_salt) as price_yami_salt,
sum(case when min(price_yami_salt) is not null then 1 end) over (order by times) as partition_price_yami_salt
min(price_dobl_egg) as price_dobl_egg,
sum(case when min(price_dobl_egg) is not null then 1 end) over (order by times) as partition_price_dobl_egg
min(price_dobl_salt) as price_dobl_salt,
sum(case when min(price_dobl_salt) is not null then 1 end) over (order by times) as partition_price_dobl_salt
FROM (
SELECT "time" AS times,
CASE WHEN shop_name::text = 'yami'::text AND product_name::text = 'egg'::text THEN price END AS price_yami_egg
CASE WHEN shop_name::text = 'yami'::text AND product_name::text = 'salt'::text THEN price END AS price_yami_salt
CASE WHEN shop_name::text = 'dobl'::text AND product_name::text = 'egg'::text THEN price END AS price_dobl_egg
CASE WHEN shop_name::text = 'dobl'::text AND product_name::text = 'salt'::text THEN price END AS price_dobl_salt
FROM shop sh
) S
GROUP BY time
ORDER BY time) SS
Do you just want aggregation?
select time,
min(price) filter (where shop_name = 'Yami' and product_name = 'EGG'),
min(price) filter (where shop_name = 'Yami' and product_name = 'SALT'),
min(price) filter (where shop_name = 'Dobl' and product_name = 'EGG'),
min(price) filter (where shop_name = 'Dobl' and product_name = 'SALT')
from shop s
group by time;
If. your concern is NULL values in the result, then you can fill them in. This is a little tricky, but the idea is:
with t as (
select time,
min(price) filter (where shop_name = 'Yami' and product_name = 'EGG') as yami_egg,
min(price) filter (where shop_name = 'Yami' and product_name = 'SALT') as yami_salt,
min(price) filter (where shop_name = 'Dobl' and product_name = 'EGG') as dobl_egg,
min(price) filter (where shop_name = 'Dobl' and product_name = 'SALT') as dobl_salt
from shop s
group by time
)
select s.*,
max(yaml_egg) over (yaml_egg_grp) as imputed_yaml_egg,
max(yaml_salt) over (yaml_egg_grp) as imputed_yaml_salt,
max(dobl_egg) over (yaml_egg_grp) as imputed_dobl_egg,
max(dobl_salt) over (yaml_egg_grp) as imputed_dobl_salt
from (select s.*,
count(yaml_egg) over (order by time) as yaml_egg_grp,
count(yaml_salt) over (order by time) as yaml_egg_grp,
count(dobl_egg) over (order by time) as dobl_egg_grp,
count(dobl_salt) over (order by time) as dobl_salt_grp
from s
) s

How to get the difference between (multiple) two different rows?

I have a set of data containing some fields: month, customer_id, row_num (RANK), and verified_date.
The rank field indicates the first (1) and second (2) purchase of each customer. I would like to know the time difference between first and second purchase for each customer and show only its first month = month where row_num = 1.
https://i.ibb.co/PjJk5Y0/Capture.png
So my expected result is like below image:
https://i.ibb.co/y5Mww7k/Capture-2.png
I'm using StandardSQL in Google Bigquery.
row_num, verified_date
from table
GROUP BY 1, 2```
We can try using a pivot query here, aggregating by the customer_id:
SELECT
MAX(CASE WHEN row_num = 1 THEN month END) AS month,
customer_id,
1 AS row_num,
DATE_DIFF(MAX(CASE WHEN row_num = 2 THEN verified_date END),
MAX(CASE WHEN row_num = 1 THEN verified_date END), DAY) AS difference
FROM yourTable
GROUP BY
customer_id;

Group by in columns and rows, counts and percentages per day

I have a table that has data like following.
attr |time
----------------|--------------------------
abc |2018-08-06 10:17:25.282546
def |2018-08-06 10:17:25.325676
pqr |2018-08-05 10:17:25.366823
abc |2018-08-06 10:17:25.407941
def |2018-08-05 10:17:25.449249
I want to group them and count by attr column row wise and also create additional columns in to show their counts per day and percentages as shown below.
attr |day1_count| day1_%| day2_count| day2_%
----------------|----------|-------|-----------|-------
abc |2 |66.6% | 0 | 0.0%
def |1 |33.3% | 1 | 50.0%
pqr |0 |0.0% | 1 | 50.0%
I'm able to display one count by using group by but unable to find out how to even seperate them to multiple columns. I tried to generate day1 percentage with
SELECT attr, count(attr), count(attr) / sum(sub.day1_count) * 100 as percentage from (
SELECT attr, count(*) as day1_count FROM my_table WHERE DATEPART(week, time) = DATEPART(day, GETDate()) GROUP BY attr) as sub
GROUP BY attr;
But this also is not giving me correct answer, I'm getting all zeroes for percentage and count as 1. Any help is appreciated. I'm trying to do this in Redshift which follows postgresql syntax.
Let's nail the logic before presenting:
with CTE1 as
(
select attr, DATEPART(day, time) as theday, count(*) as thecount
from MyTable
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
select t1.attr, t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
From here you can pivot to create a day by day if you feel the need
I am trying to enhance the query #johnHC btw if you needs for 7days then you have to those days in case when
with CTE1 as
(
select attr, time::date as theday, count(*) as thecount
from t group by attr,time::date
)
, CTE2 as
(
select theday, sum(thecount) as daytotal
from CTE1
group by theday
)
,
CTE3 as
(
select t1.attr, EXTRACT(DOW FROM t1.theday) as day_nmbr,t1.theday, t1.thecount, t1.thecount/t2.daytotal as percentofday
from CTE1 t1
inner join CTE2 t2
on t1.theday = t2.theday
)
select CTE3.attr,
max(case when day_nmbr=0 then CTE3.thecount end) as day1Cnt,
max(case when day_nmbr=0 then percentofday end) as day1,
max(case when day_nmbr=1 then CTE3.thecount end) as day2Cnt,
max( case when day_nmbr=1 then percentofday end) day2
from CTE3 group by CTE3.attr
http://sqlfiddle.com/#!17/54ace/20
In case that you have only 2 days:
http://sqlfiddle.com/#!17/3bdad/3 (days descending as in your example from left to right)
http://sqlfiddle.com/#!17/3bdad/5 (days ascending)
The main idea is already mentioned in the other answers. Instead of joining the CTEs for calculating the values I am using window functions which is a bit shorter and more readable I think. The pivot is done the same way.
SELECT
attr,
COALESCE(max(count) FILTER (WHERE day_number = 0), 0) as day1_count, -- D
COALESCE(max(percent) FILTER (WHERE day_number = 0), 0) as day1_percent,
COALESCE(max(count) FILTER (WHERE day_number = 1), 0) as day2_count,
COALESCE(max(percent) FILTER (WHERE day_number = 1), 0) as day2_percent
/*
Add more days here
*/
FROM(
SELECT *, (count::float/count_per_day)::decimal(5, 2) as percent -- C
FROM (
SELECT DISTINCT
attr,
MAX(time::date) OVER () - time::date as day_number, -- B
count(*) OVER (partition by time::date, attr) as count, -- A
count(*) OVER (partition by time::date) as count_per_day
FROM test_table
)s
)s
GROUP BY attr
ORDER BY attr
A counting the rows per day and counting the rows per day AND attr
B for more readability I convert the date into numbers. Here I take the difference between current date of the row and the maximum date available in the table. So I get a counter from 0 (first day) up to n - 1 (last day)
C calculating the percentage and rounding
D pivot by filter the day numbers. The COALESCE avoids the NULL values and switched them into 0. To add more days you can multiply these columns.
Edit: Made the day counter more flexible for more days; new SQL Fiddle
Basically, I see this as conditional aggregation. But you need to get an enumerator for the date for the pivoting. So:
SELECT attr,
COUNT(*) FILTER (WHERE day_number = 1) as day1_count,
COUNT(*) FILTER (WHERE day_number = 1) / cnt as day1_percent,
COUNT(*) FILTER (WHERE day_number = 2) as day2_count,
COUNT(*) FILTER (WHERE day_number = 2) / cnt as day2_percent
FROM (SELECT attr,
DENSE_RANK() OVER (ORDER BY time::date DESC) as day_number,
1.0 * COUNT(*) OVER (PARTITION BY attr) as cnt
FROM test_table
) s
GROUP BY attr, cnt
ORDER BY attr;
Here is a SQL Fiddle.

Loop in Oracle SQL, comparing one month to another

I have to draft a SQL query which does the following:
Compare current week (e.g. week 10) amount to the average amount over previous 4 weeks (Week# 9,8,7,6).
Now I need to run the query on a monthly basis so say for weeks (10,11,12,13).
As of now I am running it four times giving the week parameter on each run.
For example my current query is something like this :
select account_id, curr.amount,hist.AVG_Amt
from
(
select
to_char(run_date,'IW') Week_ID,
sum(amount) Amount,
account_id
from Transaction t
where to_char(run_date,'IW') = '10'
group by account_id,to_char(run_date,'IW')
) curr,
(
select account_id,
sum(amount) / count(to_char(run_date,'IW')) as AVG_Amt
from Transactions
where to_char(run_date,'IW') in ('6','7','8','9')
group by account_id
) hist
where
hist.account_id = curr.account_id
and curr.amount > 2*hist.AVG_Amt;
As you can see, if I have to run the above query for week 11,12,13 I have to run it three separate times. Is there a way to consolidate or structure the query such that I only run once and I get the comparison data all together?
Just an additional info, I need to export the data to Excel (which I do after running query on the PL/SQL developer) and export to Excel.
Thanks!
-Abhi
You can use a correlated sub-query to get the sum of amounts for the last 4 weeks for a given week.
select
to_char(run_date,'IW') Week_ID,
sum(amount) curAmount,
(select sum(amount)/4.0 from transaction
where account_id = t.account_id
and to_char(run_date,'IW') between to_char(t.run_date,'IW')-4
and to_char(t.run_date,'IW')-1
) hist_amount,
account_id
from Transaction t
where to_char(run_date,'IW') in ('10','11','12','13')
group by account_id,to_char(run_date,'IW')
Edit: Based on OP's comment on the performance of the query above, this can also be accomplished using lag to get the previous row's value. Count of number of records present in the last 4 weeks can be achieved using a case expression.
with sum_amounts as
(select to_char(run_date,'IW') wk, sum(amount) amount, account_id
from Transaction
group by account_id, to_char(run_date,'IW')
)
select wk, account_id, amount,
1.0 * (lag(amount,1,0) over (order by wk) + lag(amount,2,0) over (order by wk) +
lag(amount,3,0) over (order by wk) + lag(amount,4,0) over (order by wk))
/ case when lag(amount,1,0) over (order by wk) <> 0 then 1 else 0 end +
case when lag(amount,2,0) over (order by wk) <> 0 then 1 else 0 end +
case when lag(amount,3,0) over (order by wk) <> 0 then 1 else 0 end +
case when lag(amount,4,0) over (order by wk) <> 0 then 1 else 0 end
as hist_avg_amount
from sum_amounts
I think that is what you are looking for:
with lagt as (select to_char(run_date,'IW') Week_ID, sum(amount) Amount, account_id
from Transaction t
group by account_id, to_char(run_date,'IW'))
select Week_ID, account_id, amount,
(lag(amount,1,0) over (order by week) + lag(amount,2,0) over (order by week) +
lag(amount,3,0) over (order by week) + lag(amount,4,0) over (order by week)) / 4 as average
from lagt;