I have a question related to my previous one.
What I have is a database that looks like:
category price date
-------------------------
Cat1 37 2019-03
Cat2 65 2019-03
Cat3 34 2019-03
Cat1 45 2019-03
Cat2 100 2019-03
Cat3 60 2019-03
This db has hundred of categories and comes from another one that has different attributes for each observation.
With this code:
WITH table AS
(
SELECT
category, price, date,
substring(date, 1, 4) AS year,
substring(date, 6, 2) as month
FROM
original_table
WHERE
(year = "2019" or year = "2020")
AND (month = "03")
AND product = "XXXXX"
ORDER BY
anno
)
-- I get this from a bigger table, but prefer to make small steps
-- that anyone in the fute can understand where this comes from as
-- the original table is expected to grow fast
SELECT
category,
ROUND(1.0 * next_price/ price - 1, 2) Pct_change,
SUBSTR(Date, 1, 4) || '-' || SUBSTR(next_date, 1, 4) Period,
tipo_establecimiento
FROM
(SELECT
*,
LEAD(Price) OVER (PARTITION BY category ORDER BY year) next_price,
LEAD(year) OVER (PARTITION BY category ORDER BY year) next_date,
CASE
WHEN (category_2>= 35) AND (category_2 <= 61)
THEN 'S'
ELSE 'N'
END 'tipo_establecimiento'
FROM
table)
WHERE
next_date IS NOT NULL AND Pct_change >= 0
ORDER BY
Pct_change DESC
This code gets me a view of the data that looks like:
category Pct_change period
cat1 0.21 2019-2020
cat2 0.53 2019-2020
cat3 0.76 "
This is great! But my next view has to take this one and provide me with a range that shows how many categories are in each range.
It should look like:
range avg num_cat_in
[0.1- 0.4] 0.3 3
This last table is just an example of what I expect
I have been trying with a code that looks like this but i get nothing
WITH table AS (
SELECT category, price, date, substring(date, 1, 4) AS year, substring(date, 6, 2) as month
FROM original_table
WHERE (year= "2019" or year= "2020") and (month= "03") and product = "XXXXX"
order by anno
)
-- I get this from a bigger table, but prefer to make small steps that anyone in the future can understand where this comes from as the original table is expected to grow fast
SELECT category,
ROUND(1.0 * next_price/ price - 1, 2) Pct_change,
SUBSTR(Date, 1, 4) || '-' || SUBSTR(next_date, 1, 4) Period,
tipo_establecimiento
FROM (
SELECT *,
LEAD(Price) OVER (PARTITION BY category ORDER BY year) next_price,
LEAD(year) OVER (PARTITION BY category ORDER BY year) next_date,
CASE
WHEN (category_2>= 35) AND (category_2 <= 61)
THEN 'S'
ELSE 'N'
END 'tipo_establecimiento'
FROM table
)
WHERE next_date IS NOT NULL AND Pct_change>=0
ORDER BY Pct_change DESC
WHERE next_date IS NOT NULL AND Pct_change>=0
)
SELECT
count(CASE WHEN Pct_change> 0.12 AND Pct_change <= 0.22 THEN 1 END) AS [12 - 22],
count(CASE WHEN Pct_change> 0.22 AND Pct_change <= 0.32 THEN 1 END) AS [22 - 32],
count(CASE WHEN Pct_change> 0.32 AND Pct_change <= 0.42 THEN 1 END) AS [32 - 42],
count(CASE WHEN Pct_change> 0.42 AND Pct_change <= 0.52 THEN 1 END) AS [42 - 52],
count(CASE WHEN Pct_change> 0.52 AND Pct_change <= 0.62 THEN 1 END) AS [52 - 62],
count(CASE WHEN Pct_change> 0.62 AND Pct_change <= 0.72 THEN 1 END) AS [62 - 72],
count(CASE WHEN Pct_change> 0.72 AND Pct_change <= 0.82 THEN 1 END) AS [72 - 82]
Thank you!!!
cf. my comment, I'm first assuming that your ranges are not hard-coded and that you wish to evenly split your data across quantiles of Prc_change. What this means is the calculation will figure out the ranges which split your sample as uniformly as possible. In this case, the following would work (where theview is the name of your previous view which calculates percentages):
select
concat('[',min(Pct_change),'-',min(Pct_change),']') as `range`
, avg(Pct_change) as `avg`
, count(*) as num_cat_in
from(
select *
, ntile(5)over(order by Pct_change) as bin
from theview
) t
group by bin
order by bin;
Here is a fiddle.
If on the other hand your ranges are hard-coded, I assume the ranges are in a table such as the one I create:
create table theranges (lower DOUBLE, upper DOUBLE);
insert into theranges values (0,0.2),(0.2,0.4),(0.4,0.6),(0.6,0.8),(0.8,1);
(You have to make sure that the ranges are non-overlapping. By convention I include percentages in the range from the lower bound included to the upper bound excluded, except for the upper bound of 1 which is included.) It is then a matter of left-joining the tables:
select
concat('[',lower,'-',upper,']') as `range`
, avg(Pct_change) as `avg`
, sum(if(Pct_change is null, 0, 1)) as num_cat_in
from theranges left join theview on (Pct_change>=lower and if(upper=1,true,Pct_change<upper))
group by lower, upper
order by lower;
(Note that in the bit that says upper=1, you must change 1 to whatever your highest hard-coded range is; here I am assuming your percentages are between 0 and 1.)
Here is the second fiddle.
Related
This question is best asked using an example - if I have daily data (in this case, daily Domestic Box Office for the movie Elvis), how can I sum only the weekend values?
If the data looks like this:
Date
DBO
6/24/2022
12755467
6/25/2022
9929779
6/26/2022
8526333
6/27/2022
4253038
6/28/2022
5267391
6/29/2022
4010762
6/30/2022
3577241
7/1/2022
5320812
7/2/2022
6841224
7/3/2022
6290576
7/4/2022
4248679
7/5/2022
3639110
7/6/2022
3002182
7/7/2022
2460108
7/8/2022
3326066
7/9/2022
4324040
7/10/2022
3530965
I'd like to be able to get results that look like this:
Weekend
DBO Sum
1
31211579
2
18452612
3
11181071
Also - not sure how tricky this would be but would love to include percent change v. last weekend.
Weekend
DBO Sum
% Change
1
31211579
2
18452612
-41%
3
11181071
-39%
I tried this with CASE WHEN but I got the results in different columns, which was not what I was looking for.
SELECT
,SUM(CASE
WHEN DATE BETWEEN '2022-06-24' AND '2022-06-26' THEN index
ELSE 0
END) AS Weekend1
,SUM(CASE
WHEN DATE BETWEEN '2022-07-01' AND '2022-07-03' THEN index
ELSE 0
END) AS Weekend2
,SUM(CASE
WHEN DATE BETWEEN '2022-07-08' AND '2022-07-10' THEN index
ELSE 0
END) AS Weekend3
FROM Elvis
I would start by filtering the data on week-end days only. Then we can group by week to get the index sum ; the last step is to use window functions to compare each week-end with the previous one:
select iso_week,
row_number() over(order by iso_week) weekend_number,
sum(index) as dbo_sum,
( sum(index) - lag(sum(index) over(order by iso_week) )
/ nullif(lag(sum(index)) over(order by iso_week), 0) as ratio_change
from (
select e.*, extract(isoweek from date) iso_week
from elvis e
where extract(dayofweek from date) in (1, 7)
) e
group by iso_week
order by iso_week
Consider below
select *,
round(100 * safe_divide(dbo_sum - lag(dbo_sum) over(order by week), lag(dbo_sum) over(order by week)), 2) change_percent
from (
select extract(week from date + 2) week, sum(dbo) dbo_sum
from your_table
where extract(dayofweek from date + 2) in (1, 2, 3)
group by week
)
if applied to sample data in your question - output is
Unable to pivot multiple columns in snowflake and I would appreciate it if some one can help me:
I basically have the table attached in the screenshot in the left and need to change it to the format in the right. I wonder if pivot can work in this case ?
my current code:
select
CONCAT(RIGHT(TO_VARCHAR(YEAR(DATE)),2),'-Q',TO_VARCHAR(QUARTER(DATE)) ) closed_date,
IFNULL(sum(case when STAG='Closed' then REVENUE_AMOUNTS end),0) REVENUE AMER,
IFNULL(sum(case when STAG='Closed' then REVENUE_AMOUNTS end),0) REVENUE APAC,
IFNULL(sum(case when STAG='Closed' then REVENUE_AMOUNTS end),0) REVENUE EMEA
from REVENUE_TABLE
where 1=1
group by 1
order by 1 asc
link to screenshot
So assuming the SQL you have posted is more like this (with included fake data in a CTE)
WITH REVENUE_TABLE as (
SELECT * FROM VALUES
('Closed', 1, '2020-01-01'::date, 'amer'),
('Closed', 2, '2020-04-01'::date, 'apac'),
('Closed', 3, '2020-08-01'::date, 'emea'),
('Closed', 4, '2021-01-01'::date, 'emea')
v(stag, REVENUE_AMOUNTS, date, loc)
)
select
CONCAT(RIGHT(TO_VARCHAR(YEAR(DATE)),2),'-Q',TO_VARCHAR(QUARTER(DATE)) ) closed_date,
ZEROIFNULL(sum(IFF(loc='amer' AND STAG='Closed', REVENUE_AMOUNTS, null))) as REVENUE_AMER,
ZEROIFNULL(sum(IFF(loc='apac' AND STAG='Closed', REVENUE_AMOUNTS, null))) as REVENUE_APAC,
ZEROIFNULL(sum(IFF(loc='emea' AND STAG='Closed', REVENUE_AMOUNTS, null))) as REVENUE_EMEA
from REVENUE_TABLE
group by 1
order by 1 asc
I swapped you CASE for an IFF and puting the which column does it belong in. And I swapped IFNULL(x, 0) for ZEROIFNULL(x) while longer, it more intent clear.
which gives results that look like your existing output:
CLOSED_DATE
REVENUE_AMER
REVENUE_APAC
REVENUE_EMEA
20-Q1
1
0
0
20-Q2
0
2
0
20-Q3
0
0
3
21-Q1
0
0
4
so if that hold as "the way it is", then to get to "where you want to go" you need to find the distinct set of values or locations, and then join to your results based on that.
select l.loc,
ZEROIFNULL(sum(IFF(r.cd='20-Q1', r.REVENUE_AMOUNTS, null))) as "20-Q1",
ZEROIFNULL(sum(IFF(r.cd='20-Q2', r.REVENUE_AMOUNTS, null))) as "20-Q2",
ZEROIFNULL(sum(IFF(r.cd='20-Q3', r.REVENUE_AMOUNTS, null))) as "20-Q3",
ZEROIFNULL(sum(IFF(r.cd='21-Q1', r.REVENUE_AMOUNTS, null))) as "21-Q1"
from (
select distinct loc
FROM REVENUE_TABLE
) as l
left join (
select loc,
revenue_amounts,
CONCAT(RIGHT(TO_VARCHAR(YEAR(DATE)),2),'-Q',TO_VARCHAR(QUARTER(DATE)) ) cd
FROM REVENUE_TABLE
WHERE STAG='Closed'
) as r on l.loc = r.loc
group by 1
order by 1 asc;
gives:
LOC
20-Q1
20-Q2
20-Q3
21-Q1
amer
1
0
0
0
apac
0
2
0
0
emea
0
0
3
4
Now the downside of this pattern is you need to explicitly know the column names, but you have that problem in the PIVOT case as well. That could be worked around with Snowflake Scripting I believe.
I have a table in GBQ in the following format :
UserId Orders Month
XDT 23 1
XDT 0 4
FKR 3 6
GHR 23 4
... ... ...
It shows the number of orders per user and month.
I want to calculate the percentage of users who have orders, I did it as following :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
GROUP BY
HasOrders
ORDER BY
Parts
It gives me the following result:
HasOrders Parts
0 35
1 65
I need to calculate the percentage of users who have orders, by month, in a way that every month = 100%
Currently to do this I execute the query once per month, which is not practical :
SELECT
HasOrders,
ROUND(COUNT(*) * 100 / CAST( SUM(COUNT(*)) OVER () AS float64), 2) Parts
FROM (
SELECT
*,
CASE WHEN Orders = 0 THEN 0 ELSE 1 END AS HasOrders
FROM `Table` )
WHERE Month = 1
GROUP BY
HasOrders
ORDER BY
Parts
Is there a way execute a query once and have this result ?
HasOrders Parts Month
0 25 1
1 75 1
0 45 2
1 55 2
... ... ...
SELECT
SIGN(Orders),
ROUND(COUNT(*) * 100.000 / SUM(COUNT(*), 2) OVER (PARTITION BY Month)) AS Parts,
Month
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
Demo on Postgres:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4cd2d1455673469c2dfc060eccea8020
You've stated that it's important for the total to be 100% so you might consider rounding down in the case of no orders and rounding up in the case of has orders for those scenarios where the percentages falls precisely on an odd multiple of 0.5%. Or perhaps rounding toward even or round smallest down would be better options:
WITH DATA AS (
SELECT SIGN(Orders) AS HasOrders, Month,
COUNT(*) * 10000.000 / SUM(COUNT(*)) OVER (PARTITION BY Month) AS PartsPercent
FROM T
GROUP BY Month, SIGN(Orders)
ORDER BY Month, SIGN(Orders)
)
select HasOrders, Month, PartsPercent,
PartsPercent - TRUNCATE(PartsPercent) AS Fraction,
CASE WHEN HasOrders = 0
THEN FLOOR(PartsPercent) ELSE CEILING(PartsPercent)
END AS PartsRound0Down,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5
AND MOD(TRUNCATE(PartsPercent), 2) = 0
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsRoundTowardEven,
CASE WHEN PartsPercent - TRUNCATE(PartsPercent) = 0.5 AND PartsPercent < 50
THEN FLOOR(PartsPercent) ELSE ROUND(PartsPercent) -- halfway up
END AS PartsSmallestTowardZero
from DATA
It's usually not advisable to test floating-point values for equality and I don't know how BigQuery's float64 will work with the comparison against 0.5. One half is nevertheless representable in binary. See these in a case where the breakout is 101 vs 99. I don't have immediate access to BigQuery so be aware that Postgres's rounding behavior is different:
https://dbfiddle.uk/?rdbms=postgres_10&fiddle=c8237e272427a0d1114c3d8056a01a09
Consider below approach
select hasOrders, round(100 * parts, 2) as parts, month from (
select month,
countif(orders = 0) / count(*) `0`,
countif(orders > 0) / count(*) `1`,
from your_table
group by month
)
unpivot (parts for hasOrders in (`0`, `1`))
with output like below
I have a table with the following columns:
account, validity_date,validity_month,amount.
For each row i want to check if the value in field "amount' exist over the rows range of the next month. if yes, indicator=1, else 0.
account validity_date validity_month amount **required_column**
------- ------------- --------------- ------- ----------------
123 15oct2019 201910 400 0
123 20oct2019 201910 500 1
123 15nov2019 201911 1000 0
123 20nov2019 201911 500 0
123 20nov2019 201911 2000 1
123 15dec2019 201912 400
123 15dec2019 201912 2000
Can anyone help?
Thanks
validity_month/100*12 + validity_month MOD 100 calculates a month number (for comparing across years, Jan to previous Dec) and the inner ROW_NUMBER reduces multiple rows with the same amount per month to a single row (kind of DISTINCT):
SELECT dt.*
,CASE -- next row is from next month
WHEN Lead(nextMonth IGNORE NULLS)
Over (PARTITION BY account, amount
ORDER BY validity_date)
= (validity_month/100*12 + validity_month MOD 100) +1
THEN 1
ELSE 0
END
FROM
(
SELECT t.*
,CASE -- one row per account/month/amount
WHEN Row_Number()
Over (PARTITION BY account, amount, validity_month
ORDER BY validity_date ) = 1
THEN validity_month/100*12 + validity_month MOD 100
END AS nextMonth
FROM tab AS t
) AS dt
Edit:
The previous is for exact matching amounts, for a range match the query is probably very hard to write with OLAP-functions, but easy with a Correlated Subquery:
SELECT t.*
,CASE
WHEN
( -- check if there's a row in the next month matching the current amount +/- 10 percent
SELECT Count(*)
FROM tab AS t2
WHERE t2.account_ = t.account_
AND (t2.validity_month/100*12 + t2.validity_month MOD 100)
= ( t.validity_month/100*12 + t.validity_month MOD 100) +1
AND t2.amount BETWEEN t.amount * 0.9 AND t.amount * 1.1
) > 0
THEN 1
ELSE 0
END
FROM tab AS t
But then performance might be really bad...
Assuming the values are unique within a month and you have a value for each month for each account, you can simplify this to:
select t.*,
(case when lead(seqnum) over (partition by account, amount order by validity_month) = seqnum + 1
then 1 else 0
end)
from (select t.*,
dense_rank() over (partition by account order by validity_month) as seqnum
from t
) t;
Note: This puts 0 for the last month rather than NULL, but that can easily be adjusted.
You can do this without the subquery by using month arithmetic. It is not clear what the data type of validity_month is. If I assume a number:
select t.*,
(case when lead(floor(validity_month / 100) * 12 + (validity_month mod 100)
) over (partition by account, amount order by validity_month) =
(validity_month / 100) * 12 + (validity_month mod 100) - 1
then 1 else 0
end)
from t;
Just to add another way to do this using Standard SQL. This query will return 1 when the condition is met, 0 when it is not, and null when there isn't a next month to evaluate (as implied in your result column).
It is assumed that we're partitioning on the account field. Also includes a 10% range match on the amount field based on the comment made. Note that if you have an id field, you should include it (if two rows have the same account, validity_date, validity_month, amount there will only be one resulting row, due to DISTINCT).
Performance-wise, should be similar to the answer from #dnoeth.
SELECT DISTINCT
t1.account,
t1.validity_date,
t1.validity_month,
t1.amount,
CASE
WHEN t2.amount IS NOT NULL THEN 1
WHEN MAX(t1.validity_month) OVER (PARTITION BY t1.account) > t1.validity_month THEN 0
ELSE NULL
END AS flag
FROM `project.dataset.table` t1
LEFT JOIN `project.dataset.table` t2
ON
t2.account = t1.account AND
DATE_DIFF(
PARSE_DATE("%Y%m", CAST(t2.validity_month AS STRING)),
PARSE_DATE("%Y%m", CAST(t1.validity_month AS STRING)),
MONTH
) = 1 AND
t2.amount BETWEEN t1.amount * 0.9 AND t1.amount * 1.1;
I have a table containing orders, items, and prices. I am trying to generate histograms for each item based on the prices.
Create Table #Customer_Pricing
(
customer_id int,
item_id VARCHAR(10),
qty DECIMAL(5,2),
price DECIMAL(5,2),
)
;
GO
-- Insert Statements
Insert into #Customer_Pricing values(128456, 'SOM 555', 8, 2.50)
Insert into #Customer_Pricing values(123856, 'SOM 554', 1, 2.50)
Insert into #Customer_Pricing values(123456, 'SOM 554', 55, 2.00)
Insert into #Customer_Pricing values(123556, 'SOM 555', 2, 2.20)
Insert into #Customer_Pricing values(123456, 'SOM 553', 12, 2.13)
;
For each item, I wanted 3 bins so I determined the bin sizes by dividing the difference of the MAX-MIN by 3, then adding that value to the MIN.
WITH Stats_Table_CTE (item_id2,max_p, min_p, int_p, r1_upper, r2_lower, r2_upper, r3_lower)
AS
( SELECT item_id
,max(price)
,min(price)
,(max(price) - min(price))/3
,min(price)+(max(price) - min(price))/3-0.01
,min(price)+(max(price) - min(price))/3
,min(price)+((max(price) - min(price))/3)*2-0.01
,min(price)+((max(price) - min(price))/3)*2
FROM #Customer_Pricing
GROUP BY item_id)
Now, I need to count the frequencies for each range and each item. I have attempted to do so by using SUM(CASE...) but was unsuccessful.
SELECT item_id
,SUM(CASE WHEN price <= r1_upper, THEN 1 ELSE 0 END) AS r1_count
,SUM(CASE WHEN price >= r2_lower AND <= r2_upper, THEN 1 ELSE 0 END) AS r2_count
,SUM(CASE WHEN price >= r3_lower, THEN 1 ELSE 0 END) AS r3_count
FROM Stats_Table_CTE
GROUP BY item_id
I also attempted to use COUNT in the form
SELECT item_id, price
count(price <= r1_upper) AS r1_count.... but I got stuck
In one attempt, INNER JOINed the #Customer_Pricing table and Stats_Table_CTE but didn't know where to go from there.
Ideally, I would like the output table to appear as follows: *This is not the actual data, but I included it to show the desired format of the output.
Item ID min_p r1_upper (r2 bins) r3_lower max_p r1_count r2_ct
SOM 553 2.00 2.16 saving space 2.33 2.50 2 1
SOM 554 2.13 2.48 2.88 3.25 1 0
SOM 555 2.31 2.51 2.72 2.92 3 2
*The format of the output table is off, but I have item ID, the bins, and the counts across the top grouped by item
Here is my recommendation:
WITH Stats_Table_CTE AS (
SELECT item_id, max(price) as maxprice, min(price) as minprice,
(max(price) - min(price))/3 as binsize
FROM #Customer_Pricing
GROUP BY item_id
)
SELECT cp.item_id,
SUM(CASE WHEN price < minprice + binsize THEN 1 ELSE 0
END) AS r1_count
SUM(CASE WHEN price >= minprice + binsize AND price < minprice+ 2*binsize
THEN 1 ELSE 0
END) AS r2_count
SUM(CASE WHEN price >= minprice + 2*binsize
THEN 1 ELSE 0
END) AS r3_count
FROM #Customer_Pricing cp JOIN
Stats_Table_CTE st
ON st.item_id = cp.item_id
GROUP BY cp.item_id
The important part is the join back to #Customer_Pricing. Also important is the simplification of the logic -- you can define the bounds for the bins and use <, rather than having a lower and upper bound for each one. Also, your query had some syntax errors in it.
Note that in many databases, the CTE would not be necessary because you could just use window functions. Your question is not tagged with the database (although I could guess what it is), so that change seems unwarranted.