Related
I'm working with the Iowa Liquor Sales dataset which in this case is called "bigquery-public-data.iowa_liquor_sales.sales". Relevant columns and their datatypes are date(DATE), sale_dollars(FLOAT), item_description(STRING), store_name(STRING).
I am trying to write a query that will return the top sale for each year, of the past three years (2021,2020,2019) along with the date, item_description, and store_name.
The below code works, but only covers one year. I know I could copy+paste and change the date every time but that seems tedious. Is there a better way?
SELECT date, sale_dollars, item_description, store_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2021-01-01' and '2021-12-31'
ORDER BY sale_dollars DESC
LIMIT 1
date
sale_dollars
item_description
store_name
2021-04-19
250932.0
Titos Handmade Vodka
Hy-Vee #3
When trying different ways to write it so the max sale of 2019,2020, and 2021 return along with their date, item_description, and store_name, I ran into errors. The below is the closest I got (missing date, item_description, and store_name).
SELECT
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2021-01-01' and '2021-12-31') as sale_2021,
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2020-01-01' and '2020-12-31') as sale_2020,
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2019-01-01' and '2019-12-31') as sale_2019
How can I write a query that returns the max sale of the past three years along with it's date, item, and store name?
Consider below query
SELECT EXTRACT(YEAR FROM date) year,
ARRAY_AGG(
STRUCT(date, sale_dollars, item_description, store_name)
ORDER BY sale_dollars DESC LIMIT 1
)[OFFSET(0)].*
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date BETWEEN '2019-01-01' AND '2021-12-31'
GROUP BY 1;
Query results
+------+------------+--------------+----------------------+-------------------------------+
| year | date | sale_dollars | item_description | store_name |
+------+------------+--------------+----------------------+-------------------------------+
| 2020 | 2020-10-08 | 250932.0 | Titos Handmade Vodka | Hy-Vee #3 / BDI / Des Moines |
| 2019 | 2019-10-08 | 78435.0 | Makers Mark | Hy-Vee Food Store / Urbandale |
| 2021 | 2021-07-05 | 250932.0 | Titos Handmade Vodka | Hy-Vee #3 / BDI / Des Moines |
+------+------------+--------------+----------------------+-------------------------------+
or, you can get same result with a window function
SELECT date, sale_dollars, item_description, store_name
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date BETWEEN '2019-01-01' AND '2021-12-31'
QUALIFY ROW_NUMBER() OVER (
PARTITION BY EXTRACT(YEAR FROM date) ORDER BY sale_dollars DESC
) = 1;
As the three values deliver only one value, you can add them to the first query, only adapted to three years
SELECT
date, sale_dollars, item_description, store_name,
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2021-01-01' and '2021-12-31') as sale_2021,
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2020-01-01' and '2020-12-31') as sale_2020,
(SELECT MAX(sale_dollars)
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2019-01-01' and '2019-12-31') as sale_2019
FROM `bigquery-public-data.iowa_liquor_sales.sales`
WHERE date between '2019-01-01' and '2021-12-31'
ORDER BY sale_dollars DESC
LIMIT 1
i have a table in bigquery like this (260000 rows):
vendor date item_price
x 2021-07-08 23:41:10 451,5
y 2021-06-14 10:22:10 41,7
z 2020-01-03 13:41:12 74
s 2020-04-12 01:14:58 88
....
exactly what I want is to group this data by month and find the sum of the sales of only the top 20 vendors in that month. Expected output:
month sum_of_only_top20_vendor's_sales
2020-01 7857
2020-02 9685
2020-03 3574
2020-04 7421
.....
Consider below approach
select month, sum(sale) as sum_of_only_top20_vendor_sales
from (
select vendor,
format_datetime('%Y%m', date) month,
sum(item_price) as sale
from your_table
group by vendor, month
qualify row_number() over(partition by month order by sale desc) <= 20
)
group by month
Another solution that potentially can show much much better performance on really big data:
select month,
(select sum(sum) from t.top_20_vendors) as sum_of_only_top20_vendor_sales
from (
select
format_datetime('%Y%m', date) month,
approx_top_sum(vendor, item_price, 20) top_20_vendors
from your_table
group by month
) t
or with a little refactoring
select month, sum(sum) as sum_of_only_top20_vendor_sales
from (
select
format_datetime('%Y%m', date) month,
approx_top_sum(vendor, item_price, 20) top_20_vendors
from your_table
group by month
) t, t.top_20_vendors
group by month
I have a table as follows
user_id date month year visiting_id
123 11-04-2017 APRIL 2017 4500
123 12-05-2017 MAY 2017 4567
123 13-05-2017 MAY 2017 4568
123 17-05-2017 MAY 2017 4569
123 22-05-2017 MAY 2017 4570
123 11-06-2017 JUNE 2017 4571
123 12-06-2017 JUNE 2017 4572
I want to calculate the visiting count for the current month and last month at the monthly level as follows:
user_id month year visit_count_this_month visit_count_last_month
123 APRIL 2017 1 0
123 MAY 2017 4 1
123 JUNE 2017 2 4
I was able to calculate visit_count_this_month using the following query
SELECT v.user_id, v.month, v.year,
SUM(is_visit_this_month) as visit_count_this_month
FROM
(SELECT user_id, date, month, year,
CASE WHEN TO_CHAR(date, 'MM/YYYY') = TO_CHAR(date, 'MM/YYYY')
THEN 1 ELSE 0
END as is_visit_this_month
FROM visits
GROUP BY user_id, date, month, year
HAVING user_id = 123) v
GROUP BY v.user_id, v.month, v.year
However, I'm stuck with calculating visit_count_last_month. Similar to this, I also want to calculate visit_count_last_2months.
Can somebody help?
You can use a LATERAL JOIN like this:
SELECT user_id, month, year, COUNT(*) as visit_count_this_month, visit_count_last_month
FROM visits v
CROSS JOIN LATERAL (
SELECT COUNT(*) as visit_count_last_month
FROM visits
WHERE user_id = v.user_id
AND date = (CAST(v.date AS date) - interval '1 month')
) l
GROUP BY user_id, month, year, visit_count_last_month;
SQLFiddle - http://sqlfiddle.com/#!15/393c8/2
Assuming there are values for every month, you can get the counts per month first and use lag to get the previous month's values per user.
SELECT T.*
,COALESCE(LAG(visits,1) OVER(PARTITION BY USER_ID ORDER BY year,mth),0) as last_month_visits
,COALESCE(LAG(visits,2) OVER(PARTITION BY USER_ID ORDER BY year,mth),0) as last_2_month_visits
FROM (
SELECT user_id, extract(month from date) as mth, year, COUNT(*) as visits
FROM visits
GROUP BY user_id, extract(month from date), year
) T
If there can be missing months, it is best to generate all months within a specified timeframe and left join ing the table on to that. (This example shows it for all the months in 2017).
select user_id,yr,mth,visits
,coalesce(lag(visits,1) over(PARTITION BY USER_ID ORDER BY yr,mth),0) as last_month_visits
,coalesce(lag(visits,2) OVER(PARTITION BY USER_ID ORDER BY yr,mth),0) as last_2_month_visits
from (select u.user_id,extract(year from d.dt) as yr, extract(month from d.dt) as mth,count(v.visiting_id) as visits
from generate_series(date '2017-01-01', date '2017-12-31',interval '1 month') d(dt)
cross join (select distinct user_id from visits) u
left join visits v on extract(month from v.dt)=extract(month from d.dt) and extract(year from v.dt)=extract(year from d.dt) and u.user_id=v.user_id
group by u.user_id,extract(year from d.dt), extract(month from d.dt)
) t
Input and code:
with data as (
select 1 id, 'A' name, 'fruit' r_group, '2007' year, '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007' year, '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008' year, '05' month, 10 sales from dual union all
select 4 id, 'B' name, 'vegetable' r_group, '2008' year, '07' month, 20 sales from dual
)
select year,
month,
r_group,
sum(sales) sales,
sum(opening) opening,
sum(closing) closing
from (
select t.*,
(sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) -sales ) as opening,
sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) as closing
from data t
)
group by year, month, r_group
order by year, month
Output:
year | month | r_group | sales | opening | closing |
2007 | 04 | fruit | 104 | 0 | 104 |
2008 | 05 | fruit | 10 | 5 | 15 |
2008 | 07 | vegetable | 20 | 0 | 20 |
I want the output to be like the following:
year | month | r_group | sales | opening | closing |
2007 | 04 | fruit | 104 | 0 | 104 |
2008 | 05 | fruit | 10 | 104 | 114 |
2008 | 07 | vegetable | 20 | 0 | 20 |
I can achieve the desired output only by adding a zero-valued record in the data for month=05 and for name = 'Z' like this:
select 1 id, 'A' name, 'fruit' r_group, '2007', year '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007', year '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008', year '05' month, 10 sales from dual union all
select 4 id, 'Z' name, 'fruit' r_group, '2008', year '05' month, 0 sales from dual union all
select 5 id, 'B' name, 'vegetable' r_group, '2008', year '07' month, 20 sales from dual ))
However, I want to know if I can do this as part of the select query without having to edit the data itself.
EDIT
The inner select statement will input into a database table the detailed version: year, month, name, r_group, opening, closing. In other words the result of this query will be used to populate the db table and then aggregation using the outer query will happen afterwards:
select t.*,
(sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) -sales ) as opening,
sum(sales) over (partition by name, r_group
order by year, month
rows between unbounded preceding and current row
) as closing
from data t
then I'll use an aggregate on that using an analytical tool (3rd party) to aggregate on r_group only without including the name. But the year, month, name, r_group detail must exist in the background.
EDIT 2
In other workds, I'm trying to dynamically add missing data. For instance, if name = 'Z' exists in 2007,04 but DOES NOT in 2008,05 then the cumulative function will fail once it gets to 2008. Because, it does not have a name ='Z' in 2008 to start with it fails.
Instead of CURRENT ROW you can use PRECEDING keyword to sum till the previous row.
with data as (
select 1 id, 'A' name, 'fruit' r_group, '2007' year, '04' month, 5 sales from dual union all
select 2 id, 'Z' name, 'fruit' r_group, '2007' year, '04' month, 99 sales from dual union all
select 3 id, 'A' name, 'fruit' r_group, '2008' year, '05' month, 10 sales from dual union all
select 4 id, 'B' name, 'vegetable' r_group, '2008' year, '07' month, 20 sales from dual )
select t.*,
coalesce(sum(sales) over (partition by r_group order by year, month rows between unbounded preceding and 1 preceding),0) opening,
sum(sales) over (partition by r_group order by year, month rows between unbounded preceding and current row) closing
from (
select year, month, r_group, sum(sales) sales
from data
group by year, month, r_group
) t
order by 3,1,2;
year month r_group sales opening closing
---------------------------------------------------
2007 04 fruit 104 0 104
2008 05 fruit 10 104 114
2008 07 vegetable 20 0 20
Group by R_GROUP, YEAR and MONTH first then use the analytical query:
SELECT t.*,
SUM( sales ) OVER ( PARTITION BY r_group ORDER BY year, month ) - sales
AS opening,
SUM( sales ) OVER ( PARTITION BY r_group ORDER BY year, month ) AS closing
FROM (
SELECT r_group,
year,
month,
SUM( sales ) AS sales
FROM data
GROUP BY r_group, year, month
) t
ORDER BY year, month
Update:
This will also include the name in the output:
SELECT t.*,
SUM( sales ) OVER ( PARTITION BY r_group, dt ) AS r_group_month_sales,
COALESCE(
SUM( sales ) OVER (
PARTITION BY r_group
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND INTERVAL '1' MONTH PRECEDING
),
0
) AS opening,
SUM( sales ) OVER (
PARTITION BY r_group
ORDER BY dt
RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS closing
FROM (
SELECT d.*,
TO_DATE( year || month, 'YYYYMM' ) AS dt
FROM data d
) t
ORDER BY dt
Output:
ID NAME R_GROUP YEAR MONTH SALES DT R_GROUP_MONTH_SALES OPENING CLOSING
-- ---- --------- ---- ----- ----- ---------- ------------------- ------- -------
1 A fruit 2007 04 5 2007-04-01 104 0 104
2 Z fruit 2007 04 99 2007-04-01 104 0 104
3 A fruit 2008 05 10 2008-05-01 10 104 114
4 B vegetable 2008 07 20 2008-07-01 20 0 20
You can then do whatever processing you want on the output of this query.
Maybe something like this:
SELECT year,
month,
r_group,
MAX( r_group_month_sales ) AS sales,
MAX( opening ) AS opening,
MAX( closing ) AS closing,
YOUR_THIRD_PARTY_AGGREGATION_FUNCTION( column_names ) AS other
FROM (
-- insert the query above
)
GROUP BY year, month, r_group
ORDER BY year, month
I have two different result sets:
Result 1:
+--------------+--------------+
| YEAR_MONTH | UNIQUE_USERS |
+--------------+--------------+
| 2013-08 | 1111 |
+--------------+--------------+
| 2013-09 | 2222 |
+--------------+--------------+
Result 2:
+--------------+----------------+
| YEAR_MONTH | UNIQUE_ACTIONS |
+--------------+----------------+
| 2013-08 | 111111111 |
+--------------+----------------+
| 2013-09 | 222222222 |
+--------------+----------------+
The code for Result 1:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS
FROM CORE.DATE_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
The code for Result 2:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
However, I've tried to group them by simply doing this:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS, COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.DATE_TEST, CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
And that doesn't work. I've also tried an INNER JOIN on the second result set (result set 1 had t1 as a variable name, and result set 2 had t2), and got the error, Invalid Identifier, on t2.
This is my desired output:
+--------------+--------------+----------------+
| YEAR_MONTH | UNIQUE_USERS | UNIQUE_ACTIONS |
+--------------+--------------+----------------+
| 2013-08 | 1111 | 111111111 |
+--------------+--------------+----------------+
| 2013-09 | 2222 | 222222222 |
+--------------+--------------+----------------+
How do I do that correctly? It doesn't necessarily need to be a three-column group by; it just needs to work.
Try:
select a.YEAR_MONTH, a.UNIQUE_USERS, b.UNIQUE_ACTIONS
from (
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH,
COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS
FROM CORE.DATE_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
) a
join (
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH,
COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
) b
on a.YEAR_MONTH = b.YEAR_MONTH
order by a.YEAR_MONTH ASC
If both tables have many records, a Cartesian join is a poor solution and may not actually provide the answer you want. I'd solve this problem something like this:
SELECT TO_CHAR (COALESCE (t1.year_month, t2.year_month), 'yyyy-mm')
AS year_month,
t1.unique_users,
t2.unique_actions
FROM (SELECT TRUNC (access_date, 'mm') AS year_month,
COUNT (DISTINCT employee_id) AS unique_users
FROM core.date_test
GROUP BY TRUNC (access_date, 'mm')) t1
FULL OUTER JOIN
(SELECT TRUNC (access_date, 'mm') AS year_month,
COUNT (DISTINCT employee_action) AS unique_actions
FROM core.action_test
GROUP BY TRUNC (access_date, 'mm')) t2
ON t1.year_month = t2.year_month
ORDER BY COALESCE (t1.year_month, t2.year_month) ASC
The reason a Cartesian join performs poorly is that every row in the first table must be matched with every row in the second table before the group by is applied. If each table has only 1000 rows, that's 1,000,000 values that the database has to construct.
SELECT date.TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT date.EMPLOYEE_ID) UNIQUE_USERS, COUNT(DISTINCT act.EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.DATE_TEST date, CORE.ACTION_TEST act
WHERE date.TO_CHAR(ACCESS_DATE, 'yyyy-mm')=act.TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
Hope This will work as we need to specify the table name from where we want to extract the rows....