bigquery, is a "subtable" possible? - google-bigquery

In bigquery using legacy sql I have created a monstrous query that returns the following display of visits per day for a site that I released 2018-02-26:
Row date name release_date visits_count
1 20180226 a_name 20180226 2179
2 20180227 a_name 20180226 9522
3 20180228 a_name 20180226 1593
4 20180301 a_name 20180226 300
...
What I really want is
Row name release count_release count_release+1 count_release_rest
1 a_name 20180226 2179 9522 1893
Thus, I want the actual visit count for release date, the day after the release date and all counts after that should just be summed.
I'm new to bigquery (and kind of new to sql...). Is there a way to define my first display as a "subtable" or something like that so that I can do this or what approach would you recommend?

There are lot of ways you can achieve this functionality. The simplest way to do it is compare date with case statement.
select name, sum(case when date = relese_date then 1 else 0) as release_count,
sum(case when date = DATE_ADD(relese_date,1,"DAY") then 1 else 0) as release_count1
sum(case when date > DATE_ADD(relese_date,1,"DAY") then 1 else 0) as release_count_other

Below is for BigQuery Standard SQL
#standardSQL
WITH `project.dataset.table` AS (
SELECT '20180226' date, 'a_name' name, '20180226' release_date, 2179 visits_count UNION ALL
SELECT '20180227', 'a_name', '20180226', 9522 UNION ALL
SELECT '20180228', 'a_name', '20180226', 1593 UNION ALL
SELECT '20180301', 'a_name', '20180226', 300
)
SELECT name, release_date,
SUM(CASE WHEN date = release_date THEN visits_count END) count_release,
SUM(CASE WHEN PARSE_DATE('%Y%m%d', date) = DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY) THEN visits_count END) count_release_next_day,
SUM(CASE WHEN PARSE_DATE('%Y%m%d', date) > DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY) THEN visits_count END) count_release_rest
FROM `project.dataset.table`
GROUP BY name, release_date
or above can be "refactored" to avoid repeating PARSE_DATE, so query looks more compact and easier to manage
#standardSQL
WITH `project.dataset.table` AS (
SELECT '20180226' date, 'a_name' name, '20180226' release_date, 2179 visits_count UNION ALL
SELECT '20180227', 'a_name', '20180226', 9522 UNION ALL
SELECT '20180228', 'a_name', '20180226', 1593 UNION ALL
SELECT '20180301', 'a_name', '20180226', 300
)
SELECT name, release_date,
SUM(CASE WHEN date = release_date THEN visits_count END) count_release,
SUM(CASE WHEN visit = release_next_day THEN visits_count END) count_release_next_day,
SUM(CASE WHEN visit > release_next_day THEN visits_count END) count_release_rest
FROM `project.dataset.table`,
UNNEST([STRUCT<visit DATE, release_next_day DATE>(
PARSE_DATE('%Y%m%d', date),
DATE_ADD(PARSE_DATE('%Y%m%d', release_date), INTERVAL 1 DAY))]) x
GROUP BY name, release_date
in both cases result is
Row name release_date count_release count_release_next_day count_release_rest
1 a_name 20180226 2179 9522 1893

Related

Sum of distinct values after grouping explodes a metric

I am using
with t1 as
(
SELECT
DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
fullVisitorId,
product.productSKU,
product.v2ProductName,
case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr,
UNNEST(hits) hits,
UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select
month,
sum(views_pdp) as pdp
,sum(add_cart) as add_cart
,sum(conversions) as conversions
,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;
Which returns
month pdp add_cart conversions f0_
2021-02-01 500 100 20 10
2021-01-01 600 200 30 20
I know that f0_ ( count(distinct(hits.transaction.transactionId)) ) is bad here because of product.productSKU and product.v2ProductName grouping.
In general, when user makes an order with 3 items in his basket, I want to count this as one order, whereas now it is counted as 3.
This count(distinct(hits.transaction.transactionId)) as transaction_id_cnt results in the correct output if I comment out product.productSKU and product.v2ProductName.
Running this query:
with t1 as
(
SELECT
DATE_TRUNC(PARSE_DATE("%Y%m%d", date), MONTH) as month,
fullVisitorId,
-- product.productSKU, # commented out
-- product.v2ProductName, # commented out
case when hits.ecommerceaction.action_type = '2' then 1 else 0 end as pdp_visitor,
count(case when hits.ecommerceaction.action_type = '2' then fullvisitorid else null end) AS views_pdp,
count(case when hits.ecommerceaction.action_type = '3' then fullvisitorid else null end) AS add_cart,
count(case when hits.ecommerceaction.action_type = '6' then hits.transaction.transactionid else null end) AS conversions,
count(distinct(hits.transaction.transactionId)) as transaction_id_cnt,
FROM `table` AS nr,
UNNEST(hits) hits,
UNNEST(product) product
GROUP BY 1,2,3,4,5
)
select
month,
sum(views_pdp) as pdp
,sum(add_cart) as add_cart
,sum(conversions) as conversions
,sum(transaction_id_cnt)
from t1
group by 1
order by 1 desc;
Returns what is expected, but now I don't have productSKU and v2ProductName which I need. I suspect that the problem is that each order is a new line in google big query and when I ask to to select it by product name and SKU, I count the uniques and then sum it.
How can I achieve the correct summation of count(distinct(hits.transaction.transactionId)) without losing the grouping by product.productSKU and product.v2ProductName which explodes this metric?
On the group by Query you could cherry pick them as array(so you don't group by them):
ARRAY_AGG(DISTINCT product.productSKU IGNORE NULLS) AS productSKU_list,
ARRAY_AGG(DISTINCT product.v2ProductName IGNORE NULLS) AS productName_list,
Update per your below comment: If you want to use them in further group by just save them as string instead of array.
STRING_AGG(DISTINCT product.productSKU, ',') AS productSKU_list,
STRING_AGG(DISTINCT product.v2ProductName, ',') AS productName_list,

MSSQL Group by and Select rows from grouping

I'm trying to figure out if what I'm trying to do is possible. Instead of resorting to multiple queries on a table, I wanted to group the records by business date and id then group by the id and select one date for a field and another date for the other field.
SELECT
*
{AMOUNT FROM DATE}
{AMOUNT FROM OTHER DATE}
FROM (
SELECT
date,
id,
SUM(amount) AS amount
FROM
table
GROUP BY id, date
AS subquery
GROUP BY id
It seems that you're looking to do a pivot query. I usually use cross tabs for this. Based on the query you posted, it could look like:
SELECT
id,
SUM(CASE WHEN date = '20190901' THEN amount ELSE 0 END) AmountFromSept01,
SUM(CASE WHEN date = '20191001' THEN amount ELSE 0 END) AmountFromOct01
FROM (
SELECT
date,
id,
SUM(amount) AS amount
FROM
table
GROUP BY id, date
)AS subquery
GROUP BY id;
You could also use a CTE.
WITH CTE AS(
SELECT
date,
id,
SUM(amount) AS amount
FROM
table
GROUP BY id, date
)
SELECT
id,
SUM(CASE WHEN date = '20190901' THEN amount ELSE 0 END) AmountFromSept01,
SUM(CASE WHEN date = '20191001' THEN amount ELSE 0 END) AmountFromOct01
FROM CTE
GROUP BY id;
Or even be a rebel and do the operation directly.
SELECT
id,
SUM(CASE WHEN date = '20190901' THEN amount ELSE 0 END) AmountFromSept01,
SUM(CASE WHEN date = '20191001' THEN amount ELSE 0 END) AmountFromOct01
FROM CTE
GROUP BY id;
However, some people have tested for performance and found that pre-aggregating can improve performance.
If I understand you correctly, then you're just trying to pivot, but only with two particular dates:
select id,
date1 = sum(iif(date = '2000-01-01', amount, null)),
date2 = sum(iif(date = '2000-01-02', amount, null))
from [table]
group by id

How to group data by different date period in sql?

How to group data by different date period in sql?
For example, I want the data to be grouped from 1/2/2015 to 6/2/2015, 7/2/2015 to 12/2/2015 etc. So far I could only group them according to 1 date range by using the WHERE condition.
SELECT type, count(*)
from table1
WHERE Day(datefield) <=6
Group by type
table1:
type, datefield
typeA, '2015-2-1'
typeB, '2015-2-2'
typeB, '2015-2-9'
typeA, '2015-2-18'
typeB, '2015-2-28'
desired result:
type, no. for day 1-6, no. for day 7-12, no. for day 13-18, no. for day 19-24, no. for day 25-31
type A, 1, 0, 1, 0, 0
type B, 1, 1, 0, 0, 1
First create a derived table where each datefield is replaced by a label ('1-6','7-12',etc.).
Then group that table by its label and type and get the counts. Finally pivot the labels into columns
using conditional aggregation (max(case when ...).
select type,
max(case when date_label = '1-6' then date_count end) '1-6',
max(case when date_label = '7-12' then date_count end) '7-12',
max(case when date_label = '13-18' then date_count end) '13-18',
max(case when date_label = '19-24' then date_count end) '19-24',
max(case when date_label = '25-31' then date_count end) '25-31',
from (
select
type,
date_label,
count(*) date_count
from
(select type,
case when day(datefield) <=6 then '1-6'
when day(datefield) <= 12 then '7-12'
when day(datefield) <= 18 then '13-18'
when day(datefield) <= 24 then '19-24'
when day(datefield) <= 31 then '25-31'
else 'n/a' end as date_label
from table1) t1
group by type, date_label
) t1 group by type

Multiple Queries in different table

(Also posted here.)
So I have two tables, one is invalid table and the other is valid table.
valid table:
id
status
date
invalid table:
id
status
date
I have to produce a report with this output:
date on-time late total valid invalid1 invalid2 total rate
--------- ------- ---- ----- ----- -------- -------- ----- ----
9/10/2011 4 10 14 3 3 3 6
date: common fields on the 2 tables, field to group by, how many records on that day has
on-time: count of all the id on the valid table
late: count of all the records(id) on the invalid table
total: total of on-time and late
valid: count of id on the valid table with the "valid" status
invalid1: count of id on the invalid table with "invalid1" status
invalid2: count of id on the invalid table with "invalid2" status
total: total of valid, invalid1, invalid2
rate: average of totals
It's basically multiple queries with different table. How can I achieve it?
Someting like this?
SELECT
*,
(result.total + result._total) / 2 AS rate
FROM (
SELECT
date,
SUM(CASE WHEN data.valid = 1 THEN 1 ELSE 0 END) AS ontime,
SUM(CASE WHEN data.valid = 0 THEN 1 ELSE 0 END) AS late,
COUNT(*) AS total,
SUM(CASE WHEN data.valid = 1 AND data.status = 'valid' THEN 1 ELSE 0 END) AS valid,
SUM(CASE WHEN data.valid = 0 AND data.status = 'invalid1' THEN 1 ELSE 0 END) AS invalid1,
SUM(CASE WHEN data.valid = 0 AND data.status = 'invalid2' THEN 1 ELSE 0 END) AS invalid2,
SUM(CASE WHEN data.status IN ('valid', 'invalid', 'invalid2') THEN 1 ELSE 0 END) AS _total
FROM (
SELECT
date,
status,
valid = 1
FROM
Valid
UNION ALL
SELECT
date,
status,
valid = 0
FROM
InValid ) AS data
GROUP BY
date) AS result
SELECT date, ontime, late, ontime+late total, valid, invalid1, invalid2, valid+invalid1+invalid2 total
FROM
(SELECT date,
COUNT(*) late,
COUNT(IIF(status = 'invalid1', 1, NULL)) invalid1,
COUNT(IIF(status = 'invalid2', 1, NULL)) invalid2,
FROM invalid
GROUP BY date
) JOIN (
SELECT date,
COUNT(*) ontime,
COUNT(IIF(status = 'valud', 1, NULL)) valid,
FROM valid
GROUP BY date
) USING (date)
First of all, it seems that you are holding exactly the same information in 2 tables - I would recommend merging those tables together and add an additional boolean column called valid to hold the info related to validity of the record.
The query on your existent DB structure might look something like this:
SELECT unioned.* FROM (
( SELECT v.date AS date, v.status AS status, v.id AS id, COUNT(id) AS valid, 0 AS invalid1, 0 AS invalid2 FROM valid v GROUP BY v.date)
UNION
( SELECT i1.date AS date, i1.status AS status, i1.id AS id, 0 AS valid, COUNT(i1.id) AS invalid1, 0 AS invalid2 FROM invalid1 i1 GROUP BY i1.date)
UNION
( SELECT i2.date AS date, i2.status AS status, i2.id AS id, 0 AS valid, 0 AS invalid1, COUNT(i.id) AS invalid2 FROM invalid1 i1 GROUP BY i1.date)
) AS unioned GROUP BY unioned.date

change rows to columns and count

how to calculate count based on rows?
SOURCE TABLE
each employee can take 2 days off
Employee-----First_Day_Off-----Second_Day_Off
1------------10/21/2009--------12/6/2009
2------------09/3/2009--------12/6/2009
3------------09/3/2009--------NULL
4
5
.
.
.
Now i need a table that shows the dates and number of people taking off on that day
Date---------First_Day_Off-------Second_Day_Off
10/21/2009---1-------------------0
12/06/2009---1--------------------1
09/3/2009----2--------------------0
Any ideas?
Oracle 9i+, using Subquery Factoring (WITH):
WITH sample AS (
SELECT a.employee,
a.first_day_off AS day_off,
1 AS day_number
FROM YOUR_TABLE a
WHERE a.first_day_off IS NOT NULL
UNION ALL
SELECT b.employee,
b.second_day_off,
2 AS day_number
FROM YOUR_TABLE b
WHERE b.second_day_off IS NOT NULL)
SELECT s.day_off AS date,
SUM(CASE WHEN s.day_number = 1 THEN 1 ELSE 0 END) AS first_day_off,
SUM(CASE WHEN s.day_number = 2 THEN 1 ELSE 0 END) AS second_day_off
FROM sample s
GROUP BY s.day_off
Non Subquery Version
SELECT s.day_off AS date,
SUM(CASE WHEN s.day_number = 1 THEN 1 ELSE 0 END) AS first_day_off,
SUM(CASE WHEN s.day_number = 2 THEN 1 ELSE 0 END) AS second_day_off
FROM (SELECT a.employee,
a.first_day_off AS day_off,
1 AS day_number
FROM YOUR_TABLE a
WHERE a.first_day_off IS NOT NULL
UNION ALL
SELECT b.employee,
b.second_day_off,
2 AS day_number
FROM YOUR_TABLE b
WHERE b.second_day_off IS NOT NULL) s
GROUP BY s.day_off
It is a bit awkward to handle these queries, since you have days off stored in different columns. A better layout would be to have something like
EMPLOYEE_ID DAY_OFF
Then you would have multiple rows if an employee took multiple days off
EMPLOYEE_ID DAY_OFF
1 10/21/2009
1 12/6/2009
2 09/3/2009
2 12/6/2009
3 09/3/2009
...
In that case, you could find out how many days off each person took by using the following query:
SELECT EMPLOYEE_ID, COUNT(*) AS NUM_DAYS_OFF FROM DAYS_OFF_TABLE GROUP BY EMPLOYEE_ID
And the number of people who took days off on each date like this:
SELECT DAY_OFF, COUNT(*) AS NUM_PEOPLE FROM DAYS_OFF_TABLE GROUP BY DAY_OFF
But I digress...
You can try to use an SQL CASE statement to help with this:
SELECT Employee, CASE
WHEN First_Day_Off is NULL AND Second_Day_Off is NULL THEN 0
WHEN First_Day_Off is NOT NULL AND Second_Day_Off is NULL THEN 1
WHEN First_Day_Off is NULL AND Second_Day_Off is NOT NULL THEN 1
ELSE 2
END AS NUM_DAYS_OFF
FROM DAYS_OFF_TABLE
(note that you may need to change around the syntax slightly depending on your database.
Getting dates and number of people who took off on that day might be more complicated.
I don't know if this would work, but you can try it:
SELECT
Date_Off,
COUNT(*) AS Num_People
FROM
(SELECT
First_Day_Off, COUNT(*) AS Num_People FROM DAYS_OFF_TABLE WHERE First_Day_Off IS NOT NULL GROUP BY First_Day_Off
UNION
SELECT Second_Day_Off, COUNT(*) AS Num_People FROM DAYS_OFF_TABLE WHERE Second_Day_Off IS NOT NULL GROUP BY Second_Day_Off)
GROUP BY
Num_People