BigQuery missing rows with SUM OVER PARTITION BY

BigQuery missing rows with SUM OVER PARTITION BY - sql

TL;DR:
Given this table:
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)
How to do I get a table where the missing date/product combination (2020-11-02 - premium) is included with a fallback value for diff of 0.
Ideally, for multiple products. A list of all products can be get like this:
SELECT ARRAY_AGG(DISTINCT product) FROM subscriptions
I want to be able to get the subscription count per day, either for all products or just for some products.
And the way I think this can be easily achieved is by preparing a database that looks like this:
|---------------------|------------------|------------------|
| date | product | total |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 100 |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | 50 |
|---------------------|------------------|------------------|
With this table, I can easily group by date and product or just by date and sum the total.
Before I get to the result table I have generated a table where for each day and product I calculate the difference in subscriptions. How many new subscribers for each product are there and how many are no longer subscribed.
This table looks like this:
|---------------------|------------------|------------------|
| date | product | diff |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 50 |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | -20 |
|---------------------|------------------|------------------|
Meaning on November, 1st the total count of premium subscribers increased by 50, and the total count of basic subscribers decreased by 20.
The problem now is that this temporary table is missing date points if there weren't any changes one product, see the example below.
When I started there was no product table and I only had the date and diff column.
To get from the second to the first table I used this query which worked perfect:
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, 150 as diff
UNION ALL SELECT TIMESTAMP("2020-11-02"), -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), 60
)
SELECT
*,
SUM(diff) OVER (ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date
But when I add the product column and try to calculate the sum per day and product there are some data points missing.
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)
SELECT
*,
SUM(diff) OVER (PARTITION BY product ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date
--
|---------------------|------------------|------------------|
| date | product | total |
|---------------------|------------------|------------------|
| 2020-11-01 | basic | 100 |
|---------------------|------------------|------------------|
| 2020-11-01 | premium | 50 |
|---------------------|------------------|------------------|
| 2020-11-02 | basic | 90 |
|---------------------|------------------|------------------|
| 2020-11-03 | basic | 130 |
|---------------------|------------------|------------------|
| 2020-11-03 | premium | 70 |
|---------------------|------------------|------------------|
If I now show the total number of subscriptions per day, I would get:
150 -> 90 -> 200
But I would expect:
150 -> 140 -> 200
Same goes for the total number of premium subscriptions per day:
50 -> 0 -> 70
But I would expect:
50 -> 50 -> 70
I believe the best option to fix this would be to add the missing date/product combinations.
How would I do this?

-- Try this,I am creating a table for list of products and add total product in that list. Joining with your table to get data as per your requirement.
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
),
product_name as (
Select product from subscriptions group by 1
union all
Select "Total" as product
)
Select date
,product
,total_subscriptions
from (
Select a.date
,a.product
,diff
,SUM(diff) OVER (PARTITION BY a.product ORDER BY a.date) as total_subscriptions
from
(
Select date,a.product
from product_name A
join subscriptions B
on 1=1
where a.product !='Total'
group by 1,2
) A
left join subscriptions B
on A.product = B.product
and A.date = B.date
group by 1,2,3
) group by 1,2,3
union all
Select date
,product
,total_subscriptions
from
(
Select date,a.product
,diff
,SUM(diff) OVER (PARTITION BY a.product ORDER BY date) as total_subscriptions
from product_name A
join subscriptions B
on 1=1
where a.product ='Total'
group by 1,2,3
) group by 1,2,3
order by 1,2

If I follow you correctly, one approach is to can generate a fixed the list of dates for the period you want, and cross join it with the list of products. This gives you all possible combinations. Then, you can bring the subscriptions table with a left join, and finally perform the window sum:
select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from unnest(generate_timestamp_array(
timestamp('2020-11-01'),
timestamp('2020-11-03'),
interval 1 day)
) dt
cross join (
select 'basic' product
union all select 'premium'
) p
left join subscriptions on s.product = p.product and s.date = dt
We can make the query a more generic by dynamically generating the date range and list of products:
select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from (select min(date) min_dt, max(date) max_dt from subscriptions) d0
cross join unnest(generate_timestamp_array(d0.min_dt, d0.max_dt, interval 1 day)) dt
cross join (select distinct product from subscriptions) p
left join subscriptions on s.product = p.product and s.date = dt

Use GENERATE_TIMESTAMP_ARRAY:
WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
),
dates AS (
SELECT *
FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2020-11-01 00:00:00', '2020-11-03 00:00:00', INTERVAL 1 DAY)) as date
),
products AS (
SELECT DISTINCT product FROM subscriptions
)
SELECT dates.date, products.product, subscriptions.diff
FROM dates
CROSS JOIN products
LEFT JOIN subscriptions
ON subscriptions.date = dates.date AND subscriptions.product = products.product

Related

Fill the data on the missing date range

I have a table will the data with exist data below:
Select Date, [Closing Balance] from StockClosing
Date | Closing Quantity
---------------------------
20200828 | 5
20200901 | 10
20200902 | 8
20200904 | 15
20200905 | 18
There are some missing date on the table, example 20200829 to 20200831 and 20200903.
Those closing quantity of the missing date will be follow as per previous day closing quantity.
I would like select the table result in a full range of date (show everyday) with the closing quantity. Expected result,
Date | Closing Quantity
---------------------------
20200828 | 5
20200829 | 5
20200830 | 5
20200831 | 5
20200901 | 10
20200902 | 8
20200903 | 8
20200904 | 15
20200905 | 18
Beside using cursor/for loop to insert the missing date and data 1 by 1, is that any SQL command can do it at once?

You have option to use recursive CTE.
For reference Click Here
;with cte as(
select max(date) date from YourTable
),cte1 as (
select min(date) date from YourTable
union all
select dateadd(day,1,cte1.date) date from cte1 where date<(select date from cte)
)select c.date,isnull(y.[Closing Quantity],
(select top 1 a.[Closing Quantity] from YourTable a where c.date>a.date order by a.date desc) )
as [Closing Quantity]
from cte1 c left join YourTable y on c.date=y.date

The easiest way to do this is to use LAST_VALUE along with the IGNORE NULLS option. Sadly, SQL Server does not support this. There is a workaround using analytic functions, but I would actually offer this simple option, which uses a correlated subquery to fill in the missing values:
WITH dates AS (
SELECT '20200828' AS Date UNION ALL
SELECT '20200829' UNION ALL
SELECT '20200830' UNION ALL
SELECT '20200831' UNION ALL
SELECT '20200901' UNION ALL
SELECT '20200902' UNION ALL
SELECT '20200903' UNION ALL
SELECT '20200904' UNION ALL
SELECT '20200905'
)
SELECT
d.Date,
(SELECT TOP 1 t2.closing FROM StockClosing t2
WHERE t2.Date <= d.Date AND t2.closing IS NOT NULL
ORDER BY t2.Date DESC) AS closing
FROM dates d
LEFT JOIN StockClosing t1
ON d.Date = t1.Date;
Demo

Find the customers and other metrics based on the purchase frequency new & repeat

I am trying to find the customer count and sales by the type of customer (New and Returning) and the number of times they have purchased.
txn_date Customer_ID Transaction_Number Sales Reference(not in the SQL table) customer type (not in the sql table)
1/2/2019 1 12345 $10 Second Purchase SLS Repeat
4/3/2018 1 65890 $20 First Purchase SLS Repeat
3/22/2019 3 64453 $30 First Purchase SLS new
4/3/2019 4 88567 $20 First Purchase SLS new
5/21/2019 4 85446 $15 Second Purchase SLS new
1/23/2018 5 89464 $40 First Purchase SLS Repeat
4/3/2019 5 99674 $30 Second Purchase SLS Repeat
4/3/2019 6 32224 $20 Second Purchase SLS Repeat
1/23/2018 6 46466 $30 First Purchase SLS Repeat
1/20/2018 7 56558 $30 First Purchase SLS new
I am using the below code to get the aggregate sales and customer count for the total customers:
select seqnum, count(distinct customer_id), sum(sales) from (
select co.*,
row_number() over (partition by customer_id order by txn_date) as seqnum
from somya co)
group by seqnum
order by seqnum;
I want to get the same data by the customer type:
for example for the new customers my result should show:
New Customers Customer_Count Sum(Sales)
1st Purchase 3 $80
2nd Purchase 1 $15
Returning Customers Customer_Count Sum(Sales)
1st Purchase 3 $90
2nd Purchase 3 $60
I am trying the below query to get the data for new and repeat customers:
New Customers:
select seqnum, count(distinct customer_id), sum(sales)
from (
select co.*,
row_number() over (partition by customer_id order by trunc(txn_date)) as seqnum,
MIN (TRUNC (TXN_DATE)) OVER (PARTITION BY customer_id) as MIN_TXN_DATE
from somya co
)
where MIN_TXN_DATE between '01-JAN-19' and '31-DEC-19'
group by seqnum
order by seqnum asc;
Returning Customers:
select seqnum, count(distinct customer_id), sum(sales)
from (
select co.*,
row_number() over (partition by customer_id order by trunc(txn_date)) as seqnum,
MIN (TRUNC (TXN_DATE)) OVER (PARTITION BY customer_id) as MIN_TXN_DATE
from somya co
)
where MIN_TXN_DATE <'01-JAN-19'
group by seqnum
order by seqnum asc;
I am not able to figure out what is wrong with my query or if there is a problem with my logic.
This is just a sample data, I have transactions from all the years in my data base so I need to narrow the transaction date in the query but as soon as I narrowing down the data using the transaction date the repeat customer query doesnt give me anything and the new customer query gives me the total customer for that period.

If I understand correctly, you need to know the first time someone becomes a customer. And then use this:
select (case when first_year < 2019 then 'returning' else 'new' end) as custtype,
seqnum, count(*), sum(sales)
from (select co.*,
row_number() over (partition by customer_id, extract(year from txn_date) order by txn_date) as seqnum,
min(extract(year from txn_date)) over (partition by customer_id) as first_year
from somya co
) s
where txn_date >= date '2019-01-01' and
txn_date < date '2020-01-01'
group by (case when first_year < 2019 then 'returning' else 'new' end),
seqnum
order by custtype, seqnum;

You can categorize your sales data to assign a customer type and a purchase sequence using windowing functions, like this:
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01'
AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01'
THEN 'Repeat'
ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd
+-----------+-------------+--------------------+-------+---------------+-------------------+
| TXN_DATE | CUSTOMER_ID | TRANSACTION_NUMBER | SALES | CUSTOMER_TYPE | PURCHASE_SEQUENCE |
+-----------+-------------+--------------------+-------+---------------+-------------------+
| 03-APR-18 | 1 | 65890 | 20 | Repeat | 1 |
| 02-JAN-19 | 1 | 12345 | 10 | Repeat | 2 |
| 22-MAR-19 | 3 | 64453 | 30 | New | 1 |
| 03-APR-19 | 4 | 88567 | 20 | New | 1 |
| 21-MAY-19 | 4 | 85446 | 15 | New | 2 |
| 23-JAN-18 | 5 | 89464 | 40 | Repeat | 1 |
| 03-APR-19 | 5 | 99674 | 30 | Repeat | 2 |
| 23-JAN-18 | 6 | 46466 | 30 | Repeat | 1 |
| 03-APR-19 | 6 | 32224 | 20 | Repeat | 2 |
| 20-JAN-18 | 7 | 56558 | 30 | New | 1 |
+-----------+-------------+--------------------+-------+---------------+-------------------+
Then, you can wrap that in a common table expression (aka "WITH" clause) and summarize by the customer type and purchase sequence:
WITH categorized_sales_data AS (
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01' AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01' THEN 'Repeat' ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd)
SELECT customer_type, purchase_sequence, count(*), sum(sales)
FROM categorized_sales_data
group by customer_type, purchase_sequence
order by customer_type, purchase_sequence
+---------------+-------------------+----------+------------+
| CUSTOMER_TYPE | PURCHASE_SEQUENCE | COUNT(*) | SUM(SALES) |
+---------------+-------------------+----------+------------+
| New | 1 | 3 | 80 |
| New | 2 | 1 | 15 |
| Repeat | 1 | 3 | 90 |
| Repeat | 2 | 3 | 60 |
+---------------+-------------------+----------+------------+
Here's a full SQL with test data:
with sales_data (txn_date, Customer_ID, Transaction_Number, Sales ) as (
SELECT TO_DATE('1/2/2019','MM/DD/YYYY'), 1, 12345, 10 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2018','MM/DD/YYYY'), 1, 65890, 20 FROM DUAL UNION ALL
SELECT TO_DATE('3/22/2019','MM/DD/YYYY'), 3, 64453, 30 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 4, 88567, 20 FROM DUAL UNION ALL
SELECT TO_DATE('5/21/2019','MM/DD/YYYY'), 4, 85446, 15 FROM DUAL UNION ALL
SELECT TO_DATE('1/23/2018','MM/DD/YYYY'), 5, 89464, 40 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 5, 99674, 30 FROM DUAL UNION ALL
SELECT TO_DATE('4/3/2019','MM/DD/YYYY'), 6, 32224, 20 FROM DUAL UNION ALL
SELECT TO_DATE('1/23/2018','MM/DD/YYYY'), 6, 46466, 30 FROM DUAL UNION ALL
SELECT TO_DATE('1/20/2018','MM/DD/YYYY'), 7, 56558, 30 FROM DUAL ),
-- Query starts here
/* WITH */ categorized_sales_data AS (
SELECT sd.txn_date,
sd.customer_id,
sd.transaction_number,
sd.sales,
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01' AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01' THEN 'Repeat' ELSE 'New' END customer_type,
row_number() over ( partition by customer_id order by txn_date) purchase_sequence
FROM sales_data sd)
SELECT customer_type, purchase_sequence, count(*), sum(sales)
FROM categorized_sales_data
group by customer_type, purchase_sequence
order by customer_type, purchase_sequence
Response to comment from OP
all the customers whose first purchase date is in 2019 would be a new customer. Any customer who has transacted in 2019 but their first purchase date is before 2019 would be a repeat customer
So, change
case when min(txn_date) over ( partition by customer_id ) < DATE '2019-01-01'
AND max(txn_date) OVER ( partition by customer_id ) >= DATE '2019-01-01'
THEN 'Repeat' ELSE 'New' END customer_type
to
case when min(txn_date) over ( partition by customer_id )
BETWEEN DATE '2019-01-01' AND DATE '2020-01-01' - INTERVAL '1' SECOND
THEN 'New' ELSE 'Repeat' END customer_type
i.e., if and only if a customer's first purchase was in 2019 then they are "new".

Get last known record per month in BigQuery

Account balance collection, that shows the account balance of a customer at a given day:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 1 | -300 | 2019-10-11 |
| 1 | -200 | 2019-10-10 |
| 1 | 0 | 2019-10-09 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Notice, that customer #2 had no updates to his account balance in October.
I want to get the last account balance per customer per month. If there has been no account balance update for a customer in a given month, the last known account balance should be transferred to the current month. The result should look like that:
+---------------+---------+------------+
| customer_id | value | timestamp |
+---------------+---------+------------+
| 1 | -500 | 2019-10-12 |
| 2 | 200 | 2019-10-10 |
| 2 | 200 | 2019-09-10 |
| 1 | 600 | 2019-09-02 |
+---------------+---------+------------+
Since the account balance of customer #2 was not updated in October but in September, we create a copy of the row from September changing the date to October. Any ideas how to achieve this in BigQuery?

Below is for BigQuery Standard SQL
#standardSQL
WITH customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
if to apply to sample data from your question - as it is in below example
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 customer_id, -500 value, DATE '2019-10-12' timestamp UNION ALL
SELECT 1, -300, '2019-10-11' UNION ALL
SELECT 1, -200, '2019-10-10' UNION ALL
SELECT 2, 200, '2019-09-10' UNION ALL
SELECT 2, 100, '2019-08-11' UNION ALL
SELECT 2, 50, '2019-07-12' UNION ALL
SELECT 1, 600, '2019-09-02'
), customers AS (
SELECT DISTINCT customer_id FROM `project.dataset.table`
), months AS (
SELECT month FROM (
SELECT DATE_TRUNC(MIN(timestamp), MONTH) min_month, DATE_TRUNC(MAX(timestamp), MONTH) max_month
FROM `project.dataset.table`
), UNNEST(GENERATE_DATE_ARRAY(min_month, max_month, INTERVAL 1 MONTH)) month
)
SELECT customer_id,
IFNULL(value, LEAD(value) OVER(win)) value,
IFNULL(timestamp, DATE_ADD(LEAD(timestamp) OVER(win), INTERVAL DATE_DIFF(month, LEAD(month) OVER(win), MONTH) MONTH)) timestamp
FROM months, customers
LEFT JOIN (
SELECT DATE_TRUNC(timestamp, MONTH) month, customer_id,
ARRAY_AGG(STRUCT(value, timestamp) ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)].*
FROM `project.dataset.table`
GROUP BY month, customer_id
) USING(month, customer_id)
WINDOW win AS (PARTITION BY customer_id ORDER BY month DESC)
-- ORDER BY month DESC, customer_id
result is
Row customer_id value timestamp
1 1 -500 2019-10-12
2 2 200 2019-10-10
3 1 600 2019-09-02
4 2 200 2019-09-10
5 1 null null
6 2 100 2019-08-11
7 1 null null
8 2 50 2019-07-12

The following query should mostly answer your question by creating a 'month-end' record for each customer for every month and getting the most recent balance:
with
-- Generate a set of months
month_begins as (
select dt from unnest(generate_date_array('2019-01-01','2019-12-01', interval 1 month)) dt
),
-- Get the month ends
month_ends as (
select date_sub(date_add(dt, interval 1 month), interval 1 day) as month_end_date from month_begins
),
-- Cross Join and group so we get 1 customer record for every month to account for
-- situations where customer doesn't change balance in a month
user_month_ends as (
select
customer_id,
month_end_date
from `project.dataset.table`
cross join month_ends
group by 1,2
),
-- Fan out so for each month end, you get all balances prior to month end for each customer
values_prior_to_month_end as (
select
customer_id,
value,
timestamp,
month_end_date
from `project.dataset.table`
inner join user_month_ends using(customer_id)
where timestamp <= month_end_date
),
-- Order by most recent balance before month end, even if it was more than 1+ months ago
ordered as (
select
*,
row_number() over (partition by customer_id, month_end_date order by timestamp desc) as my_row
from values_prior_to_month_end
),
-- Finally, select only the most recent record for each customer per month
final as (
select
* except(my_row)
from ordered
where my_row = 1
)
select * from final
order by customer_id, month_end_date desc
A few caveats:
I did not order results to match your desired result set, and I also kept a month-end date to illustrate the concept. You can easily change the ordering and exclude unneeded fields.
In the month_begins CTE, I set a range of months into the future, so your result set will contain the most recent balance of 'future months'. To make this a bit prettier, consider changing '2019-12-01' to 'current_date()' and your query will always return to the end of the current month.
Your timestamp field looks to be dates, so I used date logic, but you should be able to apply the same principles to use timestamp logic if your underlying fields are actual timestamps.
In your result set, I'm not sure why your 2nd row (customer 2) would have a timestamp of '2019-10-10', that seems arbitrary as customer 2 has no 2nd balance record.
I purposefully split the logic into several CTEs so I could comment on each step easier, you could definitely perform several steps in the same code block for a more condensed query.

Frequency Distribution by Day

I have records of No. of calls coming to a call center. When a call comes into a call center a ticket is open.
So, let's say ticket 1 (T1) is open on 8/1/19 and it stays open till 8/5/19. So, if a person ran a query everyday then on 8/1 it will show 1 ticket open...same think on day 2 till day 5....I want to get records by day to see how many tickets were open for each day.....
In short, Frequency Distribution by Day.
Ticket Open_date Close_date
T1 8/1/2019 8/5/2019
T2 8/1/2019 8/6/2019
Result:
Result
Date # Tickets_Open
8/1/2019 2
8/2/2019 2
8/3/2019 2
8/4/2019 2
8/5/2019 2
8/6/2019 1
8/7/2019 0
8/8/2019 0
8/9/2019 0
8/10/2019 0

We can handle your requirement via the use of a calendar table, which stores all dates covering the full range in your data set.
WITH dates AS (
SELECT '2019-08-01' AS dt UNION ALL
SELECT '2019-08-02' UNION ALL
SELECT '2019-08-03' UNION ALL
SELECT '2019-08-04' UNION ALL
SELECT '2019-08-05' UNION ALL
SELECT '2019-08-06' UNION ALL
SELECT '2019-08-07' UNION ALL
SELECT '2019-08-08' UNION ALL
SELECT '2019-08-09' UNION ALL
SELECT '2019-08-10'
)
SELECT
d.dt,
COUNT(t.Open_date) AS num_tickets_open
FROM dates d
LEFT JOIN tickets t
ON d.dt BETWEEN t.Open_date AND t.Close_date
GROUP BY
d.dt;
Note that in practice if you expect to have this reporting requirement in the long term, you might want to replace the dates CTE above with a bona-fide table of dates.

This solution generates the list of dates from the tickets table using CTE recursion and calculates the count:
WITH Tickets(Ticket, Open_date, Close_date) AS
(
SELECT "T1", "8/1/2019", "8/5/2019"
UNION ALL
SELECT "T2", "8/1/2019", "8/6/2019"
),
Ticket_dates(Ticket, Dates) as
(
SELECT t1.Ticket, CONVERT(DATETIME, t1.Open_date)
FROM Tickets t1
UNION ALL
SELECT t1.Ticket, DATEADD(dd, 1, CONVERT(DATETIME, t1.Dates))
FROM Ticket_dates t1
inner join Tickets t2 on t1.Ticket = t2.Ticket
where DATEADD(dd, 1, CONVERT(DATETIME, t1.Dates)) <= CONVERT(DATETIME, t2.Close_date)
)
SELECT CONVERT(varchar, Dates, 1), count(*)
FROM Ticket_dates
GROUP by Dates
ORDER by Dates

A "general purpose" trick is to generate a series of numbers, which can be done using CTE's but there are many alternatives, and from that create the needed range of dates. Once that exists then you can left join your ticket data to this and then count by date.
CREATE TABLE mytable(
Ticket VARCHAR(8) NOT NULL PRIMARY KEY
,Open_date DATE NOT NULL
,Close_date DATE NOT NULL
);
INSERT INTO mytable(Ticket,Open_date,Close_date) VALUES ('T1','8/1/2019','8/5/2019');
INSERT INTO mytable(Ticket,Open_date,Close_date) VALUES ('T2','8/1/2019','8/6/2019');
Also note I am using a cross apply in this example to "attach" the min and max dates of your tickets to each numbered row. You would need to include your own logic on what data to select here.
;WITH
cteDigits AS (
SELECT 0 AS digit UNION ALL SELECT 1 UNION ALL SELECT 2 UNION ALL SELECT 3 UNION ALL SELECT 4 UNION ALL
SELECT 5 UNION ALL SELECT 6 UNION ALL SELECT 7 UNION ALL SELECT 8 UNION ALL SELECT 9
)
, cteTally AS (
SELECT
[1s].digit
+ [10s].digit * 10
+ [100s].digit * 100 /* add more like this as needed */
AS num
FROM cteDigits [1s]
CROSS JOIN cteDigits [10s]
CROSS JOIN cteDigits [100s] /* add more like this as needed */
)
select
n.num + 1 rownum
, dateadd(day,n.num,ca.min_date) as on_date
, count(t.Ticket) as tickets_open
from cteTally n
cross apply (select min(Open_date), max(Close_date) from mytable) ca (min_date, max_date)
left join mytable t on dateadd(day,n.num,ca.min_date) between t.Open_date and t.Close_date
where dateadd(day,n.num,ca.min_date) <= ca.max_date
group by
n.num + 1
, dateadd(day,n.num,ca.min_date)
order by
rownum
;
result:
+--------+---------------------+--------------+
| rownum | on_date | tickets_open |
+--------+---------------------+--------------+
| 1 | 01.08.2019 00:00:00 | 2 |
| 2 | 02.08.2019 00:00:00 | 2 |
| 3 | 03.08.2019 00:00:00 | 2 |
| 4 | 04.08.2019 00:00:00 | 2 |
| 5 | 05.08.2019 00:00:00 | 2 |
| 6 | 06.08.2019 00:00:00 | 1 |
+--------+---------------------+--------------+

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?

Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).

Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"

Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE

Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery missing rows with SUM OVER PARTITION BY - sql

Related

Fill the data on the missing date range

Find the customers and other metrics based on the purchase frequency new & repeat

Get last known record per month in BigQuery

Frequency Distribution by Day

Select distinct users group by time range

Categories

Resources