Join against date range, aggregate by SUM - sql

I need to gather the SUM of sales made on a certain category item, grouped by day for a selected date range (could be from a week out to 12weeks) and return 0 instead of NULL for days where no transactions have occurred.
My original idea was to use a pre-populated table called "calendar" (shown below) which has about 10yrs of dates which I could LEFT JOIN my "products" table against to get days when no sales occurred as a 0 SUM.
Result was too large to deal with, so I'm trying to first copy the selected range of dates to an empty table called "datetable" which shares the same column names as "calendar". So I have 3 tables:
"calendar" table. It has 10 years worth of dates with following column names:
IsoDate DayNameOfWeek
2012-01-01 Sun
2012-01-02 Mon
2012-01-03 Tue
2012-01-04 Wed
2012-01-05 Thu
2012-01-06 Fri
2012-01-07 Sat
2012-01-08 Sun
2012-01-09 Mon
2012-01-10 Tue
etc for 10yrs
"datetable" table (this is created empty with two columns to prefill from "calendar" table so the date range data for the LEFT JOIN is more compact):
IsoDate DayNameOfWeek
"products" table. It is where I'm storing sales for each ProductCat:
ExpDate ProductCat Amount
2012-01-03 28 232
2012-01-04 29 100
2012-01-04 29 1002
2012-01-06 12 12
2012-01-06 29 9
2012-01-07 10 100
2012-01-07 29 122
2012-01-07 29 17
The output I'm looking for based on a single "ProductCat" number, in this case 29:
IsoDate DayNameOfWeek AmountSummed
2012-01-01 Sun 0
2012-01-02 Mon 0
2012-01-03 Tue 0
2012-01-04 Wed 1102
2012-01-05 Thu 0
2012-01-06 Fri 9
2012-01-07 Sat 139
2012-01-08 Sun 0
2012-01-09 Mon 0
2012-01-10 Tue 0
My code is below. The initial insert works fine but I'm not sure of the syntax that will make the second part with the JOIN and the SUM work:
INSERT INTO datetable (IsoDate, DayNameOfWeek)
SELECT IsoDate, DayNameOfWeek
FROM calendar
WHERE IsoDate
BETWEEN '2012-07-01' AND '2012-07-10'
SELECT ExpDate, SUM(IFNULL(Amount, 0))
AS AmountSummed
FROM products
WHERE ProductCat = 29
AND ExpDate BETWEEN '2012-07-01' AND '2012-07-10'
LEFT JOIN products
ON datetable.IsoDate=products.ExpDate
GROUP BY datetable.IsoDate
EDIT
This is the code that works now:
SELECT C.IsoDate,IFNULL(SUM(P.Amount),0) AS AmountSummed
FROM calendar C LEFT OUTER JOIN products P ON C.IsoDate=P.ExpDate
AND P.ProductCat = 29
WHERE C.IsoDate BETWEEN '2012-07-01' AND '2012-07-10'
GROUP BY C.IsoDate, C.DayNameOfWeek
ORDER BY C.IsoDate

You've pretty much got what you need. However, you don't need the datetable.
Your query should look like this:
SELECT C.IsoDate, C.DayNameOfWeek, IFNULL(SUM(P.Amount),0) AS AmountSummed
FROM calendar C LEFT JOIN products P ON C.IsoDate=P.ExpDate
WHERE C.IsoDate BETWEEN '2012-07-01' AND '2012-07-10'
AND P.ProductCat = 29
GROUP BY C.IsoDate, C.DayNameOfWeek
ORDER BY C.IsoDate
If you really want to use your datetable, just substitute it in for calendar and remove the C.IsoDate BETWEEN '2012-07-01' AND '2012-07-10' (assuming that the datetable was empty before you started) because datetime already has all the date you are looking for.

Related

Bigquery numeric datetime to string datetime

I have written a query in bigquery like below:
SELECT date_trunc(dd.created, week) AS week,
COUNT(DISTINCT dd.user) AS total,
COUNT(dd.upload) AS info
FROM
local.detail dd
LEFT JOIN local.list du ON dd.id = du.id
WHERE
regexp_extract(du.email, r '#(.+)') != 'gmail.com'
GROUP BY
date_trunc(dd.created, week);
Output:
week
total
info
2020-02-02 00:00:00
625
382
2020-03-22 00:00:00
1059
329
i want the week_signup column data format like this(just month and day):
week
total
info
Feb 02
625
382
Mar 03
1059
329
How can i write this in bigquery to get this??
Use format_date for the same.
E.g. FORMAT_DATE("%a %d", date_trunc(dd.created_date, week))

SQL how to count but only count one instance if two columns match?

Wondering how to select from a table:
FIELDID personID purchaseID dateofPurchase
--------------------------------------------------
2 13 147 2014-03-21 00:00:00
3 15 165 2015-03-23 00:00:00
4 13 456 2018-03-24 00:00:00
5 1 133 2018-03-21 00:00:00
6 23 123 2013-03-22 00:00:00
7 25 456 2013-03-21 00:00:00
8 25 456 2013-03-23 00:00:00
9 22 456 2013-03-28 00:00:00
10 25 589 2013-03-21 00:00:00
11 82 147 1991-10-22 00:00:00
12 82 453 2003-03-22 00:00:00
I'd like to get a result table of two columns: weekday and the number of purchases of each weekday, but only count the distinct days of purchases if done by the same person on the same day - for example since personID 25 purchased two things on 2013-03-21, that should only count as one 'thursday' instead of 2.
Basically, if the personID and the dateofPurchase are the same for more than one row, only count it once is what I want.
Here is what I have currently: It does everything correctly except it will count the above scenario under the thursday twice, when I would only want to add one:
SELECT v.wkday as day, COUNT(*) as 'absences'
FROM dbo.AttendanceRecord pr CROSS APPLY
(VALUES (CASE WHEN DATEPART(WEEKDAY, date) IN (1, 7)
THEN 'Weekend'
ELSE DATENAME(WEEKDAY, date)
END)
) v(wkday)
GROUP BY v.wkday;
to clarify:
If an item is purchased for at least one puchaseID on a specific day they will be counted as purchased for that day, and do not need to be counted again for each new purchase ID on that day.
I think you want to count distinct persons, so that would be:
COUNT(DISTINCT personid) as absences
Note that single quotes are not appropriate around column aliases. If you need to escape them, use square braces.
EDIT:
If you want to count distinct person-days, then you can use:
COUNT(DISTINCT CONCAT(personid, ':', dateofpurchase) as absences

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.
As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

Left join with nested selects and aggregate functions

Problem
I have one table of generated dates (s) which I want to join with another table (d) which is a list of dates where a specific occurrence has happened.
table s
Wednesday 23rd August 2017
Thursday 24th August 2017
Friday 25th August 2017
Saturday 26th August 2017
table d
day_created -------------------------------- count
Thursday 24th August 2017 ---------------- 45
Saturday 26th August 2017 ---------------- 32
I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
I want something that looks like:
day_created -------------------------------- count
Wednesday 23rd August --------------------- 0
Thursday 24th August 2017 ---------------- 45
Friday 25th August 2017 ------------------ 0
Saturday 26th August 2017 ---------------- 32
I've tried joining with a left join as follows:
SELECT day_created, COUNT(d.day_created) as total_per_day
FROM
(SELECT date_trunc('day', task_1.created_at) as day_created
FROM task_1
)
d
LEFT JOIN (
SELECT (generate_series('2017-05-01', current_date, '1 day'::INTERVAL)) as standard_date
)
s
ON d.day_created=s.standard_date
GROUP BY d.day_created
ORDER BY day_created DESC;
I don't get an error however the join isn't working (i.e. it doesn't return dates where the count is null). What it returns is the dates from table d and the count, but not the dates in between where there are 0 occurrences.
I've been going round in circles and have understood that I need to make table s (I think!) the left table, but I'm getting confused as a newbie with the syntax.
This is all in PostgreSQL 9.5.8.
Basically, you had the LEFT JOIN backwards. This should work, with some other simplifications and performance optimizations:
SELECT s.standard_date, COUNT(d.day_created) AS total_per_day
FROM generate_series('2017-05-01', current_date, interval '1 day') s(standard_date)
LEFT JOIN task_1 d ON d.day_created >= s.standard_date
AND d.day_created < s.standard_date + interval '1 day'
GROUP BY 1
ORDER BY 1;
This counts rows in d, like you commented. Does not sum values.
Be aware that generate_series() still returns timestamp with time zone, even if you pass date values to it. You may want to cast to date or format with to_char() for display in the outer SELECT. (But rather group and order by the original timestamp value, not the formatted string.)
There may be corner cases depending on the current time zone setting depending on the actual undisclosed table definition.
Related:
How to avoid a subquery in FILTER clause?
I have one table of generated dates (s)
In real databases, we don't store a generated series. We just generate them when needed.
which I want to join with another table (d) which is a list of dates where a specific occurrence has happened. [...] I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
Nah, you can do it.
CREATE TABLE d(day_created, count) AS VALUES
('24 August 2017'::date, 45),
('26 August 2017'::date, 32);
SELECT day_created, coalesce(count,0)
FROM (
SELECT d::date
FROM generate_series(
'2017-08-01'::timestamp without time zone,
'2017-09-01'::timestamp without time zone,
'1 day'
) AS gs(d)
) AS gs(day_created)
LEFT OUTER JOIN d USING(day_created)
ORDER BY day_created;
day_created | coalesce
-------------+----------
2017-08-01 | 0
2017-08-02 | 0
2017-08-03 | 0
2017-08-04 | 0
2017-08-05 | 0
2017-08-06 | 0
2017-08-07 | 0
2017-08-08 | 0
2017-08-09 | 0
2017-08-10 | 0
2017-08-11 | 0
2017-08-12 | 0
2017-08-13 | 0
2017-08-14 | 0
2017-08-15 | 0
2017-08-16 | 0
2017-08-17 | 0
2017-08-18 | 0
2017-08-19 | 0
2017-08-20 | 0
2017-08-21 | 0
2017-08-22 | 0
2017-08-23 | 0
2017-08-24 | 45
2017-08-25 | 0
2017-08-26 | 32
2017-08-27 | 0
2017-08-28 | 0
2017-08-29 | 0
2017-08-30 | 0
2017-08-31 | 0
2017-09-01 | 0
(32 rows)

Using SQL Server 2012 LAG

I am trying to write a query using SQL Server 2012 LAG function to retrieve data from my [Order] table where the datetime difference between a row and the previous row is less than equal to 2 minutes.
The result I'm expecting is
1234 April, 28 2012 09:00:00
1234 April, 28 2012 09:01:00
1234 April, 28 2012 09:03:00
5678 April, 28 2012 09:40:00
5678 April, 28 2012 09:42:00
5678 April, 28 2012 09:44:00
but I'm seeing
1234 April, 28 2012 09:00:00
1234 April, 28 2012 09:01:00
1234 April, 28 2012 09:03:00
5678 April, 28 2012 09:40:00
5678 April, 28 2012 09:42:00
5678 April, 28 2012 09:44:00
91011 April, 28 2012 10:00:00
The last row should not be returned. Here is what I have tried: SQL Fiddle
Any one with ideas?
Okay first of all I added a row to show you where someone else's answer doesn't work but they deleted it now.
Now for the logic in my query. You said you want each row that is within two minutes of another row. That means you have to look not only backwards, but also forwards with LEAD(). In your query, you returned when previous time was NULL so it simply returned the first value of each OrderNumber regardless if it was right or wrong. By chance, the first values of each of your OrderNumbers needed to be returned until you get to the last OrderNumber where it broke. My query corrects that and should work for all your data.
CREATE TABLE [Order]
(
OrderNumber VARCHAR(20) NOT NULL
, OrderDateTime DATETIME NOT NULL
);
INSERT [Order] (OrderNumber, OrderDateTime)
VALUES
('1234', '2012-04-28 09:00:00'),
('1234', '2012-04-28 09:01:00'),
('1234', '2012-04-28 09:03:00'),
('5678', '2012-04-28 09:40:00'),
('5678', '2012-04-28 09:42:00'),
('5678', '2012-04-28 09:44:00'),
('91011', '2012-04-28 10:00:00'),
('91011', '2012-04-28 10:25:00'),
('91011', '2012-04-28 10:27:00');
with Ordered as (
select
OrderNumber,
OrderDateTime,
LAG(OrderDateTime,1) over (
partition by OrderNumber
order by OrderDateTime
) as prev_time,
LEAD(OrderDateTime,1) over (
partition by OrderNumber
order by OrderDateTime
) as next_time
from [Order]
)
SELECT OrderNumber,
OrderDateTime
FROM Ordered
WHERE DATEDIFF(MINUTE,OrderDateTime,next_time) <= 2 --this says if the next value is less than or equal to two minutes away return it
OR DATEDIFF(MINUTE,prev_time,OrderDateTime) <= 2 --this says if the prev value is less than or equal to 2 minutes away return it
Results(Remember I added a row):
OrderNumber OrderDateTime
-------------------- -----------------------
1234 2012-04-28 09:00:00.000
1234 2012-04-28 09:01:00.000
1234 2012-04-28 09:03:00.000
5678 2012-04-28 09:40:00.000
5678 2012-04-28 09:42:00.000
5678 2012-04-28 09:44:00.000
91011 2012-04-28 10:25:00.000
91011 2012-04-28 10:27:00.000