Left join with nested selects and aggregate functions - sql

Problem
I have one table of generated dates (s) which I want to join with another table (d) which is a list of dates where a specific occurrence has happened.
table s
Wednesday 23rd August 2017
Thursday 24th August 2017
Friday 25th August 2017
Saturday 26th August 2017
table d
day_created -------------------------------- count
Thursday 24th August 2017 ---------------- 45
Saturday 26th August 2017 ---------------- 32
I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
I want something that looks like:
day_created -------------------------------- count
Wednesday 23rd August --------------------- 0
Thursday 24th August 2017 ---------------- 45
Friday 25th August 2017 ------------------ 0
Saturday 26th August 2017 ---------------- 32
I've tried joining with a left join as follows:
SELECT day_created, COUNT(d.day_created) as total_per_day
FROM
(SELECT date_trunc('day', task_1.created_at) as day_created
FROM task_1
)
d
LEFT JOIN (
SELECT (generate_series('2017-05-01', current_date, '1 day'::INTERVAL)) as standard_date
)
s
ON d.day_created=s.standard_date
GROUP BY d.day_created
ORDER BY day_created DESC;
I don't get an error however the join isn't working (i.e. it doesn't return dates where the count is null). What it returns is the dates from table d and the count, but not the dates in between where there are 0 occurrences.
I've been going round in circles and have understood that I need to make table s (I think!) the left table, but I'm getting confused as a newbie with the syntax.
This is all in PostgreSQL 9.5.8.

Basically, you had the LEFT JOIN backwards. This should work, with some other simplifications and performance optimizations:
SELECT s.standard_date, COUNT(d.day_created) AS total_per_day
FROM generate_series('2017-05-01', current_date, interval '1 day') s(standard_date)
LEFT JOIN task_1 d ON d.day_created >= s.standard_date
AND d.day_created < s.standard_date + interval '1 day'
GROUP BY 1
ORDER BY 1;
This counts rows in d, like you commented. Does not sum values.
Be aware that generate_series() still returns timestamp with time zone, even if you pass date values to it. You may want to cast to date or format with to_char() for display in the outer SELECT. (But rather group and order by the original timestamp value, not the formatted string.)
There may be corner cases depending on the current time zone setting depending on the actual undisclosed table definition.
Related:
How to avoid a subquery in FILTER clause?

I have one table of generated dates (s)
In real databases, we don't store a generated series. We just generate them when needed.
which I want to join with another table (d) which is a list of dates where a specific occurrence has happened. [...] I want to show rows where the occurrence does not take place, which I cannot do if I just have table d.
Nah, you can do it.
CREATE TABLE d(day_created, count) AS VALUES
('24 August 2017'::date, 45),
('26 August 2017'::date, 32);
SELECT day_created, coalesce(count,0)
FROM (
SELECT d::date
FROM generate_series(
'2017-08-01'::timestamp without time zone,
'2017-09-01'::timestamp without time zone,
'1 day'
) AS gs(d)
) AS gs(day_created)
LEFT OUTER JOIN d USING(day_created)
ORDER BY day_created;
day_created | coalesce
-------------+----------
2017-08-01 | 0
2017-08-02 | 0
2017-08-03 | 0
2017-08-04 | 0
2017-08-05 | 0
2017-08-06 | 0
2017-08-07 | 0
2017-08-08 | 0
2017-08-09 | 0
2017-08-10 | 0
2017-08-11 | 0
2017-08-12 | 0
2017-08-13 | 0
2017-08-14 | 0
2017-08-15 | 0
2017-08-16 | 0
2017-08-17 | 0
2017-08-18 | 0
2017-08-19 | 0
2017-08-20 | 0
2017-08-21 | 0
2017-08-22 | 0
2017-08-23 | 0
2017-08-24 | 45
2017-08-25 | 0
2017-08-26 | 32
2017-08-27 | 0
2017-08-28 | 0
2017-08-29 | 0
2017-08-30 | 0
2017-08-31 | 0
2017-09-01 | 0
(32 rows)

Related

Calculate Churn by aggregating by date range in SQL

I am trying to calculate the churn rate from a data that has customer_id, group, date. The aggregation is going to be by id, group and date. The churn formula is (customers in previous cohort - customers in last cohort)/customers in previous cohort
customers in previous cohort refers to cohorts in before 28 days
customers in last cohort refers to cohorts in last 28 days
I am not sure how to aggregate them by date range to calculate the churn.
Here is sample data that I copied from SQL Group by Date Range:
Date Group Customer_id
2014-03-01 A 1
2014-04-02 A 2
2014-04-03 A 3
2014-05-04 A 3
2014-05-05 A 6
2015-08-06 A 1
2015-08-07 A 2
2014-08-29 XXXX 2
2014-08-09 XXXX 3
2014-08-10 BB 4
2014-08-11 CCC 3
2015-08-12 CCC 2
2015-03-13 CCC 3
2014-04-14 CCC 5
2014-04-19 CCC 4
2014-08-16 CCC 5
2014-08-17 CCC 3
2014-08-18 XXXX 2
2015-01-10 XXXX 3
2015-01-20 XXXX 4
2014-08-21 XXXX 5
2014-08-22 XXXX 2
2014-01-23 XXXX 3
2014-08-24 XXXX 2
2014-02-25 XXXX 3
2014-08-26 XXXX 2
2014-06-27 XXXX 4
2014-08-28 XXXX 1
2014-08-29 XXXX 1
2015-08-30 XXXX 2
2015-09-31 XXXX 3
The goal is to calculate the churn rate every 28 days in between 2014 and 2015 by the formula given above. So, it is going to be aggregating the data by rolling it by 28 days and calculating the churn by the formula.
Here is what I tried to aggregate the data by date range:
SELECT COUNT(distinct customer_id) AS count_ids, Group,
DATE_SUB(CAST(Date AS DATE), INTERVAL 56 DAY) AS Date_min,
DATE_SUB(CURRENT_DATE, INTERVAL 28 DAY) AS Date_max
FROM churn_agg
GROUP BY count_ids, Group, Date_min, Date_max
Hope someone will help me with aggregation and churn calculation. I want to simply deduct the aggregated count_ids to deduct it from the next aggregated count_ids which is after 28 days. So this is going to be successive deduction of the same column value (count_ids). I am not sure if I have to use rolling window or simple aggregation to find the churn.
As corrected by #jarlh, it's not 2015-09-31 but 2015-09-30
You can use this to create 28 days calendar:
create table daysby28 (i int, _Date date);
insert into daysby28 (i, _Date)
SELECT i, cast('01-01-2014'as date) + i*INTERVAL '28 day'
from generate_series(0,50) i
order by 1;
After you use #jarlh churn_agg table creation he sent with the fiddle, with this query, you get what you want:
with cte as
(
select count(Customer) as TotalCustomer, Cohort, CohortDateStart From
(
select distinct a.Customer_id as Customer, b.i as Cohort, b._Date as CohortDateStart
from churn_agg a left join daysby28 b on a._Date >= b._Date and a._Date < b._Date + INTERVAL '28 day'
) a
group by Cohort, CohortDateStart
)
select a.CohortDateStart,
1.0*(b.TotalCustomer - a.TotalCustomer)/(1.0*b.TotalCustomer) as Churn from cte a
left join cte b on a.cohort > b.cohort
and not exists(select 1 from cte c where c.cohort > b.cohort and c.cohort < a.cohort)
order by 1
The fiddle of all together is here

How can I get a table of counts grouped by month on axis x and by hour on axis y in PostgreSQL?

There is a log table with a lot of events, I would like to know what is statistical data, i.e. at what hour each month how many events happened.
Data sample:
date_create | event
---------------------+---------------------------
2018-03-01 18:00:00 | Something happened
2018-03-05 18:15:00 | Something else happened
2018-03-06 19:00:00 | Something happened again
2018-04-01 18:00:00 | and again
The result should look like this:
hour | 03 | 04
------+----+----
18 | 2 | 1
19 | 1 | 0
I can make it with CTE, but then it is significant manual work each time. My guess would be that it can be made with funciton, but probably it is already there.
You can use aggregation. I'm thinking:
select extract(hour from date_create) as hh,
sum(case when extract(month from date_create) = 3 then 1 else 0 end) as month_03,
sum(case when extract(month from date_create) = 4 then 1 else 0 end) as month_04
from t
group by hh
order by hh;

how to group dates from ms access database as week of month using excel vba

I am using MS access 2010 database and working with Excel VBA to connect to the database and make queries. Suppose I have a table named "MyTable" like this below:
----------------------
| Date | Count |
----------------------
|7/7/16 | 12 |
----------------------
|7/8/16 | 15 |
----------------------
|7/15/16 | 18 |
----------------------
|7/18/16 | 16 |
----------------------
|8/7/16 | 15 |
----------------------
|8/8/16 | 10 |
----------------------
|8/15/16 | 9 |
----------------------
|8/16/16 | 18 |
----------------------
Now I want to use query to get a table like this:
----------------------
|Week by Month | Sum |
----------------------
|July Week 2 | 27 |
----------------------
|July Week 3 | 18 |
----------------------
|July Week 4 | 16 |
----------------------
|Aug Week 2 | 25 |
----------------------
|Aug Week 3 | 27 |
----------------------
Use DatePart to get the week of the year, then subtract the week of the first day of the month (zero based week of the month) and then add 1 (to get to a one based week of the month:
Public Function WeekOfMonth(x As Date) As Integer
WeekOfMonth = DatePart("ww", x) - _
DatePart("ww", DateSerial(Year(x), Month(x), 1)) _
+ 1
End Function
Note that the Access SQL version should be idential to what's after the = sign.
I have solved this as below:
select weeknum, sum(count1) from (
select format(date1,'MMM') & " Week - " & int((datepart('d',date1,1,1) -1 ) / 7 + 1) as weeknum, count1 from MyTable)
group by weeknum
Show Week of Month where Week 1 is always the 1st Full Week of the Month starting in that month (First Sunday is 1 or 2 or 3 or 4 or 5 or 6 or 7), days of the month prior to the first Sunday are counted as week 4/5 of previous month.
After searching and failing to find EXACTLY the right answer for my situation - I modified ComIntern's solution as follows. This is used a CONTROL on a REPORT, where [StartDate] is a criteria on the form that calls/generates the report:
=IIf((DatePart("ww",[StartDate]-7)-DatePart("ww",DateSerial(Year([StartDate]-7),Month([StartDate]-7),1))+1)="5","1",DatePart("ww",[StartDate])-DatePart("ww",DateSerial(Year([StartDate]),Month([StartDate]),1))+0)
This results in showing the Week of Month based on FULL weeks - and accounts for when the previous month's week 5 included 1 or more days from this month.
For example - Week 5 of Oct 2017 is 29 OCT - 04 NOV. If I did not include the IIF statement to adjust the formula, 05-11 NOV is returned as Week 2, but for my reporting purposes it is Week 1 of NOV. I have tested this out and appears to ALWAYS work, if you need to see Week of Month, based on FULL weeks, this should work for you!

check date ranges with other date ranges

I have the table Distractionswith the following columns:
id startTime endTime(possible null)
Also, I have two parameters, it's period. pstart and pend.
I have to find all distractions for the period and count hours.
For example, we have:
Distractions:
`id` `startTime` `endTime`
1 01.01.2014 00:00 03.01.2014 00:00
2 25.03.2014 00:00 02.04.2014 00:00
3 27.03.2014 00:00 null
The columns contains time, but don't use them.
Period is pstart = 01.01.2014 and pend = 31.03.2014
For example above the result is equal:
for id = 1 - 72 hours
for id = 2 - 168 hours(7 days from 25 to
31 - end of period)
for id = 3 - 120 hours (5 days from 27 to 31 - the distraction not completed, therefore select end of period)
the sum is equal 360.
My code:
select
sum ((ds."endTime" - ds."startTime")*24) as hoursCount
from "Distractions" ds
--where ds."startTime" >= :pstart and ds."endTime" <= :pend
-- I don't know how to create where condition properly.
You'll have to take care of cases where date ranges are outside the input range and also account for starttime and endtime being null.
This where clause should keep only the valid data ranges. I have substituted the null startime with a earliest date and null endtime with a date
far in the future.
where coalesce(endtime,date'9999-12-31') >= :pstart
and coalesce(starttime,date'0000-01-01') <= :pend
Once you have filtered records, you need to adjust the date values so that anything starting before the input :pstart is moved forward to the :pstart,
and anything ending after :pend is moved back to :pend. Subtracting these two should give the value you are looking for. But, there is a catch. Since
the time is 00:00:00, when you subtract the dates, it will miss one full day. So, add 1 to it.
SQL Fiddle
Oracle 11g R2 Schema Setup:
create table myt(
id number,
starttime date,
endtime date
);
insert into myt values( 1 ,date'2014-01-01', date'2014-01-03');
insert into myt values( 2 ,date'2014-03-25', date'2014-04-02');
insert into myt values( 3 ,date'2014-03-27', null);
insert into myt values( 4 ,null, date'2013-04-02');
insert into myt values( 5 ,date'2015-03-25', date'2015-04-02');
insert into myt values( 6 ,date'2013-12-25', date'2014-04-09');
insert into myt values( 7 ,date'2013-12-26', date'2014-01-09');
Query 1:
select id,
case when coalesce(starttime,date'0000-01-01') < date'2014-01-01'
then date'2014-01-01'
else starttime
end adj_starttime,
case when coalesce(endtime,date'9999-12-31') > date'2014-03-31'
then date'2014-03-31'
else endtime
end adj_endtime,
(case when coalesce(endtime,date'9999-12-31') > date'2014-03-31'
then date'2014-03-31'
else endtime
end -
case when coalesce(starttime,date'0000-01-01') < date'2014-01-01'
then date'2014-01-01'
else starttime
end
+ 1) * 24 hoursCount
from myt
where coalesce(endtime,date'9999-12-31') >= date'2014-01-01'
and coalesce(starttime,date'0000-01-01') <= date'2014-03-31'
order by 1
Results:
| ID | ADJ_STARTTIME | ADJ_ENDTIME | HOURSCOUNT |
|----|--------------------------------|--------------------------------|------------|
| 1 | January, 01 2014 00:00:00+0000 | January, 03 2014 00:00:00+0000 | 72 |
| 2 | March, 25 2014 00:00:00+0000 | March, 31 2014 00:00:00+0000 | 168 |
| 3 | March, 27 2014 00:00:00+0000 | March, 31 2014 00:00:00+0000 | 120 |
| 6 | January, 01 2014 00:00:00+0000 | March, 31 2014 00:00:00+0000 | 2160 |
| 7 | January, 01 2014 00:00:00+0000 | January, 09 2014 00:00:00+0000 | 216 |

Sql Server 2012 - Group data by varying timeslots

I have some data to analyze which is at half hour granularity, but would like to group it by 2, 3, 6, 12 hour and 2 days and 1 week to make some more meaningful comparisons.
|DateTime | Value |
|01 Jan 2013 00:00 | 1 |
|01 Jan 2013 00:30 | 1 |
|01 Jan 2013 01:00 | 1 |
|01 Jan 2013 01:30 | 1 |
|01 Jan 2013 02:00 | 2 |
|01 Jan 2013 02:30 | 2 |
|01 Jan 2013 03:00 | 2 |
|01 Jan 2013 03:30 | 2 |
Eg. 2 hour grouped result will be
|DateTime | Value |
|01 Jan 2013 00:00 | 4 |
|01 Jan 2013 02:00 | 8 |
To get the 2 hourly grouped result, I thought of this code -
CASE
WHEN DatePart(HOUR,DateTime)%2 = 0 THEN
CAST(CAST(DatePart(HOUR,DateTime) AS varchar) + '':00'' AS TIME)
ELSE
CAST(CAST(DATEPART(HOUR,DateTime) As Int) - 1 AS varchar) + '':00'' END Time
This seems to work alright, but I cant think of using this to generalize to 3, 6, 12 hours.
I can for 6, 12 hours just use case statements and achieve result but is there any way to generalize so that I can achieve 2,3,6,12 hour granularity and also 2 days and a week level granularity? By generalize, I mean I could pass on a variable with desired granularity to the same query rather than having to constitute different queries with different case statements.
Is this possible? Please provide some pointers.
Thanks a lot!
I think you can use
Declare #Resolution int = 3 -- resolution in hours
Select
DateAdd(Hour,
DateDiff(Hour, 0, datetime) / #Resolution * #Resolution, -- integer arithmetic
0) as bucket,
Sum(values)
From
table
Group By
DateAdd(Hour,
DateDiff(Hour, 0, datetime) / #Resolution * #Resolution, -- integer arithmetic
0)
Order By
bucket
This calculates the number of hours since a known fixed date, rounds down to the resolution size you're interested in, then adds them back on to the fixed date.
It will miss buckets out, though if you have no data in them
Example Fiddle