PostgreSQL query to count/group by day and display days with no data - sql

I need to create a PostgreSQL query that returns
a day
the number of objects found for that day
It's important that every single day appear in the results, even if no objects were found on that day. (This has been discussed before but I haven't been able to get things working in my specific case.)
First, I found a sql query to generate a range of days, with which I can join:
SELECT to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
Results in:
date
------------
2013-03-28
2013-03-27
2013-03-26
2013-03-25
...
2012-03-28
(366 rows)
Now I'm trying to join that to a table named 'sharer_emailshare' which has a 'created' column:
Table 'public.sharer_emailshare'
column | type
-------------------
id | integer
created | timestamp with time zone
message | text
to | character varying(75)
Here's the best GROUP BY query I have so far:
SELECT d.date, count(se.id) FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
) d
JOIN sharer_emailshare se
ON (d.date=to_char(date_trunc('day', se.created), 'YYYY-MM-DD'))
GROUP BY d.date;
The results:
date | count
------------+-------
2013-03-27 | 11
2013-03-24 | 2
2013-02-14 | 2
(3 rows)
Desired results:
date | count
------------+-------
2013-03-28 | 0
2013-03-27 | 11
2013-03-26 | 0
2013-03-25 | 0
2013-03-24 | 2
2013-03-23 | 0
...
2012-03-28 | 0
(366 rows)
If I understand correctly this is because I'm using a plain (implied INNER) JOIN, and this is the expected behavior, as discussed in the postgres docs.
I've looked through dozens of StackOverflow solutions, and all the ones with working queries seem specific to MySQL/Oracle/MSSQL and I'm having a hard time translating them to PostgreSQL.
The guy asking this question found his answer, with Postgres, but put it on a pastebin link that expired some time ago.
I've tried to switch to LEFT OUTER JOIN, RIGHT JOIN, RIGHT OUTER JOIN, CROSS JOIN, use a CASE statement to sub in another value if null, COALESCE to provide a default value, etc, but I haven't been able to use them in a way that gets me what I need.
Any assistance is appreciated! And I promise I'll get around to reading that giant PostgreSQL book soon ;)

You just need a left outer join instead of an inner join:
SELECT d.date, count(se.id)
FROM
(
SELECT to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD') AS date
FROM generate_series(0, 365, 1) AS offs
) d
LEFT OUTER JOIN sharer_emailshare se
ON d.date = to_char(date_trunc('day', se.created), 'YYYY-MM-DD')
GROUP BY d.date;

Extending Gordon Linoff's helpful answer, I would suggest a couple of improvements such as:
Use ::date instead of date_trunc('day', ...)
Join on a date type rather than a character type (it's cleaner).
Use specific date ranges so they're easier to change later. In this case I select a year before the most recent entry in the table - something that couldn't have been done easily with the other query.
Compute the totals for an arbitrary subquery (using a CTE). You just have to cast the column of interest to the date type and call it date_column.
Include a column for cumulative total. (Why not?)
Here's my query:
WITH dates_table AS (
SELECT created::date AS date_column FROM sharer_emailshare WHERE showroom_id=5
)
SELECT series_table.date, COUNT(dates_table.date_column), SUM(COUNT(dates_table.date_column)) OVER (ORDER BY series_table.date) FROM (
SELECT (last_date - b.offs) AS date
FROM (
SELECT GENERATE_SERIES(0, last_date - first_date, 1) AS offs, last_date from (
SELECT MAX(date_column) AS last_date, (MAX(date_column) - '1 year'::interval)::date AS first_date FROM dates_table
) AS a
) AS b
) AS series_table
LEFT OUTER JOIN dates_table
ON (series_table.date = dates_table.date_column)
GROUP BY series_table.date
ORDER BY series_table.date
I tested the query, and it produces the same results, plus the column for cumulative total.

I'll try to provide an answer that includes some explanation. I'll start with the smallest building block and work up.
If you run a query like this:
SELECT series.number FROM generate_series(0, 9) AS series(number)
You get output like this:
number
--------
0
1
2
3
4
5
6
7
8
9
(10 rows)
This can be turned into dates like this:
SELECT CURRENT_DATE + sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)
Which will give output like this:
date
------------
2019-09-29
2019-09-30
2019-10-01
2019-10-02
2019-10-03
2019-10-04
2019-10-05
2019-10-06
2019-10-07
2019-10-08
(10 rows)
Then you can do a query like this (for example), joining the original query as a subquery against whatever table you're ultimately interested in:
SELECT sequential_dates.date,
COUNT(calendar_items.*) AS calendar_item_count
FROM (SELECT CURRENT_DATE + sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)) sequential_dates
LEFT JOIN calendar_items ON calendar_items.starts_at::date = sequential_dates.date
GROUP BY sequential_dates.date
Which will give output like this:
date | calendar_item_count
------------+---------------------
2019-09-29 | 1
2019-09-30 | 8
2019-10-01 | 15
2019-10-02 | 11
2019-10-03 | 1
2019-10-04 | 12
2019-10-05 | 0
2019-10-06 | 0
2019-10-07 | 27
2019-10-08 | 24

Based on Gordon Linoff's answer I realized another problem was that I had a WHERE clause that I didn't mention in the original question.
Instead of a naked WHERE, I made a subquery:
SELECT d.date, count(se.id) FROM (
select to_char(date_trunc('day', (current_date - offs)), 'YYYY-MM-DD')
AS date
FROM generate_series(0, 365, 1)
AS offs
) d
LEFT OUTER JOIN (
SELECT * FROM sharer_emailshare
WHERE showroom_id=5
) se
ON (d.date=to_char(date_trunc('day', se.created), 'YYYY-MM-DD'))
GROUP BY d.date;

I like Jason Swett SQL however ran into issue where the count on some dates should be a zero rather than a one.
Running the statment select count(*) from public.post_call_info where timestamp::date = '2020-11-23' count = zero, but below equals a one.
Also the + give me a forward schedule so changed to a minus provide 9 days data prior to current date.
SELECT sequential_dates.date,
COUNT(*) AS call_count
FROM (SELECT CURRENT_DATE - sequential_dates.date AS date
FROM generate_series(0, 9) AS sequential_dates(date)) sequential_dates
LEFT JOIN public.post_call_info ON public.post_call_info.timestamp::date =
sequential_dates.date
GROUP BY sequential_dates.date
order by date desc

Related

SQL How to subtract 2 row values of a same column based on same key

How to extract the difference of a specific column of multiple rows with same id?
Example table:
id
prev_val
new_val
date
1
0
1
2020-01-01 10:00
1
1
2
2020-01-01 11:00
2
0
1
2020-01-01 10:00
2
1
2
2020-01-02 10:00
expected result:
id
duration_in_hours
1
1
2
24
summary:
with id=1, (2020-01-01 10:00 - 2020-01-01 11:00) is 1hour;
with id=2, (2020-01-01 10:00 - 2020-01-02 10:00) is 24hour
Can we achieve this with SQL?
This solutions will be an effective way
with pd as (
select
id,
max(date) filter (where c.old_value = '0') as "prev",
max(date) filter (where c.old_value = '1') as "new"
from
table
group by
id )
select
id ,
new - prev as diff
from
pd;
if you need the difference between successive readings something like this should work
select a.id, a.new_val, a.date - b.date
from my_table a join my_table b
on a.id = b.id and a.prev_val = b.new_val
you could use min/max subqueries. For example:
SELECT mn.id, (mx.maxdate - mn.mindate) as "duration",
FROM (SELECT id, max(date) as mindate FROM table GROUP BY id) mn
JOIN (SELECT id, min(date) as maxdate FROM table GROUP BY id) mx ON
mx.id=mn.id
Let me know if you need help in converting duration to hours.
You can use the lead()/lag() window functions to access data from the next/ previous row. You can further subtract timestamps to give an interval and extract the parts needed.
select id, floor( extract('day' from diff)*24 + extract('hour' from diff) ) "Time Difference: Hours"
from (select id, date_ts - lag(date_ts) over (partition by id order by date_ts) diff
from example
) hd
where diff is not null
order by id;
NOTE: Your expected results, as presented, are incorrect. The results would be -1 and -24 respectively.
DATE is a very poor choice for a column name. It is both a Postgres data type (at best leads to confusion) and a SQL Standard reserved word.

Is there any way to apply loop in a presto query

My use case is to use the below presto view on top of the table to get daily count by subtracting the todays value with that of yesterday. If there is no data in the table then the view should dynamically consider next day value and then take the average for the missing day view.
This is a presto query. I have taken only one field in the below query
CREATE OR REPLACE VIEW hive.facebook.post_metrics_daily AS
SELECT
a.post_id,
a.page,
a.dt,
a.created_time,
(
COALESCE(
(
CAST(a.likes AS integer)
- IF(
(CAST(b.likes AS integer) IS NULL),
0,
CAST(b.likes AS integer)
)
)
, 0
)
) likes
FROM
hive.facebook.post_metrics a
LEFT JOIN hive.facebook.post_metrics b
ON a.dt = (b.dt + INTERVAL '+1' DAY)
AND a.post_id = b.post_id
AND a.brandname = b.brandname
WHERE a.dt = date'2019-09-10'
If the data is for 9th to 12th and 10th data is missing then the view should take 11th day data and take the avg of 9th and 11th to give the 10th view. How can It be done? Can this formula be applied in the query and If yes then how?
(today-yesterday)/n+1 where n will be the number of days missing.
This is the sample data for likes. In case of missing likes, I need avg likes and the number of days which is missing should be dynamically identified by the query..
Date Likes-org. missing likes daily likes org. expected likes
2019-10-17 20487 20487 20487 20487
2019-10-18 25384 25384 4897 4897
2019-10-19 26817 26817 1433 1433
2019-10-20 27499 missing likes 682 257
2019-10-21 27854 missing likes 355 258
2019-10-22 27987 missing likes 133 258
2019-10-23 28065 missing likes 78 258
2019-10-24 28106 28106 41 258
2019-10-25 28134 28134 28 28
I think you just want lag():
SELECT pm.*,
(pm.likes +
LAG(pm.likes) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt)
) / 2
FROM hive.facebook.post_metrics pm ;
If you want to treat the missing days as 0s, you need date arithmetic. I think this would be:
SELECT pm.*,
( (pm.likes +
LAG(pm.likes) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt)
) /
DATE_DIFF(day,
LAG(pm.dt) OVER (PARTITION BY pm.post_id, pm.brand_name ORDER BY pm.dt),
dt
)
)
FROM hive.facebook.post_metrics pm ;
If you want this for a particular day, use a subquery or for the above expression and then filter in the outer query.

Grouping Timestamps based on the interval between them

I have a table in Hive (SQL) with a bunch of timestamps that need to be grouped in order to create separate sessions based on the time difference between the timestamps.
Example:
Consider the following timestamps(Given in HH:MM for simplicity):
9.00
9.10
9.20
9.40
9.43
10.30
10.45
11.25
12.30
12.33
and so on..
So now, all timestamps that fall within 30 mins of the next timestamp come under the same session,
i.e. 9.00,9.10,9.20,9.40,9.43 form 1 session.
But since the difference between 9.43 and 10.30 is more than 30 mins, the time stamp 10.30 falls under a different session. Again, 10.30 and 10.45 fall under one session.
After we have created these sessions, we have to obtain the minimum timestamp for that session and the max timestamp.
I tried to subtract the current timestamp with its LEAD and place a flag if it is greater than 30 mins, but I'm having difficulty with this.
Any suggestion from you guys would be greatly appreciated. Please let me know if the question isn't clear enough.
Expected Output for this sample data:
Session_start Session_end
9.00 9.43
10.30 10.45
11.25 11.25 (same because the next time is not within 30 mins)
12.30 12.33
Hope this helps.
So it's not MySQL but Hive. I don't know Hive, but if it supports LAG, as you say, try this PostgreSQL query. You will probably have to change the time difference calculation, that's usually different from one dbms to another.
select min(thetime) as start_time, max(thetime) as end_time
from
(
select thetime, count(gap) over (rows between unbounded preceding and current row) as groupid
from
(
select thetime, case when thetime - lag(thetime) over (order by thetime) > interval '30 minutes' then 1 end as gap
from mytable
) times
) groups
group by groupid
order by min(thetime);
The query finds gaps, then uses a running total of gap counts to build group IDs, and the rest is aggregation.
SQL fiddle: http://www.sqlfiddle.com/#!17/8bc4a/6.
With MySQL lacking LAG and LEAD functions, getting the previous or next record is some work already. Here is how:
select
thetime,
(select max(thetime) from mytable afore where afore.thetime < mytable.thetime) as afore_time,
(select min(thetime) from mytable after where after.thetime > mytable.thetime) as after_time
from mytable;
Based on this we can build the whole query where we are looking for gaps (i.e. the time difference to the previous or next record is more than 30 minutes = 1800 seconds).
select
startrec.thetime as start_time,
(
select min(endrec.thetime)
from
(
select
thetime,
coalesce(time_to_sec(timediff((select min(thetime) from mytable after where after.thetime > mytable.thetime), thetime)), 1801) > 1800 as gap
from mytable
) endrec
where gap
and endrec.thetime >= startrec.thetime
) as end_time
from
(
select
thetime,
coalesce(time_to_sec(timediff(thetime, (select max(thetime) from mytable afore where afore.thetime < mytable.thetime))), 1801) > 1800 as gap
from mytable
) startrec
where gap;
SQL fiddle: http://www.sqlfiddle.com/#!2/d307b/20.
Try this..
SELECT MIN(session_time_tmp) session_start, MAX(session_time_tmp) session_end FROM
(
SELECT IF((TIME_TO_SEC(TIMEDIFF(your_time_field, COALESCE(#previousValue, your_time_field))) / 60) > 30 ,
#sessionCount := #sessionCount + 1, #sessionCount ) sessCount,
( #previousValue := your_time_field ) session_time_tmp FROM
(
SELECT your_time_field, #previousValue:= NULL, #sessionCount := 1 FROM yourtable ORDER BY your_time_field
) a
) b
GROUP BY sessCount
Just replace yourtable and your_time_field
Try this:
SELECT DATE_FORMAT(MIN(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_start,
DATE_FORMAT(MAX(STR_TO_DATE(B.column1, '%H.%i')), '%H.%i') AS Session_end
FROM tableA A
LEFT JOIN ( SELECT A.column1, diff, IF(#diff:=diff < 30, #id, #id:=#id+1) AS rnk
FROM (SELECT B.column1, TIME_TO_SEC(TIMEDIFF(STR_TO_DATE(B.column1, '%H.%i'), STR_TO_DATE(A.column1, '%H.%i'))) / 60 AS diff
FROM tableA A
INNER JOIN tableA B ON STR_TO_DATE(A.column1, '%H.%i') < STR_TO_DATE(B.column1, '%H.%i')
GROUP BY STR_TO_DATE(A.column1, '%H.%i')
) AS A, (SELECT #diff:=0, #id:= 1) AS B
) AS B ON A.column1 = B.column1
GROUP BY IFNULL(B.rnk, 1);
Check the SQL FIDDLE DEMO
OUTPUT
| SESSION_START | SESSION_END |
|---------------|-------------|
| 9.00 | 9.43 |
| 10.30 | 10.45 |
| 11.25 | 11.25 |
| 12.30 | 12.33 |

How to fill the gaps?

Assuming I have two records, both with a date and a count:
--Date-- --Count--
2011-09-20 00:00:00 5
2011-09-16 00:00:00 8
How would you select this for filling the time gaps, always taking the last previous record?
So the output would be:
--Date-- --Count--
2011-09-20 00:00:00 5
2011-09-19 00:00:00 8
2011-09-18 00:00:00 8
2011-09-17 00:00:00 8
2011-09-16 00:00:00 8
I couldn't figure out a neat solution for this, yet.
I guess this could be done with DATEDIFF, and a for-loop, but I hope this can be done easier.
You have 2 issues you're trying to resolve. The first issue is how to fill the gaps. The second issue is populating the Count field for those missing records.
Issue 1: This can be resolved by either using a Dates Lookup table or by creating a recursive common table expression. I would recommend creating a Dates Lookup table for this if that is an option. If you cannot create such a table, then you're going to need something like this.
WITH CTE AS (
SELECT MAX(dt) maxdate, MIN(dt) mindate
FROM yourtable
),
RecursiveCTE AS (
SELECT mindate dtfield
FROM CTE
UNION ALL
SELECT DATEADD(day, 1, dtfield)
FROM RecursiveCTE R
JOIN CTE T
ON R.dtfield < T.maxdate
)
That should create you a list of dates starting with the MIN date in your table and ending in the MAX.
Issue 2: Here is where a correlated subquery would come in handy (as much as I generally stay away from them) to get the last cnt from your original table:
SELECT r.dtfield,
(SELECT TOP 1 cnt
FROM yourtable
WHERE dt <= r.dtfield
ORDER BY dt DESC) cnt
FROM RecursiveCTE r
SQL Fiddle Demo
My solution goes like this.
Step 1: Have a Date table which has all the dates. - you can use many methods ex: Get a list of dates between two dates
Step 2: Do a Left outer from the date table to your result set. - which would result you with the below resultset: Call this table as "TEST_DATE_COUnt"
--Date-- --Count--
2011-09-20 00:00:00 5
2011-09-19 00:00:00 0
2011-09-18 00:00:00 0
2011-09-17 00:00:00 0
2011-09-16 00:00:00 8
Step 3: Do a Recursive query like below:
SELECT t1.date_x, t1.count_x,
(case when count_x=0 then (SELECT max(COUNT_X)
FROM TEST_DATE_COUNT r
WHERE r.DATE_X <= t1.DATE_X)
else COUNT_X
end)
cnt
FROM TEST_DATE_COUNT t1
Please let me know if this works. I tested and it worked.

PostgreSQL: How to return rows with respect to a found row (relative results)?

Forgive my example if it does not make sense. I'm going to try with a simplified one to encourage more participation.
Consider a table like the following:
dt | mnth | foo
--------------+------------+--------
2012-12-01 | December |
...
2012-08-01 | August |
2012-07-01 | July |
2012-06-01 | June |
2012-05-01 | May |
2012-04-01 | April |
2012-03-01 | March |
...
1997-01-01 | January |
If you look for the record with dt closest to today w/o going over, what would be the best way to also return the 3 records beforehand and 7 records after?
I decided to try windowing functions:
WITH dates AS (
select row_number() over (order by dt desc)
, dt
, dt - now()::date as dt_diff
from foo
)
, closest_date AS (
select * from dates
where dt_diff = ( select max(dt_diff) from dates where dt_diff <= 0 )
)
SELECT *
FROM dates
WHERE row_number - (select row_number from closest_date) >= -3
AND row_number - (select row_number from closest_date) <= 7 ;
I feel like there must be a better way to return relative records with a window function, but it's been some time since I've looked at them.
create table foo (dt date);
insert into foo values
('2012-12-01'),
('2012-08-01'),
('2012-07-01'),
('2012-06-01'),
('2012-05-01'),
('2012-04-01'),
('2012-03-01'),
('2012-02-01'),
('2012-01-01'),
('1997-01-01'),
('2012-09-01'),
('2012-10-01'),
('2012-11-01'),
('2013-01-01')
;
select dt
from (
(
select dt
from foo
where dt <= current_date
order by dt desc
limit 4
)
union all
(
select dt
from foo
where dt > current_date
order by dt
limit 7
)) s
order by dt
;
dt
------------
2012-03-01
2012-04-01
2012-05-01
2012-06-01
2012-07-01
2012-08-01
2012-09-01
2012-10-01
2012-11-01
2012-12-01
2013-01-01
(11 rows)
You could use the window function lead():
SELECT dt_lead7 AS dt
FROM (
SELECT *, lead(dt, 7) OVER (ORDER BY dt) AS dt_lead7
FROM foo
) d
WHERE dt <= now()::date
ORDER BY dt DESC
LIMIT 11;
Somewhat shorter, but the UNION ALL version will be faster with a suitable index.
That leaves a corner case where "date closest to today" is within the first 7 rows. You can pad the original data with 7 rows of -infinity to take care of this:
SELECT d.dt_lead7 AS dt
FROM (
SELECT *, lead(dt, 7) OVER (ORDER BY dt) AS dt_lead7
FROM (
SELECT '-infinity'::date AS dt FROM generate_series(1,7)
UNION ALL
SELECT dt FROM foo
) x
) d
WHERE d.dt &lt= now()::date -- same as: WHERE dt &lt= now()::date1
ORDER BY d.dt_lead7 DESC -- same as: ORDER BY dt DESC 1
LIMIT 11;
I table-qualified the columns in the second query to clarify what happens. See below.
The result will include NULL values if the "date closest to today" is within the last 7 rows of the base table. You can filter those with an additional sub-select if you need to.
1To address your doubts about output names versus column names in the comments - consider the following quotes from the manual.
Where to use an output column's name:
An output column's name can be used to refer to the column's value in
ORDER BY and GROUP BY clauses, but not in the WHERE or HAVING clauses;
there you must write out the expression instead.
Bold emphasis mine. WHERE dt <= now()::date references the column d.dt, not the the output column of the same name - thereby working as intended.
Resolving conflicts:
If an ORDER BY expression is a simple name that matches both an output
column name and an input column name, ORDER BY will interpret it as
the output column name. This is the opposite of the choice that GROUP BY
will make in the same situation. This inconsistency is made to be
compatible with the SQL standard.
Bold emphasis mine again. ORDER BY dt DESC in the example references the output column's name - as intended. Anyway, either columns would sort the same. The only difference could be with the NULL values of the corner case. But that falls flat, too, because:
the default behavior is NULLS LAST when ASC is specified or implied,
and NULLS FIRST when DESC is specified
As the NULL values come after the biggest values, the order is identical either way.
Or, without LIMIT (as per request in comment):
WITH x AS (
SELECT *
, row_number() OVER (ORDER BY dt) AS rn
, first_value(dt) OVER (ORDER BY (dt > '2011-11-02')
, dt DESC) AS dt_nearest
FROM foo
)
, y AS (
SELECT rn AS rn_nearest
FROM x
WHERE dt = dt_nearest
)
SELECT dt
FROM x, y
WHERE rn BETWEEN rn_nearest - 3 AND rn_nearest + 7
ORDER BY dt;
If performance is important, I would still go with #Clodoaldo's UNION ALL variant. It will be fastest. Database agnostic SQL will only get you so far. Other RDBMS do not have window functions at all, yet (MySQL), or different function names (like first_val instead of first_value). You might just as well replace LIMIT with TOP n (MS SQL) or whatever the local dialect.
You could use something like that:
select * from foo
where dt between now()- interval '7 months' and now()+ interval '3 months'
This and this may help you.