How to calculate the number of each weekday between 2 dates in PostgreSQL? - sql

There is a column with dates. I would like to calculate the number of each weekday (Monday to Sunday) from those dates to present date. On Stack Overflow and otherwise, I found answers that included creating functions, I was hoping there's some inbuilt function that would do it. I found another solution here, which mentions DATEPART('day', start - stop) AS days, but that didn't work. If this is an recent update in PostgreSQL then it won't work because the tool we use at work for PostgreSQL doesn't accept some of the recent updates (like PostgreSQL now accepts negative indexing but the tool doesn't).
What I want:
start_date
day_of_week
no_of_days
2022-04-01
1
10
2022-04-01
2
9
2022-05-15
2
3
2022-06-01
5
1
The start_date is the column of dates, which when subtracted from current_date (the other way around) returns the number of each weekday between those two days. There were 10 Mondays between 1st April 2022 and 6th June 2022 (today), and that's the number I want for each day of the week.
How can I achieve this in PostgreSQL? I am on version 12.8.

This "simple" but optimized solution counts the number of occurrences for every weekday in the interval between start_date and the current date:
WITH cte(start_date) AS (
VALUES
('2022-04-01'::date)
, ('2022-05-15')
, ('2022-06-01')
)
SELECT c.start_date, sub.dow, sub.no_of_days
FROM cte c
CROSS JOIN LATERAL (
SELECT dow, COALESCE(ct, 0) AS no_of_days
FROM (
SELECT EXTRACT('isodow' FROM g)::int AS dow, count(*) AS ct
FROM generate_series(start_date, current_date, interval '1 day') g
GROUP BY 1
) g
RIGHT JOIN generate_series(1, 7) dow USING (dow)
) sub
ORDER BY 1, 2;
db<>fiddle here
The upper bound (current_date) is included.
Every weekday is included, even when no_of_days is 0.
For very old dates (resulting in long intervals), an arithmetic solution will be cheaper than simply counting generated days. A bit more challenging, but not that hard.

Related

Convert integer years or months into days in SQL impala

I have two columns; both have integer values. One Representing years, and the other representing months.
My goal is to perform calculations in days (integer), so I have to convert both to calendar days, to achieve that, taking in consideration that we have years with both 365 and 366 days.
Example in pseudo code:
Select Convert(years_int) to days, Convert(months int) to days
from table.
Real Example:
if --> Years = 1 and Months = 12
1) Convert both to days to compare them: Years = 365 days; Months = 365 days
After conversion : (Years = Months) Returns TRUE.
The problem is when we have years = 10 (for example), we must take in account the fact that at least two of them have 366 days. The same with Months - we have 30 and 31 days. So I need to compensate that fact to get the most accurate possible value in days.
Thanks in advance
From integers to timestamp can be done in PostgreSQL. I do not have impala, but hopefully below script will help you getting this done using impala:
with
year as (select 2022 as y union select 2023),
month as (select generate_series(1,12) as m),
day as(select generate_series(1,31) as d )
select y,m,d,dt from (
select
y,m,d,
to_date(ds,'YYYYMMDD')+(((d-1)::char(2))||' day')::interval dt
from ( select
*,
y::char(4)|| right('0'||m::char(2),2) || right('0'||0::char(2),2) as ds
from year,month,day
) x
) y
where extract(year from dt)=y and extract(month from dt)=m
order by dt
;
see: DBFIDDLE
Used functions in this query and, a way, to convert them to imapala (remember I do not use that tool/language/dialect)
function
impala alternative
to_date(a,b)
This will convert the string a to a date using the format b. Using impala you can use CAST(expression AS type FORMAT pattern)
y::char(4)
Cast y to a char(4), Using imala you can use: CAST(expression AS type)
right(a,b)
Use: right()
\\
Use: concat()
generate_series(a,b)
This generates a serie of numbers from a to (an inclusing) b. A SQL altervative is to write SELECT 1 as x union SELECT 2 union SELECT 3, which generates the same series as generate_series(1,3) in PostgreSQL
extract(year from a)
Get the year from the datetime field a, see YEAR()
One special case is this one to_date(ds,'YYYYMMDD')+(((d-1)::char(2))||' day')::interval
This will convert ds (with datatype CHAR(8)) to a date, and then add (using +) a number of days (like: '4 day')
Because I included all days until 31, this will fail in Februari, April, June, September, November because those months do not have 31 days. This is corrected by the WHERE clause in the end (where extract(year from dt)=y and extract(month from dt)=m)

Query date ranges excluding weekends in PostgreSQL

I have the following postgresql table;
id | date_slot
------+-------------------------
1 | [2023-02-08,2023-02-15)
2 | [2023-02-20,2023-02-26)
3 | [2023-02-27,2023-03-29)
I want to make a query that return rows contained in these ranges but exclude weekends
for example the query I made return the following but does not exclude the weekends.
SELECT * FROM table where '2023-02-11'::date <# date_slot;
id | date_slot
------+-------------------------
1 | [2023-02-08,2023-02-15)
2023-02-11 is a weekend so it must not return a result. How can I do that?
Selecting workday-only dateranges (without weekends):
You can check what day of the week it is on the first day in the range using extract() and knowing its length from upper()-lower(), determine if it'll cross a weekend: online demo
select *
from test_table
where '2023-02-11'::date <# date_slot
and extract(isodow from lower(date_slot)
+ (not lower_inc(date_slot))::int)
+( (upper(date_slot) - (not upper_inc(date_slot))::int)
-(lower(date_slot) + (not lower_inc(date_slot))::int) ) < 6 ;
Cases where your ranges have varying upper and lower bound inclusivity are handled by lower_inc() and upper_inc() - their boolean result, when cast to int, just adds or subtracts a day to account for whether it's included by the range or not.
The range is on or includes a weekend if it starts on a weekend day or continues for too long from any other day of the week:
4 days, if it starts on a Monday (isodow=1)
3 days, if it starts on a Tuesday (isodow=2)
2 days, if it starts on a Wednesday (isodow=3)
1 days, if it starts on a Thursday (isodow=4)
0 days, if it starts on a Friday (isodow=5)
This means the isodow of range start date and the range length cannot sum up to more than 5 for the range not to overlap a weekend.
You can also enumerate the dates contained by these ranges using generate_series() and see if they include a Saturday (day 6) or a Sunday (0 as dow, 7 as isodow):
select *
from test_table
where '2023-02-11'::date <# date_slot
and not exists (
select true
from generate_series(
lower(date_slot) + (not lower_inc(date_slot))::int,
upper(date_slot) - (not upper_inc(date_slot))::int,
'1 day'::interval) as alias(d)
where extract(isodow from d) in (6,7) );
Selecting records based on workday-only dates:
First comment got it right
select *
from table_with_dateranges dr,
table_with_dates d
where d.date <# dr.date_slot
and extract(isodow from d.date) not in (6,7);

Count distinct customers, active within a year, for every week of the year

I am working with an existing E-commerce database. Actually, this process is usually done in Excel, but we want to try it directly with a query in PostgreSQL (version 10.6).
We define as an active customer a person who has bought at least once within 1 year. This means, if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
I want the output for each week of the year (2020). Basically what I need is ...
select
email,
orderdate,
id
from
orders_table
where
paid = true;
|---------------------|-------------------|-----------------|
| email | orderdate | id |
|---------------------|-------------------|-----------------|
| email1#email.com |2020-06-02 05:04:32| Order-2736 |
|---------------------|-------------------|-----------------|
I can't create new tables. And I would like to see the output like this:
Year| Week | Active customers
2020| 25 | 6978
2020| 24 | 3948
depending on whether there is a year and week column you can use a OVER (PARTITION BY ...) with extract:
SELECT
extract(year from orderdate),
extract(week from orderdate),
sum(1) as customer_count_in_week,
OVER (PARTITION BY extract(YEAR FROM TIMESTAMP orderdate),
extract(WEEK FROM TIMESTAMP orderdate))
FROM ordertable
WHERE paid=true;
Which should bucket all orders by year and week, thus showing the total count per week in a year where paid is true.
references:
https://www.postgresql.org/docs/9.1/tutorial-window.html
https://www.postgresql.org/docs/8.1/functions-datetime.html
if I analyze week 22 in 2020, an active customer will be the one that has bought at least once since week 22, 2019.
Problems on your side
This method has some corner case ambiguities / issues:
Do you include or exclude "week 22 in 2020"? (I exclude it below to stay closer to "a year".)
A year can have 52 or 53 full weeks. Depending on the current date, the calculation is based on 52 or 53 weeks, causing a possible bias of almost 2 %!
If you start the time range on "the same date last year", then the margin of error is only 1 / 365 or ~ 0.3 %, due to leap years.
A fixed "period of 365 days" (or 366) would eliminate the bias altogether.
Problems on the SQL side
Unfortunately, window functions do not currently allow the DISTINCT key word (for good reasons). So something of the form:
SELECT count(DISTINCT email) OVER (ORDER BY year, week
GROUPS BETWEEN 52 PRECEDING AND 1 PRECEDING)
FROM ...
.. triggers:
ERROR: DISTINCT is not implemented for window functions
The GROUPS keyword has only been added in Postgres 10 and would otherwise be just what we need.
What's more, your odd frame definition wouldn't even work exactly, since the number of weeks to consider is not always 52, as discussed above.
So we have to roll our own.
Solution
The following simply generates all weeks of interest, and computes the distinct count of customers for each. Simple, except that date math is never entirely simple. But, depending on details of your setup, there may be faster solutions. (I had several other ideas.)
The time range for which to report may change. Here is an auxiliary function to generate weeks of a given year:
CREATE OR REPLACE FUNCTION f_weeks_of_year(_year int)
RETURNS TABLE(year int, week int, week_start timestamp)
LANGUAGE sql IMMUTABLE STRICT PARALLEL SAFE
ROWS 52 COST 10 AS
$func$
SELECT _year, d.week::int, d.week_start
FROM generate_series(date_trunc('week', make_date(_year, 01, 04)::timestamp) -- first day of first week
, LEAST(date_trunc('week', localtimestamp), make_date(_year, 12, 28)::timestamp) -- latest possible start of week
, interval '1 week') WITH ORDINALITY d(week_start, week)
$func$;
Call:
SELECT * FROM f_weeks_of_year(2020);
It returns 1 row per week, but stops at the current week for the current year. (Empty set for future years.)
The calculation is based on these facts:
The first ISO week of the year always contains January 04.
The last ISO week cannot start after December 28.
Actual week numbers are computed on the fly using WITH ORDINALITY. See:
PostgreSQL unnest() with element number
Aside, I stick to timestamp and avoid timestamptz for this purpose. See:
Generating time series between two dates in PostgreSQL
The function also returns the timestamp of the start of the week (week_start), which we don't need for the problem at hand. But I left it in to make the function more useful in general.
Makes the main query simpler:
WITH weekly_customer AS (
SELECT DISTINCT
EXTRACT(YEAR FROM orderdate)::int AS year
, EXTRACT(WEEK FROM orderdate)::int AS week
, email
FROM orders_table
WHERE paid
AND orderdate >= date_trunc('week', timestamp '2019-01-04') -- max range for 2020!
ORDER BY 1, 2, 3 -- optional, might improve performance
)
SELECT d.year, d.week
, (SELECT count(DISTINCT email)
FROM weekly_customer w
WHERE (w.year, w.week) >= (d.year - 1, d.week) -- row values, see below
AND (w.year, w.week) < (d.year , d.week) -- exclude current week
) AS active_customers
FROM f_weeks_of_year(2020) d; -- (year int, week int, week_start timestamp)
db<>fiddle here
The CTE weekly_customer folds to unique customers per calendar week once, as duplicate entries are just noise for our calculation. It's used many times in the main query. The cut-off condition is based on Jan 04 once more. Adjust to your actual reporting period.
The actual count is done with a lowly correlated subquery. Could be a LEFT JOIN LATERAL ... ON true instead. See:
What is the difference between LATERAL and a subquery in PostgreSQL?
Using row value comparison to make the range definition simple. See:
SQL syntax term for 'WHERE (col1, col2) < (val1, val2)'

How to select and group fortnightly in postgreql

I am trying to group the rows in a table fortnightly, but can't seem to work out how to do it - especially, as the date_part function does not have a 'fortnight' keyword argument.
This is what I have so far:
CREATE TABLE foo(
dt DATE NOT NULL,
f1 REAL NOT NULL,
f2 REAL NOT NULL,
f3 REAL NOT NULL,
f4 REAL NOT NULL
);
SELECT AVG((f1+f2+f3+f4)/4) as fld_avg FROM
(
SELECT date_part('year', dt) AS year_part,
date_part('fortnight', dt) AS fortnight_part,
f1, f2, f3, f4
FROM foo
WHERE dt >= date_trunc('day', NOW() - '3 month')
) foo
GROUP BY year_part, fortnight_part
How may I rewrite (or modify) the query above so as to group data fortnightly?
Basic idea
What we need to do, is take intervals of 14 consecutive days and map them to unique buckets and then group by those buckets. These buckets can of any type, int, char, timstamp, as long as we have unique value.
Division
A simple way to accomplish this is division. Divide by 14 days and truncate the result to date precision.
For example, we can extract the number of seconds since 1970-01-01, the UNIX epoch, and divide by the number of seconds in a fortnight: 14 * 24 * 60 * 60 = 14 * 86400 = 1209600. (I'll use Vao Tsun's example data)
WITH c(d) AS (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
SELECT (EXTRACT(EPOCH FROM d)::int/86400)/14 fortnight FROM c
which yields fortnights since 1970-01-01 (a Thursday):
fortnight
-----------
1251
1252
1254
1254
(4 rows)
The integer values we get, represent the number of fortnights since 1970-01-01, but we don't have to care about this. The important thing is, that it uniquely identifies a fortnight.
Due to 1970-01-01 being a Thursday, all fortnights will start at a Thursday. We might want to vary the starting point of our fortnight to a different day of the week (e.g. Monday) by adding:
WITH c(d) AS (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
SELECT (EXTRACT(EPOCH FROM d)::int/86400 + 4)/14 fortnight FROM c
By adding four days to Thursday we end up at Monday.
If you rather want fortnights with respect to the beginning of the year, instead of some arbitrary absolute date, such as 1970-01-01, we can use the day of the year instead:
WITH c(d) AS (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
SELECT EXTRACT(year FROM d) * 26 + EXTRACT(doy FROM d)::int/14 AS fortnight FROM c;
which yields
fortnight
-----------
52467
52468
52469
52470
(4 rows)
We need to multiply the extracted year by 26, because there are 26.1… fortnights in a year.
Truncation
Instead of division another approach is truncation. We map each day of a specific fortnight to the first timestamp of that fortnight.
WITH c(d) AS (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
SELECT d - make_interval(secs => EXTRACT(EPOCH FROM d)::int % (86400 * 14)) AS fortnight FROM c;
which yields
fortnight
---------------------
2017-12-14 00:00:00
2017-12-28 00:00:00
2018-01-25 00:00:00
2018-01-25 00:00:00
(4 rows)
This might seems a bit more complicated, but has some benefits. The result is still a date/time type and other code does not need to worry about the fact, that we used fortnights.
Again, instead of absolute fortnights, we can calculate this with respect to the beginning of the year:
WITH c(d) AS (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
SELECT d - make_interval(days => EXTRACT(dow FROM d)::int % 14) AS fortnight FROM c;
which yields
fortnight
---------------------
2017-12-17 00:00:00
2017-12-31 00:00:00
2018-01-21 00:00:00
2018-01-28 00:00:00
(4 rows)
The result is of type timestamp, you might want to have date instead. This can be addressed by casting:
(d - make_interval(days => EXTRACT(dow FROM d)::int % 14))::date
or subtracting int instead of interval from date:
d - (EXTRACT(dow FROM d)::int % 14)
There are much more possibilities. With this scheme, we can calculate the fortnight or any other interval with respect to the beginning of the month, some arbitrary date, etc.
update
fortnight is a two week period - one even the other odd. eg week 1 and 2, 3 and 4, 5 and 6.
closer: 2 is even, mod(2,2)=0 and 1 is odd, mod(1,2)=1
4 is even, mod(4,2)=0 and 3 is odd, mod(3,2)=1
6 is even, mod(6,2)=0 and 5 is odd, mod(5,2)=1
thus you can make an assumption that each one week's in year consecutive number divided by two reminder is 1, and each next one weeks number/2 reminders 0
The general idea is - using the sequential number of week in a year. To avoid Jan 1st to be first and Dec31 (possible be the 53rd - and thus two odds in a row), I use IW
week number of ISO 8601 week-numbering year (01-53; the first Thursday
of the year is in week 1)
then I assume that if one week number will be odd, next will be even, so we divide all the time in parts of two weeks - even+odd.
SQL Example:
o=# with c(d) as (values('2017.12.21'::date),('2017.12.31'),('2018.01.26'),('2018.02.01'))
select d,to_char(d,'IW'),right(to_char(d,'IW'),1)::int,mod(right(to_char(d,'IW'),1)::int, 2) from c;
d | to_char | right | mod
------------+---------+-------+-----
2017-12-21 | 51 | 1 | 1
2017-12-31 | 52 | 2 | 0
2018-01-26 | 04 | 4 | 0
2018-02-01 | 05 | 5 | 1
(4 rows)
mod is either 0 or 1 - group by this column
https://www.postgresql.org/docs/current/static/functions-math.html
https://www.postgresql.org/docs/current/static/functions-formatting.html
Of course you would need to add outer join on generate_series if you want data without gaps...
I post another answer to explain how I was wrong and why my "smart-n-neat"
way failed...
the schema build and queries are at:
https://www.db-fiddle.com/f/j5i2Td8CvxCVXQQYePKzCe/0
the first (and correct) query:
select distinct w2, avg(c) over (partition by w2)
from d
join generate_series('2016.11.28'::date,'2017.02.23'::date,'2 weeks'::interval) w2
on gs >= w2 and gs < w2 + '2 weeks'::interval
order by w2;
Is a long, simple and correct approach. with idea is to join on two weeks interval. It's working, reliable and all good.
Now the second query:
select distinct div(to_char(gs,'IW')::int,2), min(gs) over w, avg(c) over w
from d
window w as (partition by div(to_char(gs,'IW')::int,2))
order by min;
Is much shorter, neater and smarter, yet has a huge limitation and is unusable. Here's why:
My approach splits next to last two-weeks-interval to two parts: last week of 2016 and first week of 2017, thus dividing the result by half. If you multiply a sum of averages for those two weeks by a half, the result for both queries will match. Alas introducing CASE WHEN logic for the edge year weeks makes neat solution a heavy and overhead. And thus the very point is lost.
TL;DR the neat and lightweight solution works only on interval of one year, farther then two weeks from end or start of the year and lastly if our fortnightly interval starts from Monday.
Now the idea behind lightweight solution: round(2/2, 0)=1 and round(3/2, 0)=1, so you can divide year in intervals of two weeks and use it for grouping by.
Also I deliberately took not this New Year switch, because this 2018 Jan 1 is Monday, so IW is same as WW - which usually is not the case.
Lastly my first answer with odd and even weeks is not viable at all. It divides year not in two-weeks interval, but rather in two parts - for even and odd weeks... I deceived myself with "something close" idea and worked on the reminder, while I should do the opposite the whole value of division...

Choose active employes per month with dates formatted dd/mm/yyyy

I'm having a hard time explaining this through writing, so please be patient.
I'm making this project in which I have to choose a month and a year to know all the active employees during that month of the year.. but in my database I'm storing the dates when they started and when they finished in dd/mm/yyyy format.
So if I have an employee who worked for 4 months eg. from 01/01/2013 to 01/05/2013 I'll have him in four months. I'd need to make him appear 4 tables(one for every active month) with the other employees that are active during those months. In this case those will be: January, February, March and April of 2013.
The problem is I have no idea how to make a query here or php processing to achieve this.
All I can think is something like (I'd run this query for every month, passing the year and month as argument)
pg_query= "SELECT employee_name FROM employees
WHERE month_and_year between start_date AND finish_date"
But that can't be done, mainly because month_and_year must be a column not a variable.
Ideas anyone?
UPDATE
Yes, I'm very sorry that I forgot to say I was using DATE as data type.
The easiest solution I found was to use EXTRACT
select * from employees where extract (year FROM start_date)>='2013'
AND extract (month FROM start_date)='06' AND extract (month FROM finish_date)<='07'
This gives me all records from june of 2013 you sure can substite the literal variables for any variable of your preference
There is no need to create a range to make an overlap:
select to_char(d, 'YYYY-MM') as "Month", e.name
from
(
select generate_series(
'2013-01-01'::date, '2013-05-01', '1 month'
)::date
) s(d)
inner join
employee e on
date_trunc('month', e.start_date)::date <= s.d
and coalesce(e.finish_date, 'infinity') > s.d
order by 1, 2
SQL Fiddle
If you want the months with no active employees to show then change the inner for a left join
Erwin, about your comment:
the second expression would have to be coalesce(e.finish_date, 'infinity') >= s.d
Notice the requirement:
So if I have an employee who worked for 4 months eg. from 01/01/2013 to 01/05/2013 I'll have him in four months
From that I understand that the last active day is indeed the previous day from finish.
If I use your "fix" I will include employee f in month 05 from my example. He finished in 2013-05-01:
('f', '2013-04-17', '2013-05-01'),
SQL Fiddle with your fix
Assuming that you really are not storing dates as character strings, but are only outputting them that way, then you can do:
SELECT employee_name
FROM employees
WHERE start_date <= <last date of month> and
(finish_date >= <first date of month> or finish_date is null)
If you are storing them in this format, then you can do some fiddling with years and months.
This version turns the "dates" into strings of the form "YYYYMM". Just express the month you want like this and you can do the comparison:
select employee_name
from employees e
where right(start_date, 4)||substr(start_date, 4, 2) <= 'YYYYMM' and
(right(finish_date, 4)||substr(finish_date, 4, 2) >= 'YYYYMM' or finish_date is null)
NOTE: the expression 'YYYYMM' is meant to be the month/year you are looking for.
First, you can generate multiple date intervals easily with generate_series(). To get lower and upper bound add an interval of 1 month to the start:
SELECT g::date AS d_lower
, (g + interval '1 month')::date AS d_upper
FROM generate_series('2013-01-01'::date, '2013-04-01', '1 month') g;
Produces:
d_lower | d_upper
------------+------------
2013-01-01 | 2013-02-01
2013-02-01 | 2013-03-01
2013-03-01 | 2013-04-01
2013-04-01 | 2013-05-01
The upper border of the time range is the first of the next month. This is on purpose, since we are going to use the standard SQL OVERLAPS operator further down. Quoting the manual at said location:
Each time period is considered to represent the half-open interval
start <= time < end [...]
Next, you use a LEFT [OUTER] JOIN to connect employees to these date ranges:
SELECT to_char(m.d_lower, 'YYYY-MM') AS month_and_year, e.*
FROM (
SELECT g::date AS d_lower
, (g + interval '1 month')::date AS d_upper
FROM generate_series('2013-01-01'::date, '2013-04-01', '1 month') g
) m
LEFT JOIN employees e ON (m.d_lower, m.d_upper)
OVERLAPS (e.start_date, COALESCE(e.finish_date, 'infinity'))
ORDER BY 1;
The LEFT JOIN includes date ranges even if no matching employees are found.
Use COALESCE(e.finish_date, 'infinity')) for employees without a finish_date. They are considered to be still employed. Or maybe use current_date in place of infinity.
Use to_char() to get a nicely formatted month_and_year value.
You can easily select any columns you need from employees. In my example I take all columns with e.*.
The 1 in ORDER BY 1 is a positional parameter to simplify the code. Orders by the first column month_and_year.
To make this fast, create an multi-column index on these expressions. Like
CREATE INDEX employees_start_finish_idx
ON employees (start_date, COALESCE(finish_date, 'infinity') DESC);
Note the descending order on the second index-column.
If you should have committed the folly of storing temporal data as string types (text or varchar) with the pattern 'DD/MM/YYYY' instead of date or timestamp or timestamptz, convert the string to date with to_date(). Example:
SELECT to_date('01/03/2013'::text, 'DD/MM/YYYY')
Change the last line of the query to:
...
OVERLAPS (to_date(e.start_date, 'DD/MM/YYYY')
,COALESCE(to_date(e.finish_date, 'DD/MM/YYYY'), 'infinity'))
You can even have a functional index like that. But really, you should use a date or timestamp column.