Find rows with adjourning date ranges and accumulate their durations

Find rows with adjourning date ranges and accumulate their durations - sql

My PostgreSQL database stores school vacation, public holidays and weekend dates for parents to plan their vacation. Many times school vacations are adjourned by weekends or public holidays. I want to display the total number of non-school days for a school vacation. That should include any adjourned weekend or public holiday.
Example Data
locations
SELECT id, name, is_federal_state
FROM locations
WHERE is_federal_state = true;
| id | name | is_federal_state |
|----|-------------------|------------------|
| 2 | Baden-Württemberg | true |
| 3 | Bayern | true |
holiday_or_vacation_types
SELECT id, name FROM holiday_or_vacation_types;
| id | name |
|----|-----------------------|
| 1 | Herbst |
| 8 | Wochenende |
"Herbst" is German for "autumn" and "Wochenende" is German for "weekend".
periods
SELECT id, starts_on, ends_on, holiday_or_vacation_type_id
FROM periods
WHERE location_id = 2
ORDER BY starts_on;
| id | starts_on | ends_on | holiday_or_vacation_type_id |
|-----|--------------|--------------|-----------------------------|
| 670 | "2019-10-26" | "2019-10-27" | 8 |
| 532 | "2019-10-28" | "2019-10-30" | 1 |
| 533 | "2019-10-31" | "2019-10-31" | 1 |
| 671 | "2019-11-02" | "2019-11-03" | 8 |
| 672 | "2019-11-09" | "2019-11-10" | 8 |
| 673 | "2019-11-16" | "2019-11-17" | 8 |
Task
I want to select all periods where location_id equals 2. And I want to calculate the duration of each period in days. That can be done with this SQL query:
SELECT id, starts_on, ends_on,
(ends_on - starts_on + 1) AS duration,
holiday_or_vacation_type_id
FROM periods
| id | starts_on | ends_on | duration | holiday_or_vacation_type_id |
|-----|--------------|--------------|----------|-----------------------------|
| 670 | "2019-10-26" | "2019-10-27" | 2 | 8 |
| 532 | "2019-10-28" | "2019-10-30" | 3 | 1 |
| 533 | "2019-10-31" | "2019-10-31" | 1 | 1 |
| 671 | "2019-11-02" | "2019-11-03" | 2 | 8 |
| 672 | "2019-11-09" | "2019-11-10" | 2 | 8 |
| 673 | "2019-11-16" | "2019-11-17" | 2 | 8 |
Any human looking at the calendar would see that the ids 670 (weekend), 532 (fall vacation) and 533 (fall vacation) are adjourned. So they add up to a 6 day vacation period. So far I do this with a program which computes this. But that takes quite a lot of resources (the actual table contains some 500,000 items).
Problem 1
Which SQL query would result in the following output (is adds a real_duration column)? Is that even possible with SQL?
| id | starts_on | ends_on | duration | real_duration | holiday_or_vacation_type_id |
|-----|--------------|--------------|----------|---------------|-----------------------------|
| 670 | "2019-10-26" | "2019-10-27" | 2 | 6 | 8 |
| 532 | "2019-10-28" | "2019-10-30" | 3 | 6 | 1 |
| 533 | "2019-10-31" | "2019-10-31" | 1 | 6 | 1 |
| 671 | "2019-11-02" | "2019-11-03" | 2 | 2 | 8 |
| 672 | "2019-11-09" | "2019-11-10" | 2 | 2 | 8 |
| 673 | "2019-11-16" | "2019-11-17" | 2 | 2 | 8 |
Problem 2
It is possible to list the adjourning periods in a part_of_range field? This would be the result. Can that be done with SQL?
| id | starts_on | ends_on | duration | part_of_range | holiday_or_vacation_type_id |
|-----|--------------|--------------|----------|---------------|-----------------------------|
| 670 | "2019-10-26" | "2019-10-27" | 2 | 670,532,533 | 8 |
| 532 | "2019-10-28" | "2019-10-30" | 3 | 670,532,533 | 1 |
| 533 | "2019-10-31" | "2019-10-31" | 1 | 670,532,533 | 1 |
| 671 | "2019-11-02" | "2019-11-03" | 2 | | 8 |
| 672 | "2019-11-09" | "2019-11-10" | 2 | | 8 |
| 673 | "2019-11-16" | "2019-11-17" | 2 | | 8 |

This is a gaps and islands problem. In this case you can use lag() to see where an island starts and then a cumulative sum.
The final operation is some aggregation (using window functions):
SELECT p.*,
(Max(ends_on) OVER (PARTITION BY location_id, grp) - Min(starts_on) OVER (PARTITION BY location_id, grp) ) + 1 AS duration,
Array_agg(p.id) OVER (PARTITION BY location_id)
FROM (SELECT p.*,
Count(*) FILTER (WHERE prev_eo < starts_on - INTERVAL '1 day') OVER (PARTITION BY location_id ORDER BY starts_on) AS grp
FROM (SELECT id, starts_on, ends_on, location_id, holiday_or_vacation_type_id,
lag(ends_on) OVER (PARTITION BY location_id ORDER BY (starts_on)) AS prev_eo
FROM periods
) p
) p;

Related

PostgreSQL: Count number of rows in table 1 for distinct rows in table 2

I am working with really big data that at the moment I become confused, looking like I'm just repeating one thing.
I want to count the number of trips per user from two tables, trips and session.
psql=> SELECT * FROM trips limit 10;
trip_id | session_ids | daily_user_id | seconds_start | seconds_end
---------+-----------------+---------------+---------------+-------------
400543 | {172079} | 17118 | 1575550944 | 1575551181
400542 | {172078} | 17118 | 1575541533 | 1575542171
400540 | {172077} | 17118 | 1575539001 | 1575539340
400538 | {172076} | 17117 | 1575540499 | 1575541999
400534 | {172074,172075} | 17117 | 1575537161 | 1575539711
400530 | {172073} | 17116 | 1575447043 | 1575447682
400529 | {172071} | 17115 | 1575496394 | 1575497803
400527 | {172070} | 17113 | 1575495241 | 1575496034
400525 | {172068} | 17115 | 1575485658 | 1575489378
400524 | {172067} | 17113 | 1575488721 | 1575490491
(10 rows)
psql=> SELECT * FROM session limit 10;
session_id | user_id | key | start_time | daily_user_id
------------+---------+--------------------------+------------+---------------
172079 | 43 | hLB8S7aSfp4gAFp7TykwYQ==+| 1575550921 | 17118
| | | |
172078 | 43 | YATMrL/AQ7Nu5q2dQTMT1A==+| 1575541530 | 17118
| | | |
172077 | 43 | fOLX4tqvsyFOP3DCyBZf1A==+| 1575538997 | 17118
| | | |
172076 | 7 | 88hwGj4Mqa58juy0PG/R4A==+| 1575540515 | 17117
| | | |
172075 | 7 | 1O+8X49+YbtmoEa9BlY5OQ==+| 1575538384 | 17117
| | | |
172074 | 7 | XOR7hsFCNk+soM75ZhDJyA==+| 1575537405 | 17117
| | | |
172073 | 42 | rAQWwYgqg3UMTpsBYSpIpA==+| 1575447109 | 17116
| | | |
172072 | 276 | 0xOsxRRN3Sq20VsXWjlrzQ==+| 1575511120 | 17114
| | | |
172071 | 7 | P4beN3W/ZrD+TCpZGYh23g==+| 1575496642 | 17115
| | | |
172070 | 43 | OFi30Zv9e5gmLZS5Vb+I7Q==+| 1575495238 | 17113
| | | |
(10 rows)
Goal: get the distribution of trips per user
Attempt:
psql=> SELECT COUNT(distinct trip_id) as trips
, count(distinct user_id) as users
, extract(year from to_timestamp(seconds_start)) as year_date
, extract(month from to_timestamp(seconds_start)) as month_date
FROM trips
INNER JOIN session
ON session_id = ANY(session_ids)
GROUP BY year_date, month_date
ORDER BY year_date, month_date;
+-------+-------+-----------+------------+
| trips | users | year_date | month_date |
+-------+-------+-----------+------------+
| 371 | 44 | 2016 | 3 |
| 12207 | 185 | 2016 | 4 |
| 3859 | 88 | 2016 | 5 |
| 1547 | 28 | 2016 | 6 |
| 831 | 17 | 2016 | 7 |
| 427 | 4 | 2016 | 8 |
| 512 | 13 | 2016 | 9 |
| 431 | 11 | 2016 | 10 |
| 1011 | 26 | 2016 | 11 |
| 791 | 15 | 2016 | 12 |
| 217 | 8 | 2017 | 1 |
| 490 | 17 | 2017 | 2 |
| 851 | 18 | 2017 | 3 |
| 1890 | 66 | 2017 | 4 |
| 2143 | 43 | 2017 | 5 |
| . | | | |
| . | | | |
| . | | | |
+-------+-------+-----------+------------+
This resultset count number of users and trips, my intention is actually to get an analysis of trips per user, like so:
+------+-------------+
| user | no_of_trips |
+------+-------------+
| 1 | 489 |
| 2 | 400 |
| 3 | 12 |
| 4 | 102 |
| . | |
| . | |
| . | |
+------+-------------+
How do I do this, please?

You seem to just want aggregation by user_id:
SELECT s.user_id, COUNT(distinct t.trip_id) as trips
FROM trips t INNER JOIN
session s
ON s.session_id = ANY(t.session_ids)
GROUP BY s.user_id ;
I'm pretty sure that the COUNT(DISTINCT) is unnecessary, so I would advise removing it:
SELECT s.user_id, COUNT(*) as trips
FROM trips t INNER JOIN
session s
ON s.session_id = ANY(t.session_ids)
GROUP BY s.user_id ;

Obtain MIN() and MAX() over not correlative values in PostgreSQL

I have a problem that I can't found a solution. This is my scenario:
parent_id | transaction_code | way_to_pay | type_of_receipt | unit_price | period | series | number_from | number_to | total_numbers
10 | 2444 | cash | local | 15.000 | 2018 | A | 19988 | 26010 | 10
This result's when a grouping parent_id, transaccion_code, way_to_pay, type_of_receipt, unit_price, periodo, series, MIN(number), MAX(number) and COUNT(number). But the grouping hides that the number is not correlative, because this is my childs situation:
parent_id | child_id | number
10 | 1 | 19988
10 | 2 | 19989
10 | 3 | 19990
10 | 4 | 19991
10 | 5 | 22001
10 | 6 | 22002
10 | 7 | 26007
10 | 8 | 26008
10 | 9 | 26009
10 | 10 | 26010
What is the magic SQL to achieve the following?
parent_id | transaction_code | way_to_pay | type_of_receipt | unit_price | period | series | number_from | number_to | total_numbers
10 | 2444 | cash | local | 15.000 | 2018 | A | 19988 | 19991 | 4
10 | 2444 | cash | local | 15.000 | 2018 | A | 22001 | 22002 | 2
10 | 2444 | cash | local | 15.000 | 2018 | A | 26007 | 26010 | 4

You can identify adjacent numbers by subtracting a sequence. It would help if you showed your query, but the idea is this:
select parent_id, transaccion_code, way_to_pay, type_of_receipt, unit_price, periodo, series,
min(number), max(number), count(*)
from (select t.*,
row_number() over
(partition by parent_id, transaccion_code, way_to_pay, type_of_receipt, unit_price, periodo, series
order by number
) as seqnum
from t
) t
group by parent_id, transaccion_code, way_to_pay, type_of_receipt, unit_price, periodo, series,
(number - seqnum);

How to calculate running total in SQL

I have my dataset in the given format
It's a month level data along with salary for each month.
I need to calculate cumulative salary for each month end. How can I do this
+----------+-------+--------+---------------+
| Account | Month | Salary | Running Total |
+----------+-------+--------+---------------+
| a | 1 | 586 | 586 |
| a | 2 | 928 | 1514 |
| a | 3 | 726 | 2240 |
| a | 4 | 538 | 538 |
| b | 1 | 956 | 1494 |
| b | 3 | 667 | 2161 |
| b | 4 | 841 | 3002 |
| c | 1 | 826 | 826 |
| c | 2 | 558 | 1384 |
| c | 3 | 558 | 1972 |
| c | 4 | 735 | 2707 |
| c | 5 | 691 | 3398 |
| d | 1 | 670 | 670 |
| d | 4 | 838 | 1508 |
| d | 5 | 1000 | 2508 |
+----------+-------+--------+---------------+
I need to calculate running total column which is cumulative column. How can I do efficiently in SQL?

You can use SUM with ORDER BY clause inside the OVER clause:
SELECT Account, Month, Salary,
SUM(Salary) OVER (PARTITION BY Account ORDER BY Month) AS RunningTotal
FROM mytable

equating an entry to an aggregated version of itself

I am trying to find if an entry's value is the max of the grouped value. Its purpose is to sit in a larger if logic.
Which I'd expect would look something like this:
SELECT
t.id as t_id,
sum(if(t.value = max(t.value), 1, 0)) AS is_max_value
FROM dataset.table AS t
GROUP BY t_id
The response is:
Error: Expression 't.value' is not present in the GROUP BY list
How should my code look to do this?

You first need to compile in a subquery the max value, then join again the value to the table.
Using the public data set available here is an example:
SELECT
t.word,
t.word_count,
t.corpus_date
FROM
[publicdata:samples.shakespeare] t
JOIN (
SELECT
corpus_date,
MAX(word_count) word_count,
FROM
[publicdata:samples.shakespeare]
GROUP BY
1 ) d
ON
d.corpus_date=t.corpus_date
AND t.word_count=d.word_count
LIMIT
25
Results:
+-----+--------+--------------+---------------+---+
| Row | t_word | t_word_count | t_corpus_date | |
+-----+--------+--------------+---------------+---+
| 1 | the | 762 | 1597 | |
| 2 | the | 894 | 1598 | |
| 3 | the | 841 | 1590 | |
| 4 | the | 680 | 1606 | |
| 5 | the | 942 | 1607 | |
| 6 | the | 779 | 1609 | |
| 7 | the | 995 | 1600 | |
| 8 | the | 937 | 1599 | |
| 9 | the | 738 | 1612 | |
| 10 | the | 612 | 1595 | |
| 11 | the | 848 | 1592 | |
| 12 | the | 753 | 1594 | |
| 13 | the | 740 | 1596 | |
| 14 | I | 828 | 1603 | |
| 15 | the | 525 | 1608 | |
| 16 | the | 363 | 0 | |
| 17 | I | 629 | 1593 | |
| 18 | I | 447 | 1611 | |
| 19 | the | 715 | 1602 | |
| 20 | the | 717 | 1610 | |
+-----+--------+--------------+---------------+---+
You can see that retains the word that have the maximum word_count in the partition defined by corpus_date

Use window function to "spread" the max value over all relevant records.
this way you can avoid the Join.
SELECT
*
FROM (
SELECT
corpus,
corpus_date,
word,
word_count,
MAX(word_count) OVER (PARTITION BY corpus) AS Max_Word_Count
FROM
[publicdata:samples.shakespeare] )
WHERE
word_count=Max_Word_Count

select
id,
value,
integer(value = max_value) as is_max_value
from (
select id, value, max(value) over(partition by id) as max_value
from dataset.table
)
Explanation:
Inner select - for each row/record calculates max of value among all rows with the same id
Outer select - for each row/record compares row's value with max value for respective group and then converts true or false into respectively 1 or 0 (as per expectation in question)

Getting average salary of last 3 salaries

+------------+--------+-----------+---------------+
| paydate | salary | ninumber | payrollnumber |
+------------+--------+-----------+---------------+
| 2015-05-15 | 1000 | jh330954b | 6 |
| 2015-04-15 | 1250 | jh330954b | 5 |
| 2015-03-15 | 800 | jh330954b | 4 |
| 2015-02-15 | 894 | jh330954b | 3 |
| 2015-05-15 | 500 | ew56780e | 6 |
| 2015-04-15 | 1500 | ew56780e | 5 |
| 2015-03-15 | 2500 | ew56780e | 4 |
| 2015-02-15 | 3000 | ew56780e | 3 |
| 2015-05-15 | 400 | rt321298z | 6 |
| 2015-04-15 | 582 | rt321298z | 5 |
| 2015-03-15 | 123 | rt321298z | 4 |
| 2015-02-15 | 659 | rt321298z | 3 |
+------------+--------+-----------+---------------+
The above list is the data in my database. I need to get the average of the previous 3 salaries for each individual and output this.
I don't know where to begin with this so I cannot provide any of my working so far.

In SQL Server, you can use row_number() to get the last three salaries in a subquery. Then use avg():
select ninumber, avg(salary)
from (select t.*,
row_number() over (partition by ninumber order by payrollnumber desc) as seqnum
from table t
) t
where seqnum <= 3
group by ninumber;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Find rows with adjourning date ranges and accumulate their durations - sql

Related

PostgreSQL: Count number of rows in table 1 for distinct rows in table 2

Obtain MIN() and MAX() over not correlative values in PostgreSQL

How to calculate running total in SQL

equating an entry to an aggregated version of itself

Getting average salary of last 3 salaries

Categories

Resources