How to "expand" a SQL join such that each unique value in column A "get" all the unique values for B? - sql

I have a dataset with two columns: id and date. The dates are monthly and span from Mar-21 to Aug-21. I am sure this question could be applied to non-date values, but I think dates are more intuitive for this example.
id | date |
----+--------+--
a | Mar-21 |
a | Apr-21 |
a | Aug-21 | <---- 'a' is missing Jun-21 and Jul-21
b | Mar-21 |
b | May-21 | <---- 'b' is missing Apr-21
b | Jun-21 |
b | Jul-21 |
b | Aug-21 |
And I want this
id | date |
----+--------+--
a | Mar-21 |
a | Apr-21 |
a | May-21 |
a | Jun-21 | <---- 'a' gets Jun-21
a | Aug-21 | <---- ...and now Jul-21
b | Mar-21 |
b | Apr-21 | <---- 'b' gets Apr-21
b | May-21 |
b | Jun-21 |
b | Jul-21 |
b | Aug-21 |
Basically I want to say "I want every single id to get all unique values of date.

Consider below approach
select id, format_date('%b-%y', dt) date
from unnest(generate_date_array('2021-03-01', '2021-08-01', interval 1 month)) dt,
(select distinct id from your_table)
-- order by id, dt
if applied to sample data in your question - output is

Related

How to get the soonest date in relation to another date field

Say I have a date field in one table (table a):
+---------+------------+
| item_id | Date |
+---------+------------+
| 12333 | 10/12/2020 |
+---------+------------+
| 45678 | 10/12/2020 |
+---------+------------+
Then I have another table with another date, and it joins to the table above as so (they join on the primary key of table b):
+-------------+------------+-----------+------------+
| primary_key | date2 | item_id | Date |
| (table b) | (table b) | (table a) | (table a) |
+-------------+------------+-----------+------------+
| 45318 | 10/10/2020 | 12333 | 10/12/2020 |
+-------------+------------+-----------+------------+
| 45318 | 10/13/2020 | 12333 | 10/12/2020 |
+-------------+------------+-----------+------------+
| 45318 | 10/24/2020 | 12333 | 10/12/2020 |
+-------------+------------+-----------+------------+
| 75394 | 10/20/2020 | 45678 | 10/12/2020 |
+-------------+------------+-----------+------------+
You see the last column is from table a. I want to get table b's "date2" column to give me the soonest date after 10/12/2020, and remove the rest.
So for the example of 45318, I want to keep the second line only (the one that is 10/13/2020) since that is the soonest date after 10/12/2020.
If this doesn't make sense, let me know and I will fix it!
One method is apply:
select a.*, b.*. -- or whatever columns you want
from a outer apply
(select top (1) b.*
from b
where b.item_id = a.item_id and
b.date2 >= '2020-10-12'
order by b.date2 asc
) b;

Complex nested aggregations to get order totals

I have a system to track orders and related expenditures. This is a Rails app running on PostgreSQL. 99% of my app gets by with plain old Rails Active Record call etc. This one is ugly.
The expenditures table look like this:
+----+----------+-----------+------------------------+
| id | category | parent_id | note |
+----+----------+-----------+------------------------+
| 1 | order | nil | order with no invoices |
+----+----------+-----------+------------------------+
| 2 | order | nil | order with invoices |
+----+----------+-----------+------------------------+
| 3 | invoice | 2 | invoice for order 2 |
+----+----------+-----------+------------------------+
| 4 | invoice | 2 | invoice for order 2 |
+----+----------+-----------+------------------------+
Each expenditure has many expenditure_items and can the orders can be parents to the invoices. That table looks like this:
+----+----------------+-------------+-------+---------+
| id | expenditure_id | cbs_item_id | total | note |
+----+----------------+-------------+-------+---------+
| 1 | 1 | 1 | 5 | Fuit |
+----+----------------+-------------+-------+---------+
| 2 | 1 | 2 | 15 | Veggies |
+----+----------------+-------------+-------+---------+
| 3 | 2 | 1 | 123 | Fuit |
+----+----------------+-------------+-------+---------+
| 4 | 2 | 2 | 456 | Veggies |
+----+----------------+-------------+-------+---------+
| 5 | 3 | 1 | 34 | Fuit |
+----+----------------+-------------+-------+---------+
| 6 | 3 | 2 | 76 | Veggies |
+----+----------------+-------------+-------+---------+
| 7 | 4 | 1 | 26 | Fuit |
+----+----------------+-------------+-------+---------+
| 8 | 4 | 2 | 98 | Veggies |
+----+----------------+-------------+-------+---------+
I need to track a few things:
amounts left to be invoiced on orders (thats easy)
above but rolled up for each cbs_item_id (this is the ugly part)
The cbs_item_id is basically an accounting code to categorize the money spent etc. I have visualized what my end result would look like:
+-------------+----------------+-------------+---------------------------+-----------+
| cbs_item_id | expenditure_id | order_total | invoice_total | remaining |
+-------------+----------------+-------------+---------------------------+-----------+
| 1 | 1 | 5 | 0 | 5 |
+-------------+----------------+-------------+---------------------------+-----------+
| 1 | 2 | 123 | 60 | 63 |
+-------------+----------------+-------------+---------------------------+-----------+
| | | | Rollup for cbs_item_id: 1 | 68 |
+-------------+----------------+-------------+---------------------------+-----------+
| 2 | 1 | 15 | 0 | 15 |
+-------------+----------------+-------------+---------------------------+-----------+
| 2 | 2 | 456 | 174 | 282 |
+-------------+----------------+-------------+---------------------------+-----------+
| | | | Rollup for cbs_item_id: 2 | 297 |
+-------------+----------------+-------------+---------------------------+-----------+
order_total is the sum of total for all the expenditure_items of the given order ( category = 'order'). invoice_total is the sum of total for all the expenditure_items with parent_id = expenditures.id. Remaining is calculated as the difference (but not greater than 0). In real terms the idea here is you place and order for $1000 and $750 of invoices come in. I need to calculate that $250 left on the order (remaining) - broken down into each category (cbs_item_id). Then I need the roll-up of all the remaining values grouped by the cbs_item_id.
So for each cbs_item_id I need group by each order, find the total for the order, find the total invoiced against the order then subtract the two (also can't be negative). It has to be on a per order basis - the overall aggregate difference will not return the expected results.
In the end looking for a result something like this:
+-------------+-----------+
| cbs_item_id | remaining |
+-------------+-----------+
| 1 | 68 |
+-------------+-----------+
| 2 | 297 |
+-------------+-----------+
I am guessing this might be a combination of GROUP BY and perhaps a sub query or even CTE (voodoo to me). My SQL skills are not that great and this is WAY above my pay grade.
Here is a fiddle for the data above:
http://sqlfiddle.com/#!17/2fe3a
Alternate fiddle:
https://dbfiddle.uk/?rdbms=postgres_11&fiddle=e9528042874206477efbe0f0e86326fb
This query produces the result you are looking for:
SELECT cbs_item_id, sum(order_total - invoice_total) AS remaining
FROM (
SELECT cbs_item_id
, COALESCE(e.parent_id, e.id) AS expenditure_id -- ①
, COALESCE(sum(total) FILTER (WHERE e.category = 'order' ), 0) AS order_total -- ②
, COALESCE(sum(total) FILTER (WHERE e.category = 'invoice'), 0) AS invoice_total
FROM expenditures e
JOIN expenditure_items i ON i.expenditure_id = e.id
GROUP BY 1, 2 -- ③
) sub
GROUP BY 1
ORDER BY 1;
db<>fiddle here
① Note how I assume a saner table definition with expenditures.parent_id being integer, and true NULL instead of the string 'nil'. This allows the simple use of COALESCE.
② About the aggregate FILTER clause:
Aggregate columns with additional (distinct) filters
③ Using short syntax with ordinal numbers of an SELECT list items. Example:
Select first row in each GROUP BY group?
can I get the total of all the remaining for all rows or do I need to wrap that into another sub select?
There is a very concise option with GROUPING SETS:
...
GROUP BY GROUPING SETS ((1), ()) -- that's all :)
db<>fiddle here
Related:
Converting rows to columns

Measure population on several dates

I want to measure the population of our manucipality (which contains out of several places). I've got two tables in: my first dataset is a calender table with a row for each first day of every month.
My second table contains alle the people that live and have lived in the manucipality.
What I want is the population of each place on every first day of the month from my calender table. I've put some raw data below (just a few records of the persons table because it contains 100.000 records)
Calender table:
+----------+
| Date |
+----------+
| 1-1-2018 |
+----------+
| 1-2-2018 |
+----------+
| 1-3-2018 |
+----------+
| 1-4-2018 |
+----------+
Persons table
+-----+-----------+-----------+---------------+-------+
| BSN | Startdate | Enddate | Date of death | Place |
+-----+-----------+-----------+---------------+-------+
| 1 | 12-1-2000 | null | null | A |
+-----+-----------+-----------+---------------+-------+
| 2 | 10-5-2011 | null | 22-1-2018 | B |
+-----+-----------+-----------+---------------+-------+
| 3 | 16-12-2011| 10-2-2018 | null | B |
+-----+-----------+-----------+---------------+-------+
| 4 | 9-11-2012 | null | null | B |
+-----+-----------+-----------+---------------+-------+
| 5 | 8-9-2013 | null | 27-3-2018 | A |
+-----+-----------+-----------+---------------+-------+
| 6 | 7-10-2017 | 28-3-2018 | null | B |
+-----+-----------+-----------+---------------+-------+
My expected result:
+----------+-------+------------+
| Date | Place | Population |
+----------+-------+------------+
| 1-1-2018 | A | 2 |
+----------+-------+------------+
| 1-1-2018 | B | 4 |
+----------+-------+------------+
| 1-2-2018 | A | 2 |
+----------+-------+------------+
| 1-2-2018 | B | 3 |
+----------+-------+------------+
| 1-3-2018 | A | 2 |
+----------+-------+------------+
| 1-3-2018 | B | 2 |
+----------+-------+------------+
| 1-4-2018 | A | 1 |
+----------+-------+------------+
| 1-4-2018 | B | 1 |
+----------+-------+------------+
What I've done so far but doesnt seems to work:
SELECT a.Place
,c.Date
,(SELECT COUNT(DISTINCT(b.BSN))
FROM Person as b
WHERE b.Startdate < c.Date
AND (b.Enddate > c.Date OR b.Enddate is null)
AND (b.Date of death > c.Date OR b.Date of death is null)
AND a.Place = b.Place) as Population
FROM Person as a
JOIN Calender as c
ON a.Startdate <= c.Date
AND a.Enddate >= c.Date
GROUP BY Place, Date
I hope someone can help finding out the problem. Thanks in advance
First cross join Calender and the places to get the date/place pairs. Then left join the persons on the place and the date. Finally group by date and place to get the count of people for that day and place.
SELECT [ca].[Date],
[pl].[Place],
count([pe].[Place]) [Population]
FROM [Calender] [ca]
CROSS JOIN (SELECT DISTINCT
[pe].[Place]
FROM [Persons] [pe]) [pl]
LEFT JOIN [Persons] [pe]
ON [pe].[Place] = [pl].[Place]
AND [pe].[Startdate] <= [ca].[Date]
AND (colaesce([pe].[Enddate],
[pe].[Date of death]) IS NULL
OR coalesce([pe].[Enddate],
[pe].[Date of death]) > [ca].[Date])
GROUP BY [ca].[Date],
[pl].[Place]
ORDER BY [ca].[Date],
[pl].[Place];
Some notes and assumptions:
If you have a table listing the places use that instead of the subquery aliases [pl]. I just had no other option with the given tables.
I believe the Date of death also implies an Enddate for the same day. You might want to consider a trigger, that sets the Enddate automatically to the Date of death if it isn't null. That would make things easier and probably more consistent.

Get weekly totals from database of daily events in SQL

I have a database of events linked to individual users (let's call them A, B, C), and listed by timestamp with timezone.
I need to put together a SQL query that tells me the total number of events from A, B, and C by week.
How would I do this?
Example Data:
| "UID" | "USER" | "EVENT" | "TIMESTAMP" |
| 1 | 'A' | "FLIGHT" | '2015-01-06 08:00:00-05' |
| 2 | 'B' | "FLIGHT" | '2015-01-07 09:00:00-05' |
| 3 | 'A' | "FLIGHT" | '2015-01-08 11:00:00-05' |
| 4 | 'A' | "FLIGHT" | '2015-01-08 12:00:00-05' |
| 5 | 'C' | "FLIGHT" | '2015-01-13 06:00:00-05' |
| 6 | 'C' | "FLIGHT" | '2015-01-14 09:00:00-05' |
| 7 | 'A' | "FLIGHT" | '2015-01-14 10:00:00-05' |
| 8 | 'A' | "FLIGHT" | '2015-01-06 12:00:00-05' |
Desired Output:
| Week | USER | FREQUENCY |
| 1 | A | 3 |
| 1 | B | 1 |
| 2 | A | 2 |
| 2 | C | 2 |
Looks like a simple aggregation to me:
select extract(week from "TIMESTAMP") as week,
"USER",
count(*)
from the_table
group by extract(week from "TIMESTAMP"), "USER"
order by extract(week from "TIMESTAMP"), "USER";
extract(week from ...) uses the ISO definition of the week.
Quote from the manual
In the ISO week-numbering system, it is possible for early-January dates to be part of the 52nd or 53rd week of the previous year, and for late-December dates to be part of the first week of the next year
So it's better to use a display that includes the week and the year. This can be done using to_char()
select to_char("TIMESTAMP", 'iyyy-iw') as week,
"USER",
count(*)
from the_table
group by to_char("TIMESTAMP", 'iyyy-iw'), "USER"
order by to_char("TIMESTAMP", 'iyyy-iw'), "USER";
If you want to limit that to specific month you can add the appropriate where condition.

Add Values to Grouping Column

I am having a lot of trouble with a scenario that I think some of you might have come across.
(the whole thing about Business Trips, two tables, one filled with payments done on Business trips, and the other is about the Business Trips, so the first one has more Rows than the other, (there are more Payments that happened than Trips))
I have two tables, Table A and Table B.
Table A looks as follows
| TableA_ID | TableB_ID | PaymentMethod | ValuePayed |
| 52 | 1 | Method1 | 23,2 |
| 21 | 1 | Method2 | 23,2 |
| 33 | 2 | Method3 | 23,2 |
| 42 | 1 | Method2 | 14 |
| 11 | 14 | Method1 | 267 |
| 42 | 1 | Method2 | 14,7 |
| 13 | 32 | Method1 | 100,2 |
Table B looks like this
| TableB_ID | TravelExpenses | OperatingExpense |
| 1 | 23 | 12 |
| 1 | 234 | 24 |
| 2 | 12 | 7 |
| 1 | 432 | 12 |
| 14 | 110 | 12 |
I am trying to create a measure Table (Table C) that looks like this:
| TableC_ID | TypeofCost | Amount |
| 1 | Method1 | 100,2 |
| 2 | Method2 | 52 |
| 3 | TravelExpenses | 7 |
| 4 | OperatingExpense| 12 |
| 5 | Method3 | 12 |
| 6 | OperatingExpense| 7 |
| 7 | Method3 | 12 |
(the Amount results are to be Summed and Columns - Employee, Month, TypeofCost Grouped)
So I pretty much have to group not only by the PaymentMethod which I get from table A,
but also insert new values in the group (TravelExpenses and OperatingExpense)
Can anybody give me any Idea about how this can be done in SQL ?
Here is what I have tried so far
SELECT PaymentMethod as TypeofCost
,Sum(ValuePayed) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
UNION
SELECT 'TravelExpenses' as TypeofCost
,Sum(TableB.TravelExpenses) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
UNION
SELECT 'OperatingExpense' as TypeofCost
,Sum(TableB.OperatingExpense) as Amount
FROM TableA Left Outer Join TableB on TableA.TableB_ID = TableB.TableB_ID
GROUP PaymentMethod
It should be something like this:
Select
row_number() OVER(ORDER BY TableB_ID) as 'TableC_ID',
u.TypeofCost,
u.Amount
from (
Select
a.TableB_ID,
a.PaymentMethod as 'TypeofCost',
SUM(a.ValuePayed) as 'Amount'
from
Table_A as a
group by a.TableB_ID, a.PaymentMethod
union
Select
b1.TableB_ID,
'TravelExpenses' as 'TypeofCost',
SUM(b1.TravelExpenses) as 'Amount'
from
Table_B as b1
group by b1.TableB_ID
union
Select
b2.TableB_ID,
'OperatingExpenses' as 'TypeofCost',
SUM(b2.OperatingExpenses) as 'Amount'
from
Table_B as b2
group by b2.TableB_ID
) as u
EDIT: Generate TableC_ID