SQL join by combination of ID and date range - sql

There are to tables - assignments and countries.
Assignments stores historical data about employees historical assignments, and the three main fields are - person_id, effective_start_date and effective_end_date.
Countries stores info about employees who have been taking trips abroad - the important fields in it are person_id, date_from, date_to, home_country, host_country.
And i need to do the following : I need to show all assignments, and the country which the employee had been to at any point during the assignment. Which means i need to join them via outer join, but the only way i can join them is via person_id but there are multiple entries in each table (Same ID's)
So what i did was something like this :
select *
from assignments ass, employees emp
where
ass.person_id=emp.person_id
AND (emp.date_from(+) >= ass.assignment_start_date AND emp.date_to(+) <= ass.assignment_end_date)
OR (emp.date_from(+) >= ass.assignment_start_date AND emp.date_to(+) >= ass.assignment_end_date)
but it doesnt work because oracle doesn't allow me to make OR statement in an outer join. I tried using union all method but the values returned are not what i quite expected - there are some missing values, so the logic isn't correct eighter. If you have any advices please post them in the same syntax i provided (oracle syntax) where joins are made in where clause, so it's easier for me to understand.

To get your desired result, you can use the following approach:
use ANSI style JOINs instead of the outdated Oracle syntax (they're much more flexible and IMHO also more readable)
concatenate the countries (e.g. using LISTAGG)
Query:
select ass.person_id,
assignment_start_date,
assignment_end_date,
listagg(home_country ||'-' || host_country, ';')
within group (order by date_from) as countries
from assignments ass
left join employees emp
on ass.person_id = emp.person_id
AND ((emp.date_from >= ass.assignment_start_date AND
emp.date_from <= ass.assignment_end_date)
OR (emp.date_to >= ass.assignment_start_date AND
emp.date_to <= ass.assignment_end_date))
group by ass.person_id, assignment_start_date, assignment_end_date
SQL Fiddle

I think is this what you want, with oracle syntax
select ass.person_id, assignment_start_date, assignment_end_date,
emp.home_country,emp.host_country
from assignments ass, employees emp
where
ass.person_id=emp.person_id(+)
AND (emp.date_from(+) <= ass.assignment_end_date AND emp.date_to(+) >= ass.assignment_end_date)
SQLFiddle

Related

PostgreSQL GROUP BY that includes zeros

I have a SQL query (postgresql) that looks something like this:
SELECT
my_timestamp::timestamp::date as the_date,
count(*) as count
FROM my_table
WHERE ...
GROUP BY the_date
ORDER BY the_date
The result is a table of YYYY-MM-DD, count pairs.
Now I've been asked to fill in the empty dates with zero. So if I was previously providing
2022-03-15 3
2022-03-17 1
I'd now want to return
2022-03-15 3
2022-03-16 0
2022-03-17 1
Now I can easily do this client-side (relative to the database) and let my program compute and return the zero-augmented list to its clients based on the original list from postgres. But perhaps it would better if I could just tell postgresql to include zeros.
I suspect this isn't easy at all, because postgres has no obvious way of knowing what I'm up to. But in the interests of learning more about postgres and SQL, I thought I'd have try. The try isn't too promising thus far...
Any pointers before I conclude that I was right to leave this to my (postgres client) program?
Update
This is an interesting case where my simplification of the problem led to a correct answer that didn't work for me. For those who come after, I thought it worth documenting what followed, because it take some fun twists through constructing SQL queries.
#a_horse_with_no_name responded with a query that I've verified works if I simplify my own query to match. Unfortunately, my query had some extra baggage that I didn't think pertinent, and so had trimmed out when posting the original question.
Here's my real (original) query, with all names preserved (if shortened):
-- current query
SELECT
LEAST(time1, time2, time3, time4)::timestamp::date as the_date,
count(*) as count
FROM reading_group_reader rgr
INNER JOIN ( SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
WHERE LEAST(time1, time2, time3, time4) > current_date - 30
GROUP BY the_date
ORDER BY the_date;
If I translate that directly into the proposed solution, however, the inner join between reading_group_reader and the temporary table TT causes the left join to become inner (I think) and the date sequence drops its zeros again. Fwiw, the table TT is a table because sometimes it actually is a subselect.
So I transformed my query into this:
SELECT
g.dt::date as the_date,
count(*) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY the_date;
but this outputs 1's instead of 0's at the places that should be 0.
The reason for that, however, is because I've now selected every date, so, of course, there's one of each. I need to include an additional field (which will be NULL) and count that.
So this query finally does what I want:
SELECT
g.dt::date as the_date,
count(rgrt.device_id) as count
FROM generate_series(date '2022-03-06', date '2022-04-06', interval '1 day') as g(dt)
LEFT JOIN (
SELECT
LEAST(rgr.time1, rgr.time2, rgr.time3, rgr.time4)::timestamp::date as the_date,
rgr.device_id
FROM reading_group_reader rgr
INNER JOIN (
SELECT group_id, group_type ::group_type_name
FROM (VALUES (31198, 'excerpt')) as T(group_id, group_type)
) TT
ON TT.group_id = rgr.group_id
AND TT.group_type = rgr.group_type
) rgrt(the_date)
ON rgrt.the_date = g.dt::date
GROUP BY g.dt
ORDER BY g.dt;
And, of course, on re-reading the accepted answer, I eventually saw that he did count an unrelated field, which I'd simply missed on my first several readings.
You will need to join to a list of dates. This can e.g. be done using generate_series()
SELECT g.dt::date as the_date,
count(t.my_timestamp) as count
FROM generate_series(date '2022-03-01',
date '2022-03-31',
interval '1 day') as g(dt)
LEFT JOIN my_table as t
ON t.my_timestamp::date = g.dt::date
AND ... -- the original WHERE clause goes here!
GROUP BY the_date
ORDER BY the_date;
Note that the original WHERE conditions need to go into the join condition of the LEFT JOIN. You can't put them into a WHERE clause because that would turn the outer join back into an inner join (which means the missing dates wouldn't be returned).

Oracle SQL - return multiple columns from subquery

Let's take a simple query in Oracle:
SELECT
CASE.ID,
CASE.TYPE,
CASE.DATE_RAISED
FROM
CASE
WHERE
CASE.DATE_RAISED > '2019-01-01'
Now let's say another table, EVENT, contains multiple events which may be associated with each case (linked via EVENT.CASE_ID). OR not exist at all. I want to report on the earliest-dated future event per case - or if nothing exists, return NULL. I can do this with a subquery in the SELECT clause, as follows:
SELECT
CASE.ID,
CASE.TYPE,
CASE.DATE_RAISED,
(
SELECT
MIN(EVENT.DATE)
FROM
EVENT
WHERE
EVENT.CASE_ID = CASE.ID
AND EVENT.DATE >= CURRENT_DATE
) AS MIN_EVENT_DATE
FROM
CASE
WHERE
CASE.DATE_RAISED > '2019-01-01'
This will return a table like this:
Case ID Case Type Date Raised Min Event Date
76 A 03/01/2019 10/05/2019
43 B 02/02/2019 [NULL]
89 A 29/01/2019 08/07/2019
90 A 04/03/2019 [NULL]
102 C 15/04/2019 20/05/2019
Note that if there do not exist any Events which match the criteria, the line is still returned but without a value. This is because the subquery is in the SELECT clause. This works just fine.
My problem, however, is if I want to return more than one column from the EVENT table - while still at the same time preserving the possibility that there are no matching rows from the EVENT table. The above code only returns EVENT.DATE as the single subquery result, to ONE column of the main query. But what if I also want to return EVENT.ID, or EVENT.TYPE? While still allowing for them to be NULL (if no matching records from CASE are found)?
I suppose I could use multiple subqueries in the SELECT clause: each returning just ONE column. But this seems horribly inefficient, given that each subquery would be based on the same criteria (the minimum-dated EVENT whose CASE ID matches that of the main query; or NULL if no such events found).
I suspect some nifty joins would be the answer - although I'm struggling to understand which ones exactly.
Please note that the above examples are vastly simplified versions of my actual code, which already contains multiple joins in the "old style" Oracle format, eg:
WHERE
CASE.ID(+) = EVENT.CASE_ID
There are reasons why this is so - therefore a request to anyone answering this, please would you demonstrate any solutions in this style of coding, as my SQL isn't far enough advanced to be able to re-factor the "newer" style joins into existing code.
You can use a join and window functions. For instance:
select c.*, e.*
from c left join
(select e.*,
row_number() over (partition by e.case_id order by e.date desc) as seqnum
from events e
) e
on e.case_id = c.id and e.seqnum = 1;
where c.date_raised > date '2019-01-01'; -- assuming the value is a date
Is this what you mean? I just rewrote Gordon's answer with old Oracle join syntax and your code style.
SELECT
CASE.ID,
CASE.TYPE,
CASE.DATE_RAISED,
MIN_E.DATE AS MIN_EVENT_DATE
FROM
CASE,
(SELECT EVENT.*,
ROW_NUMBER() OVER (PARTITION BY EVENT.CASE_ID ORDER BY EVENT.DATE DESC) AS SEQNUM
FROM
EVENT
WHERE
EVENT.DATE >= CURRENT_DATE
) MIN_E
WHERE
CASE.DATE_RAISED > DATE '2019-01-01'
AND MIN_E.CASE_ID (+) = CASE.ID
AND MIN_E.SEQNUM (+) = 1;
Create object type with columns you want and return it from subquery. Your query will be like
SELECT
CASE.ID,
CASE.TYPE,
CASE.DATE_RAISED,
(
SELECT
t_your_new_type ( MIN(EVENT.DATE) , min ( EVENT.your_another_column ) )
FROM
EVENT
WHERE
EVENT.CASE_ID = CASE.ID
AND EVENT.DATE >= CURRENT_DATE
) AS MIN_EVENT_DATE
FROM
CASE
WHERE
CASE.DATE_RAISED > '2019-01-01'

SQL values disappear when using max dates

First time posting here and have a query that I hope someone maybe able to help with, i have tried to search for the answer but with no joy.
When i use the below SQL to find a value (in this case eb.annualvalue) it returns multiple values because no end dates have been entered into the eb table and there are too many employees without end dates for me to close down.
LEFT JOIN
(
SELECT
eb.empid, eb.bencode, eb.currencycode AS [currencycode], eb.notes AS [notes], eb.annualvalue
FROM
employeebenefit AS [eb]
WHERE
eb.bencode IN ('US 401K Plan')
AND (eb.enddate IS NULL OR eb.enddate >= '20180101')
)
AS eb26
ON eb26.empid = e.empid
However, when i use MAX startdate (code below) it returns the correct number or rows however, the eb.annualvalue figure disappears.
LEFT JOIN
(
SELECT
eb.empid, eb.bencode, eb.currencycode AS [currencycode], eb.notes AS [notes], eb.annualvalue
FROM
employeebenefit AS [eb]
WHERE
eb.bencode IN ('US 401K Plan')
AND (eb.enddate IS NULL OR eb.enddate >= '20180101')
AND (eb.startdate = (SELECT MAX(eb.startdate) FROM employeebenefit AS [eb]))
)
AS eb26
ON eb26.empid = e.empid
Any help would be greatly appreciated. Thanks Dan.
This sounds like a greatest-n-per-group problem, you just want one row per employee, from a table with many rows per employee. I'm not 100% clear on how you want to select that one row, but I can give an example.
Ideally, you would use ROW_NUMBER() but that only came in to effect from SQL Server 2008 onward.
The two commons alternative are:
- Join on your data twice. Once to find the "highest date" per user, again to find the whole row.
- Use a correlated sub-query to work out an individual's best row (still really joining twice)
Simple-self-join:
LEFT JOIN
(
SELECT
empid,
MAX(startdate) AS max_startdate
FROM
employeebenefit
WHERE
bencode IN ('US 401K Plan')
AND (enddate IS NULL OR enddate >= '20180101')
GROUP BY
empid
)
latest_employeebenefit
ON latest_employeebenefit.empid = e.empid
LEFT JOIN
employeebenefit
ON employeebenefit.empid = latest_employeebenefit.empid
AND employeebenefit.startdate = latest_employeebenefit.max_startdate
AND employeebenefit.bencode IN ('US 401K Plan')
AND (employeebenefit.enddate IS NULL OR employeebenefit.enddate >= '20180101')
This has the "feature" that if two such records both match the max_startdate (a tie) then both will come through. Often that is impossible, often it's desirable, it depends on your data and your needs.
Correlated-sub-query for join:
LEFT JOIN
employeebenefit
ON employeebenefit.id =
(
SELECT TOP(1) lookup.id
FROM employeebenefit AS lookup
WHERE lookup.empid = e.empid -- the correlated bit
AND lookup.bencode IN ('US 401K Plan')
AND (lookup.enddate IS NULL OR lookup.enddate >= '20180101')
ORDER BY lookup.startdate DESC
)
This is slightly different in that it always returns just one row. If there can be a tie when only sorting by startdate it's generally best to add another column to the ORDER BY, even if it's just an id column, to ensure the results are deterministic.
You can use the code bellow , if I undestood your question
OUTER APPLY
(
SELECT TOP 1
eb.empid, eb.bencode, eb.currencycode AS [currencycode], eb.notes AS [notes], eb.annualvalue
FROM
employeebenefit AS [eb]
WHERE
eb.empid = e.empid
AND eb.bencode IN ('US 401K Plan')
AND (eb.enddate IS NULL OR eb.enddate >= '20180101')
ORDER BY
eb.startdate DESC
)
AS eb26

SQL how to handle multiple value join

I need to join assignments and expatriates tables by a combination of ID, effective_start_date and effective_end_date.
I need to get data about employees who have gone to another country during their assignment effective_start_date and effective_end_date. But there is a need to handle cases when during one assignment there have been entered data about employees going to two or more countries - I need to show only one - the last one or the active one (if there is).
In the results I'm getting multiple values for 123 person ID and it's because there are incorrect values entered in assignments table - I need to only show only one of this person 123 date - the information about him going to china (the active one).
So basically, if during one assignment (between effective_start_date and effective_end_date) there is information about him going to 2 different countries, I need to only show one case. I need to correct my select statement so it handles this case somehow.
Edit : This also needs to work when the 2 cases about employee going to another country are historical so I dont think this can be done with sysdate.
Edit nr.2 - updated sql fiddle. I need to show BOTH expatriations for person 321 and ony one for person 123 - this is basically my main goal.
Edit nr.3 - still havent found the solution.
LINK TO SQLFIDDLE
select
ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country
from expatriates exp, assignments ass
where
exp.person_id=ass.person_id
and exp.date_to >= ass.effective_start_date
and exp.date_to <= ass.effective_end_date
As #PuneetPandey already wrote your logic will not catch all overlapping periods.
To get only one row you can use ROW_NUMBER, e.g.
select *
from
(
select
ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country,
row_number()
over (partition by ass.person_id, ass.effective_start_date
order by exp.date_from) as rn
from expatriates exp, assignments ass
where
exp.person_id=ass.person_id
and exp.date_to >= ass.effective_start_date
and exp.date_to <= ass.effective_end_date
) dt
where rn = 1
First of all, I think the query needs to be changed -
and exp.date_to <= ass.effective_end_date to and exp.date_from <= ass.effective_end_date.
Now, if you want any of the visited country, you can select distinct record by personid as below -
select
distinct ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country
from expatriates exp, assignments ass
where
exp.person_id=ass.person_id
and exp.date_to >= ass.effective_start_date
and exp.date_from <= ass.effective_end_date
or, if you want a particular row, you can probably maintain another column for status and have that as '1' if the visit is active else keep that as '0' and use below query -
select
ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country
from expatriates exp, assignments ass
where
exp.person_id=ass.person_id
and exp.date_to >= ass.effective_start_date
and exp.date_from <= ass.effective_end_date
and exp.status = 1
I think you need to join a third table which will a derived table like "X" below:
select
ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country
from expatriates exp, assignments ass, (
SELECT e.person_id, MAX(e.date_from) md
FROM expatriates e
INNER JOIN assignments a ON e.person_id=a.person_id
and e.date_to >= a.effective_start_date
and e.date_to <= a.effective_end_date GROUP BY e.person_id) X
where exp.person_id= X.person_id
and exp.date_from= X.md
Im assuming if a person get fired effective_end_date will be updated and no more expatriates record will appear. So I just select the last date_to in expatriates. That is why I dont see why you need compare date ranges and remove that part from my where.
SQL FIDDLE DEMO
active_or_last_ass AS (
SELECT exp.person_id, date_from, max(exp.date_to) max_date
FROM expatriates exp
WHERE exp.date_from < sysdate
GROUP BY exp.person_id, date_from
)
select
ass.person_id,
ass.effective_start_date,
ass.effective_end_date,
exp.date_from,
exp.date_to,
exp.home_country,
exp.host_country
from
active_or_last_ass ala
inner join expatriates exp
on exp.person_id = ala.person_id
and exp.date_to = ala.max_date
inner join assignments ass
on exp.person_id = ass.person_id

Trying to select multiple columns on an inner join query with group and where clauses

I'm trying to run a query where it will give me one Sum Function, then select two columns from a joined table and then to group that data by the unique id i gave them. This is my original query and it works.
SELECT Sum (Commission_Paid)
FROM [INTERN_DB2].[dbo].[PaymentList]
INNER JOIN [INTERN_DB2]..[RealEstateAgentList]
ON RealEstateAgentList.AgentID = PaymentList.AgentID
WHERE Close_Date >= '1/1/2013' AND Close_Date <= '12/31/2013'
GROUP BY RealEstateAgentList.AgentID
I've tried the query below, but I keep getting an error and I don't know why. It says its a syntax error.
SELECT Sum (Commission_Paid)
FROM [INTERN_DB2].[dbo].[PaymentList]
INNERJOIN [INTERN_DB2]..[RealEstateAgentList](
Select First_Name, Last_Name
From [Intern_DB2]..[RealEstateAgentList]
Group By Last_name
)
ON RealEstateAgentList.AgentID = PaymentList.AgentID
WHERE Close_Date >= '1/1/2013' AND Close_Date <= '12/31/2013'
GROUP BY RealEstateAgentList.AgentID
Your query has multiple problems:
SELECT rl.AgentId, rl.first_name, rl.last_name, Sum(Commission_Paid)
FROM [INTERN_DB2].[dbo].[PaymentList] pl inner join
(Select agent_id, min(first_name) as first_name, min(last_name) as last_name
From [Intern_DB2]..[RealEstateAgentList]
GROUP BY agent_id
) rl
ON rl.AgentID = pl.AgentID
WHERE Close_Date >= '2013-01-01' AND Close_Date <= '2013-12-31'
GROUP BY rl.AgentID, rl.first_name, rl.last_name;
Here are some changes:
INNERJOIN --> inner join.
Fixed the syntax of the subquery next to the table name.
Removed columns for first and last name. They are not used.
Changed the subquery to include agent_id.
Added agent_id, first_name, and last_name to the outer aggregation, so you can tell where the values are coming from.
Changed the date formats to a less ambiguous standard form.
Added table alias for subquery.
I suspect the subquery on the agent list is not important. You can probably do:
SELECT rl.AgentId, rl.first_name, rl.last_name, Sum(pl.Commission_Paid)
FROM [INTERN_DB2].[dbo].[PaymentList] pl inner join
[Intern_DB2]..[RealEstateAgentList] rl
ON rl.AgentID = pl.AgentID
WHERE pl.Close_Date >= '2013-01-01' AND pl.Close_Date <= '2013-12-31'
GROUP BY rl.AgentID, rl.first_name, rl.last_name;
EDIT:
I'm glad this solution helped. As you continue to write queries, try to always do the following:
Use table aliases that are abbreviations of the table names.
Always use table aliases when referring to columns.
When using date constants, either use "YYYY-MM-DD" format or use convert() to convert a string using the specified format. (The latter is actually the safer method, but the former is more convenient and works in almost all databases.)
Pay attention to the error messages; they can be informative in SQL Server (unfortunately, other databases are not so clear).
Format your query so other people can understand it. This will help you understand and debug your queries as well. I have a very particular formatting style (which no one is going to change at this point); the important thing is not the particular style but being able to "see" what the query is doing. My style is documented in my book "Data Analysis Using SQL and Excel.
There are other rules, but these are a good way to get started.
SELECT Sum (Commission_Paid)
FROM [INTERN_DB2].[dbo].[PaymentList] pl
INNER JOIN (
Select First_Name, Last_Name
From [Intern_DB2]..[RealEstateAgentList]
Group By Last_name
) x ON x.AgentID = pl.AgentID
WHERE Close_Date >= '1/1/2013'
AND Close_Date <= '12/31/2013'
GROUP BY RealEstateAgentList.AgentID
This is how the query should look... however, if you subquery first and last name, you'll also have to include them in the group by. Assuming Close_Date is in the PaymentList table, this is how I would write the query:
SELECT
al.AgentID,
al.FirstName,
al.LastName,
Sum(pl.Commission_Paid) AS Commission_Paid
FROM [INTERN_DB2].[dbo].[PaymentList] pl
INNER JOIN [Intern_DB2].dbo.[RealEstateAgentList] al ON al.AgentID = pl.AgentID
WHERE YEAR(pl.Close_Date) = '2013'
GROUP BY al.AgentID, al.FirstName, al.LastName
Subqueries are evil, for the most part. There's no need for one here, because you can just get the columns from the join.