Why do I get extra rows in LEFT JOIN when joining to an ID and TIMESTAMP column? - sql

I have a table that contains multiple registration periods (date and time for the start of the registration, as well as date and time for when that instance of registration ends). For each row (registration period), there is a status column that contains the status at the end of the registration period. I was trying to get the status associated with the most recent end date of registration per a given ID. I've used a window function to get the most recent end date of interest per ID, and then I wanted to LEFT JOIN on ID and end date to get the status from the same table on which I used the window function. There should really just be one just one combination for a given end date and status per ID, but somehow I get more rows that what's in the left table.
Like I mentioned earlier, my approach was to use a window function to get MAX(end_date) per ID and some other column, let's call it enrollment_number. Then use LEFT JOIN on this table and its parent table to bring in status associated with that date only. Later, I'd like to use the result of this join to bring in the status associated with the end date into other tables where I need it.
WITH
my_first_test AS
(
SELECT my_id,
enrollment_number,
MAX(end_date_of_enrollment) OVER (partition by my_id, enrollment_number) AS end_date_enrolled
FROM enrollments
)
SELECT mft.my_id, mft.end_date_enrolled, e.status
FROM my_first_test AS mft
LEFT JOIN enrollments AS e
ON mft.my_id = e.my_id AND mft.end_date_enrolled = e.end_date_enrolled;
The CTE returns 42917 rows, same number of rows as in the enrollments table, which it should be if I understand it correctly.
Then, I LEFT JOIN enrollments, to bring in information from the status column also contained in the enrollments table. The LEFT JOIN is done on my_id and end_date_enrolled.
I expect 42917 rows in the resulting table, because my_id and end_date_enrolled together should be unique. However, I get slightly more rows in my final table - 44408. I was wondering if the StackOverflow community would be able to help me solve this mystery. I am using SQL in AWS Redshift.

You have duplicates in enrollments. You can find them with aggregation:
SELECT my_id, end_date_enrolled, COUNT(*)
FROM enrollments AS e
GROUP BY my_id, end_date_enrolled
HAVING COUNT(*) > 1;

Related

SQL - Min() on a Daily Query

I am trying to pull some specific information from an access control database.
I have a query providing results spanning several days. For a specific day, I need to get the first record of each person for that specific day. I have totally muddled the entire bit, hence my questions
This is the code used to pull the initial query
Select
Message.TimeStamp_SPM,
Message.FirstName,
Message.LastName,
Message.CardNumber,
Message.MessageDescription,
Message.Description,
Department.Description As Description1
From
Message Inner Join
CardHolder On CardHolder.CardHolderID = Message.CardHolderID Inner Join
Department On CardHolder.DepartmentID = Department.DepartmentID
Where
Message.TimeStamp_SPM > Convert(datetime,'2021-03-02',120) And
Message.TimeStamp_SPM < Convert(datetime,'2021-03-03',120) And
Message.Description Not Like '%Truck%'
From this query I need to display the obtain the first record of each person for that specific date. Any advice on the most efficient way to obtain the desired result?
From this query I need to display the obtain the first record of each person for that specific date.
Assuming "person" is CardHolderId, then include that in your query. You can then use window functions to get the most recent record for each CardHolderId:
with cte as (
<your query here with CardHolderId>
)
select cte.*
from (select cte.*,
row_number() over (partition by CardHolderID order by TimeStamp_SPM desc) as seqnum
from cte
) cte
where seqnum = 1;

Record with latest date, where date comes from a joined table

I have tried every answer that I have found to finding the last record, and I have failed in getting a successful result. I currently have a query that lists active trailers. I am needing it to only show a single row for each trailer entry, where that row is based on a date in a joined table.
I have tables
trailer, company, equipment_group, movement, stop
In order to connect trailer to stop (which is where the date is), i have to join it to equipment group, which joins to movement, which then joins to stop.
I have tried using MAX and GROUP BY, and PARTITION BY, both of which error out.
I have tried many solutions here, as well as these
https://thoughtbot.com/blog/ordering-within-a-sql-group-by-clause
https://www.geeksengine.com/article/get-single-record-from-duplicates.html
It seems that all of these solutions have the date in the same table as the thing that they want to group by, which I do not.
SELECT
trailer.*
company.name,
equipment_group.currentmovement_id,
equipment_group.company_id,
movement.dest_stop_id, stop.location_id,
stop.*
FROM trailer
LEFT OUTER JOIN company ON (company.id = trailer.company_id)
LEFT OUTER JOIN equipment_group ON (equipment_group.id =
trailer.currenteqpgrpid)
LEFT OUTER JOIN movement ON (movement.id =
equipment_group.currentmovement_id)
LEFT OUTER JOIN stop ON (stop.id = movement.dest_stop_id)
WHERE trailer.is_active = 'A'
Using MAX and GROUP BY gives error "invalid in the select list... not contained in...aggregate function"
Welllllll, I never did end up figuring that out, but if I joined movements on to equipment group by two conditions, all is well. Each extra record was created by each company id.... company id is in EVERY table.

Suppress Nonadjacent Duplicates in Report

Medical records in my Crystal Report are sorted in this order:
...
Group 1: Score [Level of Risk]
Group 2: Patient Name
...
Because patients are sorted by Score before Name, the report pulls in multiple entries per patient with varying scores - and since duplicate entries are not always adjacent, I can't use Previous or Next to suppress them. To fix this, I'd like to only display the latest entry for each patient based on the Assessment Date field - while maintaining the above order.
I'm convinced this behavior can be implemented with a custom SQL command to only pull in the latest entry per patient, but have had no success creating that behavior myself. How can I accomplish this compound sort?
Current SQL Statement in use:
SELECT "EpisodeSummary"."PatientID",
"EpisodeSummary"."Patient_Name",
"EpisodeSummary"."Program_Value"
"RiskRating"."Rating_Period",
"RiskRating"."Assessment_Date",
"RiskRating"."Episode_Number",
"RiskRating"."PatientID",
"Facility"."Provider_Name",
FROM (
"SYSTEM"."EpisodeSummary"
"EpisodeSummary"
LEFT OUTER JOIN "FOOBARSYSTEM"."RiskAssessment" "RiskRating"
ON (
("EpisodeSummary"."Episode_Number"="RiskRating"."Episode_Number")
AND
("EpisodeSummary"."FacilityID"="RiskRating"."FacilityID")
)
AND
("EpisodeSummary"."PatientID"="RiskRating"."PatientID")
), "SYSTEM"."Facility" "Facility"
WHERE (
"EpisodeSummary"."FacilityID"="Facility"."FacilityID"
)
AND "RiskRating"."PatientID" IS NOT NULL
ORDER BY "EpisodeSummary"."Program_Value"
The SQL code below may not be exactly correct, depending on the structure of your tables. The code below assumes the 'duplicate risk scores' were coming from the RiskAssessment table. If this is not correct, the code may need to be altered.
Essentially, we create a derived table and create a row_number for each record, based on the patientID and ordered by the assessment date - The most recent date will have the lowest number (1). Then, on the join, we restrict the resultset to only select record #1 (each patient has its own rank #1).
If this doesn't work, let me know and provide some table details -- Should the Facility table be the starting point? are there multiple entries in EpisodeSummary per patient? thanks!
SELECT es.PatientID
,es.Patient_Name
,es.Program_Value
,rrd.Rating_Period
,rrd.Assessment_Date
,rrd.Episode_Number
,rrd.PatientID
,f.Provider_Name
FROM SYSTEM.EpisodeSummary es
LEFT JOIN (
--Derived Table retreiving highest risk score for each patient)
SELECT PatientID
,Assessment_Date
,Episode_Number
,FacilityID
,Rating_Period
,ROW_NUMBER() OVER (
PARTITION BY PatientID ORDER BY Assessment_Date DESC
) AS RN -- This code generates a row number for each record. The count is restarted for every patientID and the count starts at the most recent date.
FROM RiskAssessment
) rrd
ON es.patientID = rrd.patientid
AND es.episode_number = rrd.episode_number
AND es.facilityid = rrd.facilityid
AND rrd.RN = 1 --This only retrieves one record per patient (the most recent date) from the riskassessment table
INNER JOIN SYSTEM.Facility f
ON es.facilityid = f.facilityid
WHERE rrd.PatientID IS NOT NULL
ORDER BY es.Program_Value

Compare 2 tables and add missing records to the first, taking into account year/months

I have 2 tables, one with codes and budgets called FACT_QUANTITY_TMP and the other is a tree with all possible codes called C_DS_BD_AP_A.
All codes that exist are in this C_DS_BD_AP_A table, yet not all are in FACT_QUANTITY_TMP. Only those with budget get added by the ERP.
We need all codes to be in this FACT_QUANTITY_TMP table, just with budget to be 0 in that case.
I was trying first to get the missing codes by the following query:
SELECT T2.D_ACTIECODE From
(SELECT distinct
A.FULL_DATE as FULL_DATE, A.DIM03 as DIM03
FROM FACT_QUANTITY_TMP A) T1
RIGHT JOIN
(select distinct B.D_ACTIECODE AS D_ACTIECODE from C_DS_BD_AP_A B) T2
ON
T1.DIM03 = T2.D_ACTIECODE
where T1.DIM03 is null
order by T1.full_date
I get a list of my missing records yet it doesn't take into accounts the FULL_DATE (year and month) of the destination table.
In short, FACT_QUANTITY_TMP needs to have all records added that it's missing grouped by months and year.
Kind of looking for the best approach here, this query would be used in a automatically run stored proc every month when the ERP data gets pulled.
You can generate the missing records by doing a cross join to generate all combinations and then removing those that are already there. For example:
select fd.fulldate, c.D_ACTIECODE
from (select distinct fulldate from fact_quantity_tmp) fd cross join
(select D_ACTIECODE from C_DS_BD_AP_A) c left join
fact_quantity_tmp fqt
on fqt.fulldate = fd.fulldate and fqt.dim03 = c.D_ACTIECODE
where fqt.fulldate is null;
You can put an insert before this to insert these rows into the fact table.

Return max date from multiple tables join oracle

I have 4 tables with the following relevant information I want to retrieve.
Table: Staff_profile (STAFF_ID, STAFF_USERNAME, STAFF_NAME, STAFF_JOB_ID, STAFF_FACULTY_ID, STAFF_OFF_TEL, STAFF_EMAIL) - holds staff information
Table: RFMUSERHISTORY (uh_staff_id, UH_DATETIME) - holds login history
Table: RFMUSERROLEJOBMAP (role_id, job_id ) - maps role-2-job [this is because job table pre-exists and this new app is only picking certain job ids to use against its own roles table
Table: RFMUSERROLE (USERROLE_CODE, USERROLE_ID) - holds user roles information
Now I want to get the last login (max date for that user in userhistory) details including role and staff details for any particular person who logs in. I have had trouble with my code and finally just resorted to selecting all the records for that user with the UH_datetime ordered desc so I can pick that latest topmost record.
Here is my current code (very inefficient as described above):
SELECT a.STAFF_ID, a.STAFF_USERNAME, a.STAFF_NAME, a.STAFF_JOB_ID, a.STAFF_FACULTY_ID,
a.STAFF_OFF_TEL, a.STAFF_EMAIL, to_CHAR(b.UH_DATETIME,'Dy DD-MM-YYYY HH24:MI:SS')
AS UH_DATETIME, e.USERROLE_CODE, e.USERROLE_ID
FROM STAFF_PROFILE a
LEFT JOIN RFMUSERHISTORY b ON STAFF_ID=b.uh_staff_id
LEFT JOIN RFMUSERROLEJOBMAP d ON a.STAFF_JOB_ID=d.job_id
LEFT JOIN RFMUSERROLE e ON d.role_id=e.userrole_id
WHERE STAFF_ID=:eid1 ORDER BY b.UH_DATETIME DESC
You could use an analytic function to rank the rows and then select the most recent one. If you're really just selecting the data for a single STAFF_ID, this is probably no more efficient than nesting your original query in an outer query that selects the row using a ROWNUM predicate. If you are selecting the data for multiple staff members, however, this should be more efficient.
SELECT *
FROM (
SELECT a.STAFF_ID,
a.STAFF_USERNAME,
a.STAFF_NAME,
a.STAFF_JOB_ID,
a.STAFF_FACULTY_ID,
a.STAFF_OFF_TEL,
a.STAFF_EMAIL,
to_CHAR(b.UH_DATETIME,'Dy DD-MM-YYYY HH24:MI:SS') AS UH_DATETIME,
e.USERROLE_CODE,
e.USERROLE_ID,
dense_rank() over (partition by a.staff_id order by b.uh_datetime desc) rnk
FROM STAFF_PROFILE a
LEFT JOIN RFMUSERHISTORY b ON STAFF_ID=b.uh_staff_id
LEFT JOIN RFMUSERROLEJOBMAP d ON a.STAFF_JOB_ID=d.job_id
LEFT JOIN RFMUSERROLE e ON d.role_id=e.userrole_id
WHERE STAFF_ID=:eid1
)
WHERE rnk = 1
Oracle doesn't send rows over the network before you ask for them. If your client code only request the first row, your query should be efficient enough.
Another option is to limit Oracle to one row with rownum:
where rownum = 1