Query keeps giving me duplicate records. How can I fix this? - sql

I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
with clientbridge as (Select *
from (Select visitorid, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY visitorid,student_id,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
--where student_id = '9999999-aaaa-6634-bbbb-96fa18a9046e'
)
where rn = 1 --visitorid = '999999999999999999999999999999'---'1111111111111111111111111111111' --and pai.datekey is not null --- 00000000000000000000000000
),
-----------------Data Header Table
studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
select
*
from clientbridge ab inner join studentvisit sv on sv.visid_visitorid = cb.visitorid
I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?

I think you may get have a better shot by joining the two datasets in the same query where you want the data ranked, otherwise your rank from query will be ignored within the results from the second query. Perhaps, something like ->
;with studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
--calcualted_visitorid,
uniquevisitkey,
--channel, -- says the room they're in. Channel might not be reliable would need to see how that operates
--office_list, -- add 7 to exact
--user_college,
--first_office_hour_name,
--first_question_time_attended,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
--date_time,
qqq144, --Course Name
qqq145, -- Course Office Hour Benefit
qqq146, --Course Office Hour ID
datekey
FROM university.office_hour_details ohd
--left_join niversity.course_office_hour_bridge cohd on ohd.visid_visitorid
where DateKey >='2022-10-01' --between '2022-10-01' and '2022-10-27'
and (qqq146 <> '')
)
,clientbridge as (
Select
sv.*,
university.course_office_hour_bridge cohd, --Visid
roomnumber,
room_id,
profid,
student_id,
ambc.datekey,
RANK() over(PARTITION BY sv.visitorid,sv.student_id,sv,profid ORDER BY ambc.datekey desc) as rn
from university.course_office_hour_bridge cohd
inner join studentvisit sv on sv.visid_visitorid = cohd.visitorid
)
select
*
from clientbridge WHERE rn=1

Related

For some reason it says a table is not found? - it says the student visit is not found when i start the client bridge query

I wrote a query which uses 2 temp tables. And then joins them into 1. However, I am seeing duplicate records in the student visit temp table. (Query is below). How could this be modified to remove the duplicate records of the visit temp table?
with studentvisit as
(SELECT
--Visit key will allow us to track everything they did within that visit.
distinct visid_visitorid,
uniquevisitkey,
studentaccountid_5,
profid_officenumber_8,
studentvisitstarttime,
room_id_115,
qqq144, --Rewards Accrual Name
qqq145, -- Rewards accrual type what they earned
qqq146, --Rewards Accrual ID
datekey
FROM university.office_hour_details ohd
where DateKey >='2022-10-27' --between '2022-10-26' and '2022-10-27'
and (qqq146 <> '')
)
,clientbridge as (
Select
studentvisit.*, university.course_office_hour_bridge cohd,
cohd.visitorid,
cohd.roomnumber,
cohd.room_id,
cohd.profid,
cohd.student_id,
cohd.datekey,
RANK() over(PARTITION BY adobevisitorid,student_id,profid ORDER BY cohd.datekey desc) as rn
from university.course_office_hour_bridge cohd
inner join studentvisit sv on sv.visid_visitorid = cohd.adobevisitorid
)
select * from clientbridge WHERE rn = 1

Select other table as a column based on datetime in BigQuery [duplicate]

This question already has an answer here:
Full outer join and Group By in BigQuery
(1 answer)
Closed 5 months ago.
I have two tables which has a relationship, but I want to grouping them based on time. Here are the tables
I want select a receipt as a column based on published_at, it must be in between pickup_time and drop_time, so will get this result :
I tried with JOIN, but it seems like select rows with drop_time is NULL only
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG(STRUCT(r.source_id, r.receipt_id, r.published_at) ORDER BY r.published_at LIMIT 1)[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` AS t
JOIN `my-project-gcp.data_source.receipts` AS r
ON
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
GROUP BY source_id, pickup_time, drop_time
and tried with sub-query, got
Correlated subqueries that reference other tables are not supported unless they can be de-correlated, such as by transforming them into an efficient JOIN
SELECT
t.source_id AS source_id,
t.pickup_time AS pickup_time,
t.drop_time AS drop_time,
ARRAY_AGG((
SELECT
STRUCT(r.source_id, r.receipt_id, r.published_at)
FROM `my-project-gcp.data_source.receipts` as r
WHERE
t.source_id = r.source_id
AND
r.published_at >= t.pickup_time
AND (
r.published_at <= t.drop_time
OR t.drop_time IS NULL
)
LIMIT 1
))[SAFE_OFFSET(0)] AS receipt
FROM `my-project-gcp.data_source.trips` as t
GROUP BY source_id, pickup_time, drop_time
Each source_id is a car and only one driver can drive a car at once.
We can partition therefore by that entry.
Your approach is working for small tables. Since there is no unique join key, the cross join fails on large tables.
I present here a solution with union all and look back technique. This is quite fast and works with up to middle large table sizes in the range of a few GB. It prevents the cross join, but is a quite long script.
In the table trips are all drives by the drivers are listed. The receipts list all fines.
We need a unique row identication of each trip to join on this one later on. We use the row number for this, please see table trips_with_rowid.
The table summery_tmp unions three tables. First we load the trips table and add an empty column for the fines. Then we load the trips table again to mark the times were no one was driving the car. Finally, we add the table receipts such that only the columns source_id, pickup_time and fine is filled.
This table is sorted by the pickup_time for each source_id and the table summary. So the fine entries are under the entry of the driver getting the car. The column row_id_new is filled for the fine entries by the value of the row_id of the driver getting the car.
Grouping by row_id_new and filtering unneeded entries does the job.
I changed the second of the entered times (lazyness), thus it differs a bit from your result.
With trips as
(Select 1 source_id ,timestamp("2022-7-19 9:37:47") pickup_time, timestamp("2022-07-19 9:40:00") as drop_time, "jhon" driver_name
Union all Select 1 ,timestamp("2022-7-19 12:00:01"),timestamp("2022-7-19 13:05:11"),"doe"
Union all Select 1 ,timestamp("2022-7-19 14:30:01"),null,"foo"
Union all Select 3 ,timestamp("2022-7-24 08:35:01"),timestamp("2022-7-24 09:15:01"),"bar"
Union all Select 4 ,timestamp("2022-7-25 10:24:01"),timestamp("2022-7-25 11:14:01"),"jhon"
),
receipts as
(Select 1 source_id, 101 receipt_id, timestamp("2022-07-19 9:37:47") published_at,40 price
Union all Select 1,102, timestamp("2022-07-19 13:04:47"),45
Union all Select 1,103, timestamp("2022-07-19 15:23:00"),32
Union all Select 3,301, timestamp("2022-07-24 09:15:47"),45
Union all Select 4,401, timestamp("2022-07-25 11:13:47"),45
Union all Select 5,501, timestamp("2022-07-18 07:12:47"),45
),
trips_with_rowid as
(
SELECT 2*row_number() over (order by source_id,pickup_time) as row_id, * from trips
),
summery_tmp as
(
Select *, null as fines from trips_with_rowid
union all Select row_id+1,source_id,drop_time,null,concat("no driver, last one ",driver_name),null from trips_with_rowid
union all select null,source_id, published_at, null,null, R from receipts R
),
summery as
(
SELECT last_value(row_id ignore nulls) over (partition by source_id order by pickup_time ) row_id_new
,*
from summery_tmp
order by 1,2
)
select source_id,min(pickup_time) pickup_time, min(drop_time) drop_time,
any_value(driver_name) driver_name, array_agg(fines IGNORE NULLS) as fines_Sum
from summery
group by row_id_new,source_id
having fines_sum is not null or (pickup_time is not null and driver_name not like "no driver%")
order by 1,2

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Using a stored procedure in Teradata to build a summarial history table

I am using Terdata SQL Assistant connected to an enterprise DW. I have written the query below to show an inventory of outstanding items as of a specific point in time. The table referenced loads and stores new records as changes are made to their state by load date (and does not delete historical records). The output of my query is 1 row for the specified date. Can I create a stored procedure or recursive query of some sort to build a history of these summary rows (with 1 new row per day)? I have not used such functions in the past; links to pertinent previously answered questions or suggestions on how I could get on the right track in researching other possible solutions are totally fine if applicable; just trying to bridge this gap in my knowledge.
SELECT
'2017-10-02' as Dt
,COUNT(DISTINCT A.RECORD_NBR) as Pending_Records
,SUM(A.PAY_AMT) AS Total_Pending_Payments
FROM DB.RECORD_HISTORY A
INNER JOIN
(SELECT MAX(LOAD_DT) AS LOAD_DT
,RECORD_NBR
FROM DB.RECORD_HISTORY
WHERE LOAD_DT <= '2017-10-02'
GROUP BY RECORD_NBR
) B
ON A.RECORD_NBR = B.RECORD_NBR
AND A.LOAD_DT = B.LOAD_DT
WHERE
A.RECORD_ORDER =1 AND Final_DT Is Null
GROUP BY Dt
ORDER BY 1 desc
Here is my interpretation of your query:
For the most recent load_dt (up until 2017-10-02) for record_order #1,
return
1) the number of different pending records
2) the total amount of pending payments
Is this correct? If you're looking for this info, but one row for each "Load_Dt", you just need to remove that INNER JOIN:
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
ORDER BY 1 DESC
If you want to get the summary info per record_order, just add record_order as a grouping column:
SELECT
load_Dt,
record_order,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE final_Dt IS NULL
GROUP BY load_Dt, record_order
ORDER BY 1,2 DESC
If you want to get one row per day (if there are calendar days with no corresponding "load_dt" days), then you can SELECT from the sys_calendar.calendar view and LEFT JOIN the query above on the "load_dt" field:
SELECT cal.calendar_date, src.Pending_Records, src.Total_Pending_Payments
FROM sys_calendar.calendar cal
LEFT JOIN (
SELECT
load_Dt,
COUNT(DISTINCT record_nbr) AS Pending_Records,
SUM(pay_amt) AS Total_Pending_Payments
FROM DB.record_history
WHERE record_order = 1
AND final_Dt IS NULL
GROUP BY load_Dt
) src ON cal.calendar_date = src.load_Dt
WHERE cal.calendar_date BETWEEN <start_date> AND <end_date>
ORDER BY 1 DESC
I don't have access to a TD system, so you may get syntax errors. Let me know if that works or you're looking for something else.

SQL - Returning CTE with Top 1

I am trying to return a set of results and decided to try my luck with CTE, the first table "Vendor", has a list of references, the second table "TVView", has ticket numbers that were created using a reference from the "Vendor" table. There may be one or more tickets using the same ticket number depending on the state of that ticket and I am wanting to return the last entry for each ticket found in "TVView" that matches a selected reference from "Vendor". Also, the "TVView" table has a seed field that is incremented.
I got this to return the right amount of entries (meaning not showing the duplicate tickets but only once) but I cannot figure out how to add an additional layer to go back through and select the last entry for that ticket and return some other fields. I can figure out how to sum which is actually easy, but I really need the Top 1 of each ticket entry in "TVView" regardless if its a duplicate or not while returning all references from "Vendor". Would be nice if SQL supported "Last"
How do you do that?
Here is what I have done so far:
with cteTickets as (
Select s.Mth2, c.Ticket, c.PyRt from Vendor s
Inner join
TVView c on c.Mth1 = s.Mth1 and c.Vendor = s.Vendor
)
Select Mth2, Ticket, PayRt from cteTickets
Where cteTickets.Vendor >='20'
and cteTickets.Vendor <='40'
and cteTickets.Mth2 ='8/15/2014'
Group by cteTickets.Ticket
order by cteTickets.Ticket
Several rdbms's that support Common Table Expressions (CTE) that I am aware of also support analytic functions, including the very useful ROW_NUMBER(), so the following should work in Oracle, TSQL (MSSQL/Sybase), DB2, PostgreSQL.
In the suggestions the intention is to return just the most recent entry for each ticket found in TVView. This is done by using ROW_NUMBER() which is PARTITIONED BY Ticket that instructs row_number to recommence numbering for each change of the Ticket value. The subsequent ORDER BY Mth1 DESC is used to determine which record within each partition is assigned 1, here it will be the most recent date.
The output of row_number() needs to be referenced by a column alias, so using it in a CTE or derived table permits selection of just the most recent records by RN = 1 which you will see used in both options below:
-- using a CTE
WITH
TVLatest
AS (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
)
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
-- using a derived table instead
SELECT
Mth2
, Ticket
, PayRt
FROM Vendor v
INNER JOIN (
SELECT
* -- specify the fields
, ROW_NUMBER() OVER (PARTITION BY Ticket
ORDER BY Mth1 DESC) AS RN
FROM TVView
) TVLatest l ON v.Mth1 = l.Mth1
AND v.Vendor = l.Vendor
AND l.RN = 1
WHERE v.Vendor >= '20'
AND v <= '40'
AND v.Mth2 = '2014-08-15'
ORDER BY
v.Ticket
;
please note: "SELECT *" is a convenience or used as an abbreviation if full details are unknown. The queries above may not operate without correctly specifying the field list (eg. 'as is' they would fail in Oracle).