I've created an Oracle SQL query which links to about five tables I'm using in an Oracle FROM clause to an Oracle Form but the problem with the query is that some records are duplicated, so I only want to show one line in the form and not any duplicate records. I've tried GROUP BY and PARTITION BY statements but the query becomes to slow with adding this into the statement.
I'm now thinking of doing this as a procedure and bring back just one of the duplicates if any occur. Would it be best to bring back an ORACLE table of records from the database into the form? How would it be best to look for a duplicate in an Oracle PL/SQL loop?
I've updated the question and adde the full query below to explain it better. The surr_id the first column in select statement below is unique but what I want to show in the Oracle form is the production number along with the other columns which are not unique. There can be duplicates of production number and even sometimes three production number records the same. Hope this helps. I was thinking of putting this in a loop and just grabbing the first production number and then only bringing back each record when the production number changes.
select x.surr_id ,
x.supplier_name as supplier ,
x.broadcaster_name as broadcaster ,
ptle.title as production_title ,
x.production_number as production_number ,
stle.title as series_title ,
x.production_source as supplied_source_ind ,
x.third_party_group_id ,
x.bro_broadcast_by_tp_surr_id ,
x.station_id from (select usage_headers.surr_id as surr_id ,
broad_supp.supplier_name as supplier_name ,
broad_supp.broadcaster_name as broadcaster_name ,
usage_headers.production_number as production_number ,
productions.production_source as production_source ,
broad_supp.station_id as station_id ,
usage_headers.prod_exploitation_cre_surr_id as prod_exploitation_cre_surr_id ,
usage_headers.bro_broadcast_by_tp_surr_id as bro_broadcast_by_tp_surr_id ,
productions.cre_surr_id as cre_surr_id ,
productions.prod_series_cre_surr_id as prod_series_cre_surr_id ,
broad_supp.third_party_group_id as third_party_group_id
from usage_headers, productions, (SELECT /*+ index (bro bro_pk) */
third_party.surr_id AS THIRD_PARTY_SURR_ID,
third_party.supplier_group_id AS THIRD_PARTY_GROUP_ID,
third_party.dn_root_tp_surr_id AS THIRD_PARTY_ROOT_ID,
third_party.supplier_name, bro.station_id AS STATION_ID,
bro.dn_tp_name AS BROADCASTER_NAME FROM ( SELECT tp.surr_id,
tp.name AS supplier_name,
tp.tp_surr_id AS supplier_group_id,
tp.dn_root_tp_surr_id FROM third_parties tp
CONNECT BY PRIOR tp.surr_id = tp.tp_surr_id
START WITH tp.surr_id IN (4251, 4247, 4237, 4034, 10157, 14362, 9834)) third_party
JOIN broadcasters bro ON (third_party.surr_id = bro.tp_surr_id)) broad_supp
where broad_supp.THIRD_PARTY_SURR_ID = usage_headers.bro_broadcast_by_tp_surr_id
AND usage_headers.prod_exploitation_cre_surr_id = productions.cre_surr_id
and usage_headers.prod_exploitation_cre_surr_id IS NOT NULL
and usage_headers.right_type in ('M','B')
AND usage_headers.udg_surr_id IS NOT NULL
AND NVL(usage_headers.dn_uls_usage_status,'3') NOT IN ('9', '11')
AND productions.production_source <> 'AP') x
LEFT OUTER JOIN titles ptle ON ( ptle.cre_surr_id = x.cre_surr_id AND ptle.tt_code = 'R')
LEFT OUTER JOIN titles stle ON ( stle.cre_surr_id = x.prod_series_cre_surr_id AND stle.tt_code = 'R')
thanks Guys in Advance
If you're getting records that are entirely duplicated then just adding a DISTINCT clause, so your SELECT becomes SELECT DISTINCT will ensure that only one of the records is returned. If even one column is different though then this won't work.
Related
my question is definitely going to be a little different, so I hope I'm still adhering to the stack overflow question etiquette. With that in mind, I'll get straight to the point.
Essentially, since I am still learning SQL I was looking at examples of scheduled queries in GCP and came across something and I wanted to see if I understand what's going on. So I took the query and wrote some comments explaining what I think the lines in the query are doing. The context in the code itself is irrelevant, I'm more curious if I'm correctly understanding what each of the clauses is doing.
Would anyone be able to tell me if I am interpreting it correctly or if I misunderstood some stuff, based on my comments? The code and comments are below. Note that the comments come first and the queries I'm commenting on follow directly after.
-- Create temporary table with the subquery below via the WITH () clause
-- Table contains session date, which webpage, total sessions, total sessions with a logout, and total clicks
-- The data in this temporary table is coming from the `gcp-project-223467.web.top_level` table in BigQuery
-- The columns correspond to dates 01/01/2022 & onwards, and exclude the 'Home'and 'Team' pages
-- The resulting data in the temp table is grouped by date & page type (first and second columns of the resulting temp table)
WITH logins AS (
SELECT
session_date as date,
website_page as page,
SUM(sessions) AS sessions,
SUM(sessions_with_logout) AS logouts,
SUM(clicks) AS clicks
FROM `gcp-project-223467.web.top_level`
WHERE DATE_session >= "2022-01-01"
AND website_page NOT IN ('Home','Team')
AND clicks > 0
GROUP BY 1, 2
)
-- Select the data from the above subquery (via SELECT logins.*)
-- Left join another temp table with data coming from `ingka-web-analytics-prod.web_data.transactions` in BigQuery
-- Left join is being done according to the logins & login_days date_hit AND logins & login_days ´logins_web´ columns.
-- The specific data taken from the aforementioned BQ table is aggregated and filtered via CASE WHEN - THEN statements
-- Further conditions are specified via the WHERE statements
-- The resulting temporary table in the subquery under LEFT JOIN is named login_days.
-- The columns in the select statement before the left join (web logins, mobile logins etc)
-- are from the temporary table in the select statement under the left join statement
SELECT
logins.*,
logins_web,
mobile_logins,
logins_ios,
logins_android,
logins_final
FROM logins
LEFT JOIN (
SELECT
date_hit as date,
website_page as page,
SUM(CASE WHEN login_type = 'web' THEN SAFE_CAST(count_logins_final AS INT64) END ) AS logins_web,
COUNT(DISTINCT CASE WHEN login_type = 'mobile' THEN login_id END ) AS mobile_logins,
SUM(CASE WHEN login_type = 'ipad' THEN SAFE_CAST(count_logins_final AS INT64) END ) AS logins_ios,
COUNT(DISTINCT CASE WHEN login_type = 'android' THEN login_id END ) AS logins_android,
COUNT(DISTINCT login_id) AS logins_final,
FROM `gcp-project-223467.web.login_data`
WHERE date_hit >= "2022-01-01" AND website_page NOT IN ('Home','Team')
AND count_logins_final != 'NaN'
AND count_logins_final NOT LIKE '%,%'
AND count_logins_final > '0'
AND website_platform != 'ibes'
AND login_type = 'Successful'
GROUP BY 1, 2
)login_days
ON logins.date = login_days.date AND logins.page = login_days.page
WHERE sessions_with_logout > 0
I am getting the users data from UUID WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'.
Since I don't want to make an additional query to fetch additional user data I'm trying to sneak them through the INSERT.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
)
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
RETURNING _u.empl_user_pvt_uuid, _u.email, _u.name_first;
However I get:
[42P01] ERROR: missing FROM-clause entry for table "_u"
Position: 994
What am I doing wrong?
It's true, as has been noted, that the RETURNING clause of an INSERT only sees the inserted row. More specifically, quoting the manual here:
The optional RETURNING clause causes INSERT to compute and return
value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used). This is primarily useful for
obtaining values that were supplied by defaults, such as a serial
sequence number. However, any expression using the table's columns
is allowed. The syntax of the RETURNING list is identical to that
of the output list of SELECT. Only rows that were successfully
inserted or updated will be returned. [...]
Bold emphasis mine.
So nothing keeps you from adding a correlated subquery to the RETURNING list:
INSERT INTO employees.password_resets AS ep
(empl_pwd_reset_uuid , empl_user_pvt_uuid , t_valid , for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', eu.empl_user_pvt_uuid , '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM employees.users eu
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
RETURNING for_empl_user_pvt_uuid AS empl_user_pvt_uuid -- alias to meet your org. query
, (SELECT email
FROM employees.emails
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
ORDER BY t DESC -- NULLS LAST ?
LIMIT 1
) AS email
, (SELECT name_first
FROM employees.profiles
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
-- ORDER BY ???
LIMIT 1
) AS name_first;
This is also much more efficient than the query you had (or what was proposed) for multiple reasons.
We don't run the subqueries ee and ep over all rows of the tables employees.emails and employees.profiles. That would be efficient if we needed major parts of those tables, but we only fetch a single row of interest from each. With appropriate indexes, a correlated subquery is much more efficient for this. See:
Efficient query to get greatest value per group from big table
Two SQL LEFT JOINS produce incorrect result
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
We don't add the overhead of one or more CTEs.
We only fetch additional data after a successful INSERT, so no time is wasted if the insert didn't go through for any reason. (See quote at the top!)
Plus, possibly most important, this is correct. We use data from the row that has actually been inserted - after inserting it. (See quote at the top!) After possible default values, triggers or rules have been applied. We can be certain that what we see is what's actually in the database (currently).
You have no ORDER BY for profiles.name_first. That's not right. Either there is only one qualifying row, then we need no DISTINCT nor LIMIT 1. Or there can be multiple, then we also need a deterministic ORDER BY to get a deterministic result.
And if emails.t can be NULL, you'll want to add NULLS LAST in the ORDER BY clause. See:
Sort by column ASC, but NULL values first?
Indexes
Ideally, you have these multicolumn indexes (with columns in this order):
users (empl_user_pub_uuid, empl_user_pvt_uuid)
emails (empl_user_pvt_uuid, email)
profiles (empl_user_pvt_uuid, name_first)
Then, if the tables are vacuumed enough, you get three index-only scans and the whole operation is lightening fast.
Get pre-INSERT values?
If you really want that (which I don't think you do), consider:
Return pre-UPDATE column values using SQL only
According Postgres Docs about 6.4. Returning Data From Modified Rows :
In an INSERT, the data available to RETURNING is the row as it was
inserted.
But here you are trying to return columns from source table instead of destination. Returning will not be able to return columns form _u table rather only from employees.password_resets table's inserted row. But you can write nested cte for insertion and can select data from source table as well. Please try below approach.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
), I as
(
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
)
select _u.empl_user_pvt_uuid, _u.email, _u.name_first from _u
I have 2 tables in SQL Server: Table 1 and Table 2.
Table 1 has 500 Records and Table 2 has Millions of Records.
Table 2 may/may not have the 500 Records of Table 1 in it.
I have to compare Table 1 and Table 2. But the result should give me only the Records of Table 1 which has any data change in Table 2. Means the Result should be less than or equal to 500.
I don't have any primary key but the columns in the 2 tables are same. I have written the following query. But I am getting time out exception and it is taking much time to process. Please help.
With CTE_DUPLICATE(OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID,ROWNUMBER)
as (
Select distinct
OLD.FIRST_NAME AS 'OLD_FIRSTNAME' ,New.First_Name AS 'NEW_FIRSTNAME',
OLD.LAST_NAME AS 'OLD_LASTNAME',New.Last_Name AS 'NEW_LASTNAME',
OLD.MI_NAME AS 'OLD_MINAME',New.MI_Name AS 'NEW_MINAME',
OLD.FAMILY_ID AS 'OLD_FAMILYID',NEW.FAMILY_ID AS 'NEW_FAMILYID',
row_number()over(partition by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID
order by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID )as rank
From EEMSCDBStatic OLD,EEMS_VIPFILE New where
OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name
and OLD.LAST_NAME <> New.Last_Name and OLD.MI_NAME <> New.MI_Name
and old.Family_Id<>New.Family_id
)
sELECT OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID FROM CTE_DUPLICATE where rownumber=1
I think the main problem here is that your query is forcing the DB to fully multiply your tables, which means processing ~500M combinations. It happens because you're connecting any record from T1 with any record from T2 that has at least one different value, including MPID that looks like the unique identifier that must be used to connect records.
If MPID is really the column that identifies records in both tables then your query should have a bit different structure:
SELECT old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
FROM EEMSCDBStatic old
INNER JOIN EEMS_VIPFILE new ON old.MPID = new.MPID
WHERE old.FIRST_NAME <> New.First_Name
AND OLD.LAST_NAME <> New.Last_Name
AND OLD.MI_NAME <> New.MI_Name
AND old.Family_Id <> New.Family_id
ORDER BY old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
A couple of other thoughts:
If you're looking for any change in a record (even if only one column has different values), you should use ORs in the WHERE clause, not ANDs. Now you're only looking for records that changed values in all columns. For instance, you'll fail to find a person who changed his or her first name but decided to keep last name.
You should obviously consider indexing your tables if it's possible.
Surely it is pointless to use DISTINCT keyword together with ROWNUMBER.
See this sql query distinct with Row_Number.
You are doing CROSS JOIN, which is terribly big in your case.
Perhaps in that condition you
where OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name and ...
you wanted to have OR instead of AND?
It is also not entirely clear why you use ROWNUMBER at all - perhaps to find the best match.
All this is because as #Shnugo correctly remarked, the logic behind your comparing is faulty - you must have some logic defined that would JOIN the tables (Like First and second name must be the same).
I'm using parts of a replication script written by a well known blogger. I want to make the part I listed below add 1 more column from a totally different table that only holds 1 row. Basically that table with a single row has a site name on it, and I want that site name from that table to populate as part of this INSERT INTO.
I know SQL 2005 introduced OUTER APPLY, but I am not sure if that is the best method to go with. Any sugegstions are welcome. Thanks.
Insert Into dbo.dba_replicationMonitor
(
monitorDate
, publicationName
, publicationDB
, iteration
, tracer_id
, distributor_latency
, subscriber
, subscriber_db
, subscriber_latency
, overall_latency
, SiteNameFromSiteInfoTable --Need to add this
)
Select
#currentDateTime
, #publicationToTest
, #publicationDB
, iteration
, tracer_id
, IsNull(distributor_latency, 0)
, subscriber
, subscriber_db
, IsNull(subscriber_latency, 0)
, IsNull(overall_latency,
IsNull(distributor_latency, 0) + IsNull(subscriber_latency, 0
)
, sitename = 'SELECT sitename FROM tblSiteInfo' --need this query to insert as well
)
From #tokenResults;
I was thinking of a variable but I don't thnk passing the variable will be enough. Any help is greatly appreciated. Thanks.
You can just join to the second table as normal. If there's only one row in this other table (and will only ever be one row), it's not going to double your results. So, like this:
INSERT INTO dbo.dba_replicationMonitor (_column_list_)
SELECT _#ToeknResultsColumns_, b.sitename
FROM #TokenResults as a
JOIN tblSiteInfo as b
ON 1 = 1
I'm trying to query a table which has a column that contains straight xml data. Within the query I'm querying columns that hold straight data (int, vchar etc) but I'm also querying the xml column. In the xml column i want to grab a value within the xml and return null if it doesn't exist. I have the following query that almost works but returns duplicates. Need help!
I have my root xml CodeFiveReport then within it Properties and within that Property which has a serial number. I'm trying to grab the serial number if it exists and displaying it.
select Distinct rs.Id
, rs.CaseNumber
, rs.StartDate
, rs.[Status]
, rs.PatrolDistrict
, rs.PrimaryUnit
, rs.Location
, rs.ReportType
, rs.IncidentType
, rs.UserId
, rs.UnitId
, rs.UnitCode
, rs.IsLocked
, rs.LockedBy
, rs.AgencyId
, rl.ReportName
, rl.ParentId
, TempTable.Party.value('(SerialNumber/text())[1]', 'varchar(50)') as SerialNumber
from dbo.vw_ReportSummary rs OUTER APPLY Report.nodes('/CodeFiveReport/Properties/Property') AS TempTable(Party)
left outer join dbo.ReportLookup rl on rs.Id = rl.Id
where rs.[Status] = 'Approved'
order by rs.Id
Well, I was able to solve the problem
I changed Report.nodes('/CodeFiveReport/Properties/Property') to Report.nodes('/CodeFiveReport/Properties')
In turn I also changed my TempTable query to: TempTable.Party.value('(Property/SerialNumber/text())[1]', 'varchar(50)') as SerialNumber and that seemed to fix the duplicates.
Thanks for your help everybody.
Hard to say without knowing your exact database schema. Assuming that this is T-SQL: Have a look at CTE (common table expressions) and split your statement in two steps. That makes these kind of statements usually much simpler and often more efficient.