SQL Server Comparing 2 tables having Millions of Records

SQL Server Comparing 2 tables having Millions of Records - sql

I have 2 tables in SQL Server: Table 1 and Table 2.
Table 1 has 500 Records and Table 2 has Millions of Records.
Table 2 may/may not have the 500 Records of Table 1 in it.
I have to compare Table 1 and Table 2. But the result should give me only the Records of Table 1 which has any data change in Table 2. Means the Result should be less than or equal to 500.
I don't have any primary key but the columns in the 2 tables are same. I have written the following query. But I am getting time out exception and it is taking much time to process. Please help.
With CTE_DUPLICATE(OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID,ROWNUMBER)
as (
Select distinct
OLD.FIRST_NAME AS 'OLD_FIRSTNAME' ,New.First_Name AS 'NEW_FIRSTNAME',
OLD.LAST_NAME AS 'OLD_LASTNAME',New.Last_Name AS 'NEW_LASTNAME',
OLD.MI_NAME AS 'OLD_MINAME',New.MI_Name AS 'NEW_MINAME',
OLD.FAMILY_ID AS 'OLD_FAMILYID',NEW.FAMILY_ID AS 'NEW_FAMILYID',
row_number()over(partition by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID
order by OLD.FIRST_NAME ,New.First_Name,
OLD.LAST_NAME ,New.Last_Name,
OLD.MI_NAME ,New.MI_Name ,
OLD.FAMILY_ID,NEW.FAMILY_ID )as rank
From EEMSCDBStatic OLD,EEMS_VIPFILE New where
OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name
and OLD.LAST_NAME <> New.Last_Name and OLD.MI_NAME <> New.MI_Name
and old.Family_Id<>New.Family_id
)
sELECT OLD_FIRSTNAME ,New_FirstName,
OLD_LASTNAME ,New_LastName,
OLD_MINAME ,New_MIName ,
OLD_FAMILYID,NEW_FAMILYID FROM CTE_DUPLICATE where rownumber=1

I think the main problem here is that your query is forcing the DB to fully multiply your tables, which means processing ~500M combinations. It happens because you're connecting any record from T1 with any record from T2 that has at least one different value, including MPID that looks like the unique identifier that must be used to connect records.
If MPID is really the column that identifies records in both tables then your query should have a bit different structure:
SELECT old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
FROM EEMSCDBStatic old
INNER JOIN EEMS_VIPFILE new ON old.MPID = new.MPID
WHERE old.FIRST_NAME <> New.First_Name
AND OLD.LAST_NAME <> New.Last_Name
AND OLD.MI_NAME <> New.MI_Name
AND old.Family_Id <> New.Family_id
ORDER BY old.FIRSTNAME, new.FirstName,
old.LASTNAME, new.LastName,
old.MINAME, new.MIName,
old.FAMILYID, new.FAMILYID
A couple of other thoughts:
If you're looking for any change in a record (even if only one column has different values), you should use ORs in the WHERE clause, not ANDs. Now you're only looking for records that changed values in all columns. For instance, you'll fail to find a person who changed his or her first name but decided to keep last name.
You should obviously consider indexing your tables if it's possible.

Surely it is pointless to use DISTINCT keyword together with ROWNUMBER.
See this sql query distinct with Row_Number.
You are doing CROSS JOIN, which is terribly big in your case.
Perhaps in that condition you
where OLD.MPID <> New.MPID and old.FIRST_NAME <> New.First_Name and ...
you wanted to have OR instead of AND?
It is also not entirely clear why you use ROWNUMBER at all - perhaps to find the best match.
All this is because as #Shnugo correctly remarked, the logic behind your comparing is faulty - you must have some logic defined that would JOIN the tables (Like First and second name must be the same).

Related

RETURNING causes error: missing FROM-clause entry for table

I am getting the users data from UUID WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'.
Since I don't want to make an additional query to fetch additional user data I'm trying to sneak them through the INSERT.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
)
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
RETURNING _u.empl_user_pvt_uuid, _u.email, _u.name_first;
However I get:
[42P01] ERROR: missing FROM-clause entry for table "_u"
Position: 994
What am I doing wrong?

It's true, as has been noted, that the RETURNING clause of an INSERT only sees the inserted row. More specifically, quoting the manual here:
The optional RETURNING clause causes INSERT to compute and return
value(s) based on each row actually inserted (or updated, if an ON CONFLICT DO UPDATE clause was used). This is primarily useful for
obtaining values that were supplied by defaults, such as a serial
sequence number. However, any expression using the table's columns
is allowed. The syntax of the RETURNING list is identical to that
of the output list of SELECT. Only rows that were successfully
inserted or updated will be returned. [...]
Bold emphasis mine.
So nothing keeps you from adding a correlated subquery to the RETURNING list:
INSERT INTO employees.password_resets AS ep
(empl_pwd_reset_uuid , empl_user_pvt_uuid , t_valid , for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', eu.empl_user_pvt_uuid , '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM employees.users eu
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
RETURNING for_empl_user_pvt_uuid AS empl_user_pvt_uuid -- alias to meet your org. query
, (SELECT email
FROM employees.emails
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
ORDER BY t DESC -- NULLS LAST ?
LIMIT 1
) AS email
, (SELECT name_first
FROM employees.profiles
WHERE empl_user_pvt_uuid = ep.empl_user_pvt_uuid
-- ORDER BY ???
LIMIT 1
) AS name_first;
This is also much more efficient than the query you had (or what was proposed) for multiple reasons.
We don't run the subqueries ee and ep over all rows of the tables employees.emails and employees.profiles. That would be efficient if we needed major parts of those tables, but we only fetch a single row of interest from each. With appropriate indexes, a correlated subquery is much more efficient for this. See:
Efficient query to get greatest value per group from big table
Two SQL LEFT JOINS produce incorrect result
Select first row in each GROUP BY group?
Optimize GROUP BY query to retrieve latest row per user
We don't add the overhead of one or more CTEs.
We only fetch additional data after a successful INSERT, so no time is wasted if the insert didn't go through for any reason. (See quote at the top!)
Plus, possibly most important, this is correct. We use data from the row that has actually been inserted - after inserting it. (See quote at the top!) After possible default values, triggers or rules have been applied. We can be certain that what we see is what's actually in the database (currently).
You have no ORDER BY for profiles.name_first. That's not right. Either there is only one qualifying row, then we need no DISTINCT nor LIMIT 1. Or there can be multiple, then we also need a deterministic ORDER BY to get a deterministic result.
And if emails.t can be NULL, you'll want to add NULLS LAST in the ORDER BY clause. See:
Sort by column ASC, but NULL values first?
Indexes
Ideally, you have these multicolumn indexes (with columns in this order):
users (empl_user_pub_uuid, empl_user_pvt_uuid)
emails (empl_user_pvt_uuid, email)
profiles (empl_user_pvt_uuid, name_first)
Then, if the tables are vacuumed enough, you get three index-only scans and the whole operation is lightening fast.
Get pre-INSERT values?
If you really want that (which I don't think you do), consider:
Return pre-UPDATE column values using SQL only

According Postgres Docs about 6.4. Returning Data From Modified Rows :
In an INSERT, the data available to RETURNING is the row as it was
inserted.
But here you are trying to return columns from source table instead of destination. Returning will not be able to return columns form _u table rather only from employees.password_resets table's inserted row. But you can write nested cte for insertion and can select data from source table as well. Please try below approach.
WITH _u AS (
SELECT
eu.empl_user_pvt_uuid,
ee.email,
ep.name_first
FROM employees.users eu
LEFT JOIN (
SELECT DISTINCT ON (ee.empl_user_pvt_uuid)
ee.empl_user_pvt_uuid,
ee.email
FROM employees.emails ee
ORDER BY ee.empl_user_pvt_uuid, ee.t DESC
) ee ON eu.empl_user_pvt_uuid = ee.empl_user_pvt_uuid
LEFT JOIN (
SELECT DISTINCT ON (ep.empl_user_pvt_uuid)
ep.empl_user_pvt_uuid,
ep.name_first
FROM employees.profiles ep
) ep ON eu.empl_user_pvt_uuid = ep.empl_user_pvt_uuid
WHERE empl_user_pub_uuid = 'e2bb39f1f28011eab66c63cb4d9c7a34'
), I as
(
INSERT INTO employees.password_resets (empl_pwd_reset_uuid, empl_user_pvt_uuid, t_valid, for_empl_user_pvt_uuid, token)
SELECT 'f70a0346-a077-11eb-bd1a-aaaaaaaaaaaa', '6efc2b7a-f27e-11ea-b66c-de1c405de048', '2021-04-18 19:57:47.111365', _u.empl_user_pvt_uuid, '19d65aea-7c4a-41bc-b580-9d047f1503e6'
FROM _u
)
select _u.empl_user_pvt_uuid, _u.email, _u.name_first from _u

SQL Server ISNULL multiple columns

I have the following query which works great but how do I add multiple columns in its select statement? Following is the query:
SELECT ISNULL(
(SELECT DISTINCT a.DatasourceID
FROM [Table1] a
WHERE a.DatasourceID = 5 AND a.AgencyID = 4 AND a.AccountingMonth = 201907), NULL) TEST
So currently I only get one column (TEST) but would like to add other columns such as DataSourceID, AgencyID and AccountingMonth.

If you want to output a row for some condition (or requested values ) and output a row when it does not meet condition,
you can set a pseudo table for your requested values in the FROM clause and make a left outer join with your Table1.
SELECT ISNULL(Table1.DatasourceId, 999999),
Table1.AgencyId,
Table1.AccountingMonth,
COUNT(*) as count
FROM ( VALUES (5, 4, 201907 ),
(6, 4, 201907 ))
AS requested(DatasourceId, AgencyId, AccountingMonth)
LEFT OUTER JOIN Table1 ON requested.agencyid=Table1.AgencyId
AND requested.datasourceid = Table1.DatasourceId
AND requested.AccountingMonth = Table1.AccountingMonth
GROUP BY Table1.DatasourceId, Table1.AgencyId, Table1.AccountingMonth
Note that:
I have put a ISNULL for the first column like you did to output a particular value (9999) when no value is found.
I did not put the ISNULL(...,NULL) like your query in the other columns since IMHO it is not necessary: if there is no value, a null will be output anyway.
I added a COUNT(*) column to illustrate an aggregate, you could use another (SUM, MIN, MAX) or none if you do not need it.
The set of requested values is provided as a constant table values (see https://learn.microsoft.com/en-us/sql/t-sql/queries/table-value-constructor-transact-sql?view=sql-server-2017)
I have added multiple rows for requested conditions : you can request for multiple datasources, agencies or months in one query with one line for each in the output.
If you want only one row, put only one row in "requested" pseudo table values.
There must be a GROUP BY, even if you do not want to use an aggregate (count, sum or other) in order to have the same behavior as your distinct clause , it restricts the output to single lines for requested values.

To me it seems that you want to see does data exists, i guess that your's AgencyID is foreign key to agency table, DataSourceID also to DataSource, and that you have AccountingMonth table which has all accounting periods:
SELECT ds.ID as DataSourceID , ag.ID as AgencyID , am.ID as AccountingMonth ,
ISNULL(COUNT(a.*),0) as Count
FROM [Table1] a
RIGHT JOIN [Datasource] ds ON ds.ID = a.DataSourceID
RIGHT JOIN [Agency] ag ON ag.ID = a.AgencyID
RIGHT JOIN [AccountingMonth] am on am.ID = a.AccountingMonth
GROUP BY ds.ID, ag.ID, am.ID
In this way you can see count of records per group by criteria. Notice RIGHT join, you must use RIGHT JOIN if you want to include all record from "Right" table.
In yours query you have DISTINCT a.DatasourceID and WHERE a.DatasourceID = 5 and it returns 5 if in table exists rows that match yours WHERE criteria, and returns null if there is no data. If you remove WHERE a.DatasourceID = 5 your query would break with error: subquery returned multiple rows.

the way you are doing only allows for one column and one record and giving it the name of test. It does not look like you really need to test for null. because you are returning null so that does nothing to help you. Remove all the null testing and return a full recordset distinct will also limit your returns to 1 record. When working with a single table you don't need an alias, if there are no spaces or keywords braced identifiers not required. if you need to see if you have an empty record set, test for it in the calling program.
SELECT DatasourceID, AgencyID,AccountingMonth
FROM Table1
WHERE DatasourceID = 5 AND AgencyID = 4 AND AccountingMonth = 201907

Select values based on DISTINCT combination of rest of columns Oracle DB

I want to select row IDs associated with distinct column combinations in the remainder of a table. For instance, if the distinct rows are
I want to get the row IDs associated with each row. I can't query for distinct IDs since they are the row's primary key (and hence are all distinct).
So far I have:
SELECT e.ID
FROM E_UPLOAD_TEST e
INNER JOIN (
SELECT DISTINCT WHAT, MATERIALS, ERROR_FIELD, UNITS, SEASONALITY, DATA_TYPE, DETAILS, METHODS, DATA_FORMAT
FROM E_UPLOAD_TEST) c
ON e.WHAT = c.WHAT AND e.MATERIALS = c.MATERIALS AND e.ERROR_FIELD = c.ERROR_FIELD AND e.DATA_TYPE = c.DATA_TYPE AND e.METHODS = c.METHODS AND e.DATA_FORMAT = c.DATA_FORMAT;
which runs but doesn't return anything. Am I missing a GROUP BY and/or MIN() statement?

#serg is correct. Every single row in your example has at least one column value that is null. That means that no row will match your join condition. That is why your query results in no rows found.
Modifying your condition might get you what you want so long has your data isn't changing frequently. If it is changing frequently, then you probably want a single query for the entire job otherwise you'll have to set your transaction so that it is immune to data changes.
An example of such a condition change is this:
( (e.WHAT is null and c.WHAT is null) or (e.WHAT = c.WHAT) )
But such a change makes sense only if two rows having a null value in the same column means the same thing for both rows and it has to mean the same thing as time marches on. What "WHAT is null" means today might not be the same thing tomorrow. And that is probably why C. J. Date hates nulls so much.

Instead of comparing, use the decode function which compares two null values correctly.
e.WHAT = c.WHAT -> DECODE(e.WHAT, c.WHAT, 1) = 1

Oracle SQL not bringing back duplicates to an Oracle Form 10g

I've created an Oracle SQL query which links to about five tables I'm using in an Oracle FROM clause to an Oracle Form but the problem with the query is that some records are duplicated, so I only want to show one line in the form and not any duplicate records. I've tried GROUP BY and PARTITION BY statements but the query becomes to slow with adding this into the statement.
I'm now thinking of doing this as a procedure and bring back just one of the duplicates if any occur. Would it be best to bring back an ORACLE table of records from the database into the form? How would it be best to look for a duplicate in an Oracle PL/SQL loop?
I've updated the question and adde the full query below to explain it better. The surr_id the first column in select statement below is unique but what I want to show in the Oracle form is the production number along with the other columns which are not unique. There can be duplicates of production number and even sometimes three production number records the same. Hope this helps. I was thinking of putting this in a loop and just grabbing the first production number and then only bringing back each record when the production number changes.
select x.surr_id ,
x.supplier_name as supplier ,
x.broadcaster_name as broadcaster ,
ptle.title as production_title ,
x.production_number as production_number ,
stle.title as series_title ,
x.production_source as supplied_source_ind ,
x.third_party_group_id ,
x.bro_broadcast_by_tp_surr_id ,
x.station_id from (select usage_headers.surr_id as surr_id ,
broad_supp.supplier_name as supplier_name ,
broad_supp.broadcaster_name as broadcaster_name ,
usage_headers.production_number as production_number ,
productions.production_source as production_source ,
broad_supp.station_id as station_id ,
usage_headers.prod_exploitation_cre_surr_id as prod_exploitation_cre_surr_id ,
usage_headers.bro_broadcast_by_tp_surr_id as bro_broadcast_by_tp_surr_id ,
productions.cre_surr_id as cre_surr_id ,
productions.prod_series_cre_surr_id as prod_series_cre_surr_id ,
broad_supp.third_party_group_id as third_party_group_id
from usage_headers, productions, (SELECT /*+ index (bro bro_pk) */
third_party.surr_id AS THIRD_PARTY_SURR_ID,
third_party.supplier_group_id AS THIRD_PARTY_GROUP_ID,
third_party.dn_root_tp_surr_id AS THIRD_PARTY_ROOT_ID,
third_party.supplier_name, bro.station_id AS STATION_ID,
bro.dn_tp_name AS BROADCASTER_NAME FROM ( SELECT tp.surr_id,
tp.name AS supplier_name,
tp.tp_surr_id AS supplier_group_id,
tp.dn_root_tp_surr_id FROM third_parties tp
CONNECT BY PRIOR tp.surr_id = tp.tp_surr_id
START WITH tp.surr_id IN (4251, 4247, 4237, 4034, 10157, 14362, 9834)) third_party
JOIN broadcasters bro ON (third_party.surr_id = bro.tp_surr_id)) broad_supp
where broad_supp.THIRD_PARTY_SURR_ID = usage_headers.bro_broadcast_by_tp_surr_id
AND usage_headers.prod_exploitation_cre_surr_id = productions.cre_surr_id
and usage_headers.prod_exploitation_cre_surr_id IS NOT NULL
and usage_headers.right_type in ('M','B')
AND usage_headers.udg_surr_id IS NOT NULL
AND NVL(usage_headers.dn_uls_usage_status,'3') NOT IN ('9', '11')
AND productions.production_source <> 'AP') x
LEFT OUTER JOIN titles ptle ON ( ptle.cre_surr_id = x.cre_surr_id AND ptle.tt_code = 'R')
LEFT OUTER JOIN titles stle ON ( stle.cre_surr_id = x.prod_series_cre_surr_id AND stle.tt_code = 'R')
thanks Guys in Advance

If you're getting records that are entirely duplicated then just adding a DISTINCT clause, so your SELECT becomes SELECT DISTINCT will ensure that only one of the records is returned. If even one column is different though then this won't work.

MS SQL update table with multiple conditions

Been reading this site for answers for quite a while and now asking my first question!
I'm using SQL Server
I have two tables, ABC and ABC_Temp.
The contents are inserted into the ABC_Temp first before making its way to ABC.
Table ABC and ABC_Temp have the same columns, except that ABC_Temp has an extra column called LastUpdatedDate, which contains the date of the last update. Because ABC_Temp can have more than 1 of the same record, it has a composite key of the item number and the last updated date.
The columns are: ItemNo | Price | Qty and ABC_Temp has an extra column: LastUpdatedDate
I want to create a statement that follows the following conditions:
Check if each of the attributes of ABC differ from the value of ABC_Temp for records with the same key, if so then do the update (Even if only one attribute is different, all other attributes can be updated as well)
Only update those that need changes, if the record is the same, then it would not update.
Since an item can have more than one record in ABC_Temp I only want the latest updated one to be updated to ABC
I am currently using 2005 (I think, not at work at the moment).
This will be in a stored procedure and is called inside the VBscript scheduled task. So I believe it is a once time thing. Also I'm not trying to sync the two tables, as the contents of ABC_Temp would only contain new records bulk inserted from a text file through BCP. For the sake of context, this will be used with in conjunction with an insert stored proc that checks if records exist.

UPDATE
ABC
SET
price = T1.price,
qty = T1.qty
FROM
ABC
INNER JOIN ABC_Temp T1 ON
T1.item_no = ABC.item_no
LEFT OUTER JOIN ABC_Temp T2 ON
T2.item_no = T1.item_no AND
T2.last_updated_date > T1.last_updated_date
WHERE
T2.item_no IS NULL AND
(
T1.price <> ABC.price OR
T1.qty <> ABC.qty
)
If NULL values are possible in the price or qty columns then you will need to account for that. In this case I would probably change the inequality statements to look like this:
COALESCE(T1.price, -1) <> COALESCE(ABC.price, -1)
This assumes that -1 is not a valid value in the data, so you don't have to worry about it actually appearing there.
Also, is ABC_Temp really a temporary table that's just loaded long enough to get the values into ABC? If not then you are storing duplicate data in multiple places, which is a bad idea. The first problem is that now you need these kinds of update scenarios. There are other issues that you might run into, such as inconsistencies in the data, etc.

You could use cross apply to seek the last row in ABC_Temp with the same key. Use a where clause to filter out rows with no differences:
update abc
set col1 = latest.col1
, col2 = latest.col2
, col3 = latest.col3
from ABC abc
cross apply
(
select top 1 *
from ABC_Temp tmp
where abc.key = tmp.key
order by
tmp.LastUpdatedDate desc
) latest
where abc.col1 <> latest.col1
or (abc.col2 <> latest.col2
or (abc.col1 is null and latest.col2 is not null)
or (abc.col1 is not null and latest.col2 is null))
or abc.col3 <> latest.col3
In the example, only col2 is nullable. Since null <> 1 is not true, you have to check differences involving null using the is null syntax.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas