sql: Fast way to check if data are already in the data base

sql: Fast way to check if data are already in the data base - sql

I have an MSSQL database Table (named here: TABLE) with four columns (ID, lookup, date, value) and I want to check for a large amount of data whether they are in the database, using python. The data I want to add are here called: to_be_added with columns index, lookup, date, value.
To check whether the data already exist I use the following sql. It returns the index from the to_be_added data which are not yet in the database. I first check which lookup are in the database and then only perform the join on the subset (here called existing).
SELECT to_be_added."index",existing."ID" FROM
(
(
select * from dbo.TABLE
where "lookup" in (1,2,3,4,5,6,7,...)
) existing
right join
(
select * from
( Values
(1, 1, 1/1/2000, 0.123),(2, 2, 1/2/2000, 0.456),(...,...,...)
)t1(index,lookup,date,value)
)to_be_added
on existing.lookup = to_be_added.lookup
and existing.date = to_be_added.date
)
WHERE existing."ID" IS NULL
I do it batchwise as otherwhise the sql command is getting too large to commit and execution time is too long. As I have millions of lines to compare I am looking for a more efficent command as it becomes quite time consuming.
Any help appreciated

I would do the following:
Load the data from Excel into a table in your DB e.g. table = to_be_added
Run a query like this:
SELECT a.index
FROM to_be_added a
LEFT OUTER JOIN existing e ON
a.lookup = e.lookup
and a.date = e.date
WHERE e.lookup IS NULL;
Ensure that table "existing" has an index on lookup+date

Related

T-SQL Match records 1 to 1 without join condition

I have a group of enitities which need to have another record associated with them from another table.
When I try to output an Id for the table to be matched on it doesn't work because you can only output from inserted, updated etc.
DECLARE #SignatureGlobalIdsTbl table (ID int,
CompanyBankAccountId int);
INSERT INTO GlobalIds (TypeId)
-- I Cannot output cba.Id into the table since its not from inserted
OUTPUT Inserted.Id,
cba.Id
INTO #SignatureGlobalIdsTbl (ID,
CompanyBankAccountId)
SELECT (#DocumentsGlobalTypeKey)
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
INSERT INTO Documents (DocumentPath,
DocumentType,
DocumentIsExternal,
OwnerGlobalId,
OwnerGlobalTypeID,
DocumentName,
Extension,
GlobalId)
SELECT SignatureFile,
#SignatureDocumentTypeKey,
1,
CompanyGlobalId,
#OwnerGlobalTypeKey,
[dbo].[fnGetFileNameWithoutExtension](SignatureFile),
[dbo].[fnGetFileExtension](SignatureFile),
documentGlobalId
FROM (SELECT c.GlobalId AS CompanyGlobalId,
cba.*,
s.ID AS documentGlobalId
FROM CompanyBankAccounts cba
INNER JOIN Companies c ON c.CompanyId = cba.CompanyId
CROSS JOIN #SignatureGlobalIdsTbl s) info
WHERE SignatureDocumentId IS NULL
AND (SignatureFile IS NOT NULL
AND SignatureFile != '');
I Tried to use cross join to prevent cartesian production but that did not work. I also tried to output the rownumber over some value but I could not get that to be stored in the table either.
If I have two seperate queries which return the same amount of records, how can I pair the records together without creating cartesian production?

'When I try to output an Id for the table ... it doesn't work.'
This seems to be because one of the columns you want to OUTPUT is not actually part of the insert. It's an annoying problem and I wish SQL Server would allow us to do it.
Someone may have a much better answer for this than I do, but the way I usually approach this is
Create a temporary table/etc of the data I want to insert, with a column for ID (starts blank)
Do an insert of the correct amount of rows, and get the IDs out into another temporary table,
Assign the IDs as appropriate within the original temporary table
Go back and update the inserted rows with any additional data needed (though that's probably not needed here given you're just inserting a constant)
What this does is to flag/get the IDs ready for you to use, then you allocate them to your data as needed, then fill in the table with the data. It's relatively simple although it does do 2 table hits rather than 1.
Also consider doing it all within a transaction to keep the data consistent (though also probably not needed here).
How can I pair the records together?
A cross join unfortunately multiplies the rows (number of rows on left times the number of rows on the right). It is useful in some instances, but possibly not here.
I suggest when you do your inserts above, you get an identifier (e.g., companyID) in your temp table and join on that.
If you don't have a matching record and just want to assign them in order, you can use an answer similar to my answer in another recent question How to update multiple rows in a temp table with multiple values from another table using only one ID common between them?
Further notes
I suggest avoiding table variables (e.g., DECLARE #yourtable TABLE) and use temporary tables (CREATE TABLE #yourtable) instead - for performance reasons. If it's only a small amount of rows it's OK, but it gets worse as it gets larger as SQL Server assumed that table variables only have 1 row
In your bottom statement, why is there the SELECT statement in the FROM clause? Couldn't you just get rid of that select statement and have the FROM clause list the tables you want?

I figured out a way to have access to the output, by using a merge statement.
DECLARE #LogoGlobalIdsTbl TABLE (ID INT, companyBankAccountID INT)
MERGE GlobalIds
USING
(
SELECT (cba.CompanyBankAccountId)
FROM CompanyBankAccounts cba
INNER JOIN Companies c on c.CompanyId = cba.CompanyId
WHERE cba.LogoDocumentId IS NULL AND (cba.LogoFile IS NOT NUll AND cba.LogoFile != '')
) src ON (1=0)
WHEN NOT MATCHED
THEN INSERT ( TypeId )
VALUES (#DocumentsGlobalTypeKey)
OUTPUT [INSERTED].[Id], src.CompanyBankAccountId
INTO #LogoGlobalIdsTbl;

SQL Server : joining 2 tables where each row has unique PKEY value

I am attempting to join 2 tables with an equivalent amount of columns and where each column is named the same. Tb1 is an MS Access table that I have imported to SQL Server. Tb2 is a table that is updated from tb1 quarterly and is used to generate reports.
I have gone into design view and ensured that all column datatypes are the same and have the same names. Likewise, every row in each table is assigned a unique integer value in a column named PKEY.
What I would like to do is add all new entries present in tb1 (the MS Access table) to the existing tb2. I believe this can be done by writing a query that loads all unique pkeys found in tb1 (AKA load all keys that are NOT found in both tables, only load unique keys belonging to rows in the access table) and then appending these entries into Tb2.
Not really sure where to start when writing this query, I tried something like:
SELECT *
FROM tb1
WHERE PKEY <> Tb2.PKEY
Any help would be greatly appreciated. Thanks!

I would recommend not exists:
select tb1.*
from tb1
where not exists (select 1 from tb2 where tb2.pkey = tb1.pkey);
You can put an insert before this to insert the rows into the second table.

Insert into tb2 Select * from tb1 Where tb1.id not in (select Id from tb2)
The script above inserts records to tb2 from the results the first select query.
The select query only returns records with an ID that is not listed in the select sub query.

SQL SELECT query where the IDs were already found

I have 2 tables:
Table A has 3 columns (for example) with opportunity sales header data:
OPP_ID, CLOSE_DTTM, STAGE
Table B has 3 columns with the individual line items for the Opportunities:
OPP_LINE_ID, OPP_ID, AMOUNT_USD
I have a select statement that correctly parses through Table A and returns a list of Opportunities. What I would like to do is, without joining the data, to have a SELECT statement that will get data from Table B but only for the OPP_IDs that were found in my first query.
The result should be 2 views/resultset (one for each select query) and not just 1 combined view where Table B is joined to Table A.
The reason why I want to keep them separate is because I will have to perform a few manipulations to the result from table B and i don't want the result from table A affected.

Subquery is all what you need
SELECT OPP_ID, CLOSE_DTTM, STAGE
From table a
where a.opp_id IN (Select opp_id from table b)

Presuming you're using this in some client side data access library that represents B's data in some 2 dimensional collection and you want to manipulate it without affecting/ having A's data present in that collection:
Identify the records in A:
SELECT * FROM a WHERE somecolumn = 'somevalue'
Identify the records in B that relate to A, but don't return A's data:
SELECT b.* FROM a JOIN b ON a.opp_id = b.opp_id WHERE a.somecolumn = 'somevalue'
Just because JOIN is used doesn't mean your end-consuming program has to know about A's data. You could also use IN, like the other answer does, but internally the database will rewrite them to be the same thing anyway

I tend to use exists for this type of query:
select b.*
from b
where exists (select 1 from a where a.opp_id = b.opp_id);
If you want two results sets, you need to run two queries. It is unclear what the second query is, perhaps the first query on A.

What if the column to be indexed is nvarchar data type in SQL Server?

I retrieve data by joining multiple tables as indicated on the image below. On the other hand, as there is no data in the FK column (EmployeeID) of Event table, I have to use CardNo (nvarchar) fields in order to join the two tables. On the other hand, the digit numbers of CardNo fields in the Event and Employee tables are different, I also have to use RIGHT function of SQL Server and this makes the query to be executed approximately 10 times longer. So, in this scene what should I do? Can I use CardNo field without changing its data type to int, etc (because there are other problem might be seen after changing it and it sill be better to find a solution without changing the data type of it). Here is also execution plan of the query below.
Query:
; WITH a AS (SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID),
b AS (SELECT eve.EventID, eve.EventTime, eve.CardNo, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC

You can add a computed column to your table like this:
ALTER TABLE TEmployee -- Don't start your table names with prefixes, you already know they're tables
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
ALTER TABLE TEvent
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
CREATE INDEX TEmployee_CardNoRight8_IDX ON TEmployee (CardNoRight8)
CREATE INDEX TEvent_CardNoRight8_IDX ON TEvent (CardNoRight8)
You don't need to persist the column since it already matches the criteria for a computed column to be indexed, but adding the PERSISTED keyword shouldn't hurt and might help the performance of other queries. It will cause a minor performance hit on updates and inserts, but that's probably fine in your case unless you're importing a lot of data (millions of rows) at a time.
The better solution though is to make sure that your columns that are supposed to match actually match. If the right 8 characters of the card number are something meaningful, then they shouldn't be part of the card number, they should be another column. If this is an issue where one table uses leading zeroes and the other doesn't then you should fix that data to be consistent instead of putting together work arounds like this.

This line is what is costing you 86% of the query time:
LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
This is happening because it has to run RIGHT() on those fields for every row and then match them with the other table. This is obviously going to be inefficient.
The most straightforward solution is probably to either remove the RIGHT() entirely or else to re-implement it as a built-in column on the table so it doesn't have to be calculated on the fly while the query is running.
While inserting the record, you would have to also insert the eight, right digits of the card number and store it in this field. My original thought was to use a computed column but I don't think those can be indexed so you'd have to use a regular column.
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, eve.CardNoRightEight, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID
)
SELECT *
FROM b
LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight
ORDER BY b.EventID ASC

This will help you see how to add a calculated column to your database.
create table #temp (test varchar(30))
insert into #temp
values('000456')
alter table #temp
add test2 as right(test, 3) persisted
select * from #temp
The other alternative is to fix the data and the data entry so that both columns are the same data type and contain the same leading zeros (or remove them)

Many thanks all of your help. With the help of your answers, I managed to reduce the query execution time from 2 minutes to 1 at the first step after using computed columns. After that, when creating an index for these columns, I managed to reduce the execution time to 3 seconds. Wow, it is really perfect :)
Here are the steps posted for those who suffers from a similar problem:
Step I: Adding computed columns to the tables (As CardNo fields are nvarchar data type, I specify data type of computed columns as int):
ALTER TABLE TEvent ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
ALTER TABLE TEmployee ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
Step II: Create index for the computed columns in order to execute the query faster:
CREATE INDEX TEmployee_CardNoRightEight_IDX ON TEmployee (CardNoRightEight)
CREATE INDEX TEvent_CardNoRightEight_IDX ON TEvent (CardNoRightEight)
Step 3: Update the query by using the computed columns in it:
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight --emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, evt.EventCH, dor.DoorName, eve.CardNoRightEight --eve.CardNo
FROM TEvent eve
LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight --ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC

Performance of nested select

I know this is a common question and I have read several other posts and papers but I could not find one that takes into account indexed fields and the volume of records that both queries could return.
My question is simple really. Which of the two is recommended here written in an SQL-like syntax (in terms of performance).
First query:
Select *
from someTable s
where s.someTable_id in
(Select someTable_id
from otherTable o
where o.indexedField = 123)
Second query:
Select *
from someTable
where someTable_id in
(Select someTable_id
from otherTable o
where o.someIndexedField = s.someIndexedField
and o.anotherIndexedField = 123)
My understanding is that the second query will query the database for every tuple that the outer query will return where the first query will evaluate the inner select first and then apply the filter to the outer query.
Now the second query may query the database superfast considering that the someIndexedField field is indexed but say that we have thousands or millions of records wouldn't it be faster to use the first query?
Note: In an Oracle database.

In MySQL, if nested selects are over the same table, the execution time of the query can be hell.
A good way to improve the performance in MySQL is create a temporary table for the nested select and apply the main select against this table.
For example:
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from someTable s2
where s2.Field = 123);
Can have a better performance with:
create temporary table 'temp_table' as (
Select someTable_id
from someTable s2
where s2.Field = 123
);
Select *
from someTable s1
where s1.someTable_id in
(Select someTable_id
from tempTable s2);
I'm not sure about performance for a large amount of data.

About first query:
first query will evaluate the inner select first and then apply the
filter to the outer query.
That not so simple.
In SQL is mostly NOT possible to tell what will be executed first and what will be executed later.
Because SQL - declarative language.
Your "nested selects" - are only visually, not technically.
Example 1 - in "someTable" you have 10 rows, in "otherTable" - 10000 rows.
In most cases database optimizer will read "someTable" first and than check otherTable to have match. For that it may, or may not use indexes depending on situation, my filling in that case - it will use "indexedField" index.
Example 2 - in "someTable" you have 10000 rows, in "otherTable" - 10 rows.
In most cases database optimizer will read all rows from "otherTable" in memory, filter them by 123, and than will find a match in someTable PK(someTable_id) index. As result - no indexes will be used from "otherTable".
About second query:
It completely different from first. So, I don't know how compare them:
First query link two tables by one pair: s.someTable_id = o.someTable_id
Second query link two tables by two pairs: s.someTable_id = o.someTable_id AND o.someIndexedField = s.someIndexedField.
Common practice to link two tables - is your first query.
But, o.someTable_id should be indexed.
So common rules are:
all PK - should be indexed (they indexed by default)
all columns for filtering (like used in WHERE part) should be indexed
all columns used to provide match between tables (including IN, JOIN, etc) - is also filtering, so - should be indexed.
DB Engine will self choose the best order operations (or in parallel). In most cases you can not determine this.
Use Oracle EXPLAIN PLAN (similar exists for most DBs) to compare execution plans of different queries on real data.

When i used directly
where not exists (select VAL_ID FROM #newVals = OLDPAR.VAL_ID) it was cost 20sec. When I added the temp table it costs 0sec. I don't understand why. Just imagine as c++ developer that internally there loop by values)
-- Temp table for IDX give me big speedup
declare #newValID table (VAL_ID int INDEX IX1 CLUSTERED);
insert into #newValID select VAL_ID FROM #newVals
insert into #deleteValues
select OLDPAR.VAL_ID
from #oldVal AS OLDPAR
where
not exists (select VAL_ID from #newValID where VAL_ID=OLDPAR.VAL_ID)
or exists (select VAL_ID from #VaIdInternals where VAL_ID=OLDPAR.VAL_ID);

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas