SQLite NOT IN query is slow

SQLite NOT IN query is slow - sql

I have two tables - Keys and KeysTemp.
KeysTemp contains temporary data which should be merged with Keys using the Hash field.
Here is the query:
SELECT
r.[Id]
FROM
[KeysTemp] AS k
WHERE
r.[Hash] NOT IN (SELECT [Hash] FROM [Keys] WHERE [SourceId] = 10)
I have indexes on both tables for SourceId and Hash fields:
CREATE INDEX [IdxKeysTempSourceIdHash] ON [KeysTemp]
(
[SourceId],
[Hash]
);
The same index for Keys table, but query is still very slow.
There is 5 rows in temporary table and about 60000 in the main table. Query by hash takes about 27 milliseconds, but querying this 5 rows takes about 3 seconds.
I also tried splitting index, i.e. creating different indexes for SourceId and Hash, but it works the same way. OUTER JOIN works even worse here. How to solve that issue?
UPDATE
If I remove WHERE [SourceId] = 10 from the query it completes in 30ms, that's great, but I need this condition :)
Thanks

Maybe
select k.id
from keytemp as k left outer join keys as kk on (k.hash=kk.hash and kk.sourceid=10)
where kk.hash is null;
? Assuming, that r is k. Also have you tried not exists? I have no idea if it works different way…

I would do :
SELECT
r.[Id]
FROM
[KeysTemp] AS k
WHERE
r.[Id] NOT IN (SELECT A.[Id] FROM [KeysTemp] AS A, [Keys] AS B WHERE B.[SourceId] = 10 AND A.[Hash] == B.[Hash])
You list all elements in KeysTemp (few) that exist in Keys, then take the not these ones in KeysTemp

If there are just a few new keys you could try this:
SELECT
r.[Id]
FROM
[KeysTemp] AS k
WHERE
r.[Id] NOT IN (SELECT kt.[Id] FROM [Keys] AS k1
INNER JOIN [KeysTemp] AS kt ON kt.Hash = k1.Hash
WHERE k1.[SourceId] = 10)
KeysTemp should have an index on the Hash column and Keys on the SourceId column.

Related

How can I generate a UID on-the-fly in bigquery SQL?

I am trying to join a table with itself. Here is a MWE of the problem:
WITH elems as (
SELECT letter, generate_uuid() randomid
FROM
UNNEST(SPLIT('aabcdefghij', '')) letter
),
l as (SELECT * FROM ten_elems),
r as (SELECT * FROM ten_elems)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
If you run this, you will see that the random IDs on the left and on the right are different. Obviously if, instead, you uncommment the other join, it returns no results. The same happens for row_number() OVER (), and because my top level elements are not unique I cannot simply use row_number() OVER (ORDER BY letter) as it will still (potentially) assign different IDs to the two "a" entries.
The actual table is obviously way more complex, and contains arrays of arrays. However, as here, the top level elements are not necessarily unique, so I need to generate UIDs before unnesting, so I can later join them together correctly.
I understand that a work-around would be to save the table with the UID first, and then do the self-join, but I had hoped I wouldn't need to do that, as in general this data doesn't need identification at this level. So if there is some way of making the UID persistent through my queries, rather than generated anew on-demand, it would really help me.

WITH tables store in Memory and I think generate_uuid is not persistent because it was made to always regenerate unique even in a in memory access. If you create a truth temporal table that fixes the issue.
Example of a script creating a temporal table for 5 seconds in here: your-project.dataset.test_guid_2 then using it.
CREATE TABLE `your-project.dataset.test_guid_2`
OPTIONS(
expiration_timestamp=TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 5 SECOND)
) AS
SELECT letter, CAST(generate_uuid() AS STRING) randomid
FROM
UNNEST(SPLIT('abcdefghij', '')) letter;
WITH
l as (SELECT * FROM `your-project.dataset.test_guid_2`),
r as (SELECT * FROM `your-project.dataset.test_guid_2`)
--SELECT * FROM l INNER JOIN r on l.randomid = r.randomid
SELECT * FROM l INNER JOIN r on l.letter = r.letter
Output:

Oracle Sql tuning with index

I have a table T with some 500000 records. That table is a hierarchical table.
My goal is to update the table by self joining the same table based on some condition for parent - child relationship
The update query is taking really long because the number of rows is really high. I have created an unique index on the column which helps identifying the rows to update (meanign x and Y). After creating the index the cost has reduced but still the query is performing a lot slower.
This my query format
update T
set a1, b1
= (select T.parent.a1, T.parent.b1
from T T.paremt, T T.child
where T.parent.id = T.child.Parent_id
and T.X = T.child.X
and T.Y = T.child.Y
after creating the index the execution plan shows that it is doing an index scan for CRS.PARENT but going for a full table scan for for CRS.CHILD and also during update as a result the query is taking for ever to complete.
Please suggest any tips or recommendations to solve this problem

You are updating all 500,000 rows, so an index is a bad idea. 500,000 index lookups will take much longer than it needs to.
You would be better served using a MERGE statement.
It is hard to tell exactly what your table structure is, but it would look something like this, assuming X and Y are the primary key columns in T (...could be wrong about that):
MERGE INTO T
USING ( SELECT TC.X,
TC.Y,
TP.A1,
TP.A2
FROM T TC
INNER JOIN T TP ON TP.ID = TC.PARENT_ID ) U
ON ( T.X = U.X AND T.Y = U.Y )
WHEN MATCHED THEN UPDATE SET T.A1 = U.A1,
T.A2 = U.A2;

What if the column to be indexed is nvarchar data type in SQL Server?

I retrieve data by joining multiple tables as indicated on the image below. On the other hand, as there is no data in the FK column (EmployeeID) of Event table, I have to use CardNo (nvarchar) fields in order to join the two tables. On the other hand, the digit numbers of CardNo fields in the Event and Employee tables are different, I also have to use RIGHT function of SQL Server and this makes the query to be executed approximately 10 times longer. So, in this scene what should I do? Can I use CardNo field without changing its data type to int, etc (because there are other problem might be seen after changing it and it sill be better to find a solution without changing the data type of it). Here is also execution plan of the query below.
Query:
; WITH a AS (SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID),
b AS (SELECT eve.EventID, eve.EventTime, eve.CardNo, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC

You can add a computed column to your table like this:
ALTER TABLE TEmployee -- Don't start your table names with prefixes, you already know they're tables
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
ALTER TABLE TEvent
ADD CardNoRight8 AS RIGHT(CardNo, 8) PERSISTED
CREATE INDEX TEmployee_CardNoRight8_IDX ON TEmployee (CardNoRight8)
CREATE INDEX TEvent_CardNoRight8_IDX ON TEvent (CardNoRight8)
You don't need to persist the column since it already matches the criteria for a computed column to be indexed, but adding the PERSISTED keyword shouldn't hurt and might help the performance of other queries. It will cause a minor performance hit on updates and inserts, but that's probably fine in your case unless you're importing a lot of data (millions of rows) at a time.
The better solution though is to make sure that your columns that are supposed to match actually match. If the right 8 characters of the card number are something meaningful, then they shouldn't be part of the card number, they should be another column. If this is an issue where one table uses leading zeroes and the other doesn't then you should fix that data to be consistent instead of putting together work arounds like this.

This line is what is costing you 86% of the query time:
LEFT JOIN a ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
This is happening because it has to run RIGHT() on those fields for every row and then match them with the other table. This is obviously going to be inefficient.
The most straightforward solution is probably to either remove the RIGHT() entirely or else to re-implement it as a built-in column on the table so it doesn't have to be calculated on the fly while the query is running.
While inserting the record, you would have to also insert the eight, right digits of the card number and store it in this field. My original thought was to use a computed column but I don't think those can be indexed so you'd have to use a regular column.
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, eve.CardNoRightEight, evt.EventCH, dor.DoorName
FROM TEvent eve LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID
)
SELECT *
FROM b
LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight
ORDER BY b.EventID ASC

This will help you see how to add a calculated column to your database.
create table #temp (test varchar(30))
insert into #temp
values('000456')
alter table #temp
add test2 as right(test, 3) persisted
select * from #temp
The other alternative is to fix the data and the data entry so that both columns are the same data type and contain the same leading zeros (or remove them)

Many thanks all of your help. With the help of your answers, I managed to reduce the query execution time from 2 minutes to 1 at the first step after using computed columns. After that, when creating an index for these columns, I managed to reduce the execution time to 3 seconds. Wow, it is really perfect :)
Here are the steps posted for those who suffers from a similar problem:
Step I: Adding computed columns to the tables (As CardNo fields are nvarchar data type, I specify data type of computed columns as int):
ALTER TABLE TEvent ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
ALTER TABLE TEmployee ADD CardNoRightEight AS RIGHT(CAST(CardNo AS int), 8)
Step II: Create index for the computed columns in order to execute the query faster:
CREATE INDEX TEmployee_CardNoRightEight_IDX ON TEmployee (CardNoRightEight)
CREATE INDEX TEvent_CardNoRightEight_IDX ON TEvent (CardNoRightEight)
Step 3: Update the query by using the computed columns in it:
; WITH a AS (
SELECT emp.EmployeeName, emp.Status, dep.DeptName, job.JobName, emp.CardNoRightEight --emp.CardNo
FROM TEmployee emp
LEFT JOIN TDeptA AS dep ON emp.DeptAID = dep.DeptID
LEFT JOIN TJob AS job ON emp.JobID = job.JobID
),
b AS (
SELECT eve.EventID, eve.EventTime, evt.EventCH, dor.DoorName, eve.CardNoRightEight --eve.CardNo
FROM TEvent eve
LEFT JOIN TEventType AS evt ON eve.EventType = evt.EventID
LEFT JOIN TDoor AS dor ON eve.DoorID = dor.DoorID)
SELECT * FROM b LEFT JOIN a ON a.CardNoRightEight = b.CardNoRightEight --ON RIGHT(a.CardNo, 8) = RIGHT(b.CardNo, 8)
ORDER BY b.EventID ASC

sqlite is using wrong index in left join

I am joining two tables with a left join:
The first table is quite simple
create table L (
id integer primary key
);
and contains only a handful of records.
The second table is
create table R (
L_id null references L,
k text not null,
v text not null
);
and contains millions of records.
The following two indexes are on R:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
This select statement, imho, selects the wrong index:
select
L.id,
R.v
from
L left join
R on
L.id = R.L_id and
R.k = 'foo';
A explain query plan tells me that the select statement uses the index R_ix_2, the execution of the select takes too much time. I believe the performance would be much
better if sqlite chose to use R_ix_1 instead.
I tried also
select
L.id,
R.v
from
L left join
R indexed by R_ix_1 on
L.id = R.L_id and
R.k = 'foo';
but that gave me Error: no query solution.
Is there something I can do to make sqlite use the other index?

Your join condition relies on 2 columns, so your index should cover those 2 columns:
create index R_ix_1 on R(L_id, k);
If you do some other queries relying only on single column, you can keep old indexes, but you still need to have this double-column index as well:
create index R_ix_1 on R(L_id);
create index R_ix_2 on R(k);
create index R_ix_3 on R(L_id, k);

I wonder if the SQLite optimizer just gets confused in this case. Does this work better?
select L.id, R.v
from L left join
R
on L.id = R.L_id
where R.k = 'foo' or R.k is NULL;
EDIT:
Of course, SQLite will only use an index if the types of the columns are the same. The question doesn't specify the type of l_id. If it is not the same as the type of the primary key, then the index (probably) will not be used.

Optimizing "IS NULL"

I have a query that is very slow due to a IS NULL check in the where clause. At least, that's what it looks like. The query needs over a minute to complete.
Simplified query:
SELECT DISTINCT TOP 100 R.TermID2, NP.Title, NP.JID
FROM Titles NP
INNER JOIN Term P
ON NP.TermID = P.ID
INNER JOIN Relation R
ON P.ID = R.TermID2
WHERE R.TermID1 IS NULL -- the culprit?
AND NP.JID = 3
I have non-unique, non-clusterd and unique, clustered indices on all of the mentioned fields as well as an extra index that covers R.TermID1 and has a filter TermID1 IS NULL.
Term has 2835302 records. Relation has 25446678 records, where 10% of them has TermID1 = NULL.
The SQL plan in XML form is here: http://pastebin.com/raw.php?i=xcDs0VD0

So, I was messing around the index of the largest table, adding filtered indexes, covering columns, changin around the clauses, etc.
At one point I simply deleted the index and created a new index that had the old configuration and it worked!

You could remove the WHERE clause and put the conditions into the JOIN clauses.
SELECT DISTINCT TOP 100 R.TermID2, NP.Title, NP.JID
FROM Titles NP
INNER JOIN Term P
ON NP.TermID = P.ID AND NP.JID = 3
INNER JOIN Relation R
ON P.ID = R.TermID2 AND R.TermID1 IS NULL

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

SQLite NOT IN query is slow - sql

Maybe select k.id from keytemp as k left outer join keys as kk on (k.hash=kk.hash and kk.sourceid=10) where kk.hash is null; ? Assuming, that r is k. Also have you tried not exists? I have no idea if it works different way…

I would do : SELECT r.[Id] FROM [KeysTemp] AS k WHERE r.[Id] NOT IN (SELECT A.[Id] FROM [KeysTemp] AS A, [Keys] AS B WHERE B.[SourceId] = 10 AND A.[Hash] == B.[Hash]) You list all elements in KeysTemp (few) that exist in Keys, then take the not these ones in KeysTemp

If there are just a few new keys you could try this: SELECT r.[Id] FROM [KeysTemp] AS k WHERE r.[Id] NOT IN (SELECT kt.[Id] FROM [Keys] AS k1 INNER JOIN [KeysTemp] AS kt ON kt.Hash = k1.Hash WHERE k1.[SourceId] = 10) KeysTemp should have an index on the Hash column and Keys on the SourceId column.

Related

How can I generate a UID on-the-fly in bigquery SQL?

Oracle Sql tuning with index

What if the column to be indexed is nvarchar data type in SQL Server?

sqlite is using wrong index in left join

Optimizing "IS NULL"

Categories

Resources