Performance difference between Insert and Insert Where Not Exists

Performance difference between Insert and Insert Where Not Exists - sql

I know its better to use INSERT WHERE NOT EXISTS than INSERT as it leads to duplicated records or unique key violation issues.
But with respect to performance, will it create any big difference ?
INSERT WHERE NOT EXISTS will internally triggers extra SELECT statement to check the record is existing or not. In case of large tables, which is recommended to use INSERT vs INSERT WHERE NOT EXITS ?
And someone pls explain cost of execution difference between the both.

Most Oracle IN clause queries involve a series of literal values, and when a table is present a standard join is better. In most cases the Oracle cost-based optimizer will create an identical execution plan for IN vs EXISTS, so there is no difference in query performance.
The Exists keyword evaluates true or false, but the IN keyword will compare all values in the corresponding subquery column. If you are using the IN operator, the SQL engine will scan all records fetched from the inner query. On the other hand, if we are using EXISTS, the SQL engine will stop the scanning process as soon as it found a match.
The EXISTS subquery is used when we want to display all rows where we have a matching column in both tables. In most cases, this type of subquery can be re-written with a standard join to improve performance.
The EXISTS clause is much faster than IN when the subquery results is very large. Conversely, the IN clause is faster than EXISTS when the subquery results is very small.
Also, the IN clause can't compare anything with NULL values, but the EXISTS clause can compare everything with NULLs.

It's not a matter of "what's fastest" but a matter of "what's correct".
When you INSERT into a table (without any restriction) you simply add records to that table. If an existing identical record already was in there this will then result there now being two such records. This may be fine or this may be an issue depending on your needs (**).
When you add a WHERE NOT EXISTS() to your INSERT construction the system will only add records to the table that aren't there yet, thus avoiding the situation of ending up with multiple identical records.
(**: suppose you have unique or primary key constraint on the target table then the INSERT of a duplicate record will result in a UQ/PK Violation error. IF your question was: "What's fastest: try to insert the row and if there is such an error simply ignore it versus try to insert where not exists() and avoid the error" then I can't give you conclusive answer but I'm fairly certain it will be a close call. What I can say however is that the WHERE NOT EXISTS() approach will look much nicer in code and (importantly!) it will also work for set-based operations, the try/catch approach will fail for the entire set even if only 1 record causes an issue.)

INSERT will check inserted data against any existing schema constraints, PK, FK, Unique Indexes, not nulls and any other custom constraints - whatever the table schema demands. If those checks are ok, the row will be inserted and loop on to the next row.
INSERT WHERE NOT EXISTS, prior to the above check, will check data of all columns of the row against data of all rows of the table. Even if 1 column is different it is ok and then it will move on exactly as INSERT above.
The performance impact mostly depends:
1. number of existing rows at the table
2. size of row
So as the table gets larger followed by larger row size the difference grows.

Related

SQL : Can WHERE clause increase a SELECT DISTINCT query's speed?

So here's the specific situation: I have primary unique indexed keys set to each entry in the database, but each row has a secondID referring to an attribute of the entry, and as such, the secondIDs are not unique. There is also another attribute of these rows, let's call it isTitle, which is NULL by default, but each group of entries with the same secondID have at least one entry with 1 isTitle value.
Considering the conditions above, would a WHERE clause increase the processing speed of the query or not? See the following:
SELECT DISTINCT secondID FROM table;
vs.
SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
EDIT:
The first query without the WHERE clause is faster, but could someone explain me why? Algorithmically the process should be faster with having only one 'if' in the cycle, no?

In general, to benchmark performances of queries, you usually use queries that gives you the execution plan the query they receive in input (Every small step that the engine is performing to solve your request).
You are not mentioning your database engine (e.g. PostgreSQL, SQL Server, MySQL), but for example in PostgreSQL the query is the following:
EXPLAIN SELECT DISTINCT secondID FROM table WHERE isTitle = 1;
Going back to your question, since the isTitle is not indexed, I think the first action the engine will do is a full scan of the table to check that attribute and then perform the SELECT hence, in my opinion, the first query:
SELECT DISTINCT secondID FROM table;
will be faster.
If you want to optimize it, you can create an index on isTitle column. In such scenario, the query with the WHERE clause will become faster.

This is a very hard question to answer, particularly without specifying the database. Here are three important considerations:
Will the database engine use the index on secondID for select distinct? Any decent database optimizer should, but that doesn't mean that all do.
How wide is the table relative to the index? That is, is scanning the index really that much faster than scanning the table?
What is the ratio of isTitle = 1 to all rows with the same value of secondId?
For the first query, there are essentially two ways to process the query:
Scan the index, taking each unique value as it comes.
Scan the table, sort or hash the table, and choose the unique values.
If it is not obvious, (1) is much faster than (2), except perhaps in trivial cases where there are a small number of rows.
For the second query, the only real option is:
Scan the table, filter out the non-matching values, sort or hash the table, and choose the unique values.
The key issues here are how much data needs to be scanned and how much is filtered out. It is even possible -- if you had, say, zillions of rows per secondaryId, no additional columns, and small number of values -- that this might be comparable or slightly faster than (1) above. There is a little overhead for scanning an index and sorting a small amount of data is often quite fast.
And, this method is almost certainly faster than (2).
As mentioned in the comments, you should test the queries on your system with your data (use a reasonable amount of data!). Or, update the table statistics and learn to read execution plans.

Increase Speed In SQL Big Data

I have a table of 50 million records with 17 columns. I want to publish data into some tables. I have built some tables for this.
I wrote a sql script for this work. But the speed of this script is very low.
The main problem is that Before I want to insert a record in a table, I must check the table to not exists that record.
Of course I already done some optimization in my code. For example I replace cursor with while statement. But still the speed is very low.
What can I do to increase the speed and optimization?

I must check the table to not exists that record.
Let the database do the work via a unique constraint or index. Decide on the columns that cannot be identical and run something like:
create unique index unq_t_col1_col2_col3 on t(col1, col2, col3);
The database will then return an error if you attempt to insert a duplicate.
This is standard functionality and should be available in any database. But, you should tag your question with the database you are using and provide more information about what you mean by a duplicate.

SQL - relying on server errors during INSERT

I'm working with PostgreSQL 9.1. Let's say I have a table where some columns have UNIQUE constraint. The easiest example:
CREATE TABLE test (
value INTEGER NOT NULL UNIQUE
);
Now, when inserting some values, I have to separately handle the case, where the values to be inserted are already in the table. I have two options:
Make a SELECT beforehand to ensure the values are not in the table, or:
Execute the INSERT and watch for any errors the server might return.
The application utilizing the PostgreSQL database is written in Ruby. Here's how I would code the second option:
require 'pg'
db = PG.connect(...)
begin
db.exec('INSERT INTO test VALUES (66)')
rescue PG::UniqueViolation
# ... the values are already in the table
else
# ... the values were brand new
end
db.close
Here's my thinking: let's suppose we make a SELECT first, before inserting. The SQL engine would have to scan the rows and return any matching tuples. If there are none, we make an INSERT, which presumably makes yet another scan, to see if the UNIQUE constraint is not about to be violated by any chance. So, in theory, second option would speed the execution up by 50%. Is this how PostgreSQL would actually behave?
We're assuming there's no ambiguity when it comes to the exception itself (e.g. we only have one UNIQUE constraint).
Is it a common practice? Or are there any caveats to it? Are there any more alternatives?

It depends - if your application UI normally allows entering duplicate values, then it's strongly encouraged to check before inserting. Because any error would invalidate current transaction, consume sequence/serial values, fill logs with error messages etc.
But if your UI is not allowing duplicates, and inserting duplicate is only possible when somebody is using tricks (for example during vulnerability research) or highly improbable then I'd allow inserting without checking first.
As unique constraint forces creation of an index, this check is not slow. But definitely slightly slower than inserting and checking for rare errors. Postgres 9.5 would have on conflict do nothing support, which would be both fast and safe. You'd check number of rows inserted to detect duplicates.

You don't (and shouldn't) have to test before; you can test while inserting. Just add the test as a where clause. The following insert inserts either zero or one tuple, dependiing on the existence of a row with the same value. (and it certainly is not slower) :
INSERT INTO test (value)
SELECT 55
WHERE NOT EXISTS (
SELECT * FROM test
WHERE value = 55
);
Though your error-driven approach may look elegant from the client side, from the database side it is a near-disaster: the current transaction is rolled back implicitely + all cursors (including prepared statements) are closed. (thus: your application will have to rebuild the complete transaction but without the error and start again.)
Addition: when adding more than one row you can put the VALUES() into a CTE and refer to the CTE in the insert query:
WITH vvv(val) AS (
VALUES (11),(22),(33),(44),(55),(66)
)
INSERT INTO test(value)
SELECT val FROM vvv
WHERE NOT EXISTS (
SELECT *
FROM test nx
WHERE nx.value = vvv.val
);
-- SELECT * FROM test;

Fast check of existence of an entry in a SQL database

Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)

this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.

The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.

Does limiting a query to one record improve performance

Will limiting a query to one result record, improve performance in a large(ish) MySQL table if the table only has one matching result?
for example
select * from people where name = "Re0sless" limit 1
if there is only one record with that name? and what about if name was the primary key/ set to unique? and is it worth updating the query or will the gain be minimal?

If the column has
a unique index: no, it's no faster
a non-unique index: maybe, because it will prevent sending any additional rows beyond the first matched, if any exist
no index: sometimes
if 1 or more rows match the query, yes, because the full table scan will be halted after the first row is matched.
if no rows match the query, no, because it will need to complete a full table scan

If you have a slightly more complicated query, with one or more joins, the LIMIT clause gives the optimizer extra information. If it expects to match two tables and return all rows, a hash join is typically optimal. A hash join is a type of join optimized for large amounts of matching.
Now if the optimizer knows you've passed LIMIT 1, it knows that it won't be processing large amounts of data. It can revert to a loop join.
Based on the database (and even database version) this can have a huge impact on performance.

To answer your questions in order:
1) yes, if there is no index on name. The query will end as soon as it finds the first record. take off the limit and it has to do a full table scan every time.
2) no. primary/unique keys are guaranteed to be unique. The query should stop running as soon as it finds the row.

I believe the LIMIT is something done after the data set is found and the result set is being built up so I wouldn't expect it to make any difference at all. Making name the primary key will have a significant positive effect though as it will result in an index being made for the column.

If "name" is unique in the table, then there may still be a (very very minimal) gain in performance by putting the limit constraint on your query. If name is the primary key, there will likely be none.

Yes, you will notice a performance difference when dealing with the data. One record takes up less space than multiple records. Unless you are dealing with many rows, this would not be much of a difference, but once you run the query, the data has to be displayed back to you, which is costly, or dealt with programmatically. Either way, one record is easier than multiple.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas