Does SQL NOT IN opeator scales good? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I'm writing an app where users pass quizzes. So my purpose is to show quizzes which user didn't tackle before. For this reason I'm using SELECT id, name, problem FROM quizzes WHERE id NOT IN (...).
Imagine that there will be thousands of ids and quizzes.
Is it ok? How does it scale? Probably I need to redesign something / using DB appropriate for that or use another technique to achieve my purpose?

If you have a fixed list, then it should be fine.
If you have a subquery, then I strongly encourage not exists:
from foo f
where not exists (select 1 from <bar> b where b.quiz_id = f.quiz_id)
I recommend this based on the semantics of not exists versus not in. not exists handles NULL values more intuitively.
That said, with appropriate indexes, in most databases, not exists also often has the better performance.

You should consider there are limits to the SQL statements length imposed by each database engine. Though I haven't tested those limits, a 1k values in an IN operator should still work well for most databases, I would think that if you scale it up to 10k or more it could reach some databases limits and your statements will crash.
I would suggest rethinking this solution unless you can verify the worst possible case (with maximum parameters) still works well.
Usually a subquery can do the job, instead of manually sending 1k parameters or assembling a big SQL statement by concatenating strings.

Related

EndDate on Dimension Table - Should we go with NULL or 99991231 Date Value [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I am building a Data Warehouse on SQL Server and I was wondering what is the best approach in handling the current record in a dimension table (SCD type 2) with respect to the 'end_date' attribute.
For the current record, we have the option of using a date literal such as '12/31/9999' or specify it as NULL. The dimension tables also have an additional 'current_flag' attribute in addition to 'start_date' and 'end_date'.
It is probably a minor design decision but just wanted to see if there are any advantages of using one over the other in terms of query performance or in any other way?
I have seen systems written both ways. Personally, I go for the infinite end date (but not NULL and the reason is simple: it is easier to validate that the type-2 records are properly tiled, with no gaps or overlaps. I prefer only one validation to two -- the other being the validation of the is_current flag. There is also only one correct way of accessing the data.
That said, a system that I'm currently working on also publishes a view with only the current records. That is handy.
That system is not in SQL Server. One optimization that you can attempt is clustering so the current records are all colocated -- assuming they are much more commonly accessed. You can do this using either method. Using a clustered index like this makes updates more expensive, but they can be handy for optimizing memory.

Can converting a SQL query to PL/SQL improve performance in Oracle 12c? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 5 years ago.
Improve this question
I have been given an 800 lines SQL Query which is taking around 20 hours to fetch around 400 million records.
There are 13 tables which are partitioned by month.
The tables have records ranging from 10k to 400 million in each partition.
The tables are indexed on primary keys.
The query uses many inline views and outer joins and a few group by functions.
DBAs say we cannot add more indexes as it would slow down the performance since it is an OLTP system.
I have been asked to convert the query logic to pl/sql and then populate a table in chunks.Then do a select * from that table.
My end result should be a query which can be fed to my application.
So even after I use pl/sql to populate a table in chunks,ultimately I need to fetch the data from that table as a query.
My question is, since pl/sql would require select and insert both, are there any chances pl/sql can be faster than sql?
Are there any cases where pl/sql is faster for any result which is achievable by sql?
I will be happy to provide more information if the given info doesn't suffice.
Implementing it as a stored procedure could be faster because the SQL will already be parsed and compiled when the procedure is created. However, given the volume of data you are describing its unclear if this will make a significant difference. All you can do is try it and see.
I think you really need to identify where the performance problem is; where the time is being spent. For example (and I have seen examples of this many times), the majority of the time might be in fetching to 400M rows to whatever the "client" is. In that case, re-writing the query or as PL/SQL will make no difference.
Anyway, once you can enumerate the problem, you have a better chance of getting sound answers, rather than guesses...

Best way to do an Anti-join in RedShift [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
An anti-join is a way of getting tuples in one table that don't match in another table.
There are numerous ways we can implement an anti-join:
Correlated sub-query
Uncorrelated Sub-query
Outer Join and Check for NULL
Which is the most optimal way to perform an anti-join in Redshift? The correlated sub-query in this case, is not optimial and the RedShift's query engine does not decorrelate that query.
As #denismo said, it's hard without context, but you can try something like:
SELECT ...
FROM a LEFT OUTER JOIN b USING(key)
WHERE b.key is null
it will return the list of rows in a that their key does not exist in a
Without actual table layout and statistics, and actual query, it's hard to give definitive answer to this.
I could speculate that this kind of task, would require a HashJoin on Redshift side. If one of the tables is not too large, and both have the same dist keys, it can be done efficiently in memory. Outer Join with Null check then sounds the most suitable as it is the most predictable - it'll do just that.
But the actual performance will depend on many factors, better come back once you have tables/query.

Represent Is"Something" Data in SQL-Server [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
Joe Celko (sql guru) says that we should not use proprietary data types and especially refrain from machine level things like bit or byte since SQL-Server uses a high level language. Basically that the principle of data modeling is data abstraction. So discerning the above recommendation for fields like "IsActive" etc., what would the correct choice be for the data type, one that is a very portable and one that is deciphered clearly by front end layers? Thanks!
In SQL Server, I would go for BIT data type as it matches the abstract requirements that you describe: it can have 2 values (which map to Yes and No by a widely used convention of Yes = 1 and No = 0). It can have an additional NULL value if desired.
If possible, using native data types has all the benefits of performance, clarity and understandability for others. Not to mention the principle of not overcomplicating things when you can keep them simple.
SQL Server doesn't have a Boolean data type so Boolean is out of the question. BIT is a numeric type that accepts the values 0 and 1 as well as null. I usually prefer a CHAR type with a CHECK constraint permitting values like "Y"/"N" or "T"/"F". CHAR at least lets you extend the set of values to more than just two if you want to.
BIT has the potential disadvantage that it's non-standard, not particularly user-friendly and not well understood even by SQL Server users. The semantics of BIT are very peculiar in SQL Server and even Microsoft's own products treat BIT in inconsistent ways.

Sql server: internal workings [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
Some of these might make little sense, but:
Sql code interpreted or compiled (and to what)?
What are joins translated into - I mean into some loops or what?
Is algorithm complexity analysis applicable to a query, for example is it possible to write really bad select - exponential in time by number of rows selected? And if so how to analyze queries?
Well ... quite general questions, so some very general answers
1) Sql code interpreted or compiled (and to what)?
SQL code is compiled in to execution plans.
2) What are joins translated into - I mean into some loops or what?
Depends on the join and the tables you're joining (as far as i know). SQL Server has some join primitives (hash join, nested loop join), depending on the objects involved in your sql code the query optimizer tries to choose the best option.
3) Not reallyIs algorithm complexity analysis applicable to a query, for example is it possible to write really bad select - exponential in time by number of rows selected? And if so how to analyze queries?
Not really sure, what you mean by that. But there are cases where you can do really bad things, for example using
SELECT TOP 1 col FROM Table ORDER BY col DESC
on a table without an index on col to find the lagest value for col instead of
SELECT MAX(col) FROM Table
You should get your hands on some/all of the books from the SQL Server internals series. They are really excellent an cover many things in great detail.
You'd get a lot of these answers by reading one of Itzik Ben-Gan's books. He covers these topics you mention in some detail.
http://tsql.solidq.com/books/index.htm