How does IN query work in SQL - sql

Select *
from TableName
where columnA = value1
AND columnB = value2
and column3 IN (list of ids);
How will the above query work in either of the DBs.
How is it different from the following loop query:-
for x in list_of_ids:
Select * from TableName where columnA = value1 AND columnB = value2 and column3 = x;

The second example is simply a loop where select will be done X number of times.
The execution first example:
1. If number of values are small IN will be transformed to ORed conditions.
2. If number of values are huge some db engines will optimize it and create a spool with all the values and do a join with your table
In any case first example will be faster.
Note: SQL queries are not valid in NOSQL databases. The above scenario is for traditional dbs only.

The later example is not a query. You have a loop (probably Python) that is executing a specific query (with a specific value for column3) each time.
Is this a fictitious example or taken from a specific DB ?

The first example should be more efficient as it will perform the select * from table once.
The second example will perform the select X times and therefore is likely to be less efficient than the first.
The link below gives the detail on the In clause I think you are after
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/4ca39765-6521-4dec-aec1-373b1adf0be7/how-does-the-in-clause-affect-performance?forum=transactsql

Related

Will "Where 0=1" parse full table or just return column names

I came across this question:
SQL Server: Select Top 0?
I want to ask if I use the query
SELECT * FROM table WHERE 0=1
or
SELECT TOP 0 * FROM table
will it return just the column names instantly, or will it keep on parsing the whole table and in the end return zero results?
I have a production table with 10,000 rows - will it check the WHERE condition on each row?
The SQL Server query optimizer is smart enough to figure out that this WHERE condition can never ever produce a true result on any row, so it doesn't bother actually scanning the table.
If you look at the actual execution plan for such a query, it's easy to see that nothing is being done and the query returns immediately:
MySql is smart enough to detect it and know its impossible to do.
desc SELECT * FROM table WHERE 0=1;
In the query
SELECT * FROM table WHERE 0=1
the WHERE clause never will be true so SQL Server is not going to scan all of your table.
And in the query
SELECT TOP 0 * FROM table
you are specifying TOP 0 so SQL Server is very smart so it never scans your table for returning 0 rows.
Both the queries will return only the column headers.
Both query is used for getting an empty set of table;
SELECT TOP 0 * FROM table
SELECT * FROM table WHERE 0=1
As well as for achieve below things:
To get same structure of column name
Used for return column details but no data
And want query to check connectivity

SELECT COUNT(*) ;

I have a database, database1, with two tables (Table 1, Table2) in it.
There are 3 rows in Table1 and 2 rows in Table2. Now if I execute the following SQL query SELECT COUNT(*); on database1, then the output is "1".
Does anyone has the idea, what this "1" signifies?
The definition of the two tables is as below.
CREATE TABLE Table1
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
CREATE TABLE Table2
(
ID INT PRIMARY KEY,
NAME NVARCHAR(20)
)
Normally all selects are of the form SELECT [columns, scalar computations on columns, grouped computations on columns, or scalar computations] FROM [table or joins of tables, etc]
Because this allows plain scalar computations we can do something like SELECT 1 + 1 FROM SomeTable and it will return a recordset with the value 2 for every row in the table SomeTable.
Now, if we didn't care about any table, but just wanted to do our scalar computed we might want to do something like SELECT 1 + 1. This isn't allowed by the standard, but it is useful and most databases allow it (Oracle doesn't unless it's changed recently, at least it used to not).
Hence such bare SELECTs are treated as if they had a from clause which specified a table with one row and no column (impossible of course, but it does the trick). Hence SELECT 1 + 1 becomes SELECT 1 + 1 FROM ImaginaryTableWithOneRow which returns a single row with a single column with the value 2.
Mostly we don't think about this, we just get used to the fact that bare SELECTs give results and don't even think about the fact that there must be some one-row thing selected to return one row.
In doing SELECT COUNT(*) you did the equivalent of SELECT COUNT(*) FROM ImaginaryTableWithOneRow which of course returns 1.
Along similar lines the following also returns a result.
SELECT 'test'
WHERE EXISTS (SELECT *)
The explanation for that behavior (from this Connect item) also applies to your question.
In ANSI SQL, a SELECT statement without FROM clause is not permitted -
you need to specify a table source. So the statement "SELECT 'test'
WHERE EXISTS(SELECT *)" should give syntax error. This is the correct
behavior.
With respect to the SQL Server implementation, the FROM
clause is optional and it has always worked this way. So you can do
"SELECT 1" or "SELECT #v" and so on without requiring a table. In
other database systems, there is a dummy table called "DUAL" with one
row that is used to do such SELECT statements like "SELECT 1 FROM
dual;" or "SELECT #v FROM dual;". Now, coming to the EXISTS clause -
the project list doesn't matter in terms of the syntax or result of
the query and SELECT * is valid in a sub-query. Couple this with the
fact that we allow SELECT without FROM, you get the behavior that you
see. We could fix it but there is not much value in doing it and it
might break existing application code.
It's because you have executed select count(*) without specifying a table.
The count function returns the number of rows in the specified dataset. If you don't specify a table to select from, a single select will only ever return a single row - therefore count(*) will return 1. (In some versions of SQL, such as Oracle, you have to specify a table or similar database object; Oracle includes a dummy table (called DUAL) which can be selected from when no specific table is required.)
you wouldn't normally execute a select count(*) without specifying a table to query against. Your database server is probably giving you a count of "1" based on default system table it is querying.
Try using
select count(*) from Table1
Without a table name it makes no sense.
without table name it always return 1 whether it any database....
Since this is tagged SQL server, the MSDN states.
COUNT always returns an int data type value.
Also,
COUNT(*) returns the number of items in a group. This includes NULL
values and duplicates.
Thus, since you didn't provide a table to do a COUNT from, the default (assumption) is that it returns a 1.
COUNT function returns the number of rows as result. If you don't specify any table, it returns 1 by default. ie., COUNT(*), COUNT(1), COUNT(2), ... will return 1 always.
Select *
without a from clause is "Select ALL from the Universe" since you have filtered out nothing.
In your case, you are asking "How many universe?"
This is exactly how I would teach it. I would write on the board on the first day,
Select * and ask what it means. Answer: Give me the world.
And from there I would teach how to filter the universe down to something meaningful.
I must admit, I never thought of Select Count(*), which would make it more interesting but still brings back a true answer. We have only one world.
Without consulting Steven Hawking, SQL will have to contend with only 1.
The results of the query is correct.

SQL WHERE ID IN (id1, id2, ..., idn)

I need to write a query to retrieve a big list of ids.
We do support many backends (MySQL, Firebird, SQLServer, Oracle, PostgreSQL ...) so I need to write a standard SQL.
The size of the id set could be big, the query would be generated programmatically. So, what is the best approach?
1) Writing a query using IN
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
My question here is. What happens if n is very big? Also, what about performance?
2) Writing a query using OR
SELECT * FROM TABLE WHERE ID = id1 OR ID = id2 OR ... OR ID = idn
I think that this approach does not have n limit, but what about performance if n is very big?
3) Writing a programmatic solution:
foreach (var id in myIdList)
{
var item = GetItemByQuery("SELECT * FROM TABLE WHERE ID = " + id);
myObjectList.Add(item);
}
We experienced some problems with this approach when the database server is queried over the network. Normally is better to do one query that retrieve all results versus making a lot of small queries. Maybe I'm wrong.
What would be a correct solution for this problem?
Option 1 is the only good solution.
Why?
Option 2 does the same but you repeat the column name lots of times; additionally the SQL engine doesn't immediately know that you want to check if the value is one of the values in a fixed list. However, a good SQL engine could optimize it to have equal performance like with IN. There's still the readability issue though...
Option 3 is simply horrible performance-wise. It sends a query every loop and hammers the database with small queries. It also prevents it from using any optimizations for "value is one of those in a given list"
An alternative approach might be to use another table to contain id values. This other table can then be inner joined on your TABLE to constrain returned rows. This will have the major advantage that you won't need dynamic SQL (problematic at the best of times), and you won't have an infinitely long IN clause.
You would truncate this other table, insert your large number of rows, then perhaps create an index to aid the join performance. It would also let you detach the accumulation of these rows from the retrieval of data, perhaps giving you more options to tune performance.
Update: Although you could use a temporary table, I did not mean to imply that you must or even should. A permanent table used for temporary data is a common solution with merits beyond that described here.
What Ed Guiness suggested is really a performance booster , I had a query like this
select * from table where id in (id1,id2.........long list)
what i did :
DECLARE #temp table(
ID int
)
insert into #temp
select * from dbo.fnSplitter('#idlist#')
Then inner joined the temp with main table :
select * from table inner join temp on temp.id = table.id
And performance improved drastically.
First option is definitely the best option.
SELECT * FROM TABLE WHERE ID IN (id1, id2, ..., idn)
However considering that the list of ids is very huge, say millions, you should consider chunk sizes like below:
Divide you list of Ids into chunks of fixed number, say 100
Chunk size should be decided based upon the memory size of your server
Suppose you have 10000 Ids, you will have 10000/100 = 100 chunks
Process one chunk at a time resulting in 100 database calls for select
Why should you divide into chunks?
You will never get memory overflow exception which is very common in scenarios like yours.
You will have optimized number of database calls resulting in better performance.
It has always worked like charm for me. Hope it would work for my fellow developers as well :)
Doing the SELECT * FROM MyTable where id in () command on an Azure SQL table with 500 million records resulted in a wait time of > 7min!
Doing this instead returned results immediately:
select b.id, a.* from MyTable a
join (values (250000), (2500001), (2600000)) as b(id)
ON a.id = b.id
Use a join.
In most database systems, IN (val1, val2, …) and a series of OR are optimized to the same plan.
The third way would be importing the list of values into a temporary table and join it which is more efficient in most systems, if there are lots of values.
You may want to read this articles:
Passing parameters in MySQL: IN list vs. temporary table
I think you mean SqlServer but on Oracle you have a hard limit how many IN elements you can specify: 1000.
Sample 3 would be the worst performer out of them all because you are hitting up the database countless times for no apparent reason.
Loading the data into a temp table and then joining on that would be by far the fastest. After that the IN should work slightly faster than the group of ORs.
For 1st option
Add IDs into temp table and add inner join with main table.
CREATE TABLE #temp (column int)
INSERT INTO #temp (column)
SELECT t.column1 FROM (VALUES (1),(2),(3),...(10000)) AS t(column1)
Try this
SELECT Position_ID , Position_Name
FROM
position
WHERE Position_ID IN (6 ,7 ,8)
ORDER BY Position_Name

Selecting data effectively sql

I have a very large table with over 1000 records and 200 columns. When I try to retreive records matching some criteria in the WHERE clause using SELECT statement it takes a lot of time. But most of the time I just want to select a single record that matches the criteria in the WHERE clause rather than all the records.
I guess there should be a way to select just a single record and exit which would minimize the retrieval time. I tried ROWNUM=1 in the WHERE clause but it didn't really work because I guess the engine still checks all the records even after finding the first record matching the WHERE criteria. Is there a way to optimize in case if I want to select just a few records?
Thanks in advance.
Edit:
I am using oracle 10g.
The Query looks like,
Select *
from Really_Big_table
where column1 is NOT NULL
and column2 is NOT NULL
and rownum=1;
This seems to work slower than the version without rownum=1;
rownum is what you want, but you need to perform your main query as a subquery.
For example, if your original query is:
SELECT co1, col2
FROM table
WHERE condition
then you should try
SELECT *
FROM (
SELECT col1, col2
FROM table
WHERE condition
) WHERE rownum <= 1
See http://www.oracle.com/technology/oramag/oracle/06-sep/o56asktom.html for details on how rownum works in Oracle.
1,000 records isn't a lot of data in a table. 200 columns is a reasonably wide table. For this reason, I'd suggest you aren't dealing with a really big table - I've performed queries against millions of rows with no problems.
Here is a little experiment... how long does it take to run this compared to the "SELECT *" query?
SELECT
Really_Big_table.Id
FROM
Really_Big_table
WHERE
column1 IS NOT NULL
AND
column2 IS NOT NULL
AND
rownum=1;
An example is here: You can view more here
SELECT ename, sal
FROM ( SELECT ename, sal, RANK() OVER (ORDER BY sal DESC) sal_rank
FROM emp )
WHERE sal_rank <= 1;
You also have to do some column indexing for column in the WHERE clause
In SQL most of the optimization would come in the form on index on the table (where you would index the columns that appear in the WHERE and ORDER BY columns as a rough guide).
You did not specify what SQL database you are using, so I can't point to a good resource.
Here is an introduction to indexes on Oracle.
Here another tutorial.
As for queries - you should always specify the columns you are returning and not use a blanket *.
it shouldn't take a lot of time to query a 1000 rows table. There are exceptions however, check if you are in one of the following cases:
1. Lots of rows were deleted
The table had a massive amount of rows in the past. Since the High Water Mark (HWM) is still high (delete won't lower it) and FULL TABLE SCAN read all the data up to the high water mark, it may take a lot of time to return results even if the table is now nearly empty.
Analyse your table (dbms_stats.gather_table_stats('<owner>','<table>')) and compare the space actually used by the rows (space on disk) with the effective space (data), for example:
SELECT t.avg_row_len * t.num_rows data_bytes,
(t.blocks - t.empty_blocks) * ts.block_size bytes_used
FROM user_tables t
JOIN user_tablespaces ts ON t.tablespace_name = ts.tablespace_name
WHERE t.table_name = '<your_table>';
You will have to take into account the overhead of the rows and blocks as well as the space reserved for update (PCT_FREE). If you see that you use a lot more space than required (typical overhead is below 30%, YMMV) you may want to reset the HWM, either:
ALTER TABLE <your_table> MOVE; and then rebuild INDEX (ALTER INDEX <index> REBUILD), don't forget to collect stats afterwards.
use DBMS_REDEFINITION
2. The table has very large columns
Check if you have columns of datatype LOB, CLOB, LONG (irk), etc. Data over 4000 bytes in any of these columns is stored out of line (in a separate segment), which means that if you don't select these columns you will only query the other smaller columns.
If you are in this case, don't use SELECT *. Either you don't need the data in the large columns or use SELECT rowid and then do a second query : SELECT * WHERE rowid = <rowid>.

SQL find non-null columns

I have a table of time-series data of which I need to find all columns that contain at least one non-null value within a given time period. So far I am using the following query:
select max(field1),max(field2),max(field3),...
from series where t_stamp between x and y
Afterwards I check each field of the result if it contains a non-null value.
The table has around 70 columns and a time period can contain >100k entries.
I wonder if there if there is a faster way to do this (using only standard sql).
EDIT:
Unfortunately, refactoring the table design is not an option for me.
The EXISTS operation may be faster since it can stop searching as soon as it finds any row that matches the criteria (vs. the MAX which you are using). It depends on your data and how smart your SQL server is. If most of your columns have a high rate of non-null data then this method will find rows quickly and it should run quickly. If your columns are mostly NULL values then your method may be faster. I would give them both a shot and see how they are each optimized and how they run. Also keep in mind that performance may change over time if the distribution of your data changes significantly.
Also, I've only tested this on MS SQL Server. I haven't had to code strict ANSI compatible SQL in over a year, so I'm not sure that this is completely generic.
SELECT
CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN #x AND #y AND field1 IS NOT NULL) THEN 1 ELSE 0 END AS field1,
CASE WHEN EXISTS (SELECT * FROM Series WHERE t_stamp BETWEEN #x AND #y AND field2 IS NOT NULL) THEN 1 ELSE 0 END AS field2,
...
EDIT: Just to clarify, the MAX method might be faster since it could determine those values with a single pass through the data. Theoretically, the method here could as well, and potentially with less than a full pass, but your optimizer may not recognize that all of the subqueries are related, so it might do separate passes for each. That still might be faster, but as I said it depends on your data.
It would be faster with a different table design:
create table series (fieldno integer, t_stamp date);
select distinct fieldno from series where t_stamp between x and y;
Having a table with 70 "similar" fields is not generally a good idea.
When you say "a faster way to do this", if you mean a faster way for the query to run, then yes, here's how to do it: break it out into one query per column:
select top 1 field1 from series where t_stamp between x and y and field1 is not null
select top 1 field2 from series where t_stamp between x and y and field2 is not null
select top 1 field3 from series where t_stamp between x and y and field3 is not null
This way, you won't be doing a table scan across the entire table to find the maximum value. Instead, the database engine will stop as soon as it finds a non-null value. Assuming your data isn't 99% nulls, this should give you faster execution - but at the expense of more programming time to set this up.
How about this... You query for a list of field names that you can iterate through.
select 'field1' as fieldname from series
where field1 is not null and t_stamp between x and y
UNION
select 'field2' from series where field2 is not null
... etc
Then you have a recordset that will only contain the string name of the fields that are not null. Then you can loop over this recordset to build your real query as dynamic SQL and ignore fields that don't have any data. The "select 'field2'" will not return a string when there is no crieteria matching the where clause.
Edit: I think I misread the question... this will give you all the rows with a non-null value. I'll leave it here in case it helps someone but it's not the answer to your question. Thanks #Pax
I think you want to use COALESCE:
SELECT ... WHERE COALESCE(fild1, field2, field3) IS NOT NULL
For a start, this is a very bad idea with standard SQL since not all DBMSs sort with NULLs last.
There are all sorts of tricky ways you could do this and most would be interminably slow.
I'd suggest you (sort-of) normalize the database some more so that each of the columns is in a separate table which would make a select easier but that's probably not what you want.
After edit of question: if refactoring table design is not an option, your given solution is probably the best, especially if you have indexes on all the 70 columns.
Although that's likely to slow down inserts quite a bit, you may want to use a non-indexed table for maximum insert speed and transfer the data periodically (overnight?) to an indexed table which would run your selects at best speed (by avoiding a full table scan).
select count(field1),count(field2),count(field3),...
from series where t_stamp between x and y
will tell you how many non-null values are in each column. Unfortunately, it's not much better than the way you're doing it now.
Try this:
SELECT CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END AS field1_stat,
CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END AS field2_stat,
... for every field to be checked
FROM series
WHERE foo IN bar
GROUP BY CASE WHEN field1 IS NOT NULL THEN '' ELSE 'contains null' END,
CASE WHEN field2 IS NOT NULL THEN '' ELSE 'contains null' END
... etc
This will give you a summary on the combination of 'nulled' fields in the table