Why query with "in" and "on" statement runs infinitely - sql

I have three tables, table3 is bascially the intermediate table of table1 and table2. When I execute the query statement that contains "in" and joins table1 and table3, it just kept running and I could not get the result. If I use id=134 instead of id in (134,267,390,4234 ... ), the result comes up. I don't understand why "in" has the effect, does anyone have an idea?
Query statement:
select count(*) from table1, table3 on id=table3.table1_id where table3.table2_id = 123 and id in (134,267,390,4234) and item = 30;
table structure:
table1:
id integer primary key,
item integer
table2:
id integer,
item integer
table3:
table1_id integer,
table2_id integer
-- the DB without index was 0.8 TB after the three indices is now 2.5 TB
indices on: table1.item, table3.table1_id, table3.table2_id
env: Linux, sqlite 3.7.17

from table1, table3 is a cross join on most databases, with the size of your data a cross join is enormous, but in SQLite3 it's an inner join. From the SQLite SELECT docs
Side note: Special handling of CROSS JOIN. There is no difference between the "INNER JOIN", "JOIN" and "," join operators. They are completely interchangeable in SQLite.
That's not your problem in this specific instance, but let's not tempt fate; always write out your joins explicitly.
select count(*)
from table1
join table3 on id=table3.table1_id
where table3.table2_id = 123
and id in (134,267,390,4234);
Since you're just counting, you don't need any data from table1 but the ID. table3 has table1_id, so there's no need to join with table1. We can do this entirely with the table3 join table.
select count(*)
from table3
where table2_id = 123
and table1_id in (134,267,390,4234);
SQLite can only use one index per table. For this to be performant on such a large data set, you need a composite index of both columns: table3(table1_id, table2_id). Presumably you don't want duplicates, so this should take the form of a unique index. That will cover queries for just table1_id and queries for both table1_id and table2_id; you should drop your table1_id index to save space and time.
create unique index table3_unique on table3(table1_id, table2_id);
The composite index will not for queries which use only table2_id, keep your existing table2_id index.
Your query should now run lickity-split.
For more, read about the SQLite Query Optimizer.
A terabyte is a lot of data. While SQLite technicly can handle this, it might not be the best choice. It's great for small and simple databases, but it's missing a lot of features. You should look into a more powerful database such as PostgreSQL. It is not a magic bullet, all the same principles apply, but it is much more appropriate for data at that scale.

Related

Joining multiple tables with single join clause (sqlite)

So I'm learning SQL (sqlite flavour) and looking through the sqlite JOIN-clause documentation, I figure that these two statements are valid:
SELECT *
FROM table1
JOIN (table2, table3) USING (id);
SELECT *
FROM table1
JOIN table2 USING (id)
JOIN table3 USING (id)
(or even, but that's beside the point:
SELECT *
FROM table1
JOIN (table 2 JOIN table3 USING id) USING id
)
Now I've seen the second one (chained join) a lot in SO questions on JOIN clauses, but rarely the first (grouped table-query). Both querys execute in SQLiteStudio for the non-simplified case.
A minimal example is provided here based on this code
CREATE TABLE table1 (
id INTEGER PRIMARY KEY,
field1 TEXT
)
WITHOUT ROWID;
CREATE TABLE table2 (
id INTEGER PRIMARY KEY,
field2 TEXT
)
WITHOUT ROWID;
CREATE TABLE table3 (
id INTEGER PRIMARY KEY,
field3 TEXT
)
WITHOUT ROWID;
INSERT INTO table1 (field1, id)
VALUES ('FOO0', 0),
('FOO1', 1),
('FOO2', 2),
('FOO3', 3);
INSERT INTO table2 (field2, id)
VALUES ('BAR0', 0),
('BAR2', 1),
('BAR3', 3);
INSERT INTO table3 (field3, id)
VALUES ('PIP0', 0),
('PIP1', 1),
('PIP2', 2);
SELECT *
FROM table1
JOIN (table2, table3) USING (id);
SELECT *
FROM table1
JOIN table2 USING (id)
JOIN table3 USING (id);
Could someone explain why one would use one over the other and if they are not equivalent for certain input data, provide an example? The first certainly looks cleaner (at least less redundant) to me.
INNER JOIN ON vs WHERE clause has been suggested as a possible duplicate. While it touches on the use of , as a join operator, I feel the questions and especially the answers are more focussed on the readability aspect and use of WHERE vs JOIN. My question is more about the general validity and possible differences in outcome (given the necessary input to induce the difference).
SQLite does not enforce a proper join syntax. It sees the join operator ([INNER] JOIN, LEFT [OUTER] JOIN, etc., even the comma of the outdated 1980s join syntax) separate from the condition (ON, USING). That is not good, because it makes joins more prone to errors. The SQLite docs are hence a very bad reference for learning joins. (And SQLite itself a bad system for learning them, because the DBMS doesn't detect standard SQL join violations.)
Stick to the syntax defined by the SQL standard (and don't ever use comma-separated joins):
FROM table [alias]
((([INNER] | [(LEFT|FULL) [OUTER]]) JOIN table [alias] (ON conditions | USING ( columns ))) | (CROSS JOIN table [alias]))
((([INNER] | [(LEFT|FULL) [OUTER]]) JOIN table [alias] (ON conditions | USING ( columns ))) | (CROSS JOIN table [alias]))
...
(Hope, I've got this right :-) And I also hope this is readable enough :-| I've omitted NATURAL JOIN and RIGHT [OUTER] JOIN here, because I don't recommend using them at all.)
For table you can place some table name or view or a subquery (the latter including parentheses, e.g. (select * from mytable)). Columns in USING have to be surrounded by parentheses (e.g. USING (a, b, c)). (You can of couse use parentheses around ON conditions as well, if you find this more readable.)
In your case, a properly written query would be:
SELECT *
FROM table1
JOIN table2 USING (id)
JOIN table3 USING (id)
or
SELECT *
FROM table1 t1
JOIN table2 t2 ON t2.id = t1.id
JOIN table3 t3 ON t3.id = t1.id
for instance. The example suggests three 1:1 related tables, though. In real life these are extremely rare and a more typical example would be
SELECT *
FROM table1 t1
JOIN table2 t2 ON t2.t1_id = t1.id
JOIN table3 t3 ON t3.t2_id = t2.id
After fixing syntax, these are not the same for all tables, read the syntax & definitions of the join operators in the manual. Comma is cross join with lower precedence than join keyword joins. Different DBMS's SQLs have syntax variations. Read the manual. Some allow naked join for cross join.
using returns only one column for each specified column name & natural is using for all common columns; but other joins are based on cross join & return a column for every input column. So since here tables 2 & 3 have id columns the comma returns a table with 2 id columns. Then using (id) doesn't make sense since one operand has 2 id columns.
If only tables 1 & 3 have an id column, clearly the 2nd query can't join 1 & 2 using id.
There are always many ways to express things. In particular SQL DBMSs execute many different expressions the same way. Research re relational query implementation/optimization in general, in SQL & in your DBMS manual. Generally no simple query variations like these make a difference in execution for the simplest query engine. (We see that in SQLite cross join "is handled differently by the query optimizer".)
First learn to write straightforward queries & learn what the operators do & what their syntax & restrictions are.

Impact of index on different columns while join

lets say we have two tables- table A and table B and both tables have 5 million records each. They have common fields, id and name. i want to check that what would be the impact of index if we apply on join field while joining the tables and what would be the impact of index on select column while joining the tables. below is query
select t1.name from table A t1 inner join table B t2 on t1.id=t2.id;
on which field shall i create index in order to have faster result. shall i put index on id or name? please help
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
looking for performance improvement
my expectation is if we put index on id column, then query will give result in shorter duration rather if we put index on name field.
For this query:
select t1.name
from tableA t1 inner join
tableB t2
on t1.id = t2.id;
I would expect the best index to be tableB(id). This is the key used for the JOIN.
Under some circumstances, an index on tableA(id, name) might be the best alternative. This would be particularly true if tableA were much larger than tableB.

Index suggestion for a table which has id columns

I have a table whose all columns store ids of other tables (huge tables).
CREATE TABLE #mytable (
Table1Id int,
Table2Id int,
Table3Id int,
Table4Id int,
Table5Id int,
)
Now my select has join to all the tables whose ids are stored in the columns of my table.
select T1.col1, t2.Col1, T3.col1... from
#mytable MyTable inner join table1 T1 on MyTable.Table1Id = T1.Id
inner join table2 T2 on MyTable.Table2Id = T2.Id
inner join table3 T3 on MyTable.Table3Id = T3.Id
inner join table4 T4 on MyTable.Table4Id = T4.Id
inner join table5 T5 on MyTable.Table5Id = T5.Id
order by T1.Col1, T2.col1
At the moment I only have an index on Table1Id and on all the id columns of the other tables. Any suggestions to improve the performance.
You don't say which column your index is currently defined on, but based on your example query, you should create an index for all five columns;
Table1Id, Table2Id, Table3Id, Table4Id, Table5Id
This allows the SQL engine to resolve the query just by reading the index, which should be faster than reading the index, then reading the table.
If you run queries where you access some of the columns, then you need an index for those columns as well. Let's say you run a query on Table3Id and Table4Id. Then you need to create an index on;
Table3Id, Table4Id
I can't tell from the information you provided in your question if these indexes should be unique or non unique. You would have to make that determination.
Examine #mytable
You have no search criteria on that table
no where
no order by
no group by
You are just going to get those rows in no particular order.
There is no use for any index on #mytable
The index Table1Id is not used by that query and will slow down inserts
I suspect #mytable is just an output table and the where conditions are used to populate that table.
The join will use the ID on the table to be joined.
So index ID on table1-x and index it as a PK (clustered) if you can.
If that index is fragmented then defrag.
That join should be an index seek and you can't do any better.
Verify the query plan has index seeks on the joins.
If you don have index seeks on those joins then post the query plan.
You could experiment with hints on the join but I suspect the query optimizer will get it right - that may be a big query but it is not a complex query.
Since SQL will grab pages if you order the #mytable by the individual columns you have a better chance of that page being in memory.
A PK is free IF you can insert in the order of the PK.
In that case you would put the column with the most values in the last position.
Actually would would put the column with tightest groupings of PK in the last position.
And then sort by PK.
For the statement that you have put in your question, there is probably little that you can do. In fact, indexes could even hurt under some circumstances if you are in a memory limited environment.
As a first step, though, you should have indexes in the numbered tables on the id column. That is, you should be storing and then joining on the primary key of these tables (the index is automatic on a primary key).
Generally, the purpose of indexes is to prevent scanning an entire table to find a particular set of records. In this case, it looks like you want all the records anyway, so full-table scans are necessary. That limits the applicability of indexes. There is a good chance that SQL Server will turn these joins into hash joins, which is an efficient way of joining a table when you need to read all the rows.
Additional indexes might be warranted depending on where and group by clauses.

Subquery VS join with respect to performance

Which is better in performance [Subquery] or [join]?
I have 3 tables related to each other, and i need to select data from one table that has some fields related to the other 2 tables, which one from the following 2 SQL statements is better from the view of performance :
select Table1.City, Table1.State, Table2.Name, Table1.Code, Table3.ClassName
from Table1 inner join Table2 on Table1.EmpId = Table2.Id inner join Table3
on Table1.ClassId = Table3.Id where Table.Active = 1
OR
select City, State, (select Name from Table2 where Id = Table1.EmpId) as Name, Code,
(select ClassName from Table3 where Id = Table1.ClassId) as ClassName from Table1
where Active = 1
I have tried the execution plan but its statistics is not expressive to me because the current data is a test data not real one, so i can't imagine the amount of data when tables are live of course they will be more than the test one.
Note : The Id field in Table2 and Table3 is primary key
Thanks in advance
The first approach, with joins, is by far faster. In second the query will be executed for each row. Some databases optimize nested queries into joins though.
Join vs. sub-query
I use subqueries often if I expect large joins with big tables or many joins.
Especially with left joins it can happen that the query exceeds the size of the join cache.

SQL: how do I speed up this query

Here is the situation. I have one table that contains records based on records in many different tables (t1 below). t2 is one of the tables that has information pulled from it into t1.
t1
table_oid --which table id is a FK to
id --fk to other table
store_num --field
t2
t2_id
Here is what I need to find: I need the largest t2_id where the store_num is not null in the corresponding record of t1. Here is the query I wrote:
select max(id) from t1
join t2 on t2.t2_id = t1.id
where store_num is not null
and table_oid = 1234;
However, this takes fairly long. I think this should be a fast query. all _ids have indexes for them. (t1.id/t1.table_oid, t2.t2_id). The vast majority of entries in t1 have a store_num.
Mentally, I would get the t2_ids in desc order, than one by one, try them against t1 until I found the first one that had a store_num.
select t2_id from t2 order by t2_id desc;
has an explain cost of 25612
select t1.* from t1 where table_oid = 1234
and id in (select max(t2_id) from t2);
has an explain cost of 8.
So why wouldn't the above query be a cost of at most 25612*8 = 204896? When I explain it, it comes back as more than 3 times that.
Really, my question is how do I re-write that query to run faster.
NOTE: I am using Oracle.
EDIT:
t2 has 11,895,731 rows
t1 has 473,235,192 rows
EDIT 2:
As I've tried different things, the part of the query that is taking the longest is the full scan on t1 looking for the store_num. Is there a way to keep this from doing a full scan, since I only need the biggest entry?
You say:
all _ids have indexes for them
But your query is:
...
where store_num is not null
and table_oid = 1234;
All of your _id indexes are useless for this query unless store_num and table_oid are also indexed, and are the first columns in said index.
So of course it has to do a full scan; it can give you back max(id) instantly without any filter conditions, but as soon as you put in the filter, it can't use the id index anymore because it doesn't know which part of the index matches those store_num is not null entries - not without a scan.
To speed the query up, you need to create an index on (store_num, table_oid, id). Standard disclaimers about creating indexes for a single ad-hoc query apply; having too many indexes will hurt insert/update performance.
It really doesn't matter how you "rewrite" your query - this isn't like application code, the optimizer is going to rearrange all of the pieces of your query anyway. Unless you have sufficiently-selective indexes on your seek columns or the entire query is completely covered by a single index, it's going to be slow.
Not sure if these apply to Oracle. Do you have an index on the fk id column for the join. Also if you can avoid the 'NOT IN' is't a non-sargable type in SQL which slows down a query.
another option that might be slower is doing an outer join then checking for null on that column. (not sure if that only applies to sql also)
select max(id) from t1
left outer join t2 on t2.t2_id = t1.id
where t1... IS NULL
and table_oid = 1234;
The best way I can think of to have this run fast is to:
Create an index on (TABLE_OID, ID DESC, COVERED_ENTITY_ID) in that order. Why?
table_oid -- this is your primary access condition
id -- so you don't have to access a data block to read it,
-- and you get higher ID values first
covered_entity_id -- you're filtering the data based on this, null vs not null
That should prevent the need to access the 473m rows in T1 at all.
Ensure that there's an index on T2_ID.
If all that's in place, a query like:
select max(id)
from t1
inner join t2
on t2.t2_id = t1.id
where covered_entity_id is not null
and table_oid = 1234;
should be (the optimizer is a finicky beast) able to do a semi-join driven by a fast full scans against the index on T1, never scanning the data blocks. Also consider writing it manaully as:
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and exists (select null
from t2
where t1.id = t2.t2_id);
select max(id)
from t1
where covered_entity_id is not null
and table_oid = 1234
and id in (select t2_id from t2);
As the optimizer may write those plans slightly differently.
In the following I assume covered_entity_id is the same as store_num - it would really make things easier for us if you were consistent in your naming.
The vast majority of entries in t1
have a store_num.
Given that this is the case, the following clause shouldn't have any impact on the performance of your query ...
where covered_entity_id is not null
However, you go on to say
the part of the query that is taking
the longest is the full scan on t1
looking for the store_num
This suggests the query is looking for covered_entity_id is not null first rather than the presumably far more selective table_oid = 1234. The solution might be as simple as re-writing the query like this ...
where table_oid = 1234
and covered_entity_id is not null;
... although I suspect not. You could try hinting to get the query to use the index on table_oid.
The other thing is, how fresh are the statistics? When the optimizer chooses a radically bad execution plan it is often because the stats are out of date.
Incidentally, why are you joining to T2 at all? Your requirements could be met by selecting max(id) from T1 (unless you don't have a foreign key enforcing T1.ID references T2.T2_ID, and hence need to be sure).
edit
To check your statistics run this query:
select table_name
, num_rows
, last_analyzed
from user_tables
where table_name in ('T1', 'T2')
/
If the results show num_rows is widely divergent from the values you gave in your first edit then you should re-gather statistics. If last_anlayzed is something like the day you went live then you definitely should re-gather. You may want to export your statistics first; refreshing the statistics can affect the execution plans (that is the object of the exercise) usually for good but sometimes things can get worse. Find out more.