Postgresql why is INNER JOIN so much slower than WHERE - sql

I have 2 tables where I copy file name from one table to another in an update operation. Using INNER JOIN makes the query run in 22 seconds when there are just ~4000 rows. Using a WHERE clause allows it to run it in about 200 milliseconds. How and why is this happening, does the INNER JOIN result in additional looping?
Example 1 using INNER JOIN - Takes 22 seconds when table a has about 4k records.
UPDATE table_a SET file_name = tmp.file_name FROM
(
SELECT b.customer_id, b.file_name, b.file_id FROM table_b AS b WHERE b.status = 'A'
) tmp
INNER JOIN table_a AS a
ON tmp.customer_id=a.customer_id AND tmp.file_id=a.file_id;
Example 2 using WHERE runs in about 200 ms.
UPDATE table_a AS a SET file_name = tmp.file_name FROM
(
SELECT b.customer_id, b.file_name, b.file_id FROM table_b AS b WHERE b.status = 'A'
) tmp
WHERE tmp.customer_id=a.customer_id AND tmp.file_id=a.file_id;

The queries are doing totally different things. The first is updating every row in table_a with the expression. I am guessing that there are even multiple updates on the same row.
The two table_as in the first version are two different references to the table. The effect is a cross join because you have no conditions combining them.
The second method is the correct syntax for what you want to do in Postgres.

Related

Converting Nested Subqueries into Mini Queries

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?
Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).
For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

What's the purpose of a JOIN where no column from 2nd table is being used?

I am looking through some hive queries we are running as part of analytics on our hadoop cluster, but I am having trouble understanding one. This is the Hive QL query
SELECT
c_id, v_id, COUNT(DISTINCT(m_id)) AS participants,
cast(date_sub(current_date, ${window}) as string) as event_date
from (
select
a.c_id, a.v_id, a.user_id,
case
when c.id1 is not null and a.timestamp <= c.stitching_ts then c.id2 else a.m_id
end as m_id
from (
select * from first
where event_date <= cast(date_sub(current_date, ${window}) as string)
) a
join (
select * from second
) b on a.c_id = b.c_id
left join third c
on a.user_id = c.id1
) dx
group by c_id, v_id;
I have changed the names but otherwise this is the select statement being used to insert overwrite to another table.
Regarding the join
join (
select * from second
) b on a.c_id = b.c_id
b is not used anywhere except for join condition, so is this join serving any purpose at all?
Is it for making sure that this join only has entries where c_id is present in second table? Would a where IN condition be better if thats all this is doing.
Or I can just remove this join and it won't make any difference at all.
Thanks.
Join (any inner, left or right) can duplicate rows if join key in joined dataset is not unique. For example if a contains single row with c_id=1 and b contains two rows with c_id=1, the result will be two rows with a.c_id=1.
Join (inner) can filter rows if join key is absent in joined dataset. I believe this is what it meant to do.
If the goal is to get only rows with keys present in both datasets(filter) and you do not want duplication, and you do not use columns from joined dataset, then better use LEFT SEMI JOIN instead of JOIN, it will work as filter only even if there are duplicated keys in joined dataset:
left semi join (
select c_id from second
) b on a.c_id = b.c_id
This is much safer way to filter rows only which exist in both a and b and avoid unintended duplication.
You can replace join with WHERE IN/EXISTS, but it makes no difference, it is implemented as the same JOIN, check the EXPLAIN output and you will see the same query plan. Better use LEFT SEMI JOIN, it implements uncorrelated IN/EXISTS in efficient way.
If you prefer to move it to the WHERE:
WHERE a.c_id IN (select c_id from second)
or correlated EXISTS:
WHERE EXISTS (select 1 from second b where a.c_id=b.c_id)
But as I said, all of them are implemented internally using JOIN operator.

INTERSECT two table of size 500ml rows in vertica

I am very new to vertica db and hence looking for different efficient ways for comparing two tables of average size 500ml-800ml rows in vertica. I have a process that gets the data from vertica view and dump in to SQL server for later merge to final table in sql server. for few large tables combine it is dumping about 3bl rows daily. Instead of dumping all data I want to take daily snapshot, and compare it with previous days snapshot on vertica side only and then push changed rows only in to SQL SEREVER.
lets say previous snapshot is stored in tableA, today's snapshot stored in tableB. PK on both table is column named OrderId.
Simplest way I can think of is
Select * from tableB
Where OrderId NOT IN (
SELECT * from tableA
INTERSECT
SELECT * from tbleB
)
So my questions are:
Is there any other/better option in vertica to get only changed rows between two tables? Or should I
even consider doing this compare on vertica side?
How much doing such comparison should take?
What should I consider to improve the performance of such query?
If your columns have no NULL values, then a massive LEFT JOIN would seem to do what you want:
select b.*
from tableB b left join
tableA a
on b.OrderId = a.OrderId and
b.col1 = a.col1 and
. . . -- for all the columns you care about
However, I think you want except:
select b.*
from tableB b
except
select a.*
from tableA a;
I imagine this would have reasonable performance.
Do you have a primary key in the two tables?
Then my technique, for a complete Change Data Capture, is:
SELECT
'I' AS to_do
, newrows.*
FROM tb_today newrows
LEFT
JOIN tb_yesterday oldrows USING(id)
WHERE oldrows.id IS NULL
UNION ALL
SELECT
'U' AS to_do
, newrows.*
FROM tb_today newrows
JOIN tb_yesterday oldrows
WHERE oldrows.fname <> newrows.fname
OR oldrows.lnamd <> newrows.lname
OR oldrows.bdate <> newrwos.bdate
OR oldrows.sal <> newrows.sal
[...]
OR oldrows.lastcol <> newrows.lastcol
UNION ALL
SELECT
'D' AS to_do
, oldrows.*
FROM tb_yesterday oldrows
LEFT
JOIN tb_today oldrows USING(id)
WHERE newrows.id IS NULL
;
Just leave out the last leg of the UNION SELECT if you don't want to cater for DELETEs ('D')
Good luck
you also do it nicely using joins:
SELECT b.*
FROM tableB AS b
LEFT JOIN tableA AS a ON a.id = b.id
WHERE a.id IS NULL
so above query return only diff from TableB to TableA i.e. data which is present in both table will be skipped...

Slow performance query

I'm having a slow query performance and sometimes it came to an error of "Can't allocate space for object 'temp work table'"
I have 2 tables and 1 view. The first two tables have an left join and the last view will do a sub query. Below is the sample query.
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID
WHERE a.ID (SELECT ID
FROM View1).
The above query is very slow. BUT when I used a #temp table it becomes faster.
SELECT ID
INTO #Temp
FROM View1
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID
WHERE a.ID IN (SELECT ID
FROM #Temp)
Could someone explain why the first sql statement is very slow? and kindly give me an advise like adding new index?
Note: The first query statement cannot be altered or modified. I used only the second query statement to show to my team that if we put the 3rd table into temporary table and used it, makes faster.
Basically in the first query you are accessing the view for each and every row, and in turn the view is executing it's query.
In the second one you are executing the view's query just once and using the returned results through the temp table.
Try:
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID,
(SELECT ID
FROM View1) c
WHERE a.ID = c.ID;

Refactor SQL statement with lots of common INNER JOINS and columns

I need suggestions on how to re-factor the following SQL expression. As you can see all the selected columns except col_N are same. Plus all the inner joins except the last one in the 2 sub-queries are the same. This is just a snippet of my code so I am not including the WHERE clause I have in my query. FYI-This is part of a stored procedure which is used by a SSRS report and performance is BIG for me due to thousands of records:
SELECT col_A
, col_B, col_C,...
, '' As[col_N]
FROM table_A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D1
UNION
SELECT col_A
, col_B, col_C,...
, (select E.field_2 from table_E AS E where D2.field_1 = E.field_1 AND A.field_1 = E.field_2) AS [col_N]
FROM table_A as A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D2 as D2
Jean's first suggestion of creating a view by joining A, B and C worked. I created a temp table by joining ABC and then used it to achieve significant performance improvement (query time reduces to half for couple thousand records)!