Refactor SQL statement with lots of common INNER JOINS and columns - sql

I need suggestions on how to re-factor the following SQL expression. As you can see all the selected columns except col_N are same. Plus all the inner joins except the last one in the 2 sub-queries are the same. This is just a snippet of my code so I am not including the WHERE clause I have in my query. FYI-This is part of a stored procedure which is used by a SSRS report and performance is BIG for me due to thousands of records:
SELECT col_A
, col_B, col_C,...
, '' As[col_N]
FROM table_A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D1
UNION
SELECT col_A
, col_B, col_C,...
, (select E.field_2 from table_E AS E where D2.field_1 = E.field_1 AND A.field_1 = E.field_2) AS [col_N]
FROM table_A as A
INNER JOIN table_B
INNER JOIN table_C
INNER JOIN table_D2 as D2

Jean's first suggestion of creating a view by joining A, B and C worked. I created a temp table by joining ABC and then used it to achieve significant performance improvement (query time reduces to half for couple thousand records)!

Related

Converting Nested Subqueries into Mini Queries

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?
Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).
For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

Postgresql why is INNER JOIN so much slower than WHERE

I have 2 tables where I copy file name from one table to another in an update operation. Using INNER JOIN makes the query run in 22 seconds when there are just ~4000 rows. Using a WHERE clause allows it to run it in about 200 milliseconds. How and why is this happening, does the INNER JOIN result in additional looping?
Example 1 using INNER JOIN - Takes 22 seconds when table a has about 4k records.
UPDATE table_a SET file_name = tmp.file_name FROM
(
SELECT b.customer_id, b.file_name, b.file_id FROM table_b AS b WHERE b.status = 'A'
) tmp
INNER JOIN table_a AS a
ON tmp.customer_id=a.customer_id AND tmp.file_id=a.file_id;
Example 2 using WHERE runs in about 200 ms.
UPDATE table_a AS a SET file_name = tmp.file_name FROM
(
SELECT b.customer_id, b.file_name, b.file_id FROM table_b AS b WHERE b.status = 'A'
) tmp
WHERE tmp.customer_id=a.customer_id AND tmp.file_id=a.file_id;
The queries are doing totally different things. The first is updating every row in table_a with the expression. I am guessing that there are even multiple updates on the same row.
The two table_as in the first version are two different references to the table. The effect is a cross join because you have no conditions combining them.
The second method is the correct syntax for what you want to do in Postgres.

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.
As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id
I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))
It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.
You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)

How to compare two tables in Postgresql?

I have two identical tables:
A : id1, id2, qty, unit
B: id1, id2, qty, unit
The set of (id1,id2) is identifying each row and it can appear only once in each table.
I have 140 rows in table A and 141 rows in table B.
I would like to find all the keys (id1,id2) that are not appearing in both tables. There is 1 for sure but there can't be more (for example if each table has whole different data).
I wrote this query:
(TABLE a EXCEPT TABLE b)
UNION ALL
(TABLE b EXCEPT TABLE a) ;
But it's not working. It compares the whole table where I don't care if qty or unit are different, I only care about id1,id2.
use a full outer join:
select a.*,b.*
from a full outer join b
on a.id1=b.id1 and a.id2=b.id2
this show both tables side by side. with gaps where there is an unmatched row.
select a.*,b.*
from a full outer join b
on a.id1=b.id1 and a.id2=b.id2
where a.id1 is null or b.id1 is null;
that will only show unmatched rows.
or you can use not in
select * from a
where (id1,id2) not in
( select id1,id2 from b )
that will show rows from a not matched by b.
or the same result using a join
select a.*
from a left outer join b
on a.id1=b.id1 and a.id2=b.id2
where b.id1 is null
sometimes the join is faster than the "not in"
Here is an example of using EXCEPT to see what records are different. Reverse the select statements to see what is different. a except s / then s except a
SELECT
a.address_entrytype,
a.address_street,
a.address_city,
a.address_state,
a.address_postal_code,
a.company_id
FROM
prospects.address a
except
SELECT
s.address_entrytype,
s.address_street,
s.address_city,
s.address_state,
s.address_postal_code,
s.company_id
FROM
prospects.address_short s

Slow performance query

I'm having a slow query performance and sometimes it came to an error of "Can't allocate space for object 'temp work table'"
I have 2 tables and 1 view. The first two tables have an left join and the last view will do a sub query. Below is the sample query.
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID
WHERE a.ID (SELECT ID
FROM View1).
The above query is very slow. BUT when I used a #temp table it becomes faster.
SELECT ID
INTO #Temp
FROM View1
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID
WHERE a.ID IN (SELECT ID
FROM #Temp)
Could someone explain why the first sql statement is very slow? and kindly give me an advise like adding new index?
Note: The first query statement cannot be altered or modified. I used only the second query statement to show to my team that if we put the 3rd table into temporary table and used it, makes faster.
Basically in the first query you are accessing the view for each and every row, and in turn the view is executing it's query.
In the second one you are executing the view's query just once and using the returned results through the temp table.
Try:
SELECT a.*
FROM Table1 a LEFT JOIN Table2 b ON a.ID = b.ID,
(SELECT ID
FROM View1) c
WHERE a.ID = c.ID;