Best way to understand big and complex SQL queries with many subqueries - sql
I just started in a new project, in a new company.
I was given a big and complex SQL, with about 1000 lines and MANY subqueries, joins, sums, group by, etc.
This SQL is used for report generation (it has no inserts nor updates).
The SQL has some flaws, and my first job in the company is to identify and correct these flaws so that the report shows the correct values (I know the correct values by accessing a legacy system written in Cobol...)
How can I make it easier for me to understand the query, so I can identify the flaws?
As an experienced Java programmer, I know how to refactor a complex bad written monolitic Java code into an easier to understand code with small pieces of code. But I have no clue on how to do that with SQL.
The SQL looks like this:
SELECT columns
FROM
(SELECT columns
FROM
(SELECT DISTINCT columns
FROM table000 alias000
INNER JOIN
table000 alias000
ON column000 = table000.column000
LEFT JOIN
(SELECT columns
FROM (
SELECT DISTINCT columns
FROM columns
WHERE conditions) AS alias000
GROUP BY columns ) alias000
ON
conditions
WHERE conditions
) AS alias000
LEFT JOIN
(SELECT
columns
FROM many_tables
WHERE many_conditions
) )
) AS alias000
ON condition
LEFT JOIN (
SELECT columns
FROM
(SELECT
columns
FROM
many_tables
WHERE many_conditions
) ) ) AS alias001
,
(SELECT
many_columns
FROM
many_tables
WHERE many_conditions) AS alias001
) AS alias001
ON condition
LEFT JOIN
(SELECT
many_columns
FROM many_tables
WHERE many_conditions
) AS alias001
ON condition
,
(SELECT DISTINCT columns
FROM table001 alias001
INNER JOIN
table001 alias001
ON condition
LEFT JOIN
(SELECT columns
FROM (
SELECT DISTINCT columns
FROM tables
WHERE conditions
) AS alias001
GROUP BY
columns ) alias001
ON
condition
WHERE
conditions
) AS alias001
LEFT JOIN
(SELECT columns
FROM tables
WHERE conditions
) AS alias001
ON condition
LEFT JOIN (
SELECT columns
FROM
(SELECT columns
FROM tables
WHERE conditions ) AS alias001
,
(SELECT
columns
FROM
tables
WHERE conditions ) AS alias001
) AS alias001
ON condition
LEFT JOIN
(SELECT
columns
FROM
tables
WHERE conditions
) AS alias001
ON condition
WHERE
condition
) AS alias001
order by column001
How can I make it easier for me to understand the query, so I can identify the flaws?
I deal with code like this every day as we do a lot of reporting and exporting of complex data here.
Step one is to understand the meaning of what you are doing. If you don't understand the meaning, you can't evaluate if you got the correct results. So understand exactly what you are trying to accomplish and see if you can see the results you should see for one record in the user interface. It really helps to have something to compare to so that you can see as you go through the query how adding in new things changes the results. If your query has used single letters or something else meaningless for the derived table aliases, then as you figure out the meaning of that that derived table is supposed to be doing, then replace the alias with something more meaningful like Employees instead of A. This will make it easier for the next person who works on it to decode it later.
Then what you do is start at the innermost derived table(Or subquery if you prefer but when it is being used as a table, the term derived table is more accurate). First figure out what it is supposed to be doing. For instance maybe it is getting all the employees who have less than satisfactory performance evaluations.
Run that and check the results to see if they look correct based on the meaning of what you are doing. For instance, if you are looking at unsatisfactory evaluations and you have 10,000 employees would 5617 seem like a reasonable results set for that chunk of data? Look for repeated records. If the same person is in there three times, then likely you have problem where you are joining one to many and getting the many back when you only want one. This can be fixed either through using aggregate functions and group by or putting another derived table in to replace the problem join.
Once you have the innermost part clear, then start checking the results of the other other derived tables, adding the code back in and checking the results until you find where either records dropped out that should not have (Hey I had 137 employees at this stage and now I only have 116. What caused that?) Remember that is only a clue to look at why that happened. There will be times as you build a complex query when the basic results will change and times when they should not have, that is why understanding the meaning of the data is critical.
Some things in general to look out for:
How null values are handled can affect results
Mixing implict and explict joins can cause incorrect results in some
databases.
At any rate you should always replace all implicit joins with
explicit ones. That makes the code clearer and less likely to have
errors.
If you have implicit joins, look for accidental cross joins. They are
very easy to introduce even in short queries, in complex ones, they
are much more likely which is why implicit joins should never be
used.
If you have left joins look out for places where they get
accidentally converted to inner joins by putting a where clause on
the left join table (other than whether id is null). So this
structure is a problem:
FROM table1 t1
LEFT JOIN Table2 t2 ON t1.t1id = T2.t1id
WHERE t2.somefield = 'test'
and should be
FROM table1 t1
LEFT JOIN Table2 t2 ON t1.t1id = T2.t1id
AND t2.somefield = 'test'
Working from the middle is commonplace in SQL and converting the set based logic of sql as sequential logic can lead to performance issues. Try hard to avoid this although I know it will be very tempting to do so.
The first thing I would do is question the join syntax. Is this literally the way it is currently written now?
select
from tb1, tb2, tb3, tb4, tb5 ...
left join ...
That from clause should look like this
From tb1
Inner join tb2 on .....
Inner join tb3 on .....
....
Left join
http://www-03.ibm.com/software/products/en/data-studio
IBM provides an Eclipse-based analysis tool that has the capability of generating a Visual EXPLAIN graph for complex queries. It shows how indexes are used, what internal result sets are produced and combined and so on.
Example:
SELECT * FROM EMPLOYEE, DEPARTMENT WHERE WORKDEPT=DEPTNO
The solution was to simplify the query using COMMON TABLE EXPRESSIONS.
This allowed me to break the big and complex SQL query into many small and easy to understand queries.
COMMON TABLE EXPRESSIONS:
Can be used to break up complex queries, especially complex joins and sub-queries
Is a way of encapsulating a query definition.
Persist only until the next query is run.
Correct use can lead to improvements in both code quality/maintainability and speed.
Can be used to reference the resulting table multiple times in the same statement (eliminate duplication in SQL).
Can be a substitute for a view when the general use of a view is not required; that is, you do not have to store the definition in metadata.
Example:
WITH cte (Column1, Column2, Column3)
AS
(
SELECT Column1, Column2, Column3
FROM SomeTable
)
SELECT * FROM cte
My new SQL looks like this:
------------------------------------------
--COMMON TABLE EXPRESSION 001--
------------------------------------------
WITH alias001 (column001, column002) AS (
SELECT column005, column006
FROM table001
WHERE condition001
GROUP by column008
)
--------------------------------------------
--COMMON TABLE EXPRESSION 002 --
--------------------------------------------
, alias002 (column009) as (
select distinct column009 from table002
)
--------------------------------------------
--COMMON TABLE EXPRESSION 003 --
--------------------------------------------
, alias003 (column1, column2, column3) as (
SELECT '1' AS column1, '1' as column2, 'name001' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '1' AS column1, '1.1' as column2, 'name002' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '1' AS column1, '1.2' as column2, 'name003' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '2' AS column1, '2' as column2, 'name004' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '2' AS column1, '2.1' as column2, 'name005' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '2' AS column1, '2.2' as column2, 'name006' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '3' AS column1, '3' as column2, 'name007' AS column3 FROM SYSIBM.SYSDUMMY1
UNION ALL
SELECT '3' AS column1, '3.1' as column2, 'name008' AS column3 FROM SYSIBM.SYSDUMMY1
)
--------------------------------------------
--COMMON TABLE EXPRESSION 004 --
--------------------------------------------
, alias004 (column1) as (
select distinct column1 from table003
)
------------------------------------------------------
--COMMON TABLE EXPRESSION 005 --
------------------------------------------------------
, alias005 (column1, column2) as (
select column1, column2 from alias002, alias004
)
------------------------------------------------------
--COMMON TABLE EXPRESSION 006 --
------------------------------------------------------
, alias006 (column1, column2, column3, column4) as (
SELECT column1, column2, column3, sum(column0) as column4
FROM table004
LEFT JOIN table005 ON column01 = column02
group by column1, column2, column3
)
------------------------------------------------------
--COMMON TABLE EXPRESSION 007 --
------------------------------------------------------
, alias007 (column1, column2, column3, column4) as (
SELECT column1, column2, column3, sum(column0) as column4
FROM table006
LEFT JOIN table007 ON column01 = column02
group by column1, column2, column3
)
------------------------------------------------------
--COMMON TABLE EXPRESSION 008 --
------------------------------------------------------
, alias008 (column1, column2, column3, column4) as (
select column1, column2, column3, column4 from alias007 where column5 = 123
)
----------------------------------------------------------
--COMMON TABLE EXPRESSION 009 --
----------------------------------------------------------
, alias009 (column1, column2, column3, column4) as (
select column1, column2,
CASE WHEN column3 IS NOT NULL THEN column3 ELSE 0 END as column3,
CASE WHEN column4 IS NOT NULL THEN column4 ELSE 0 END as column4
from table007
)
----------------------------------------------------------
--COMMON TABLE EXPRESSION 010 --
----------------------------------------------------------
, alias010 (column1, column2, column3) as (
select column1, sum(column4), sum(column5)
from alias009
where column6 < 2005
group by column1
)
--------------------------------------------
-- MAIN QUERY --
--------------------------------------------
select j.column1, n.column2, column3, column4, column5, column6,
column3 + column5 AS column7,
column4 + column6 AS column8
from alias010 j
left join alias006 m ON (m.column1 = j.column1)
left join alias008 n ON (n.column1 = j.column1)
EDIT: I got downvoted on this answer, possibly because they thought I was proposing this as how you should build the final query. I should clarify that this is purely to try and understand what is going on. Once you understand the subqueries and how they link together, you would then use that knowledge to makes the necessary changes to the query and rebuild it in an efficient way.
I've used the technique of intermediate temp tables to troubleshoot complex queries quite a bit. They break the logic up into smaller chunks and are also useful if the original query takes a long time. You can test how to combine these intermediate tables without the overhead of rerunning the whole query. Sometimes I'll use temporary views instead of temp tables because the query optimiser can continue to use indexes on the base tables. The temporary views would get then get dropped once you've finished.
I would start from the innermost subqueries and work my way to the outside.
You're looking for subqueries which appear several times under slightly different guises, and also to give them a concise description - what are they designed to do?
Eg, replace
from (
select x1.y1, x1.y2, x1.y3 ...
from tb1, tb2, tb3, tb4, tb5 ...
left join ...
where ...
group by ...
) as a1
with
from daniel_view1 as a1
where daniel_view1 is
create view daniel_view as
select x1.y1, x1.y2, x1.y3 ...
from tb1, tb2, tb3, tb4, tb5 ...
left join ...
where ...
group by ...
That will already make it look cleaner. Then compare the views. Can any be merged together? You won't necessarily end up keeping the views in the final product, but they will help see the broader pattern without drowning in detail.
Alternatively, you could insert the subquery into a temp table
insert #daniel_work1
select x1.y1, x1.y2, x1.y3 ...
from tb1, tb2, tb3, tb4, tb5 ...
left join ...
where ...
group by ...
Then replace the subquery with
select ... from #daniel_work1 as a1
The other thing you could do is to see if you can break it up into sequential steps.
If you see
select ... from ...
union all
select ... from ...
this could become
insert #steps
select 'step1', ...#1...
insert #steps
select 'step2', ...#2...
union is trickier because set union removes duplicate rows (rows where all of their columns are the same as another row).
By storing intermediate results in temp tables, you can look inside the query as it unfolded, and replay difficult steps. I have 'step_id' as the first column of all my debugging temp tables, so if it gets filled in stages, then you see what data applies to what stage.
There are a few tricks that give a clue about what is going on. If you see a table joined to itself like this:
select ... from mytable t1 inner join mytable t2 on t2.id < t.id
it usually means they want a cross product of the table with itself, but without duplicates. you'll get keys 1 & 2 but not 2 & 1.
Related
Write a where clause that compares two columns to the same subquery?
I want to know if it's possible to make a where clause compare 2 columns to the same subquery. I know I could make a temp table/ variable table or write the same subquery twice. But I want to avoid all that if possible. The Subquery is long and complex and will cause significant overhead if I have to write it twice. Here is an example of what I am trying to do. SELECT * FROM Table WHERE (Column1 OR Column2) IN (Select column from TABLE) I'm looking for a simple answer and that might just be NO but if it's possible without anything too elaborate please clue me in. I updated the select to use OR instead of AND as this clarified my question a little better.
The example you've given would probably perform best using exists, such as: select * from t1 where exists ( select 1 from t2 where t2.col = t1.col1 and t2.col = t1.col2 );
To prevent writing the complicated subquery twice, you can use a CTE (Common Table Expression): ;WITH MyFirstCTE (x) AS ( SELECT [column] FROM [TABLE1] -- add all the very complicated stuff here ) SELECT * FROM Table2 WHERE Column1 IN (SELECT x FROM MyFirstCTE) AND Column2 IN (SELECT x FROM MyFirstCTE) Or using EXISTS: ;WITH MyFirstCTE (x) AS ( SELECT [column] FROM [TABLE1] -- add all the very complicated stuff here ) SELECT * FROM Table2 WHERE EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column1) AND EXISTS (SELECT 1 FROM MyFirstCTE WHERE x = Column2) I used deliberately clumsy names, best to pick better ones. I started it with a ; because if it's not the first command in a larger script then a ; is needed to separate the CTE from the commands before it.
Return distinct values from multiple columns in one query
i have searched but i did not find any good answer actually i got the distinct value but the problem is i am applying query on two columns it should return distinct values but it is returning these values Au |FAA303 Au |FAA505 From my table i want to appear Au only one time as it is now associated with the FAA303 and FAA505 What i want is like this Au |FAA303 |FAA505 This is my query in postgresql. I am kinda new to the database queries. select distinct column1, column2 from table_name
The distinct keyword applies to the combination of all selected fields, not to the first one only. Suppressing repeated values is something you would typically do in an application that connects to your database and performs the query. Just to show you that it is possible in SQL, I provide you this query, but please consider doing this in the application instead: select case row_number() over (partition by column1 order by column2) when 1 then column1 end as column1, column2 from ( select distinct column1, column2 from table_name order by column1, column2 )
Is it possible to ORDER BY a computed column without including it in the result set?
I have this query: SELECT Column1, Column2, Column3, /* computed column */ AS SortColumn FROM Table1 ORDER BY SortColumn SortColumn serves no other purpose as to define an order for sorting the result set. Thus I'd like to omit it in the result set to decrease the size of the data sent to the client. The following fails … SELECT Column1, Column2, Column3 FROM ( SELECT Column1, Column2, Column3, /* computed column */ AS SortColumn FROM Table1 ORDER BY SortColumn ) AS SortedTable1 … because of: Msg 1033, Level 15, State 1 The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP or FOR XML is also specified. So there's this hacky solution: SELECT Column1, Column2, Column3 FROM ( SELECT TOP /* very high number */ Column1, Column2, Column3, /* computed column */ AS SortColumn FROM Table1 ORDER BY SortColumn ) AS SortedTable1 Is there a clean solution I'm not aware of, since this doesn't sound like a rare scenario? Edit: The solutions already given work indeed fine for the query I referred to. Unfortunately, I left out an important detail: The (already existent) query consists of two SELECTs with a UNION in between, which changes the matter pretty much (again simplified, and hopefully not too simplified): SELECT Column1, Column2, Column3 FROM Table1 UNION ALL SELECT Column1, Column2, Column3 FROM Table1 ORDER BY /* computed column */ Msg 104, Level 16, State 1 ORDER BY items must appear in the select list if the statement contains a UNION, INTERSECT or EXCEPT operator. So this error message clearly says that I have to put the computed column in both of the select lists. So there we are again with the subquery solution which doesn't reliably work, as pointed out in the answers.
You don't need to have a computed column in the select statement to use it in an order by SELECT Column1, Column2, Column3 FROM Table1 ORDER BY /* computed column */ If you need to do it using UNION, then do the UNION in a cte, and the order by in the select, making sure to include all the columns you need to do the calculation in the CTE WITH src AS ( SELECT Column1, Column2, Column3, /* computation */ ColumnNeededForOrderBy FROM Table1 UNION ALL SELECT Column1, Column2, Column3, /* computation */ ColumnNeededForOrderBy FROM Table2 ) SELECT Column1, Column2, Column3 FROM src ORDER BY ColumnNeededForOrderBy If you don't care to be specific with the column name, you can use the column index and skip the CTE. I don't like this because you might add a column to the query later and forget to update the index in the ORDER BY clause (I've done it before). Also, the query plans will likely be the same, so it's not like the CTE will cost you anything. SELECT Column1, Column2, Column3, /* computation */ FROM Table1 UNION ALL SELECT Column1, Column2, Column3, /* computation */ FROM Table2 ORDER BY 4
If, for whatever reason, it's not practical to do the calculation in the ORDER BY, you can do something quite similar to your attempt: SELECT Column1, Column2, Column3 FROM ( SELECT Column1, Column2, Column3, /* computed column */ AS SortColumn FROM Table1 ) AS SortedTable1 ORDER BY SortColumn Note that all that's changed here is that the ORDER BY is applied to the outer query. It's perfectly valid to reference columns in the ORDER BY that don't appear in the SELECT clause.
Just put the expression in the order by: SELECT Column1, Column2, Column3, FROM Table1 ORDER BY <computed column>
The reason this is forbidden is that the ordering of the outer select has nothing to do with the ordering of the inner select - not by contract. So if you use order by without a top clause, you're obviously making a mistake. By using top the way you do, you simply hide the error, but you still have the same mistake. Your hack only works because the engine happened to preserve the order - but that's not a given, and there's no way to enforce that (other than using order by in the outer query). For example, a different index usage or parallel execution can scramble your data. So no, there isn't another way - you need to order by in the outer query, and that requires you to output the column you want to sort by in the subquery. And unless you're using *, it's not like it makes any difference - you don't need to select it in the outer select, just the inner one. And only the outer select is sent to the client :)
The only place for an ORDER BY is the outer most statement. Of course there are exceptions: If you for example need the TOP record for a filtered list (e.g. the last valid value on a given date). But in these cases you must combine ORDER BY with TOP. Only the outer most ORDER BY will sort the list you get.
After the edit looks like this is what you need SELECT Column1, Column2, Column3 FROM ( SELECT Column1, Column2, Column3 FROM Table1 UNION ALL SELECT Column1, Column2, Column3 FROM Table1 ) ORDER BY /* computed column */
SQL Server - improve performance of searching a values in table
I'm facing with problem in one query. The easiest will be to explain step by step: At first I'm searching a specific values in colum1 in table1 by using query like this: Query #1: select column1 from table1 where column1 in('xxx','yyy','zzz') group by column1 having count(*) >3 So now I have a list on values from column1, which occurs more than 3 times. Then I need to use that list in where condition in another query: select column1, column2, column3 from table1 where column1 in (query 1) Unfortunately when I'm using query 1 as subquery, execution is really slow and I need to find a different way to this. Any suggest how can I increase a performance ? Best regards and thank you in advance
If they are the same table, then use window functions: select t.* from (select t.*, count(*) over (partition by column1) as cnt from table1 t where column1 in ('xxx', 'yyy', 'zzz') ) t where cnt > 3; Both this an your original query will benefit from h having an index on table1(column1).
1)First of all take a look if the query is correctly indexed. Maybe you have to add an index on column1. 2) try with it: select column1, column2, column3 from table1 as T1 inner join ( select column1, column2, column3 from table1 where column1 in (query 1)) as T2 on t1.column1 = t2.column1
SQL Query to retrieve results from two equally designed tables
How can I query the results of two equally designed tables? if table1 contains 1 column with data: abc def hjj and table2 contains 1 column with data: uyy iuu pol then i want my query to return abc def hjj uyy iuu pol but I want to make sure that if I try to do the same task with multiple columns that the associations remain.
SELECT Column1, Column2, Column3 FROM Table1 UNION SELECT Column1, Column2, Column5 AS Column3 FROM Table2 ORDER BY Column1 Notice how I do an order by at the end and that Column5 in Table2 is the equivalent of Column3 in Table1. The Order By is of course optional, but allows you to control the order of items from both tables once they are combined.
Use a UNION SELECT * FROM TABLE_A UNION SELECT * FROM TABLE_B UNION will give you all distinct results, as where UNION ALL will give you results combined from the sets.
SELECT col FROM t1 UNION SELECT col FROM t2 Union reference.
sev, since union is the solution to what you described and you say that didn't work, perhaps you can provide the code you wrote that didn't work as clearly we are missing part of the picture. Are you positive the second table has the records you want? How do you know for sure?