Can I select several tables in the same WITH query? - sql

I have a long query with a with structure. At the end of it, I'd like to output two tables. Is this possible?
(The tables and queries are in snowflake SQL by the way.)
The code looks like this:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
..... many more alias tables and subqueries here .....
)
select * from table_g where z = 3 ;
But for the very last row, I'd like to query table_g twice, once with z = 3 and once with another condition, so I get two tables as the result. Is there a way of doing that (ending with two queries rather than just one) or do I have to re-run the whole code for each table I want as output?

One query = One result set. That's just the way that RDBMS's work.
A CTE (WITH statement) is just syntactic sugar for a subquery.
For instance, a query similar to yours:
with table_a as (
select id,
product_a
from x.x ),
table_b as (
select id,
product_b
from x.y ),
table_c as (
select id,
product_c
from x.z ),
select *
from table_a
inner join table_b on table_a.id = table_b.id
inner join table_c on table_b.id = table_c.id;
Is 100% identical to:
select *
from
(select id, product_a from x.x) table_a
inner join (select id, product_b from x.y) table_b
on table_a.id = table_b.id
inner join (select id, product_c from x.z) table_c
on table_b.id = table_c.id
The CTE version doesn't give you any extra features that aren't available in the non-cte version (with the exception of a recursive cte) and the execution path will be 100% the same (EDIT: Please see Simon's answer and comment below where he notes that Snowflake may materialize the derived table defined by the CTE so that it only has to perform that step once should the CTE be referenced multiple times in the main query). As such there is still no way to get a second result set from the single query.

While they are the same syntactically, they don't have the same performance plan.
The first case can be when one of the stages in the CTE is expensive, and is reused via other CTE's or join to many times, under Snowflake, use them as a CTE I have witness it running the "expensive" part only a single time, which can be good so for example like this.
WITH expensive_select AS (
SELECT a.a, b.b, c.c
FROM table_a AS a
JOIN table_b AS b
JOIN table_c AS c
WHERE complex_filters
), do_some_thing_with_results AS (
SELECT stuff
FROM expensive_select
WHERE filters_1
), do_some_agregation AS (
SELECT a, SUM(b) as sum_b
FROM expensive_select
WHERE filters_2
)
SELECT a.a
,a.b
,b.stuff
,c.sum_b
FROM expensive_select AS a
LEFT JOIN do_some_thing_with_results AS b ON a.a = b.a
LEFT JOIN do_some_agregation AS c ON a.a = b.a;
This was originally unrolled, and the expensive part was some VIEWS that the date range filter that was applied at the top level were not getting pushed down (due to window functions) so resulted in full table scans, multiple times. Where pushing them into the CTE the cost was paid once. (In our case putting date range filters in the CTE made Snowflake notice the filters and push them down into the view, and things can change, a few weeks later the original code ran as good as the modified, so they "fixed" something)
In other cases, like this the different paths that used the CTE use smaller sub-sets of the results, so using the CTE reduced the remote IO so improved performance, there then was more stalls in the execution plan.
I also use CTEs like this to make the code easier to read, but giving the CTE a meaningful name, but the aliasing it to something short, for use. Really love that.

Related

Converting Nested Subqueries into Mini Queries

I have a lot of trouble reading nested subqueries - I personally prefer to write several mini queries and work from there. I understand that more advanced SQL users find it more efficient to write nested subqueries.
For instance, in the following query:
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join ( select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1) c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1
Problem: I am trying to turn this into several different queries:
#PART 1 :
create table_1 as select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
#PART 2:
create table_2 as select a.id_1, max(a.var_1) as max_var_1 from table_a a
group by a.id_1
#PART 3 (final result: final_table)
create final_table as select a.*, b.*
from table_1 a
inner join table_2 b
on a.id_1 = b.id_1 and a.var_1 = b.max_var_1
My Question: Can someone please tell me if this is correct? Is this how the above nested subquery can be converted into 3 mini queries?
Subqueries are only inserted into separate tables when you use them multiple times. And yet, if the result of the subquery returns many records, then it is not recommended to insert them separately into the table. Because when you are using only select DB will only read data from the disc, but when using insert command, DB will write to disc. Inserting many records may be long process than selecting.
P.S. Mostly used "create temporary table" when inserting subquery process.
Another good way is to use "CTE (Common Table Expression)". When using "CTE", the database stores the results of "SELECT" queries in RAM, executing the subquery only once. If then subqueries are then used multiple times, the database only uses the results from RAM (not executing).
For the performance of your query, you can use only #PART2, and others should not be used. They are unnecessary. But for better performance, I recommended you write your query without inserting, using CTE. For example:
with sub_query as (
select
id_1,
max(var_1) as max_var_1
from
table_a
group by id_1
)
select distinct b.table_b, a.*
from table_a a
inner join table_c b
on a.id_1 = b.id_1
inner join sub_query c
on a.id_1 = b.id_1 and a.var_1 = c.max_var_1;

In SQL is there a way to use select * on a join?

Using Snowflake,have 2 tables, one with many columns and the other with a few, trying to select * on their join, get the following error:
SQL compilation error:duplicate column name
which makes sense because my joining columns are in both tables, could probably use select with columns names instead of *, but is there a way I could avoid that? or at least have the query infer the columns names dynamically from any table it gets?
I am quite sure snowflake will let you choose all from both halves of two+ tables via
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
what you will not be able to do is refer to the named of the columns in GROUP BY indirectly, thus this will not work
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY x
even though some databases know because you have JOIN ON a.x = b.x there is only one x, snowflake will not allow it (well it didn't last time I tried this)
but you can with the above use the alias name or the output column position thus both the following will work.
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY a.x
SELECT a.*, b.*
FROM table_a AS a
JOIN table_b AS b
ON a.x = b.x
ORDER BY 1 -- assuming x is the first column
in general the * and a.* forms are super convenient, but are actually bad for performance.
when selecting you are now are risk of getting the columns back in a different order if the table has been recreated, thus making reading code unstable. Which also impacts VIEWs.
It also means all meta data for the table need to be loaded to know what the complete form of the data will be in. Where if you want x,y,z only and later a w was added to the table, the whole query plan can be compiled faster.
Lastly if you are selecting SELECT * FROM table in a sub-select and only a sub-set of those columns are needed the execution compiler doesn't need to prune these. And if all variables are attached to a correctly aliased table, if later a second table adds the same named column, naked columns are not later ambiguous. Which will only occur when that SQL is run, which might be an "annual report" which doesn't happen that often. wow, what a long use alias rant.
You can prefix the name of the column with the name of the table:
select table_a.id, table_b.name from table_a join table_b using (id)
The same works in combination with *:
select table_a.id, table_b.* from table_a join table_b using (id)
It works in "join" and "where" parts of the statement as well
select table_a.id, table_b.* from table_a join table_b
on table_a.id = table_b.id where table_b.name LIKE 'b%'
You can use table aliases to make the statement sorter:
select a.id, b.* from table_a a join table_b b
on a.id = b.id
Aliases could be applies on fields to use in subqueries, client software and (depending on the SQL server) in the other parts of the statements, for example 'order by':
select a.id as a_id, b.* from table_a a join table_b b
on a.id = b.id order by a_id
If you're after a result that includes all the distinct non-join columns from each table in the join with the join columns included in the output only once (given they will be identical for an inner-join) you can use NATURAL JOIN.
e.g.
select * from d1 natural inner join d2 order by id;
See examples: https://docs.snowflake.com/en/sql-reference/constructs/join.html#examples

Oracle SQL - Select not using index as expected

So I haven't used Oracle in more than 5 years and I'm out of practice. I've been on SQL Server all that time.
I'm looking at some of the existing queries and trying to improve them, but they're reacting really weirdly. According to the explain plan instead of going faster they're instead doing full table scans and not using the indexes.
In the original query, there is an equijoin done between two tables done in the where statement. We'll call them table A and B. I used an explain plan followed by SELECT * FROM table(DBMS_XPLAN.DISPLAY (FORMAT=>'ALL +OUTLINE')); and it tells me that Table A is queried by Local Index.
TABLE ACCESS BY LOCAL INDEX ROWID
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
I tried to change the query and join TableA with a new table (Table C). Table C is a subset of Table B with 700 records instead of 100K. However the explain plan tells me that Table A is now queried with a full lookup.
CREATE TableC
AS<br>
SELECT * FROM TableB WHERE Active='Y';
SELECT A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
Next step, I kept the join between tables A & C, but used a hint to tell it to use the index on Table A. However it still does a full lookup.
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
So I tried to change from a join to a simple Select of table A and use an IN statement to compare to table C. Still a full table scan.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Lastly, I took the previous statement and changed the subselect to pull the top 1000 records, and it used the index. The odd thing is that there are only 700 records in Table C.
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000
)
I was wondering if someone could help me figure out what's happening?
My best guess is that since TableC is a new table, maybe the optimizer doesn't know how many records are in it and that's why it's it will only use the index if it knows that there are fewer than 1000 records?
I tried to run dbms_stats.gather_schema_stats on my schema though and it did not help.
Thank you for your help.
As a general rule Using an index will not necessarily make your query go faster ALWAYS.
Hints are directives to the optimizer to make use of the path, it doenst mean optimizer would choose to obey the hint directive. In this case, the optimizer would have considered that an index lookup on TableA is more expensive in the
SELECT A.*
FROM TableA A, TableB B
WHERE A.SecondaryID = B.ID;
SELECT /*+ INDEX (A_NDX01) */ A.*
FROM TableA A, TableC C
WHERE A.SecondaryID = C.ID;
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC);
Internally it might have converted all of these statements(IN) into a join which when considering the data in the tableA and tableC decided to make use of full table scan.
When you did the rownum condition, this plan conversion was not done. This is because view-merging will not work when it has the rownum in the query block.
I believe this is what is happening when you did
SELECT A.*
FROM TableA A
WHERE A.SecondaryID in (SELECT ID FROM TableC WHERE rownum <1000)
Have a look at the following link
Oracle. Preventing merge subquery and main query conditions

Which is better for performance, selecting all the columns or select only the required columns while performng join?

I am been asked to do performance tuning of a SQL Server query which has so many joins in it.
For example
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
There are almost 25 columns present in vw_Billing_Cenus_R but we wanted to use only 3 of them. So I wanted to know instead of selecting all the columns from the view or table, if I only select those columns which are required and then perform join like this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
So Will this improve the performance or not?
The important part is the columns you are actually using on the outmost SELECT, not the ones to are selecting to join. The SQL Server engine is smart enough to realize that he does not need to retrieve all columns from the referenced table (or view) if he doesn't need them.
So the following 2 queries should yield the exact same query execution plan:
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
The difference would be if you actually use the selected column (in a conditional where or actually retrieving the value), as in here:
SELECT
A.SomeColumn,
X.* -- * has all X columns
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.*
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
SELECT
A.SomeColumn,
X.* -- * has only X's SomeColumn
FROM
MyTable AS A
LEFT JOIN (
SELECT
B.SomeColumn
FROM
OtherTable AS B) AS X ON A.SomeColumn = X.SomeColumn
I would rather use this approach:
LEFT JOIN
vw_BILLABLE_CENSUS_R CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
than this
LEFT JOIN (SELECT [Column_1], [Column_2], [Column_3]
FROM vw_BILLABLE_CENSUS_R) CEN ON DE.Client = CEN.Client
AND CAL.REPORTING_MONTH = CEN.REPORTING_MONTH
Since in this case:
you make your query simpler,
you does not have to rely on query optimizer smartness and expect that it will eliminate unnecessary columns and rows
finally, you can select as many columns in the outer SELECT as necessary without using derived tables techniques.
In some cases, derived tables are welcome, when you want to eliminate duplicates in a table you want to join on a fly, but, imho, not in your case.
It depends on how many records are stored, but generally it will improve performance.
In this case read #LukStorms ' comments, I think he is right

Returning only duplicate rows from two tables

Every thread I've seen so far has been to check for duplicate rows and avoiding them. I'm trying to get a query to only return the duplicate rows. I thought it would be as simple as a subquery, but I was wrong. Then I tried the following:
SELECT * FROM a
WHERE EXISTS
(
SELECT * FROM b
WHERE b.id = a.id
)
Was a bust too. How do I return only the duplicate rows? I'm currently going through two tables, but I'm afraid there are a large amount of duplicates.
use this query, maybe is better if you check the relevant column.
SELECT * FROM a
INTERSECT
SELECT * FROM b
I am sure your posted code would work too like
SELECT * FROM a
WHERE EXISTS
(
SELECT 1 FROM b WHERE id = a.id
)
You can as well do a INNER JOIN like
SELECT a.* FROM a
JOIN b on a.id = b.id;
You can as well use a IN operator saying
SELECT * FROM a where id in (select id from b);
If none of them, then you can use UNION if both table satisfies the union restriction along with ROW_NUMBER() function like
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER(PARTITION BY id ORDER BY id) AS rn
FROM (
select * from a
union all
select * from b) xx ) yy
WHERE rn = 1;
Note: there's an ambiguity as to what you mean by a duplicate row, and whether you're talking about duplicate keys, or all fields being the same. My answer deals with all fields being the same; some of the others are assuming it's just the keys. It's unclear which you intend.
You might try
SELECT id, col1, col2 FROM a INNER JOIN b ON a.id = b.id
WHERE a.col1 = b.col1 AND a.col2 = b.col2
adding in other columns as necessary. The database engine should be intelligent enough to do the comparisons on the indexed columns first, so it'll be efficient as long as you don't have rows that are different only on lots of non-indexed fields. (If you do, then I don't think anything will do it particularly efficiently.)