When are extra columns removed in Teradata SQL? - sql

I understand that the order of operations for SQL in Teradata is as follows:
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
This is from this link.
Does this mean that any extra, unneeded columns in the tables I am joining are always removed at the very end (when SELECT is performed)? Do those extra unselected columns take up spool space until they are finally dropped?
So if I am joining Table A (5 columns) with Table B (10 columns), the intermediate result right after the join is 14 columns (with 1 common key). But let's say I'm ultimately only selecting 3 columns at the end.
Does the query optimizer always include all 14 columns in the intermediate result (thus taking up spool space) or is it smart enough to only include the needed 3 columns in the intermediate result?
If it is smart enough to do this, then I could save spool space by rewriting every table I'm joining to as a subquery of ONLY the columns I need from that table.
Thank you for your help.

You are confusing the compiling and execution of queries.
Those are not the "order of operations". What you have described is the order of "interpreting the query". This occurs during the compilation phase, when the identifiers (column and table names and aliases) are interpreted.
SQL is a descriptive language. A SQL query describes the result set. It does not describe how the data is processed (a procedural language would do that).
As for not reading columns. Teradata is probably smart enough to read the columns it needs from the data pages and not bring along unreferenced columns throughout the processing.

Related

Row Order in SQL

I wanted to know if the row order returned by a query mattered?
I'm not using a SQL service yet, just working with plain tables and Excel.
For example if I do a left join on two tables, my take is that all the rows from the left or first table to be mentioned will be the first ones in my resulting table, whether there are coincidences on the right one or not. But a classmate ordered the results so he placed the rows with coincidences first and the ones without, with null values, at the end.
SQL tables represent unordered sets. SQL results sets are unordered unless you explicitly have an ORDER BY for the outermost SELECT.
This is always true and is a fundamental part of the language. Your class should have covered this on day 1.
The results from a query without an ORDER BY may look like they are in a particular order. However, you should not depend on that -- or, you depend on that at your peril. The rule is simple: without an ORDER BY, you do not know the ordering of the result set.

MS Access SQL - Removing Duplicates From Query

MS Access SQL - This is a generic performance-related duplicates question. So, I don't have a specific example query, but I believe I have explained the situation below clearly and simply in 3 statements.
I have a standard/complex SQL query that Selects many columns; some computed, some with asterisk, and some by name - e.g. (tab1.*, (tab2.co1 & tab2.col2) as computedFld1, tab3.col4, etc).
This query Joins about 10 tables. And the Where clause is based on user specified filters that could be based on any of the fields present in all 10 tables.
Based on these filters, I can sometimes get records with the same tab4.ID value.
Question: What is the best way to eliminate duplicate result rows with the same tab4.ID value. I don't care which rows get eliminated. They will differ in non-important ways.
Or, if important, they will differ in that they will have different tab5.ID values; and I want to keep the result rows with the LARGEST tab5.ID values.
But if the first query performs better than the second, then I really don't care which rows get eliminated. The performance is more important.
I have worked on this most of the morning and I am afraid that the answer to this is above my pay scale. I have tried Group By tab4.ID, but can't use "*" in Select clause; and many other things that I just keep bumping my head against a wall.
Access does not support CTEs but you can do something similar with saved queries.
So first alias the columns that have same names in your query, something like:
SELECT tab4.ID AS tab4_id, tab5.ID AS tab5_id, ........
and then save your query for example as myquery.
Then you can use this saved query like this:
SELECT q1.*
FROM myquery AS q1
WHERE q1.tab5_id = (SELECT MAX(q2.tab5_id) FROM myquery AS q2 WHERE q2.tab4_id = q1.tab4_id)
This will return 1 row for each tab4_id if there are no duplicate tab5_ids for each tab4_id.
If there are duplicates then you must provide additional conditions.

UNION ALL vs UNPIVOT to change column values into rows

Why would one want to use one method over the other for taking several column values and moving them into rows with a label classifier and a value column?
UNPIVOT is better from a performance perspective because it only scans the rows once. UNION ALL is going to scan the rows once for every subquery. In theory, this doesn't have to happen, but I don't know of an optimizer that would only do one scan.
This is particularly important for large tables or if the "table" is really a complex SQL expression or view.

What is the order of execution for this SQL statement

I have below SQL Query :
SELECT TOP 5 C.CustomerID,C.CustomerName,C.CustomerSalary
FROM Customer C
WHERE C.CustomerSalary > 10000
ORDER BY C.CustomerSalary DESC
What will be execution order of the following with proper explanation ?
TOP Clause
WHERE Clause
ORDER BY Clause
Check out the documentation for the SELECT statement, in particular this section:
Logical Processing Order of the SELECT statement
The following steps show the logical processing order, or binding
order, for a SELECT statement. This order determines when the objects
defined in one step are made available to the clauses in subsequent
steps. For example, if the query processor can bind to (access) the
tables or views defined in the FROM clause, these objects and their
columns are made available to all subsequent steps. Conversely,
because the SELECT clause is step 8, any column aliases or derived
columns defined in that clause cannot be referenced by preceding
clauses. However, they can be referenced by subsequent clauses such as
the ORDER BY clause. Note that the actual physical execution of the
statement is determined by the query processor and the order may vary
from this list.
which gives the following order:
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
WHERE
ORDER BY
TOP
Here is a good article about that: http://blog.sqlauthority.com/2009/04/06/sql-server-logical-query-processing-phases-order-of-statement-execution/
Simply remember this phrase:-
Fred Jones' Weird Grave Has Several Dull Owls
Take the first letter of each word, and you get this:-
FROM
(ON)
JOIN
WHERE
GROUP BY
(WITH CUBE or WITH ROLLUP)
HAVING
SELECT
DISTINCT
ORDER BY
TOP
Hope that helps.
This is exact execution order, with your case.
1-FROM
2-WHERE
3-SELECT
4-ORDER BY
5-TOP
TOP, WHERE, and ORDER BY are not "executed" - they simply describe the desired result and the database query optimizer determines (hopefully) the best plan for the actual execution. The separation between "declaring the desired result" and how it is physically achieved is what makes SQL a "declarative" language.
Assuming there is an index on CustomerSalary, and the table is not clustered, your query will likely be executed as an index seek + table heap access, as illustrated in this SQL Fiddle (click on View Execution Plan at the bottom):
As you can see, first the correct CustomerSalary value is found through the Index Seek, then the row that value belongs to is retrieved from the table heap through RID Lookup (Row ID Lookup). The Top is just for show here (and has 0% cost), as is the Nested Loops for that matter - the starting index seek will return (at most) one row in any case. The whole query is rather efficient and will likely cost only a few I/O operations.
If the table is clustered, you'll likely have another index seek instead of the table heap access, as illustrated in this SQL Fiddle (note the lack of NONCLUSTERED keyword in the DDL SQL):
But beware: I was lucky this time to get the "right" execution plan. The query optimizer might have chosen a full table scan, which is sometimes actually faster on small tables. When analyzing query plans, always try to do that on realistic amounts of data!
Visit https://msdn.microsoft.com/en-us/library/ms189499.aspx for a better explanation.
The following steps show the logical processing order, or binding order, for a SELECT statement. This order determines when the objects defined in one step are made available to the clauses in subsequent steps. For example, if the query processor can bind to (access) the tables or views defined in the FROM clause, these objects and their columns are made available to all subsequent steps. Conversely, because the SELECT clause is step 8, any column aliases or derived columns defined in that clause cannot be referenced by preceding clauses. However, they can be referenced by subsequent clauses such as the ORDER BY clause. Note that the actual physical execution of the statement is determined by the query processor and the order may vary from this list.
FROM
ON
JOIN
WHERE
GROUP BY
WITH CUBE or WITH ROLLUP
HAVING
SELECT
DISTINCT
ORDER BY
TOP
My $0,02 here.
There's two different concepts in action here: the logical execution order and the plan of query execution. An other was to see it is who answers the following questions:
How MSSQL understood my SQL Query?
What it'll do to execute it in the best possible way given the current schema and data?
The first question is answered by the logical execution order. Brian's answer show what it is. It's the way SQL understood your command: "FROM Customer table (aliased as C) consider only the rows WHERE the C.CustomerSalary > 10000, ORDER them BY C.CustomerSalary in descendent order and SELECT the columns listed for the TOP 5 rows". The resultset will obey that meaning
The second question's answer is the query execution plan - and it depends on your schema (table definitions, selectivity of data, quantity of rows in the customer table, defined indexes, etc) since is heavily dependant of SQL Server optimizer internal workings.
Here is the complete sequence for sql server :
1. FROM
2. ON
3. JOIN
4. WHERE
5. GROUP BY
6. WITH CUBE or WITH ROLLUP
7. HAVING
8. SELECT
9. DISTINCT
10. ORDER BY
11. TOP
So from the above list, you can easily understand the execution sequence of TOP, WHERE and ORDER BY which is :
1. WHERE
2. ORDER BY
3. TOP
Get more information about it from Microsoft

Avoid full table scan

I have an SQL select query to be tuned. In the query there is a View in from clause which has been formed through 4 tables. When this query is executed Full table scan takes place on all these 4 tables which causes CPU spikes. The four tables have valid indexes built on them.
The query looks similar to this:
SELECT DISTINCT ID, TITLE,......
FROM FINDSCHEDULEDTESTCASE
WHERE STEP_PASS_INDEX = 1 AND LOWER(COMPAREANAME) ='abc' ORDER BY ID;
The dots indicate that there are many more columns. Here FINDSCHEDULEDTESTCASE is a view on four tables.
Can someone guide me how to avoid full table scan on those four tables.
In any case using your condition
AND LOWER(COMPAREANAME) ='abc'
you'll have the full scan of COMPAREANAME values because for each value function LOWER must be calculated.
It depends on so many things!
SELECT DISTINCTG ID, TITLE, ......
Depending on how many columns you SELECT, it is possible that SQL Server decides to do a table scan instead of using your indexes.
Also, depending on your "WHERE" conditions, SQL Server can also decides to do a table scan instead of using your indexes.
Which version of SQL Server are you using?
There can be ways to improve the indexes on the tables, if, for an example, the conditions in the "WHERE" represents less than 50% of the rows, and if you are using SQL 2008. (With filtered indexes http://msdn.microsoft.com/en-us/library/ms188783.aspx )
Or you can create indexes on views (http://msdn.microsoft.com/en-us/library/ms191432.aspx )
There really is not enough details in your question to be able to really help you.