UNION ALL vs UNPIVOT to change column values into rows - sql

Why would one want to use one method over the other for taking several column values and moving them into rows with a label classifier and a value column?

UNPIVOT is better from a performance perspective because it only scans the rows once. UNION ALL is going to scan the rows once for every subquery. In theory, this doesn't have to happen, but I don't know of an optimizer that would only do one scan.
This is particularly important for large tables or if the "table" is really a complex SQL expression or view.

Related

When are extra columns removed in Teradata SQL?

I understand that the order of operations for SQL in Teradata is as follows:
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
This is from this link.
Does this mean that any extra, unneeded columns in the tables I am joining are always removed at the very end (when SELECT is performed)? Do those extra unselected columns take up spool space until they are finally dropped?
So if I am joining Table A (5 columns) with Table B (10 columns), the intermediate result right after the join is 14 columns (with 1 common key). But let's say I'm ultimately only selecting 3 columns at the end.
Does the query optimizer always include all 14 columns in the intermediate result (thus taking up spool space) or is it smart enough to only include the needed 3 columns in the intermediate result?
If it is smart enough to do this, then I could save spool space by rewriting every table I'm joining to as a subquery of ONLY the columns I need from that table.
Thank you for your help.
You are confusing the compiling and execution of queries.
Those are not the "order of operations". What you have described is the order of "interpreting the query". This occurs during the compilation phase, when the identifiers (column and table names and aliases) are interpreted.
SQL is a descriptive language. A SQL query describes the result set. It does not describe how the data is processed (a procedural language would do that).
As for not reading columns. Teradata is probably smart enough to read the columns it needs from the data pages and not bring along unreferenced columns throughout the processing.

How can I optimize an SQL query (Using Indexes but not bitmap indexes)?

I have an SQL query, example:
SELECT * FROM TAB1 NATURAL JOIN TAB2 WHERE TAB1.COL1 = 'RED'
How can I optimize this query to use indexes but not bitmap indexes in Oracle?
NOTE: This answers the original version of the question.
First, don't use NATURAL JOIN. It is an abomination because it does not use properly declared foreign key relationships. It simply uses columns with the same name, and that can produce misleading results.
Second, the query is syntactically incorrect for two reasons. First, "Red" is a reference to a column, not a string value. Does the table have a column named "Red". The second reason is that you have a self join, so ROW1 is ambiguous.
That rings up the larger issue. Your query basically makes no sense at all. You are joining the table to itself, returning duplicate columns. What are the results? Pretty indeterminate:
If any column contains a NULL value, then no rows are returned.
If all the rows are duplicates (with no NULL values), then you'll get a result set with the N^2 rows and duplicate columns, where N is the number of rows in the table.
I cannot think of any use for the query. I see no reason to try to optimize it.
If you have a real query that you want to discuss, I would suggest that you ask another question.

Optimize Oracle SELECT on large dataset

I am new in Oracle (working on 11gR2). I have a table TABLE with something like ~10 millions records in it, and this pretty simple query :
SELECT t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1
AND t.col11 = val2
AND t.col12 = val3
AND t.col13 = val4
The query is currently taking about 30s/1min.
My question is: how can I improve performance ? After a lot of research, I am aware of the most classical ways to improve performance but I have some problems :
Partitioning: can't really, the table is used in an other project and it would be too impactful. Plus it only delay the problem given the number of rows inserted in the table every day.
Add an index: The thing is, the columns used in the WHERE clause are not the one returned by the query (except for one). Thus, I have not been able to find an appropriate index yet. As far as I know, setting an index on 12~13 columns does not make a lot of sense (or does it?).
Materialized views: I must say I never used them, but I understood the maintenance cost is pretty high and my table is updated quite often.
I think the best way to do this would be to add an appropriate index, but I can't find the right columns on which it should be created.
An index makes sense provided that your query results in a small percentage of all rows. You would create one index on all four columns used in the WHERE clause.
If too many records match, then a full table scan will be done. You may be able to speed this up by having this done in parallel threads using the PARALLEL hint:
SELECT /*+parallel(t,4)*/
t.col1, t.col2, t.col3, t.col4, t.col5, t.col6, t.col7, t.col8, t.col9, t.col10
FROM TABLE t
WHERE t.col1 = val1 AND t.col11 = val2 AND t.col12 = val3 AND t.col13 = val4;
Table with 10 millions records is quite little table. You just need to create an appropriate index. Which column select for index - depends on content of them. For example, if you have column that contains only "1" and "0", or "yes" and "no", you shouldn't index it. The more different values contains column - the more effect gives index. Also you can make index on two or three (and more) columns, or function-based index (in this case index contains results of your SQL function, not columns values). Also you can create more than one index on table.
And in any case, if your query selects more then 20 - 30% of all table records, index will not help.
Also you said that table is used by many people. In this case, you need to cooperate with them to avoid duplicating indexes.
Indexes on each of the columns referenced in the WHERE clause will help performance of a query against a table with a large number of rows, where you are seeking a small subset, even if the columns in the WHERE clause are not returned in the SELECT column list.
The downside of course is that indexes impede insert/update performance. So when loading the table with large numbers of records, you might need to disable/drop the indexes prior to loading and then re-create/enable them again afterwards.

Join 50 millions query one to one row

I have two tables having 50 million unique rows each.
Row number from one table corresponds to row number in the second table.
i.e
1st row in the 1st table joins with 1st row in the second table, 2nd row in first table joins with 2nd row in the second table and so on. Doing inner join is costly.
It takes more than 5 hours on clusters. Is there an efficient way to do this in SQL?
To start with: tables are just sets. So the row number of a record can be considered pure coincidence. You must not join two tables based on row numbers. So you would join on IDs rather than on row numbers.
There is nothing more efficient than a simple inner join. As the whole tables must be read, you might not even gain anything from indexes (but as we are talking of IDs, there will be indexes anyhow, so nothing we must ponder on).
Depending on the DBMS you may be able to parallelize the query. In Oracle for example you would use a hint such as /*+ parallel( tablename , parallel_factor ) */.
Try to sort both tables by rows (if isnt sorted),then use normal SELECT (maybe you could use LIMIT to get it part by part) for both tables anddata connect line by line wherever you want

Selecting 'highest' X rows without sorting

I've got a table with huge amount of data. Lets say 10GB of lines, containing bunch of crap. I need to select for example X rows (X is usually below 10) with highest amount column.
Is there any way how to do it without sorting the whole table? Sorting this amount of data is extremely time-expensive, I'd be OK with one scan through the whole table and selecting X highest values, and letting the rest untouched. I'm using SQL Server.
Create an index on amount then SQL Server can select the top 10 from that and do bookmark lookups to retrieve the missing columns.
SELECT TOP 10 Amount FROM myTable ORDER BY Amount DESC
if it is indexed, the query optimizer should use the index.
If not, I do no see how one could avoid scanning the whole thing...
Wether an index is usefull or not depends on how often you do that search.
You could also consider putting that query into an indexed view. I think this will give you the best benefit/cost ration.