SQL Query Performance in case of simple union

SQL Query Performance in case of simple union - sql

I have a simple query tuning question, it is
can We improve the performance of a view which have definition as
SELECT * FROM A
UNION ALL
SELECT * FROM B
and this is performing so poorly that it is taking 12 seconds for 6.5k Records
Any help is appreciated.

The problem with a union view like that is that the original columns of the tables are not accessible to the optimizer via the view, so when you use the view like this:
select * from myview where foo = 5
the where clause is a filter on the entire rowset delivered by the view, so all the rows from both tables are processed after the union, so no indexes will be used.
To have any hope at performance, you have to somehow get the condition you want inside the view and applied to each table, but keep it variable so you can apply different criteria when you use it.
I found a work-around that does this. It's a little bit hacky, and doesn't work for concurrent use, but it works! Try this:
create table myview_criteria(val int);
insert into myview_criteria values (0); -- it should have exactly one row
create view myview as
SELECT * FROM A
WHERE foo = (select val from myview_criteria)
UNION ALL
SELECT * FROM B
WHERE foo = (select val from myview_criteria);
then to use it:
update myview_criteria set val = 5;
select * from myview;
Assuming there's an index on foo, that index will be used.
Caution here, because obviously this technique won't work when multiple executions are done concurrently.

is there a reason u keep these two not in the same table, no matter what you do, eventually it will be horrible. If possible consider a migration to one table. If not, you can always "materialize a view" by inserting them all into the same thing.
with that being said, you are not going to be selecting * on a union, if there are specific conditions on top of the view, performance can be improve by the indexing that column.
It would help if you tell everyone what specific db this is for.

My best advice is to limit the number or records that are returned from the source tables to only what you really need and making sure your data is properly indexed.
Don't use * unless its an ad-hoc script (even then it would still be better to list the columns) include only the columns that you really need for the task at hand.
Use proper indexes based on the most frequently used queries. Remove any unused indexes. Remove any duplicate indexes.
Pluralsight has a wonderful course on MS SQL Server query performance tuning.
http://www.pluralsight.com/training/Courses/TableOfContents/query-tuning-introduction
The course is presented by Vinod Kumar and Pinal Dave who are two very brilliant individuals.
You could also check out Pinal's article on SQL Server Performance to help point out areas that you may not have thought about.
http://blog.sqlauthority.com/sql-server-performance-tuning/

Related

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.

If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.

So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Is using COUNT() or SELECT a good idea?

I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?

1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.

COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.

When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.

Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.

It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.

When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.

Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.

In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.

Views or table functions or something else

I designed 5 stored procedures which almost use same join condition but parameters or values in where clause change for each on different runs.
Is it best solution to create a view with all join conditions without where clause and then query from view or work on view? Can views auto update itself if i create view?
Can i do sub-queries or query similar to (i think i read somewhere views do not support sub queries but not 100% sure)
select count(x1) as x1cnt, count(x2) as x2count
from (
select x1,x2,
(
case when x1 is 'y' then 1 else 0 end +
case when x2 is 'y' then 1 else 0 end
) per
from vw_viewname) v1
where v1.per = 1
Updated below:
In my queries i use joins similar to this also
select c1,c2,c3
FROM [[join conditions - 5 tables]]
Inner join
(
select x1,x2,x3, some case statements
FROM [[join conditions - 5 tables]]
where t1.s1 = val1 and t2.s2 = v2 etc
) s
on s.id = id
so i'm using join twice so i thought can i reduce it using some views

Leaving out their where clause could make the query run more slowly or just give more results than a specific query would. But you will have to determine if that is advantageous based on your system.
You will get the common view results table to work with. View basically run the query when you use them so you will get results as if you did the query yourself by some other mechanism. You can do sub queries on a view just as if it were another table. That should not be a problem. But if you have 5 different queries doing 5 specific things then it is probably beneficial to leave it as so. One or two of those may be called more and you would be trading off their performance with a general view table and gain nothing really for doing so other than view reuse.
I would only construct the view if you have some specific benefit from doing so.
Also I found this post that may be similar Dunno if you will find it helpful or not.
EDIT: Well, I think it would just make it worse really. You would just be calling the view twice and if its a generic view it means each of those calls is going to get a lot of generic results to deal with.
I would say just focus on optimizing those queries to give you exactly what you need. Thats really what you have 5 different procedure for anyway right? :)

It's 5 different queries so leave it like that.
It's seductive to encapsulate similar JOINs in a view, but before you know it you have views on top of views and awful performance. I've seen it many times.
The "subquery in a view" thing probably refers to indexed views which have limitations.

Unless your talking about an indexed view, the view will actually run the script to generate the view on demand. In that regard, it would be the same as using a subquery.
If I were you, I would leave it as it is. It may seem like you should compact your code (each of the 5 scripts have almost the same code), but its what is different that is important here.

You can have subqueries in a view, and that approach is perfectly acceptable.

SQL Server views do support sub-queries. And, in a sense, views to auto update themselves because a view is not a persisted object (unless you use an Indexed View). With a non Indexed View, each time you query the view, it is using the underlying tables. So, your view will be as up to date as the tables they are based upon.
It sounds to me like a view would be a good choice here.

It's fine to create a view, even if it contains a subselect. You can remove the where for the view.
Are you sure you want to use COUNT like that without a group by? It counts the number of rows which contain non-null values or the parameter.

I've done a lot of presentations recently on the simplification offered by the Query Optimiser. Essentially if you have planned your joins well enough, the system can see that they're redundant and ignore them completely.
http://msmvps.com/blogs/robfarley/archive/2008/11/09/join-simplification-in-sql-server.aspx
Stored procedures will do the same work each time (parameters having some effect), but a view (or inline TVF) will be expanded into the outer query and simplified out.

SQL query - Select * from view or Select col1, col2, ... colN from view [duplicate]

This question already has answers here:
What is the reason not to use select *?
(20 answers)
Closed 3 years ago.
We are using SQL Server 2005, but this question can be for any RDBMS.
Which of the following is more efficient, when selecting all columns from a view?
Select * from view
or
Select col1, col2, ..., colN from view

NEVER, EVER USE "SELECT *"!!!!
This is the cardinal rule of query design!
There are multiple reasons for this. One of which is, that if your table only has three fields on it and you use all three fields in the code that calls the query, there's a great possibility that you will be adding more fields to that table as the application grows, and if your select * query was only meant to return those 3 fields for the calling code, then you're pulling much more data from the database than you need.
Another reason is performance. In query design, don't think about reusability as much as this mantra:
TAKE ALL YOU CAN EAT, BUT EAT ALL YOU TAKE.

It is best practice to select each column by name. In the future your DB schema might change to add columns that you would then not need for a particular query. I would recommend selecting each column by name.

Just to clarify a point that several people have already made, the reason Select * is inefficient is because there has to be an initial call to the DB to find out exactly what fields are available, and then a second call where the query is made using explicit columns.
Feel free to use Select * when you are debugging, running casual queries or are in the early stages of developing a query, but as soon as you know your required columns, state them explicitly.

Select * is a poor programming practice. It is as likely to cause things to break as it is to save things from breaking. If you are only querying one table or view, then the efficiency gain may not be there (although it is possible if you are not intending to actually use every field). If you have an inner join, then you have at least two fields returning the same data (the join fields) and thus you are wasting network resources to send redundant data back to the application. You won't notice this at first, but as the result sets get larger and larger, you will soon have a network pipeline that is full and doesn't need to be. I can think of no instance where select * gains you anything. If a new column is added and you don't need to go to the code to do something with it, then the column shouldn't be returned by your query by definition. If someone drops and recreates the table with the columns in a different order, then all your queries will have information displaying wrong or will be giving bad results, such as putting the price into the part number field in a new record.
Plus it is quick to drag the column names over from the object browser, so that is just pure laziness not efficiency in coding.

It depends. Inheritance of views can be a handy thing and easy to maintain (SQL Anywhere):
create view v_fruit as select F.id, S.strain from F key join S;
create view v_apples as select v_fruit.*, C.colour from v_fruit key join C;

If you're really selecting all columns, it shouldn't make any noticeable difference whether you ask for * or if you are explicit. The SQL server will parse the request the same way in pretty much the same amount of time.

Always do select col1, col2 etc from view. There's no efficieny difference between the two methods that I know of, but using "select *" can be dangerous. If you modify your view definition adding new columns, you can break a program using "select *", whereas selecting a predefined set of columns (even all of them, named), will still work.

I guess it all depends on what the query optimizer does.
If I want to get every record in the row, I will generally use the "SELECT *..." option, since I then don't have to worry should I change the underlying table structure. As well, for someone maintaining the code, seeing "SELECT *" tells them that this query is intended to return every column, whereas listing the columns individually does not convey the same intention.

For performance - look at the query plan (should be no difference).
For maintainability. - always supply a fieldlist (that goes for INSERT INTO too).

select
column1
,column2
,column3
.
.
.
from Your-View
this one is more optimizer than Using the
select *
from Your View

Why is using '*' to build a view bad?

Why is using '*' to build a view bad ?
Suppose that you have a complex join and all fields may be used somewhere.
Then you just have to chose fields needed.
SELECT field1, field2 FROM aview WHERE ...
The view "aview" could be SELECT table1.*, table2.* ... FROM table1 INNER JOIN table2 ...
We have a problem if 2 fields have the same name in table1 and table2.
Is this only the reason why using '*' in a view is bad?
With '*', you may use the view in a different context because the information is there.
What am I missing ?
Regards

I don't think there's much in software that is "just bad", but there's plenty of stuff that is misused in bad ways :-)
The example you give is a reason why * might not give you what you expect, and I think there are others. For example, if the underlying tables change, maybe columns are added or removed, a view that uses * will continue to be valid, but might break any applications that use it. If your view had named the columns explicitly then there was more chance that someone would spot the problem when making the schema change.
On the other hand, you might actually want your view to blithely
accept all changes to the underlying tables, in which case a * would
be just what you want.
Update: I don't know if the OP had a specific database vendor in mind, but it is now clear that my last remark does not hold true for all types. I am indebted to user12861 and Jonny Leeds for pointing this out, and sorry it's taken over 6 years for me to edit my answer.

Although many of the comments here are very good and reference one common problem of using wildcards in queries, such as causing errors or different results if the underlying tables change, another issue that hasn't been covered is optimization. A query that pulls every column of a table tends to not be quite as efficient as a query that pulls only those columns you actually need. Granted, there are those times when you need every column and it's a major PIA having to reference them all, especially in a large table, but if you only need a subset, why bog down your query with more columns than you need.

Another reason why "*" is risky, not only in views but in queries, is that columns can change name or change position in the underlying tables. Using a wildcard means that your view accommodates such changes easily without needing to be changed. But if your application references columns by position in the result set, or if you use a dynamic language that returns result sets keyed by column name, you could experience problems that are hard to debug.
I avoid using the wildcard at all times. That way if a column changes name, I get an error in the view or query immediately, and I know where to fix it. If a column changes position in the underlying table, specifying the order of the columns in the view or query compensates for this.

These other answers all have good points, but on SQL server at least they also have some wrong points. Try this:
create table temp (i int, j int)
go
create view vtemp as select * from temp
go
insert temp select 1, 1
go
alter table temp add k int
go
insert temp select 1, 1, 1
go
select * from vtemp
SQL Server doesn't learn about the "new" column when it is added. Depending on what you want this could be a good thing or a bad thing, but either way it's probably not good to depend on it. So avoiding it just seems like a good idea.
To me this weird behavior is the most compelling reason to avoid select * in views.
The comments have taught me that MySQL has similar behavior and Oracle does not (it will learn about changes to the table). This inconsistency to me is all the more reason not to use select * in views.

Using '*' for anything production is bad. It's great for one-off queries, but in production code you should always be as explicit as possible.
For views in particular, if the underlying tables have columns added or removed, the view will either be wrong or broken until it is recompiled.

Using SELECT * within the view does not incur much of a performance overhead if columns aren't used outside the view - the optimizer will optimize them out; SELECT * FROM TheView can perhaps waste bandwidth, just like any time you pull more columns across a network connection.
In fact, I have found that views which link almost all the columns from a number of huge tables in my datawarehouse have not introduced any performance issues at all, even through relatively few of those columns are requested from outside the view. The optimizer handles that well and is able to push the external filter criteria down into the view very well.
However, for all the reasons given above, I very rarely use SELECT *.
I have some business processes where a number of CTEs are built on top of each other, effectively building derived columns from derived columns from derived columns (which will hopefully one day being refactored as the business rationalizes and simplifies these calculations), and in that case, I need all the columns to drop through each time, and I use SELECT * - but SELECT * is not used at the base layer, only in between the first CTE and the last.

The situation on SQL Server is actually even worse than the answer by #user12861 implies: if you use SELECT * against multiple tables, adding columns to a table referenced early in the query will actually cause your view to return the values of the new columns under the guise of the old columns. See the example below:
-- create two tables
CREATE TABLE temp1 (ColumnA INT, ColumnB DATE, ColumnC DECIMAL(2,1))
CREATE TABLE temp2 (ColumnX INT, ColumnY DATE, ColumnZ DECIMAL(2,1))
GO
-- populate with dummy data
INSERT INTO temp1 (ColumnA, ColumnB, ColumnC) VALUES (1, '1/1/1900', 0.5)
INSERT INTO temp2 (ColumnX, ColumnY, ColumnZ) VALUES (1, '1/1/1900', 0.5)
GO
-- create a view with a pair of SELECT * statements
CREATE VIEW vwtemp AS
SELECT *
FROM temp1 INNER JOIN temp2 ON 1=1
GO
-- SELECT showing the columns properly assigned
SELECT * FROM vwTemp
GO
-- add a few columns to the first table referenced in the SELECT
ALTER TABLE temp1 ADD ColumnD varchar(1)
ALTER TABLE temp1 ADD ColumnE varchar(1)
ALTER TABLE temp1 ADD ColumnF varchar(1)
GO
-- populate those columns with dummy data
UPDATE temp1 SET ColumnD = 'D', ColumnE = 'E', ColumnF = 'F'
GO
-- notice that the original columns have the wrong data in them now, causing any datatype-specific queries (e.g., arithmetic, dateadd, etc.) to fail
SELECT *
FROM vwtemp
GO
-- clean up
DROP VIEW vwTemp
DROP TABLE temp2
DROP TABLE temp1

It's because you don't always need every variable, and also to make sure that you are thinking about what you specifically need.
There's no point getting all the hashed passwords out of the database when building a list of users on your site for instance, so a select * would be unproductive.

Once upon a time, I created a view against a table in another database (on the same server) with
Select * From dbname..tablename
Then one day, a column was added to the targetted table. The view started returning totally incorrect results until it was redeployed.
Totally incorrect : no rows.
This was on Sql Server 2000.
I speculate that this is because of syscolumns values that the view had captured, even though I used *.

A SQL query is basically a functional unit designed by a programmer for use in some context. For long-term stability and supportability (possibly by someone other than you) everything in a functional unit should be there for a purpose, and it should be reasonably evident (or documented) why it's there - especially every element of data.
If I were to come along two years from now with the need or desire to alter your query, I would expect to grok it pretty thoroughly before I would be confident that I could mess with it. Which means I would need to understand why all the columns are called out. (This is even more obviously true if you are trying to reuse the query in more than one context. Which is problematic in general, for similar reasons.) If I were to see columns in the output that I couldn't relate to some purpose, I'd be pretty sure that I didn't understand what it did, and why, and what the consequences would be of changing it.

It's generally a bad idea to use *. Some code certification engines mark this as a warning and advise you to explicitly refer only the necessary columns. The use of * can lead to performance louses as you might only need some columns and not all. But, on the other hand, there are some cases where the use of * is ideal. Imagine that, no matter what, using the example you provided, for this view (aview) you would always need all the columns in these tables. In the future, when a column is added, you wouldn't need to alter the view. This can be good or bad depending the case you are dealing with.

I think it depends on the language you are using. I prefer to use select * when the language or DB driver returns a dict(Python, Perl, etc.) or associative array(PHP) of the results. It makes your code alot easier to understand if you are referring to the columns by name instead of as an index in an array.

No one else seems to have mentioned it, but within SQL Server you can also set up your view with the schemabinding attribute.
This prevents modifications to any of the base tables (including dropping them) that would affect the view definition.
This may be useful to you for some situations. I realise that I haven't exactly answered your question, but thought I would highlight it nonetheless.

And if you have joins using select * automatically means you are returning more data than you need as the data in the join fields is repeated. This is wasteful of database and network resources.
If you are naive enough to use views that call other views, using select * can make them even worse performers (This is technique that is bad for performance on its own, calling mulitple columns you don't need makes it much worse).

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas