How to convert rows to columns in indexed view? - sql

I use OUTER JOIN to get values stored in rows and show them as columns. When there is no value, I show NULL in column.
Source table:
Id|Name|Value
01|ABCG|,,,,,
01|ZXCB|.....
02|GHJK|;;;;;
View:
Id|ABCG|ZXCB|GHJK
01|,,,,|....|NULL
02|NULL|NULL|;;;;
The query looks like:
SELECT DISTINCT
b.Id,
bABCG.Value AS "ABCG"
bZXCB.Value AS "ZXCB"
bGHJK.Value AS "GHJK"
FROM
Bars b
LEFT JOIN Bars bABCG ON b.Id = bABCG.Id and b.Name = 'ABCG'
LEFT JOIN Bars bZXCB ON b.Id = bZXCB.Id and b.Name = 'ZXCB'
LEFT JOIN Bars bGHJK ON b.Id = bGHJK.Id and b.Name = 'GHJK'
I want to remove LEFT JOIN because it's not allowed in indexed view. I tried replacing it with inner SELECT, but inner SELECT is not allowed also and UNION too. I can't use INNER JOIN because I want to show NULLs in view. What should I use?

You may be able to implement something similar using an actual table to store the results, and a set of triggers against the base tables to maintain the internal data.
I believe that, under the covers, this is what SQL Server does (in spirit, if not in actual implementation) when you create an indexed view. However, by examining the rules for indexed views, it's clear that the triggers should only use the inserted and deleted tables, and should not be required to scan the base tables to perform the maintenance - otherwise, for large tables, maintaining this indexed view would impose a serious performance penalty.
As an example of the above, whilst you can easily write a trigger for insert to maintain a MAX(column) column in the view, deletion would be more problematic - if you're deleting the current max value, you'd need to scan the table to determine the new maximum. For many of the other restrictions, try writing the triggers by hand, and most times there'll come a point where you need to scan the base table.
Now, in your particular case, I believe it could be reasonably efficient for these triggers to perform the maintenance - but you need to carefully consider all of the insert/update/delete scenarios, and make sure that your triggers actually faithfully maintain this data - e.g. if you update any ids, you may need to perform a mixture of updates, inserts and deletes.

The best you are going to be able to do is use inner joins to get the matches, then union with the left joins and filter it to only return nulls. This probably won't solve your problem.
I don't know the specifics of your system but I am assuming that you are dealing with performance issues, which is why you want to use the indexed view. There are a few alternatives, but I think the following is the most appropriate.
Since you commented this is for a DW I am going to assume that your system is more intensive on reads than writes and that data is loaded into it on a schedule by an ETL process. In this kind of high read/low write* situation I would recommend you "materialize" this view, which means when the ETL process runs, to generate the table with your initial select statement that includes the left joins. You will take the hit on the write, then all your reads will be on par with the performance of the indexed view (you would be doing the same thing the indexed view would do, except in a batch instead of on a row by row basis). If your source DB and DW are on the same instance this is a better choice than an indexed view b/c it won't affect the performance of the source system (indexed views slow down inserts). This is the same concept as the indexed view because you take the performance hit on the insert to speed up the select.
I've been down this path before and come to the following conclusion:
An indexed view is more likely to be part of the solution than the entire solution.
*when I said "high read/low write" above you can also think of it as "high read/scheduled write"

SELECT DISTINCT
b.Id,
(Select bABCG.Value from Bars bABCG where b.Name = 'ABCG') AS "ABCG"
...
FROM
Bars b
you may have to add a aggregation on the value, I'm not sure how your data is organized

Related

Does my previous SQL query/ies affect my current query?

I have multiple SQL queries that I run one after the other to get a set of data. In each query, there are a bunch of tables joined that are exactly the same with the other queries. For example:
Query1
SELECT * FROM
Product1TableA A1
INNER JOIN Product1TableB B on A1.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
Query2
SELECT * FROM Product2TableA A2
INNER JOIN Product2TableB B on A2.BId = B.Id
INNER JOIN CommonTable1 C on C.Id = B.CId
INNER JOIN CommonTable2 D on D.Id = B.DId
...
I am playing around re-ordering the joins (around 2 dozen tables joined per query) and I read here that they should not really affect query execution unless SQL "gives up" during optimization because of how big the query is...
What I am wondering is if bunching up common table joins at the start of all my queries actually helps...
In theory, the order of the joins in the from clause doesn't make a difference on query performance. For a small number of tables, there should be no difference. The optimizer should find the best execution path.
For a larger number of tables, the optimizer may have to short-circuit its search regarding join order. It would then be using heuristics -- and these could be affected by join order.
Earlier queries would have no effect on a particular execution plan.
If you are having problems with performance, I am guessing that join order is not the root cause. The most common problem that I have in SQL Server are inappropriate nested-loop joins -- and these can be handled with an optimizer hint.
I think I understood what he was trying to say/to do:
What I am wondering is if bunching up common table joins at the start
of all my queries actually helps...
Imagine that you have some queries and every query has more than 3 inner joins. The queries are different but always have (for example) 3 tables in common that are joined on the same fields. Now the question is:
what will happen if every query will start with these 3 tables in join, and all the other tables are joined after?
The answer is it will change nothing, i.e. optimizer will rearrange the tables in the way it thinks will bring to optimal execution.
The thing may change if, for example, you save the result of these 3 joins into a temporary table and then use this saved result to join with other tables. But this depends on the filters that your queries use. If you have appropriate indexes and your query filters are selective enough(so that your query returns very few rows) there is no need to cache intermediate no-filtered result that has too many rows because optimizer can choose to first filter every table and only then to join them
Gordon's answer is a good explanation, but this answer explains the JOIN's behavior and also specifies that SQL Server's version is relevant:
Although the join order is changed in optimisation, the optimiser
does't try all possible join orders. It stops when it finds what it
considers a workable solution as the very act of optimisation uses
precious resources.
While the optimizer tries its best in choosing a good order for the JOINs, having many JOINs creates a bigger chance of obtaining a not so good plan.
Personally, I have seen many JOINs in some views within an ERP and they usually ran ok. However, from time to time (based on client's data volume, instance configuration etc.), some selects from these views took much more than expected.
If this data reaches an actual application (.NET, JAVA etc.), a way is to cache information from all small tables, store it as dictionaries (hashes) and perform O(1) lookups based on the keys.
This provides the advantages of reducing the JOIN count and not performing reads from the database for these tables (except once when caching data). However, this increases the complexity of the application (cache management).
Another solution is use temporary tables and populate them in multiple queries to avoid many JOINs per single query. This solution usually performs better and also increases debuggability (if the query does not provide the correct data or no data at all, which of the 10-15 JOINs is the problem?).
So, my answer to your question is: you might get some benefit from reordering the JOIN clauses, but I recommend avoiding lots of JOINs in the first place.

Performance for big query in SQL Server view

I have a big query for a view that takes a couple of hours to run and I feel like it may be possible to work on its performance "a bit"..
The problem is that I am not sure of what I should do. The query SELECT 39 values, LEFT OUTER JOIN 25 tables and each table could have up to a couple of million rows.
Any tip is good. Is there any good way to attack this problem? I tried to look at the actual execution plan on a test with less data (took about 10 min to run) but it's crazy big. Is there any general things I could do to make this faster? Do I have to tackle one small part at the time..?
Maybe there is just one join that slows down everything? How do I detect it? So what I mean for short, how do I work on a query like this?
As a said, all feedback is good. Is there some more information I need to show, tell me!
The query looks something like this:
SELECT DISTINCT
A.something,
A.somethingElse,
B.something,
C.somethingElse,
ISNULL(C.somethingElseElse, '')
C.somethingElseElseElse,
CASE *** THEN D.something ELSE 0,
E.something,
...
U.something
FROM
TableA A
JOIN
TableB B on ...
JOIN
TableC C on ...
JOIN
TableD D on ...
JOIN
TableE E on ...
JOIN
TableF F on ...
JOIN
TableG G on ...
...
JOIN
Table U on ...
Break your problem into manageable pieces. If the execution plan is too large for you to analyze, start with a smaller part of the query, check its execution plan and optimize it.
There is no general answer on how to optimize a query, since there is a whole bunch of possible reasons why a query can be slow. You have to check the execution plan.
Generally the most promising ways to improve performance are:
Indexing:
When you see a a Clustered Index Scan or - even worse (because then you don't have a clustered index) - a Table Scan in your query plan for a table that you join, you need an index for your JOIN predicate. This is especially true if you have tables with millions of entries, and you select only a small subset of those entries. Check also the index suggestions in the execution plan.
You see that the index works when your Clustered Index Scan turns into an Index Seek.
Index includes:
You probably are displaying columns from your joined tables that are different from the fields you use to join (otherwise, why would you need to join then?). SQL Server needs to get the fields that you need from the table, which you see in the execution plan as Key Lookup.
Since you are taking 39 values from 25 tables, there will be very few fields per table that you will need to get (mostly one or two). SQL Server needs to load entire pages of the respecitive table and get the values from them.
In this case, you should INCLUDE the column(s) you want to display in your index to avoid the key lookups. This comes at an increased index size, but considering you only include a few columns, that cost should be neglectable compared to the size of your tables.
Checking views that you join:
When you join VIEWs you should be aware that it basically means an extension to your query (which means also of the execution plan). Do the same performance optimizations for the view as you do for your main query. Also, check if you join tables in the view that you already join in the main query. These joins might be unnecessary.
Indexed views (maybe):
In general, you can add indexes to views you are joining to your query or create one or more indexed views for parts of your query. There are some caveats though:
Indexed views take storage space in your DB, because you store parts of the data multiple times.
There are a lot of restrictions to indexed views, most notably in your case that OUTER JOINs are forbidden. If you can transform at least some of your OUTER JOINs to INNER JOINs this might be an option.
When you join indexed views, don't forget to use WITH(NOEXPAND) in your join, otherwise they might be ignored.
Partitioned tables (maybe):
If you are running on the Enterprise Edition of SQL Server, you can partition your tables. That can be useful if the rows you join are always selected from a small subset of the available rows. You can make a partition for this subset and increase performance.
Summary:
Divide and conquer. Analyze your query bit by bit to optimize it. The most promising options are indexes and index includes. If you still have trouble, go from there.

Rolling up values from multiple child tables

What is the best way to roll up values from a series of child tables into a parent table in SQL Server?
Let's say we have a contracts table. This table has a series of child tables, such as contract_timesheets, contract_materials, contract_other_expenses - etc. What is the best way to pull costs / hours / etc out of those child tables and make them easily accessible in the parent table?
Option 1: My first thought would be to simply use a view. An example might be something like this:
SELECT
contract_code,
caption,
description,
(
SELECT SUM(t.hours * l.rate_hourly)
FROM timesheets t
JOIN labor l ON t.hr_code = l.hr_code AND t.contract_code = c.contract_code
) AS labor_cost,
(
SELECT ...
) AS material_cost,
...
FROM contracts c
So we'll have a view that might have a dozen or more subqueries like that, many of which will themselves need joins to pull in all of the info we need.
This completely works fine. Until we have hundreds of thousands of rows. Then things start to get noticeably slow. It's still workable, but it the row count gets up too much higher, or the server gets too much other workload, i'm concerned that this isn't workable.
Is there a more efficient way to structure such a view?
Option 2: The other obvious solution is to roll those numbers up into physical fields in the parent table. The big issue with that is just maintaining the numbers when the data might be accessed from a variety of clients. Maybe it's a report, maybe it's a form, maybe it's some integration service. So trying to use some premade roll-up SQL file that gets run as an event in the front-end prior to displaying the report / chart / whatever isn't an ideal solution.
To ensure that the roll up numbers stay in synch, we could attach a series of triggers to all of the child tables (and possibly relatives of those, if the numbers in the child tables rely on something else). Everytime the source numbers get updated, we roll it up into the parent. This seems like a lot of trouble, but if the triggers are written correctly, i suppose this would work fine.
Option 3: Do everything in the UI. This is also an option, but with a variety of clients accessing the data, it makes things unpleasant.
Option 4(?): Since most of these records are actually completed with no need to add more data, i can also imagine some kind of hybrid system. The base table for the parent contract would have physical columns for the labor costs, material costs, or whatever. When a contract is marked as Closed (or some other status indicating no more data needs to be entered), those physical columns would be filled in (otherwise they're NULL). The view which is accessible to the clients could then decide, based upon the status (or just an ISNULL stiuation), whether to directly return the data from the physical columns, or whether it needs to calculate it on the fly. I'm not sure how the performance would be with this, but it might be worth a look. This would mean that the roll up numbers only need to be calculated for a few thousand rows at most - everything else would come from the physical fields.
So, what is the right way to do this? Am i missing other possibilities?
Try using an Indexed View. This "materializes" the view. Creating a clustered index on the view will allow your queries to go directly to the index rather than all of the underlying tables/queries that make up the view.
Edit: Modified Indexed View link.
I think the view is probably the right answer, but the way you have the query written with correlated subqueries in the SELECT list may be what causes the performance degradation when the rows increase. If you write everything out as joins with GROUP BY, it might allow the query optimizer to simply the plan for the view at execution time and give you better performance.
Have you also looked into Indexed Views? There are a lot of restrictions to creating indexed views so they may not be a viable option for you, but it's something to consider. Essentially an indexed view is a sort of denormalization. It would allow SQL Server to keep the aggregations updated for you automatically as the underlying data in the tables changes. It may of course degrade performance for inserts, updates and deletes, but it's something to consider if the performance of the aggregations is critical.
To get the best read performance in this case, indexed views are the way to go.
CREATE VIEW labor_costs
WITH SCHEMA_BINDING
AS
SELECT t.contract_code, t.hr_code, SUM(t.hours * l.rate_hourly) AS labor_cost
FROM dbo.timesheets t
GROUP BY t.contract_code, t.hr_code
GO
CREATE UNIQUE CLUSTERED INDEX UX_LaborCosts
ON LaborCosts(t.contract_code, t.hr_code)
Once you have the indexed view you can left join to it. For example:
SELECT
c.contract_code,
c.caption,
c.description,
lb.labor_cost
FROM
dbo.contracts c
LEFT JOIN dbo.labor_costs lb WITH (NOEXPAND)
on c.contract_code = lb.contract_code AND c.hr_code = lb.hr_code

JOIN on concatenated column performance

I have a view that needs to join on a concatenated column. For example;
dbo.View1 INNER JOIN
dbo.table2 ON dbo.View1.combinedcode = dbo.table2.code
Inside the 'View1' there is a column which is comprised like so;
dbo.tableA.details + dbo.tableB.code AS combinedcode
Performing a join on this column is extremely slow. However the actual 'View1' runs extremely quickly. The poor performance comes with the join, and there aren't even many rows in any of the tables or views. Does anyone know why this might be?
Thanks for any insight!
Since there's no index on combinedcode, the JOIN will most likely result in a full "table scan" of the view to calculate the code for every row.
If you want to speed things up, try making the view into an indexed view with an index on combinedcode to help the join.
Another alternative, depending on your SQL server version, is to (as Parado answers) create a temporary table for the join, although it's usually less performant, at least for single shot queries.
Try this way:
select *
into #TemTap
from View1
/*where conditions on view1*/
after that You could create index on #TemTap.combinedcode and than
dbo.#TemTap as View1 INNER JOIN dbo.table2 ON dbo.View1.combinedcode =
dbo.table2.code
It often works for me.
The reason is because the optimizer has no information about the concatenated column, so it cannot choose a reasonable join path. My guess, if you look at the execution plan, is that the join is using a "nested loop" join. (I'm tempted to add "dreaded" to that.)
You might be able to fix this by putting an index on table2(code). The optimizer should decide to use this index, getting around the bad join optimization.
You can also use query hints to force the use of a "hash join" or "merge join". I am finding myself doing this more often for complex queries, where changes to the data might effect the query plan. (Such hints go in when a query that has been taking 2 minutes for a year decides to take hours, fill the temporary database, and die when it runs out of space.) You can do this by adding OPTION (merge join, hash join) to the end of the query. You can also explicitly choose the type of join in the on clause.
Finally, storing the intermediate results in a temporary table (as proposed by Parado) should give the optimizer enough information to choose the best join algorithm.
Using SQL functions is where condition is not advised. here you are using concatenate in where condition (indirectly but yes). so it is performing concatenation for every row and then comparing it with other table.
Now solution will be try using intermediate table rather then this view to hold the concatinated value.
if not possible try using index view, I know its a hell of task.
I would have preferred creating intermediate table.
see the link for indexed views
http://msdn.microsoft.com/en-us/library/ms191432.aspx#Restrictions

use of views to protect the actual tables in sql

how do views act as a mediator between the actual tables and an end user ? what's the internal process which occurs when a view is created. i mean that when a view is created on a table, then does it stands like a wall between the table and the end user or else? how do views protect the actual tables, only with the check option? but if a user inserts directly into the table then how come do i protect the actual tables?
if he/she does not use : insert into **vw** values(), but uses: insert into **table_name** values() , then how is the table protected now?
Non-materialized views are just prepackaged SQL queries. They execute the same as any derived table/inline view. Multiple references to the same view will run the query the view contains for every reference. IE:
CREATE VIEW vw_example AS
SELECT id, column, date_column FROM ITEMS
SELECT x.*, y.*
FROM vw_example x
JOIN vw_example y ON y.id = x.id
...translates into being:
SELECT x.*, y.*
FROM (SELECT id, column, date_column FROM ITEMS) x
JOIN (SELECT id, column, date_column FROM ITEMS) y ON y.id = x.id
Caching
The primary benefit is caching because the query will be identical. Queries are cached, including the execution plan, in order to make the query run faster later on because the execution plan has been generated already. Caching often requires queries to be identical to the point of case sensitivity, and expires eventually.
Predicate Pushing
Another potential benefit is that views often allow "predicate pushing", where criteria specified on the view can be pushed into the query the view represents by the optimizer. This means that the query could scan the table once, rather than scan the table in order to present the resultset to the outer/ultimate query.
SELECT x.*
FROM vw_example x
WHERE x.column = 'y'
...could be interpreted by the optimizer as:
SELECT id, column, date_column
FROM ITEMS
WHERE x.column = 'y'
The decision for predicate pushing lies solely with the optimizer. I'm unaware of any ability for a developer to force the decision, only that it really depends on the query the view uses and what additional criteria is being applied.
Commentary on Typical Use of Non-materialized Views
Sadly, it's very common to see a non-materialized SQL view used for nothing more than encapsulation to simplify writing queries -- simplification which isn't a recommended practice either. SQL is SET based, and doesn't optimize well using procedural approaches. Layering views on top of one another is also not a recommended practice.
Updateable Views
Non-materialized views are also updatable, but there are restrictions because a view can be made of numerous tables joined together. An updatable, non-materialized view will stop a user from being able to insert new records, but could update existing ones. The CHECK OPTION depends on the query used to create the view for enforcing a degree of update restriction, but it's not enough to ensure none will ever happen. This demonstrates that the only reliable means of securing against unwanted add/editing/deletion is to grant proper privileges to the user, preferably via a role.
Views do not protect tables, though they can be used in a permissions-based table-protection scheme. Views simply provide a convenient way to access tables. If you give a user access to views and not tables, then you have probably greatly restricted access.