Creating view ,SQL Query performance

Creating view ,SQL Query performance - sql

I am trying to create view, But select statement from this view is taking more than 15 secs.How can i make it faster. My query for the view is below.
create view Summary as
select distinct A.Process_date,A.SN,A.New,A.Processing,
COUNT(case when B.type='Sold' and A.status='Processing' then 1 end) as Sold,
COUNT(case when B.type='Repaired' and A.status='Processing' then 1 end) as Repaired,
COUNT(case when B.type='Returned' and A.status='Processing' then 1 end) as Returned
from
(select distinct M.Process_date,M.SN,max(P.enter_date) as enter_date,M.status,
COUNT(case when M.status='New' then 1 end) as New,
COUNT(case when M.status='Processing' and P.cn is null then 1 end) as Processing
from DB1.dbo.Item_details M
left outer join DB2.dbo.track_data P on M.SN=P.SN
group by M.Process_date,M.SN,M.status) A
left outer join DB2.dbo.track_data B on A.SN=B.SN
where A.enter_date=B.enter_date or A.enter_date is null
group by A.Process_date,A.New,A.Processing,A.SN
After this view..my select query is
select distinct process_date,sum(New),sum(Processing),sum(sold),sum(repaired),sum(returned) from Summary where month(process_date)=03 and year(process_date)=2011
Please suggest me on what changes to be made for the query to perform faster.
Thank you
ARB

It is hard to give advices without seeing the actual data and the structure of the tables. I would rewrite the query keeping in mind these principles:
Use inner join instead of outer join if possible.
Get rid of case operator inside COUNT function. Build a query so you use conditions in WHERE section not in COUNT.
Try to not use aggregated values in GROUP BY. Currently you use aggregated values New and Processing for grouping. Use GROUP BY by existing table values if possible.
If the query gets too complicated, break it into smaller queries and combine results in the final query. Writing a store procedure may help in this case.
I hope this helps.

For tuning a database query, I shall add few items additional to what #Davyd has already listed:
Look at the tables and indexing on those tables. Putting the right index and avoiding the wrong ones always speed up the query.
Is there anything in the where condition that is not part of any index? At times we put index on a column and in the query we use a cast or convert on the column. So the underlying index is not effective. You may consider setting the index on the cast/convert of the column.
Look at the normal form conformity or over normalisation. 3.
Good luck.

If your are using Postgresql, I suggest you use a tool like "http://explain.depesz.com/" in order to see more clearly what part of your query is slow. Depending on what you get, you could either optimize your indexes, or rewrite part of your query. If your are using another database, I'm sure a similar tool exists.
If none of these ideas help, the final solution would be to create a "materialized query". There are plenty of infos on the web regarding this.
Good luck.

Related

SQL query efficiency of subquery in select vs inner join

I have a query with the following structure:
SELECT
Id,
(SELECT COUNT(1) AS [A1]
FROM [dbo].Table2 AS [Extent4]
WHERE (Table1.Id = [Extent4].Id2)) AS [C1]
FROM TPO_User
This query structure is usually used by LINQ as opposed to the following structure:
SELECT Id
FROM Table1
LEFT OUTER JOIN
(SELECT COUNT(1) AS [A1], [Extent4].Id2
FROM [dbo].Table2 AS [Extent4]
GROUP BY [Extent4].Id2) AS [C1] ON C1.Id2 = Table1.Id
When I compare them, the second query has a shorter duration. Could someone explain the exact difference in execution of such a query?
And is it worth it to ever have a subquery in your select statement instead of an inner join?

I would expect both queries to have similar performance characteristics. When doing performance comparisons, you have to be sure you do them correctly. For instance, running two queries in a row is not a good comparison, because the table data has been loaded in to memory.
To really compare the queries, you need a quiescent server and cold caches. That said, the execution plan can be a big help in understanding what is happening.
I would expect the correlated subquery to have good performance with the right indexes. For your example, you want an index on Table2(Id2).
Which has better performance in general? Well, it is simple to devise scenarios where the correlated subquery is better. For instance, if TPO_User has 1 row and Table2 has 1,000,000 rows, then the correlated subquery will be better under almost any circumstances.

In my understanding:
the FROM clause is the definition of the target.
the SELECT clause is the projection (line-by-line) definition.
So the FROM clause load the data you need in memory and after that the projection is made on each line of your select statement.
So if you do a query (or call a function...) in the SELECT clause, you say that you want this sub-job to be done for each line of your projection. Seems quite heavy ;)
A little source about the running order of an SQL request : https://www.periscopedata.com/blog/sql-query-order-of-operations
Hope this helps (and do not hesitate people to correct me if I am wrong)
(And if I remember well there is now an automatic feature to optimize queries in sql server. I think it will do the correction by itself, should it not?)

What's the best way to amalgamate the 2 queries below

I wrote the query below as part of a larger query to create a table. As I'm new to SQL, this was done in a very step-by-step manner so that I could easily understand the individual steps in the query and what each part was doing.
However, I've now been tasked to make the below 2 parts of the query more efficient by joining them together, and this is where I'm struggling.
I feel like I should be creating a single table rather than 2 and that the single table should contain all of the columns/values that I require. However, I am not at all sure of the syntax required to make this happen or the order in which I need to re-write the query.
Is anyone able to offer any advice?
Many thanks
sys_type as (select nvl(dw_start_date,sysdate) date_updated, id, descr
from scd2.scd2_table_a
inner join year_month_period
on 1=1
WHERE batch_end_date BETWEEN dw_start_date and NVL(dw_end_date,sysdate)),
sys_type_2 as (select -1 as sys_typ_id,
'Unassigned' as sys_typ_desc,
sysdate as date_updated
from dual
union
select id as sys_typ_id, descr as sys_typ_desc, date_updated
from sys_type),

Assuming you are using Oracle database, the queries above seem fine. I don't think you can make them more efficient just by 'joining' them (joining defined very loosely here. Is there a performance issue?
I think you can get better results by tuning your first inline query 'sys_type'.
You have a cartesian product there. Do you need that? Why don't you put the condition in the where clause as the join clause?
Basically
sys_type as (select nvl(dw_start_date,sysdate) date_updated, id, descr
from scd2.scd2_table_a
inner join year_month_period
on (batch_end_date BETWEEN dw_start_date and NVL(dw_end_date,sysdate)))

How exactly is the value of count(*) determined in BigQuery?

I am joining a table of about 70000 rows with a slightly bigger second table through inner join each. Now count(a.business_column) and count(*) give different results. The former correctly reports back ~70000, while the latter gives ~200000. But this only happens when I select count(*) alone, when I select them together they give the same result (~70000). How is this possible?
select
count(*)
/*,count(a.business_column)*/
from table_a a
inner join each table_b b
on b.key_column = a.business_column

UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.
To answer the title question: COUNT(*) in BigQuery is always accurate.
The caveat is that in SQL COUNT(*) and COUNT(column) have semantically different meanings - and the sample query can be interpreted in different ways.
See: http://www.xaprb.com/blog/2009/04/08/the-dangerous-subtleties-of-left-join-and-count-in-sql/
There they have this sample query:
select user.userid, count(email.subject)
from user
inner join email on user.userid = email.userid
group by user.userid;
That query turns out to be ambigous, and the article author changes it for a more explicit one, adding this comment:
But what if that’s not what the author of the query meant? There’s no
way to really know. There are several possible intended meanings for
the query, and there are several different ways to write the query to
express those meanings more clearly. But the original query is
ambiguous, for a few reasons. And everyone who reads this query
afterwards will end up guessing what the original author meant. “I
think I can safely change this to…”
UPDATE: For a step by step explanation on how this works, see BigQuery flattens when using field with same name as repeated field instead.

COUNT(*) counts most repeated field in your query, if you want to count full records - use COUNT(0).

Conditional aggregate database queries and their performance implications

I think this question is best asked with an example: if you want two counts from a table - say one with all the rows with a bit flag set to false and another with all of the ones set to true - is there a best practice for this kind of query and what are the performance implications of any approaches that could be taken?
To expand a little, and basing it off of the article below, how would separate queries compare to the version with the CASE evaluation in the SELECT list from a performance point of view? Are there other methods?
http://www.codeproject.com/Articles/310674/Conditional-Sums-in-SQL-Aggregate-Methods

Other than Blam's way, I think there are three basic ways to get the desired result. I tested the three options below as well as Blam's on my system. The results I found were as follows. Also, a side note, we didn't have any bit data in our system so I counted an indexed column with two values ("H" or "R").
Using Conditional Aggregates method resulted in the fastest performance. Using Blam's Grouping with an Aggregate method resulted in the second fastest way, consistently taking about 33% longer than the Conditional Aggregates. The Two Separate Select Statements method was the third fastest, consistently taking close to 50% longer than the Conditional Aggregates. Finally, the Joins method took the longest, and was close to 1000% slower than the Conditional Aggregates. The joins were expected (by me) to take the longest as you're joining to that table multiple times. The reason I included this method is because it was not discussed (possibly for obvious reasons) in the question; all performance issues aside, it is a viable if not extremely slow option. The two separate select statements also makes sense as you're running two separate aggregates, accessing that table two separate times.
I'm not sure what accounts for the differences between the conditional aggregate method and Blam's method. I've always been pleasantly surprised by the speed and performance of case statements, and today was no different.
I think the case statement method, aside from the performance considerations, is possibly the most versatile method. It allows you to work with just about any type of field and facilitates the selection of a subset of values, whereas Blam's Grouping with an Aggregate method would show all possible column values unless a Where clause were included.
Conditional Aggregates
Select SUM(Case When bitcol = 1 Then 1 Else 0 End) as True_Count
, SUM(Case When bitcol = 0 Then 1 Else 0 End) as False_Count
From Table;
Two separate select statements
Select Count(1) as True_Count
From Table
Where bitcol = 1;
Select Count(1) as False_Count
From Table
Where bitcol = 0;
Using Joins
Select Count(T2.bitcol) as True_Count
, Count(T3.bitcol) as False_Count
From Table T1
Left Outer Join Table T2
on T1.ID = T2.ID
Left Outer Join Table T3
on T1.ID = T3.ID;

SELECT [bitCol], count(*)
FROM [table]
GROUP BY [bitCol]
If that column is indexed it is an index scan followed by a stream aggregate.
Doubt you can do better than that

Avoiding a problematic nested query in MySQL

I have this SQL query which due to my own lack of knowledge and problem with mysql handling nested queries, is really slow to process. The query is...
SELECT DISTINCT PrintJobs.UserName
FROM PrintJobs
LEFT JOIN Printers
ON PrintJobs.PrinterName = Printers.PrinterName
WHERE Printers.PrinterGroup
IN (
SELECT DISTINCT Printers.PrinterGroup
FROM PrintJobs
LEFT JOIN Printers
ON PrintJobs.PrinterName = Printers.PrinterName
WHERE PrintJobs.UserName='<username/>'
);
I would like to avoid splitting this into two queries and inserting the values of the subquery into the main query progamatically.

This is probably not exactly what you are looking for however, i will contribute my 2 cents. First off you should show us your schema and exactly what you are trying to accomplish with that query. However from the looks of it you are not using numeric IDs in the table and are instead using varchar fields to join tables, this is not really a good idea performance wise. Also i am not sure why you are doing:
(select PrinterName, UserName
from PrintJobs) AS Table1
instead of just joining on PrintJobs? Similar stuff for this one:
(select
PrinterName,
PrinterGroup
from Printers) as Table1
Maybe i am just not seeing it right. I would recommend that you simplify the query as much as possible and try it. Also tell us what exactly you are hoping to accomplish with the query and give us some schema to work with.
Removed the bad query from the answer.

This query you have is pretty messed up, not sure if this will handle everything you need but simplifying like this kills all the nested queries and it way faster. You can also use the EXPLAIN command to know how mysql will fetch your query.
SELECT DISTINCT PrintJobs.UserName
FROM PrintJobs
LEFT JOIN Printers ON PrintJobs.PrinterName = Printers.PrinterName
AND Printers.Username = '<username/>'
;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas