Optimize WHERE clause in query - sql

I have the following query:
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date = (select max(a.date) from table1 a where a.serialcode ='49257');
It seems it is retrieving the select max subquery for each join. It takes a lot of time. Is there a way to optimize it? Any help will be appreciated.

Sub selects that end up being evaluated "per row of the main query" can cause tremendous performance problems once you try to scale to larger number of rows.
Sub selects can almost always be eliminated with a data model tweak.
Here's one approach: add a new is_latest to the table to track if it's the max value (and for ties, use other fields like created time stamp or the row ID). Set it to 1 if true, else 0.
Then you can add where is_latest = 1 to your query and this will radically improve performance.
You can schedule the update to happen or add a trigger etc. if you need an automated way of keeping is_latest up to date.
Other approaches involve 2 tables - one where you keep only the latest record and another table where you keep the history.

declare #maxDate datetime;
select #maxDate = max(a.date) from table1 a where a.serialcode ='49257';
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
WHERE table2.serialcode = '49257'
and date =#maxDate;

You can optimize this query using indexes. Here are somethat should help: table1(serialcode, serial, date), table1(serialcode, date), and pivots(serialcode).
Note: I find it very strange that you have columns called serial and serialcode in the same table, and the join is on serial.

Since you haven't mentioned which DB you are using, I would answer if it was for Oracle.
You can use WITH clause to take out the subquery and make it perform just once.
WITH d AS (
SELECT max(a.date) max_date from TABLE1 a WHERE a.serialcode ='49257'
)
SELECT table2.serialcode,
p.name,
date,
power,
behind,
direction,
length,
centerlongitude,
centerlatitude,
currentlongitude,
currentlatitude
FROM table1 as table2
JOIN pivots p ON p.serialcode = table2.serial
JOIN d on (table2.date = d.max_date)
WHERE table2.serialcode = '49257'
Please note that you haven't qualified date column, so I just assumed it belonged to table1 and not pivots. You can change it. An advise on the same note - always qualify your columns by using table.column format.

Related

Compare value of one field in one table to the total sum of one column in another table

I'm having trouble with executing a query to compare where the value of a column in one table is not equal to the sum of another column in a different table. Below is the query I have been trying to execute:
select id.invoice_no,sum(id.bank_charges),
from db2apps.invoice_d id
inner join db2apps.invoice_h ih on (id.invoice_no = ih.invoice_no)
group by id.invoice_no
having coalesce(sum(id.bank_charges), 0) != ih.tax_value
with ur;
I tried with joining on the tables, the group by having format, etc and have had no luck. I really want to select id.invoice_no, ih.tax_value, and sum(id.bank_charges) in the result set, and also grab the data where the sum(id.bank_charges) is not equal to the value of ih.tax_value. Any help would be appreciated.
Perhaps this solves your problem:
select ih.invoice_no, ih.tax_value, sum(id.bank_charges)
from db2apps.invoice_h ih left join
db2apps.invoice_d id
on id.invoice_no = ih.invoice_no
group by ih.invoice_no, ih.tax_value
having coalesce(sum(id.bank_charges), 0) <> ih.tax_value;
The most logical way is probably to SUM the invoice detail first.
SELECT IH.INVOICE_NO
, IH.TAX_VALUE
FROM
DB2APPS.INVOICE_H IH
JOIN
( SELECT INVOICE_NO
, COALESCE(SUM(BANK_CHARGES),0) AS BANK_CHARGES
FROM
DB2APPS.INVOICE_D
GROUP BY
INVOICE_NO
) ID
ON
ID.INVOICE_NO = IH.INVOICE_NO
WHERE
ID.BANK_CHARGE <> IH.TAX_VALUE
Generally, you never need to use HAVING in SQL and often your code will be clearer and easier to follow if you do avoid using it (even if it it sometimes a bit longer).
P.S. you can remove the COALESCE if BANK_CHARGES is NOT NULL.

Can any one please suggest which sql query is better approach for large data set in mysql

I have to find the max id row for same group in a table and show the roe details.Using following two approach we can achieve it. But want know which will be good approach for large data. Or any other new approach that will take less time to execute?
Approach 1:
select a.* from tab1 a left join (SELECT max(id) as id,name from tab1
GROUP by name) as tab2 on a.id=tab2.id where a.id=tab2.id
Approach 2:
SELECT id,name from tab1 where id in(SELECT MAX(id) FROM tab1 GROUP by name)
Taken from the manual (13.2.10.11 Rewriting Subqueries as Joins):
A LEFT JOIN can be faster than a subquery because
the server might be able to optimize it better.
So subqueries can be slower than LEFT [OUTER] JOINS, but in my opinion their strength is slightly higher readability. But since there is one LEFT JOIN and a subquery in the first approach, the second approach might be faster at large scale query.
You can also use a window function to avoid a self-join:
SELECT id, name
FROM (
SELECT
id,
name,
RANK() OVER(PARTITION BY name ORDER BY Id DESC) AS IdRankPerGroup
FROM tab1
) src
WHERE IdRankPerGroup = 1
The RANK() function orders each row per "name" group and assigns a ranking based on the "id" value within each group. Then in the outer query, you just get the rows with a ranking = 1.
Try all three queries, check out the EXPLAIN plans, and see which one works best with large amounts of data.

Need to make SQL subquery more efficient

I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient
Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'
You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".
This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )

Cumulative Summing Values in SQLite

I am trying to perform a cumulative sum of values in SQLite. I initially only needed to sum a single column and had the code
SELECT
t.MyColumn,
(SELECT Sum(r.KeyColumn1) FROM MyTable as r WHERE r.Date < t.Date)
FROM MyTable as t
Group By t.Date;
which worked fine.
Now I wanted to extend this to more columns KeyColumn2 and KeyColumn3 say. Instead of adding more SELECT statements I thought it would be better to use a join and wrote the following
SELECT
t.MyColumn,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM MyTable as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
However this does not give me the correct answer (instead it gives values that are much larger than expected). Why is this and how could I correct the JOIN to give me the correct answer?
You are likely getting what I would call mini-Cartesian products: your Date values are probably not unique and, as a result of the self-join, you are getting matches for each of the non-unique values. After grouping by Date the results are just multiplied accordingly.
To solve this, the left side of the join must be rid of duplicate dates. One way is to derive a table of unique dates from your table:
SELECT DISTINCT Date
FROM MyTable
and use it as the left side of the join:
SELECT
t.Date,
Sum(r.KeyColumn1),
Sum(r.KeyColumn2),
Sum(r.KeyColumn3)
FROM (SELECT DISTINCT Date FROM MyTable) as t
Left Join MyTable as r On (r.Date < t.Date)
Group By t.Date;
I noticed that you used t.MyColumn in the SELECT clause, while your grouping was by t.Date. If that was intentional, you may be relying on undefined behaviour there, because the t.MyColumn value would probably be chosen arbitrarily among the (potentially) many in the same t.Date group.
For the purpose of this example, I assumed that you actually meant t.Date, so, I replaced the column accordingly, as you can see above. If my assumption was incorrect, please clarify.
Your join is not working cause he will find way more possibilities to join then your subselect would do.
The join is exploding your table.
The sub select does a sum of all records where the date is lower then the one from the current record.
The join joins every row multiple times aslong as the date is lower then the current record. This mean a single record could do as manny joins as there are records with a date lower. This causes multiple records. And in the end a higher SUM.
If you want the sum from mulitple columns you will have to use 3 sub query or define a unique join.

Updating with sub query on the same table

I was trying to achieve this query1:
UPDATE temp_svn1 t set closedate=(select max(date) from temp_svn1 p where p.id=t.id
Apparently MySQL does not allow such queries. So I came up with this query using inner joins but this is too slow. How can I write a better query for this? OR How can I achieve the logic of query 1?
UPDATE temp_svn1 AS out INNER JOIN (select id, close from temp_svn1 T inner join (select id as cat, max(date) as close from temp_svn1 group by cat) as in where T.id = in.cat group by id ) as result ON out.id = result.id SET out.closedate = result.close
Because of mysql peculiarities the only way you're going to get this to be faster is to split it into two statements. First, fetch the max, and then use it in an update. This is a known hack, and, afaik, there's not a "neater" way of doing it in a single statement.
Since your subquery is just returning a single value, you could do the query in two stages. Select the max(date) into a server-side variable, then re-use that variable in the outer query. Of course, this breaks things up into two queries and it'll no longer be atomic. But with appropriate transactions/locks, that becomes moot.
This works:
UPDATE temp_svn1
set closedate = (select max(date) from temp_svn1 p where p.id = temp_svn1.id)