SQL query performance issues - sql

I have the below SQL query that is taking an inordinate amount of time to run. Indexes have been added to all of the join fields in each table. Record counts for each table are as follows:
CRM.ASSET_PLUS:15,766,000
CRM.EMPLOYEE: 44,300
CRM.ACCOUNT: 1,180,000
CRM.DATA_NOTIFICATIONS: 500
CRM.PROD_INT: 87,800
What can I do to make this query more efficient?
SELECT D.NAME AS UP_ACCOUNT_NAME,
B.FIRST_NAME,
B.LAST_NAME
FROM CRM.ASSET_PLUS A,
CRM.EMPLOYEE B,
CRM.ACCOUNT C,
CRM.ACCOUNT D,
CRM.DATA_NOTIFICATIONS E,
CRM.PROD_INT F
WHERE A.STATUS IN ('Active', 'Pending Install')
AND E.PROD_DEF_OLD = F.X_ITEM_NUMBER
AND F.ROW_ID = A.PRODUCT_ID
AND C.UP_ACCOUNT_ID = D.ACCOUNT_ID
AND C.ACCOUNT_ID = A.LOCATION_ACCOUNT_ID
AND D.MANAGER_ID = B.EMPLOYEE_ID
AND UPPER(D.NAME) LIKE '%BP%'
GROUP BY D.NAME,
B.FIRST_NAME,
B.LAST_NAME

Get rid off that group by.
You are just selecting (account join employee) and using all other tables to filter it. You can get duplicated result and that's the only reason why you are using that group by.
But that's not necessary if you rewrite the query to move all other tables to the where clause.
I created a example to demonstrate it.
and try to write explicit joins to separate filter and join predicates.
example using SQL fiddle

1)
Get rid of this if you can:
AND UPPER(D.NAME) LIKE '%BP%'
1.1)If posible, do Not allow user to put % in front of a like clause (because it implies full table scan)
Note: You can allow it if you use only one at the beginning, by storing a computed column with reverse(d.name) and using D.reversedname LIKE 'PB%' instead of D.name like 'BP%' . The key here is that the % must be in the end and not in the beginning.
Check this out
https://use-the-index-luke.com/sql/where-clause/searching-for-ranges/like-performance-tuning
create index for d.name
1.2 do NOT use UPPER -> either:
->change the column collation to a case insensitive one (ending in _CI) and remove the UPPER (easier)
or
->use a computed column to precalculate this UPPER(D.NAME) in a new PERSISTED column and use that column instead in the clause. If you use this solution, do not forget to create an index for that new column.
2) create indexes for all foreign keys used in the query / joins
3) if possible , require minimum chars for that LIKE, in order to reduce the number of possible results.

First thing that I see here is that you use INNER JOIN and then WHERE statement.
Use syntax table1 as t1 INNER JOIN table2 as t2 ON t1.key = t2.key.
So in where you will have only UPPER(D.NAME) LIKE '%BP%' AND A.STATUS IN ('Active', 'Pending Install') condition. It will allow DBMS to do lots of optimizations.
Also instead of INNER JOIN try to do LEFT JOIN where it possible. It will significantly decrease amount of rows in temp table.

Related

Use HAVING or WHERE?

I am confused about when to use HAVING and when to use WHERE. I need to
Find all of the bugs on software Debugger that pertain to the /main.html
This is my query
select Tickets.TicketID, b.Data
from Bugs b
Inner Join Tickets
On b.TicketID = Tickets.TicketID
Inner Join Softwares
on Software.SoftwareId = Tickets.SoftwareID
where Software.URL = 'http://debugger.com' and Tickets.Title = '/main.html'
NOTE: THIS GIVES ME DESIRED RESULT
But I want to make sure I am not missing anything important here. Maybe should I use HAVING somewhere here?
Also in order to make the query perform better on a large dataset, I have created an index on foreign keys
create nonclustered index IX_Tickets_SoftwareId
on [dbo].[Tickets] ([SoftwareId])
go
create nonclustered index IX_Bugs_TicketsId
on [dbo].[Bugs] ([TicketsId])
Am doing allright?
Your query is fine. You want to filter individual records, which is what the WHERE clause does.
The HAVING clause comes into play in aggregate queries - queries that use GROUP BY, and its purpose is to filter groups of records, using aggregate functions (such as SUM(), MAX() or the-like). It makes no sense for your query, that does not use aggregation.
Incidently, I note that your are not returning anything from the softwares table, so that join is used for filtering only. In such situation, I find that exists is more appropriate, because it is explicit about its purpose:
select t.ticketid, b.data
from bugs b
inner join tickets t on b.ticketid = t.ticketid
where t.title = '/main.html' and exists (
select 1
from softwares s
where s.softwareid = t.softwareid and s.url = 'http://debugger.com'
)
For performance, consider an index on softwares(softwareid, url), so the subquery execute efficiently. An index on tickets(ticketid, title) might also help.
WHERE is used to filter records before any groupings take place. HAVING is used to filter values after they have been groups. Only columns or expressions in the group can be included in the HAVING clause's

SQLite: distinguish between table and column alias

Can SQLite distinguish between a column from some aliased table, e.g. table1.column and a column that is aliased with the same name, i.e. column, in the SELECT statement?
This is relevant because I need to refer to the column that I construct in the SELECT statement later on in a HAVING clause, but must not confuse it with the column in aliased table. To my knowledge, I cannot alias the table to be constructed in my SELECT statement (without reverting to some nasty work-around like SELECT * FROM (SELECT ...) AS alias) to ensure both are distinguishable.
Here's a stripped down version of the code I am concerned with:
SELECT
a.entity,
b.DATE,
TOTAL(a.dollar_amount*b.ret_usd)/TOTAL(a.dollar_amount) AS ret_usd
FROM holdings a
LEFT JOIN returns b
ON a.stock = b.stock AND
a.DATE = b.DATE
GROUP BY
a.entity,
b.DATE
HAVING
ret_usd NOT NULL
Essentially, I want to get rid of groups for which I cannot find any returns and thus would show up with NULL values. I am not using an INNER JOIN because in my production code I merge multiple types of returns - for some of which I may have no data. I only want to drop those groups for which I have no returns for any of the return types.
To my understanding, the SQLite documentation does not address this issue.
LEFT JOIN all the return tables, then add a WHERE something like
COALESCE(b.ret_used, c.ret_used, d.ret_used....) is not NULL
You might need a similar strategy to determine which ret_used in the TOTAL. FYI, TOTAL never returns NULL.

Access SQL: Update list with Max() and Min() values not possible

I've a list of dates: list_of_dates.
I want to find the max and min values of each number with this code (#1).
It works how it should, and therefore I get the table MinMax
Now I want to update a other list (list_of_things) with these newly acquired values (#2).
However, it is not possible.
I assume it's due to DISTINCT and the fact that I always get two rows per number, each with the min and max values. Therefore an update is not possible.
Unfortunately I don't know any other way.
#1
SELECT a.number, b.MaxDateTime, c.MinDateTime
FROM (list_of_dates AS a
INNER JOIN (
SELECT a.number, MAX(a.dat) AS MaxDateTime
FROM list_of_dates AS a
GROUP BY a.number) AS b
ON a.number = b.number)
INNER JOIN (SELECT a.number, MIN(a.dat) AS MinDateTime
FROM list_of_dates AS a
GROUP BY a.number) AS c
ON a.number = c.number;
#2
UPDATE list_of_things AS a
LEFT JOIN MinMax AS b
ON a.number = b.number
SET a.latest = b. MaxDateTime, a.ealiest = b.MinDateTime```
No part of an MS Access update query can contain aggregation, else the resulting recordset becomes 'not updateable'.
In your case, the use of the min & max aggregate functions within the MinMax subquery cause the final update query to become not updateable.
Whilst it is not always advisable to store aggregated data (in favour of generating an output from transactional data using queries), if you really need to do this, here are two possible methods:
1. Using a Temporary Table to store the Aggregated Result
Run a select into query such as the following:
select
t.number,
max(t.dat) as maxdatetime,
min(t.dat) as mindatetime
into
temptable
from
list_of_dates t
group by
t.number
To generate a temporary table called temptable, then run the following update query which sources date from this temporary table:
update
list_of_things t1 inner join temptable t2
on t1.number = t2.number
set
t1.latest = t2.maxdatetime,
t1.earliest = t2.mindatetime
2. Use Domain Aggregate Functions
Since domain aggregate functions (dcount, dsum, dmin, dmax etc.) are evaluated separately from the evaluation of the query, they do not break the updateable nature of a query.
As such, you might consider using a query such as:
update
list_of_things t1
set
t1.latest = dmax("dat","list_of_dates","number = " & t1.number),
t1.earliest = dmin("dat","list_of_dates","number = " & t1.number)
It's a shot in the dark, but try adding DistinctRow as per SQL Update woes in MS Access - Operation must use an updateable query
Also try using an inner join. If you need to, you can run an update to a null value first for all the records in the query to simulate the effect of the outer join.

Whether Inner Queries Are Okay?

I often see something like...
SELECT events.id, events.begin_on, events.name
FROM events
WHERE events.user_id IN ( SELECT contacts.user_id
FROM contacts
WHERE contacts.contact_id = '1')
OR events.user_id IN ( SELECT contacts.contact_id
FROM contacts
WHERE contacts.user_id = '1')
Is it okay to have query in query? Is it "inner query"? "Sub-query"? Does it counts as three queries (my example)? If its bad to do so... how can I rewrite my example?
Your example isn't too bad. The biggest problems usually come from cases where there is what's called a "correlated subquery". That's when the subquery is dependent on a column from the outer query. These are particularly bad because the subquery effectively needs to be rerun for every row in the potential results.
You can rewrite your subqueries using joins and GROUP BY, but as you have it performance can vary, especially depending on your RDBMS.
It varies from database to database, especially if the columns compared are
indexed or not
nullable or not
..., but generally if your query is not using columns from the table joined to -- you should be using either IN or EXISTS:
SELECT e.id, e.begin_on, e.name
FROM EVENTS e
WHERE EXISTS (SELECT NULL
FROM CONTACTS c
WHERE ( c.contact_id = '1' AND c.user_id = e.user_id )
OR ( c.user_id = '1' AND c.contact_id = e.user_id )
Using a JOIN (INNER or OUTER) can inflate records if the child table has more than one record related to a parent table record. That's fine if you need that information, but if not then you need to use either GROUP BY or DISTINCT to get a result set of unique values -- and that can cost you when you review the query costs.
EXISTS
Though EXISTS clauses look like correlated subqueries, they do not execute as such (RBAR: Row By Agonizing Row). EXISTS returns a boolean based on the criteria provided, and exits on the first instance that is true -- this can make it faster than IN when dealing with duplicates in a child table.
You could JOIN to the Contacts table instead:
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
-- exercise: without the GROUP BY, how many duplicate rows can you end up with?
This leaves the following question up to the database: "Should we look through all the contacts table and find all the '1's in the various columns, or do something else?" where your original SQL didn't give it much choice.
The most common term for this sort of query is "subquery." There is nothing inherently wrong in using them, and can make your life easier. However, performance can often be improved by rewriting queries w/ subqueries to use JOINs instead, because the server can find optimizations.
In your example, three queries are executed: the main SELECT query, and the two SELECT subqueries.
SELECT events.id, events.begin_on, events.name
FROM events
JOIN contacts
ON (events.user_id = contacts.contact_id OR events.user_id = contacts.user_id)
WHERE events.user_id = '1'
GROUP BY events.id
In your case, I believe the JOIN version will be better as you can avoid two SELECT queries on contacts, opting for the JOIN instead.
See the mysql docs on the topic.

SQL GROUP BY/COUNT even if no results

I am attempting to get the information from one table (games) and count the entries in another table (tickets) that correspond to each entry in the first. I want each entry in the first table to be returned even if there aren't any entries in the second. My query is as follows:
SELECT g.*, count(*)
FROM games g, tickets t
WHERE (t.game_number = g.game_number
OR NOT EXISTS (SELECT * FROM tickets t2 WHERE t2.game_number=g.game_number))
GROUP BY t.game_number;
What am I doing wrong?
You need to do a left-join:
SELECT g.Game_Number, g.PutColumnsHere, count(t.Game_Number)
FROM games g
LEFT JOIN tickets t ON g.Game_Number = t.Game_Number
GROUP BY g.Game_Number, g.PutColumnsHere
Alternatively, I think this is a little clearer with a correlated subquery:
SELECT g.Game_Number, G.PutColumnsHere,
(SELECT COUNT(*) FROM Tickets T WHERE t.Game_Number = g.Game_Number) Tickets_Count
FROM Games g
Just make sure you check the query plan to confirm that the optimizer interprets this well.
You need to learn more about how to use joins in SQL:
SELECT g.*, count(*)
FROM games g
LEFT OUTER JOIN tickets t
USING (game_number)
GROUP BY g.game_number;
Note that unlike some database brands, MySQL permits you to list many columns in the select-list even if you only GROUP BY their primary key. As long as the columns in your select-list are functionally dependent on the GROUP BY column, the result is unambiguous.
Other brands of database (Microsoft, Firebird, etc.) give you an error if you list any columns in the select-list without including them in GROUP BY or in an aggregate function.
"FROM games g, tickets t" is the problem line. This performs an inner join. Any where clause can't add on to this. I think you want a LEFT OUTER JOIN.