Optomizing a sql Count Query

Optomizing a sql Count Query - sql-server-2000

I am still fairly new to SQL so I wanted to know if I am doing this the most optimized way.
SELECT DISTINCT ACCOUNTID, ACCOUNT_NAME
(SELECT COUNT(*)
FROM TICKET
WHERE (ACCOUNTID = OPPORTUNITY.ACCOUNTID)) AS [Number Of Tickets],
(SELECT COUNT(*)
FROM TICKET
WHERE (ACCOUNTID = OPPORTUNITY.ACCOUNTID) AND (STATUSCODE = 1 OR
STATUSCODE = 2 OR
STATUSCODE = 3)) AS [Active Tickets]
from OPPORTUNITY
where AccountID > #LowerBound and AccountID < #UpperBound
What I am trying to do is get a list of all accounts and have it show how many tickets the account has and how many are active (have a status code that is 1, 2, or 3). Is the select inside the select the correct way to do this or is there a way it can be done using something like group by.
My biggest concern is speed, it takes 3-5 seconds to just pull around 20 records and the query could potentially have 1000's of results.
I am not the DBA so any changes to table schema are not imposable but will require some pleading with upper management.
This is being run against SQL Server 2000.
EDIT--
as all of the answers where asking about it I checked on it. Both Opportunity and ticket index on accountid ascending.

I think the below should be logically equivalent and more efficient. Obviously test both aspects your end!
SELECT O.ACCOUNTID, O.ACCOUNT_NAME,
COUNT(*) AS [Number Of Tickets],
ISNULL(SUM(CASE WHEN STATUSCODE IN (1,2,3) THEN 1 ELSE 0 END),0)
AS [Active Tickets]
FROM OPPORTUNITY O
LEFT OUTER JOIN TICKET T ON T.ACCOUNTID = O.ACCOUNTID
WHERE O.ACCOUNTID > #LowerBound and O.ACCOUNTID < #UpperBound
GROUP BY O.ACCOUNTID, O.ACCOUNT_NAME
IF you can view the execution plans then you should check indexes exist on ACCOUNTID in both tables and are being used.

The SQL engine (even in 2000) is smart enough to optimize that sql. Based on your performance numbers with such few results, I'm guessing the source data had a bunch of records and does not have the indexes the sql needs.
Make sure there is an index on Opportunity.AccountID and an index on Ticket.AccountID.

Related

Joining two data sets with timestamps that are slightly off

I have two tables, one with transactions and one that reflects an audit done on each of these transactions before they were allowed process. I'd like to join the audit results to each transaction. Unfortunately, due to a system error, the TransID sometimes gets duplicated and thus does not always uniquely identify the transaction event. What DOES uniquely identify the event is the TransID + TransTimestamp. The row in the auditor's data that is uniquely associated with this transaction is the one with the same TransID and an AuditTimestamp that's the most recent timestamp that comes before the TransTimestamp. The code I've tried below causes SQL developer to run perpetually:
SELECT Trans.TransID, Audit.AuditTimestamp
FROM Trans
LEFT JOIN Audit ON Trans.TransID = Audit.TransID
WHERE Trans.TransTimestamp >= (SELECT MAX(Audit.AuditTimestamp)
FROM Audit
WHERE Trans.TransID = Audit.AuditID);

If a core piece of information such a transid is being duplicated then it is a "very messy business" and no amount of workarounds will properly overcome this.
To attempt this query I think you need to divide the transactions into those which have problems and those which don't and below I have used COUNT() OVER() to establish if a transaction has been duplicated or not. From there we can use a correlated subquery to locate a "probable match" for those transactions that need this approach. The probable match needs to have a timestamp >= the transaction but less than some allowed timeframe e.g. 1 second.
From there we join the good with the bad to arrive at (hopefully) something reasonable. But although it may appear reasonable it isn't guaranteed to be, you do need to fix the source problem.
WITH T AS (
SELECT
trans.*
, COUNT() over(partition by Trans.TransID) tran_count
FROM Trans
)
, A2 AS (
SELECT
T.TransID
, SELECT MIN(Audit.AuditTimestamp) FROM Audit
WHERE T.TransID = Audit.TransID
AND Audit.AuditTimestamp > T.TransTimestamp
AND Audit.AuditTimestamp < (T.TransTimestamp + interval '1' second) /* change to suit */
AS ProbableMatch
FROM T
WHERE T.tran_count > 1
)
SELECT
T.TransID
, T.trans_count
, COALESECE(A1.AuditTimestamp, A2.ProbableMatch) AS AuditTimestamp
, COALESECE(A1.ID, A3.ID) AS SourceAuditID
FROM T
LEFT JOIN Audit A1 ON T.TransID = A1.TransID AND T.tran_count = 1
LEFT JOIN A2 ON T.TransID = A2.TransID
LEFT JOIN Audit A3 ON A2.TransID = A3.TransID AND A2.AuditTimestamp = A3.AuditTimestamp

Need to make SQL subquery more efficient

I have a table that contains all the pupils.
I need to look through my registered table and find all students and see what their current status is.
If it's reg = y then include this in the search, however student may change from y to n so I need it to be the most recent using start_date to determine the most recent reg status.
The next step is that if n, then don't pass it through. However if latest reg is = y then search the pupil table, using pupilnumber; if that pupil number is in the pupils table then add to count.
Select Count(*)
From Pupils Partition(Pupils_01)
Where Pupilnumber in (Select t1.pupilnumber
From registered t1
Where T1.Start_Date = (Select Max(T2.Start_Date)
From registered T2
Where T2.Pupilnumber = T1.Pupilnumber)
And T1.reg = 'N');
This query works, but it is very slow as there are several records in the pupils table.
Just wondering if there is any way of making it more efficient

Worrying about query performance but not indexing your tables is, well, looking for a kind word here... ummm... daft. That's the whole point of indexes. Any variation on the query is going to be much slower than it needs to be.
I'd guess that using analytic functions would be the most efficient approach since it avoids the need to hit the table twice.
SELECT COUNT(*)
FROM( SELECT pupilnumber,
startDate,
reg,
rank() over (partition by pupilnumber order by startDate desc) rnk
FROM registered )
WHERE rnk = 1
AND reg = 'Y'

You can look execution plan for this query. It will show you high cost operations. If you see table scan in execution plan you should index them. Also you can try "exists" instead of "in".

This query MIGHT be more efficient for you and hope at a minimum you have indexes per "pupilnumber" in the respective tables.
To clarify what I am doing, the first inner query is a join between the registered table and the pupil which pre-qualifies that they DO Exist in the pupil table... You can always re-add the "partition" reference if that helps. From that, it is grabbing both the pupil AND their max date so it is not doing a correlated subquery for every student... get all students and their max date first...
THEN, join that result to the registration table... again by the pupil AND the max date being the same and qualify the final registration status as YES. This should give you the count you need.
select
count(*) as RegisteredPupils
from
( select
t2.pupilnumber,
max( t2.Start_Date ) as MostRecentReg
from
registered t2
join Pupils p
on t2.pupilnumber = p.pupilnumber
group by
t2.pupilnumber ) as MaxPerPupil
JOIN registered t1
on MaxPerPupil.pupilNumber = t1.pupilNumber
AND MaxPerPupil.MostRecentRec = t1.Start_Date
AND t1.Reg = 'Y'
Note: If you have multiple records in the registration table, such as a person taking multiple classes registered on the same date, then you COULD get a false count. If that might be the case, you could change from
COUNT(*)
to
COUNT( DISTINCT T1.PupilNumber )

Query optimization for postgresql

I have to resolve a problem in my class about query optimization in postgresql.
I have to optimize the following query.
"The query determines the yearly loss in revenue if orders just with a quantity of more than the average quantity of all orders in the system would be taken and shipped to customers."
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where i_data like '%b'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a
Is it possible through indices or something else to optimize that query (Materialized view is possible as well)?
Execution plan can be found here. Thanks.

first if you have to do searches from the back of data, simply create an index on the reverse of the data
create index on item(reverse(i_data);
Then query it like so:
select sum(ol_amount) / 2.0 as avg_yearly
from orderline, (select i_id, avg(ol_quantity) as a
from item, orderline
where reverse(i_data) like 'b%'
and ol_i_id = i_id
group by i_id) t
where ol_i_id = t.i_id
and ol_quantity < t.a

Remember that making indexes may not speed up the query when you have to retreive something like 30% of the table. In this case bitmap index might help you but as far as I remember it is not available in Postgres. So, think which table to index, maybe it would be worth to index the big table by ol_i_id as the join you are making only needs to match less than 10% of the big table and small table is loaded to ram (I might be mistaken here, but at least in SAS hash join means that you load the smaller table to ram).
You may try aggregating data before doing any joins and reuse the groupped data. I assume that you need to do everything in one query without explicitly creating any staging tables by hand. Also recently, I have been working a lot on SQL Server so I may mix the syntax, but give it a try. There are many assumptions I have made about the data and the structure of the table, but hopefully it will work.
;WITH GrOrderline (
SELECT ol_i_id, ol_quantity, SUM(ol_amount) AS Yearly, Count(*) AS cnt
FROM orderline
GROUP BY ol_i_id, ol_quantity
),
WITH AvgOrderline (
SELECT
o.ol_i_id, SUM(o.ol_quantity)/SUM(cnt) AS AvgQ
FROM GrOrderline AS o
INNER JOIN item AS i ON (o.ol_i_id = i.i_id AND RIGHT(i.i_data, 1) = 'b')
GROUP BY o.ol_i_id
)
SELECT SUM(Yearly)/2.0 AS avg_yearly
FROM GrOrderline o INNER JOIN AvgOrderline a ON (a.ol_i_id = a.ol_i_id AND o.ol_quantity < a.AvG)

SQL Scheduling - Select All Rooms Available for Given Date Range

I'm using Microsoft's idea for storing resource and booking information. In short, resources, such as a hotel room, do not have date records and booking records have a BeginDate and EndDate.
I'm trying to retrieve room availability information using MS's queries but something tells me that MS's queries leave much to be desired. Here's the MS article I'm referring to: http://support.microsoft.com/kb/245074
How can I retrieve available rooms for a given date range? Here's my query that returns a simple list of bookings:
SELECT r.RoomID, b.BeginDate, b.EndDate
FROM tblRoom as r INNER JOIN tblBooking b ON r.RoomID = b.AssignedRoomID;
But I'm still baffled as to how I can get a list of available rooms for a given date range?
I am using Microsoft Access but I'd like to keep my queries DBMS agnostic, as much as possible. While this isn't really my question, if you feel that the data model I'm using is unsound, please say so, as I am willing to consider a better way of storing my data.
Edit1:
I failed to mention that I don't like MS's queries for two reasons. First of all, I'm really confused about the 3 different OR operators in the WHERE clause. Are those really necessary? Secondly, I don't like the idea of saving a query and using it as a table although I'm willing to do that if it gets the job done, which in this case I believe it does.
Edit2:
This is the solution I've landed on using the excellent answer given here. This is MS Access SQL dialect (forgive me):
SELECT * FROM tblRoom AS r
WHERE RoomID NOT IN
(SELECT AssignedRoomID as RoomID From tblBooking
WHERE assignedroomid IS NOT NULL AND assignedroomid = r.roomid AND
(BeginDate < #BookingInquiryEndDate AND EndDate > #BookingInquiryBeginDate)
)

You want all the rooms which do not have a booking in that date range, i.e.,
If your sql engine does subqueries...
Select * From Rooms r
where not exists
(Select * From Bookings
Where room = r.room
And startBooking < #endRange
And endBooking > #startRange)
HIK, to understand the need for the room = r.room clause try these two queries
Query One (with room = r.room clause)
Select r.*,
Case Where Exists
(Select * From Bookings
Where room = r.room
And startBooking < #endRange
And endBooking > #startRange)
Then 'Y' Else 'N' End HasBooking
From Rooms r
Query Two(without room = r.room clause)
Select r.*,
Case Where Exists
(Select * From Bookings
Where startBooking < #endRange
And endBooking > #startRange)
Then 'Y' Else 'N' End HasBooking
From Rooms r
Notice the first one returns different values in HasBooking for each row of the output, because the subquery is 'Correleated' with the outer query... it is run over and over agaio, once for each outer query results row.
The second one is the same value for all rows... It is only executed once, because nothing in it is dependant on which row of the outer query it is being generated for.

Running an SQL query faster

SELECT projectID, urlID, COUNT(1) AS totalClicks, projectPage,
(SELECT COUNT(1)
FROM tblStatSessionRoutes, tblStatSessions
WHERE tblStatSessionRoutes.statSessionID = tblStatSessions.ID
AND tblStatSessions.projectID = tblAdClicks.projectID
AND (tblStatSessionRoutes.leftPageID = tblAdClicks.projectPage OR
tblStatSessionRoutes.rightPageID = tblAdClicks.projectPage)) AS totalViews
FROM tblAdClicks
WHERE projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)
GROUP BY projectID, urlID, projectPage
ORDER BY CASE projectID
WHEN 170 THEN
1
ELSE
0
END, projectID
This is by no means an especially complex query, but because the database is normalised to a good level, and we are dealing with a significant amount of data, this query can be quite slow for the user.
Does anyone have tips on how to improve the speed of it? If I strategically denormalise parts of the database would this help? Will running it in a stored proc offer significant improvements?
The way I handle the data is efficient in my code, the bottleneck really is with this query.
Thanks!

De-normalising your database should be a last resort since (to choose just one reason) you don't want to encourage data inconsistencies which de-normalisation will allow.
First thing is to see if you can get some clues from the query execution plan. It could be, for example, that your sub-selects are costing too much, and would be better done first into temp tables which you then JOIN in your main query.
Also, if you see lots of table-scans, you could benefit from improved indexes.
If you haven't already, you should spend a few minutes re-formatting your query for readability. It's amazing how often the obvious optimisation will jump out at you while doing this.

I would try to break up that
projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)
and use a JOIN instead:
SELECT
projectID, urlID, COUNT(1) AS totalClicks, projectPage,
(SELECT COUNT(1) ....) AS totalViews
FROM
dbo.tblAdClicks a
INNER JOIN
dbo.tblProjects p ON a.ProjectID = p.ProjectID
WHERE
p.UserID = 5
GROUP BY
a.projectID, a.urlID, a.projectPage
ORDER BY
CASE a.projectID
WHEN 170 THEN 1
ELSE 0
END, a.projectID
Not sure just how much this will help - should help a bit, I hope!
Other than that, I would check if you have indices on the relevant columns, e.g. on a.ProjectID (to help with the JOIN), and maybe on a.urlID and a.ProjectPage (to help with the GROUP BY)

If your dbms has a tool that explains its query plan, use that first. (Your first correlated subquery might be running once per row.) Than make sure every column referenced in a WHERE clause has an index.
This subquery--WHERE projectID IN (SELECT projectID FROM tblProjects WHERE userID = 5)--can surely benefit from being cut and implemented as a view. Then join to the view.
It's not unusual to treat clickstream data as a data warehouse application. If you need to go that route, I'd usually implement a separate data warehouse rather than denormalize a well-designed OLTP database.
I doubt that running it as a stored proc will help you.

I would try to remove the correlated subquery (the inner (SELECT COUNT(1) ...)). Having to join against your session routes where either the left page or the right page matches makes things a bit tricky. Something along the lines of (but I haven't tested this):
SELECT tblAdClicks.projectID, tblAdClicks.urlID, COUNT(1) AS totalClicks, tblAdClicks.projectPage,
SUM(CASE WHEN leftRoute.statSessionID IS NOT NULL OR rightRoute.statSessionID IS NOT NULL THEN 1 ELSE 0 END) AS totalViews
FROM tblAdClicks
JOIN tblProjects ON tblProjects.projectID = tblAdClicks.projectID
LEFT JOIN tblStatSessions ON tblStatSessions.projectID = tblAdClicks.projectID
LEFT JOIN tblStatSessionRoutes leftRoute ON leftRoute.statSessionID = tblStatSessions.ID AND leftRoute.leftPageID = tblAdClicks.projectPage
LEFT JOIN tblStatSessionRoutes rightRoute ON rightRoute.statSessionID = tblStatSessions.ID AND rightRoute.rightPageID = tblAdClicks.projectPage
WHERE tblProjects.userID = 5
GROUP BY tblAdClicks.projectID, tblAdClicks.urlID, tblAdClicks.projectPage
ORDER BY CASE tblAdClicks.projectID WHEN 170 THEN 1 ELSE 0 END, tblAdClicks.projectID
If I were to add some cache tables to help this, as I indicated I'd try to reduce the two queries against tblStatSessionRoutes for both left and right page to a single query. If you know that leftPageID will never be equal to rightPageID, it should be possible to simply use a trigger to populate an additional table with the left and right views in separate rows, for example.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Optomizing a sql Count Query - sql-server-2000

Related

Joining two data sets with timestamps that are slightly off

Need to make SQL subquery more efficient

Query optimization for postgresql

SQL Scheduling - Select All Rooms Available for Given Date Range

Running an SQL query faster

Categories

Resources