I've got a view defined that lists transactions together with a running total, something like
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
The biggest tables this query looks at have ~10 million rows, but on average when the view is queried it will only return a few tens of rows.
My issue is that when this SELECT statement is run directly for a given member, it executes extremely quickly and returns results in a couple of milliseconds. However, when queried as a view...
SELECT h.createdDate, h.value, h.runningTotal
FROM historyView h
WHERE member.username = 'blah#blah.com'
...the performance is dreadful. The two query plans are very different - in the first case it is pretty much ideal but in the latter case, there are loads of scans and hundreds of thousands/millions of rows being read. This is clearly because the filter on member is being run last thing after everything else has been done, rather than right up front at the start.
If I remove the SUM(x) OVER (ORDER BY y) clause, this problem goes away.
Is there something I can do to ensure that the SUM(x) OVER (ORDER BY y) clause does not ruin the query plan?
One solution to my problem is to let the query optimiser know it is safe to filter before running the windowed function by PARTITION'ing by that property. The change to the view is:
CREATE VIEW historyView AS
SELECT
a.createdDate,
a.value,
m.memberId,
SUM(a.value) OVER (PARTITION BY m.username ORDER BY a.createdDate) as runningTotal,
...many more columns...
FROM allocations a
JOIN member m ON m.id = a.memberId
JOIN ...many joins...
Unfortunately this only creates the correct plan if filtering my member's username is part of the query.
That's because there's probably an index on m.username. When it comes to query tuning it takes some trial and error.
When using window functions there is the concept of 'POC' index to take into consideration - just search on google (Itzik Ben-Gan has good references about this as well).
From the book 'High-Performance T-SQL Using Window Functions':
Absent a POC index, the plan includes a Sort iterator, and with large input sets, it can be quite
expensive. Sorting has N * LOG(N) complexity, which is worse than linear. This means that with more
rows, you pay more per row. For example 1000 * LOG(1000) = 3000 and 10000 * LOG(10000) =
40000. This means that 10 times more rows results in 13 times more work, and it gets worse the further you go.
Here's a reference link to get started on window functions and indexes.
Related
I've created a scraper that collects huge amounts of data to the Postgres database. One of the tables has more than 120 million records and still grows.
It creates obvious problems with even simple selects, but when I run aggregate
functions like COUNT(), it takes ages to get a result. I want to display this data using a web service, but it is definitely too slow to do it directly. I thought about materialized views, but even there if I run some more advanced query (query with subqueries to show trend) it throws an error with not enough memory, and if the query is simple, then it takes about an hour to complete. I am asking about general rules (I haven't managed to find any) with dealing with such huge databases.
The example queries which I use:
The simple query takes about an hour to complete (Items table have 120 million records, ItemTypes have about 30k - they keep the names and all information for the Items)
SELECT
IT."name",
COUNT("Items".id) AS item_count,
(CAST(COUNT("Items".id) AS DECIMAL(10,1))/(SELECT COUNT(id) FROM "Items"))*100 as percentage_of_all
FROM "Items" JOIN "ItemTypes" IT on "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
When I run the above query with subquery which returns COUNT("Items".id) AS item_count % trend which is the count of them from a week ago compared to count now, it throws an error that memory was exceeded.
As I wrote above I am looking for tips, how to optimize it. The first thing I plan to optimize the above query is to move names from ItemTypes to Items, to Items. It won't be required to join ItemTypes anymore, but I already tried to mock it and the results aren't a lot better.
You don't need a subquery, so an equivalent version is:
SELECT IT."name",
COUNT(*) AS item_count,
COUNT(*) * 100.0 / SUM(COUNT(*)) OVER () as percentage_of_all
FROM "Items" JOIN
"ItemTypes" IT
ON "Items"."itemTypeId" = IT.id
GROUP BY IT."name"
ORDER BY item_count DESC;
I'm not sure if this will fix your resource problem. In addition, this assumes that all items have a valid ItemType. If that is not the case, use a LEFT JOIN instead of JOIN.
I've got this query here which uses dense_rank to number groups in order to select the first group only. It is working but its slow and tempdb (SQL server) becomes so big that the disk is filled up. Is it normal for dense_rank that it's such a heavy operation? And how else should this be done then without resorting to coding?
select
a,b,c,d
from
(select a,b,c,d,
dense_rank() over (order by s.[time] desc) as gn
from [Order] o
JOIN Scan s ON s.OrderId = o.OrderId
JOIN PriceDetail p ON p.ScanId = s.ScanId) as p
where p.OrderNumber = #OrderNumber
and p.Number = #Number
and p.Time > getdate() - 20
and p.gn = 1
group by a,b,c,d,p.gn
Any operation that has to sort a large dataset may fill tempdb. dense_rank is no exception, just like rank, row_number, ntile etc etc.
You are asking for a sort over what appears to be a global, complete sort of every scan entry, since database start. The way you expressed the query the join must occur before the sort, so the sort will be both big and wide. After all is said and done, consuming a lot of IO, CPU and tempdb space, you will restrict the result to a small subset for only a specified order and some conditions (which mentions columns not present in projection, so they must be some made up example not the real code).
You have a filter on WHERE gn=1 followed by a GROUP BY gn. This is unnecessary, the gn is already unique from the predicate so it cannot contribute to the group by.
You compute the dense_rank over every order scan and then you filter by p.OrderNumber = #OrderNumber AND p.gn = 1. This makes even less sense. This query will only return results if the #OrderNumber happens to contain the scan with rank 1 over all orders! It cannot possibly be correct.
Your query makes no sense. The fact that is slow is just a bonus. Post your actual requirements.
If you want to learn about performance investigation, read How to analyse SQL Server performance.
PS. As a rule, computing ranks and selecting =1 can always be expressed as a TOP(1) correlated subquery, with usually much better results. Indexes help, obviously.
PPS. Use of group by without any aggregate function is yest another serious code smell.
I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..
I often find myself running a query to get the number of people who meet a certain criteria, the total number of people in that population and the finding the percentage that meets that criteria. I've been doing it for the same way for a while and I was wondering what SO would do to solve the same type of problem. Below is how I wrote the query:
select m.state_cd
,m.injurylevel
,COUNT(distinct m.patid) as pplOnRx
,x.totalPatientsPerState
,round((COUNT(distinct m.patid) /cast(x.totalPatientsPerState as float))*100,2) as percentPrescribedNarcotics
from members as m
inner join rx on rx.patid=m.PATID
inner join DrugTable as dt on dt.drugClass=rx.drugClass
inner join
(
select m2.state_cd, m2.injurylevel, COUNT(distinct m2.patid) as totalPatientsPerState
from members as m2
inner join rx on rx.patid=m2.PATID
group by m2.STATE_CD,m2.injuryLevel
) x on x.state_cd=m.state_cd and m.injuryLevel=x.injurylevel
where drugText like '%narcotics%'
group by m.state_cd,m.injurylevel,x.totalPatientsPerState
order by m.STATE_CD,m.injuryLevel
In this example not everyone who appears in the members table is in the rx table. The derived table makes sure that everyone whose in rx is also in members without the condition of drugText like narcotics. From what little I've played with it it seems that the over(partition by clause might work here. I have no idea if it does, just seems like it to me. How would someone else go about tackling this problem?
results:
This is exactly what MDX and SSAS is designed to do. If you insist on doing it in SQL (nothing wrong with that), are you asking for a way to do it with better performance? In that case, it would depend on how the tables are indexed, tempdb speed, and if the tables are partitioned, then that too.
Also, the distinct count is going to be one of larger performance hits. The like '%narcotics%' in the predicate is going to force a full table scan and should be avoided at all costs (can this be an integer key in the data model?)
To answer your question, not really sure windowing (over partition by) is going to perform any better. I would test it and see, but there is nothing "wrong" with the query.
You could rewrite the count distinct's as virtual tables or temp tables with group by's or a combination of those two.
To illustrate, this is a stub for windowing that you could grow into the same query:
select a.state_cd,a.injurylevel,a.totalpatid, count(*) over (partition by a.state_cd, a.injurylevel)
from
(select state_cd,injurylevel,count(*) as totalpatid, count(distinct patid) as patid
from
#members
group by state_cd,injurylevel
) a
see what I mean about not really being that helpful? Then again, sometimes rewriting a query slightly can improve performance by selecting a better execution plan, but rather then taking stabs in the dark, I'd first find the bottlenecks in the query you have, since you already took the time to write it.
We generate a lot of SQL procedurally and SQL Server is killing us. Because of some issues documented elsewhere we basically do SELECT TOP 2 ** 32 instead of TOP 100 PERCENT.
Note: we must use the subqueries.
Here's our query:
SELECT * FROM (
SELECT [me].*, ROW_NUMBER() OVER( ORDER BY (SELECT(1)) )
AS rno__row__index FROM (
SELECT [me].[id], [me].[status] FROM (
SELECT TOP 4294967296 [me].[id], [me].[status] FROM
[PurchaseOrders] [me]
LEFT JOIN [POLineItems] [line_items]
ON [line_items].[id] = [me].[id]
WHERE ( [line_items].[part_id] = ? )
ORDER BY [me].[id] ASC
) [me]
) [me]
) rno_subq
WHERE rno__row__index BETWEEN 1 AND 25
Are there better ways to do this that anyone can see?
UPDATE: here is some clarification on the whole subquery issue:
The key word of my question is "procedurally". I need the ability to reliably encapsulate resultsets so that they can be stacked together like building blocks. For example I want to get the first 10 cds ordered by the name of the artist who produced them and also get the related artist for each cd.. What I do is assemble a monolithic subselect representing the cds ordered by the joined artist names, then apply a limit to it, and then join the nested subselects to the artist table and only then execute the resulting query. The isolation is necessary because the code that requests the ordered cds is unrelated and oblivious to the code selecting the top 10 cds which in turn is unrelated and oblivious to the code that requests the related artists.
Now you may say that I could move the inner ORDER BY into the OVER() clause, but then I break the encapsulation, as I would have to SELECT the columns of the joined table, so I can order by them later. An additional problem would be the merging of two tables under one alias; if I have identically named columns in both tables, the select me.* would stop right there with an ambiguous column name error.
I am willing to sacrifice a bit of the optimizer performance, but the 2**32 seems like too much of a hack to me. So I am looking for middle ground.
If you want top rows by me.id, just ask for that in the ROW_NUMBER's ORDER BY. Don't chase your tail around subqueries and TOP.
If you have a WHERE clause on a joined table field, you can have an outer JOIN. All the outer fields will be NULL and filtered out by the WHERE, so is effectively an inner join.
.
WITH cteRowNumbered AS (
SELECT [me].id, [me].status
ROW_NUMBER() OVER (ORDER BY me.id ASC) AS rno__row__index
FROM [PurchaseOrders] [me]
JOIN [POLineItems] [line_items] ON [line_items].[id] = [me].[id]
WHERE [line_items].[part_id] = ?)
SELECT me.id, me.status
FROM cteRowNumbered
WHERE rno__row__index BETWEEN 1 and 25
I use CTEs instead of subqueries just because I find them more readable.
Use:
SELECT x.*
FROM (SELECT po.id,
po.status,
ROW_NUMBER() OVER( ORDER BY po.id) AS rno__row__index
FROM [PurchaseOrders] po
JOIN [POLineItems] li ON li.id = po.id
WHERE li.pat_id = ?) x
WHERE x.rno__row__index BETWEEN 1 AND 25
ORDER BY po.id ASC
Unless you've omitted details in order to simplify the example, there's no need for all your subqueries in what you provided.
Kudos to the only person who saw through naysaying and actually tried the query on a large table we do not have access to. To all the rest saying this simply will not work (will return random rows) - we know what the manual says, and we know it is a hack - this is why we ask the question in the first place. However outright dismissing a query without even trying it is rather shallow. Can someone provide us with a real example (with preceeding CREATE/INSERT statements) demonstrating the above query malfunctioning?
Your update makes things much clearer. I think that the approach which you're using is seriously flawed. While it's nice to be able to have encapsulated, reusable code in your applications, front-end applications are a much different animal than a database. They typically deal with small structures and small, discrete process that run against those structures. Databases on the other hand often deal with tables that are measured in the millions of rows and sometimes more than that. Using the same methodologies will often result in code that simply performs so badly as to be unusable. Even if it works now, it's very likely that it won't scale and will cause major problems down the road.
Best of luck to you, but I don't think that this approach will end well in all but the smallest of databases.