Database query times out on heroku - sql

I'm stress testing an app by adding loads and loads of items and forcing it to do lots of work.
select *, (
select price
from prices
WHERE widget_id = widget.id
ORDER BY id DESC
LIMIT 1
) as maxprice
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
that query selects from widgets (approx 8500) and prices has 777000 or so entries in it.
The query is timing out on the test environment which is using the basic Heroku shared database. (193mb in use of the 5gig max.)
What will solve that time out issue? The prices update each hour, so every hour you get 8500x new rows.
It's hugely excessive amounts for the app (in reality it's unlikely it would ever have 8500 widgets) but I'm wondering what's appropriate to solve this?
Is my query stupid? (i.e. is it a bad style of query to do that subselect - my SQL knowledge is terrible, one of the goals of this project is to improve it!)
Or am I just hitting a limit of a shared db and should expect to move onto a dedicated db (e.g. the min $200 per month dedicated postgres instance from Heroku.) given the size of the prices table? Is there a deeper issue in terms of how I've designed the DB? (i.e. it's a one to many, one widget has many prices.) Is there a more sensible approach?
I'm totally new to the world of sql and queries etc. at scale, hence the utter ignorance expressed above. :)

Final version after comments below:
#Dave wants the latest price per widget. You could do that in sub-queries and LIMIT 1 per widget, but in modern PostgreSQL, a window function does the job more elegantly. Consider first_value() / last_value():
SELECT w.*
, first_value(p.price) OVER (PARTITION BY w.id
ORDER BY created_at DESC) AS latest_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Original post for the maximum price per widget:
SELECT w.*
, max(p.price) AS max_price
FROM (
SELECT *
FROM widgets
ORDER BY created_at DESC
LIMIT 20
) w
JOIN prices p ON p.widget_id = w.id
GROUP BY w.col1, w.col2 -- spell out all columns of w.*
Fix table aliases.
Retrieve all columns of widgets like the question demonstrates
In PostgreSQL 8.3 you must spell out all non-aggregated columns of the SELECT list in the GROUP BY clause. In PostgreSQL 9.1 or later, the primary key column would cover the whole table. I quote the manual here:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause
I advice to never use mixed case identifiers like maxWidgetPrice. Unquoted identifiers are folded to lower case by default in PostgreSQL. Do yourself a favor and use lower case identifiers exclusively.
Always use explicit JOIN conditions where possible. It's the canonical SQL way and it's more readable.
OFFSET 0 is just noise
Indexes:
However, the key to performance are the right indexes. I would go two indexes like these:
CREATE INDEX widgets_created_at_idx ON widgets (created_at DESC);
CREATE INDEX prices_widget_id_idx ON prices(widget_id, price DESC);
The second one is a multicolumn index, that should provide best performance for retrieving the maximum prize after you have determined the top 20 widgets using the first index. Not sure if PostgreSQL 8.3 (default on Heroku shared db) is already smart enough to make the most of it. PostgreSQL 9.1 certainly is.
For the latest price (see comments), use this index instead:
CREATE INDEX prices_widget_id_idx ON prices(widget_id, created_at DESC);
You don't have to (and shouldn't) just trust me. Test performance and query plans with EXPLAIN ANALYZE with and without indexes and see for yourself. Index creation should be very fast, even for a million rows.
If you consider to switch to a standalone PostgreSQL database on Heroku, you may be interested in this recent Heroku blog post:
The default is now PostgreSQL 9.1.
There you can cancel long running queries now.

I'm not quite clear on what you are asking, but here is my understanding:
Find the widgets you want to price. In this case it looks like you are looking for the most recent 20 widgets:
SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
For each of the 20 widgets you found, it seems you want to find the highest associated price from the widget table:
SELECT s.id, MAX(p.price) AS maxWidgetPrice
FROM (SELECT w.id
FROM widgets
ORDER BY created_at DESC
LIMIT 20 OFFSET 0
) s -- widget subset
, prices p
WHERE s.id = p.widget_id
GROUP BY s.id
prices.widget_id needs to be indexed for this to be effective. You don't want to process the entire prices table each time if it is relatively large, just the subset of rows you need.
EDIT: added "group by" (and no, this was not tested)

Related

How to get value frequencies for large data

I have a table with millions of rows and 940 columns. I'm really hoping there is a way to summarize this data. I want to see frequencies for each value for EVERY column. I used this code with a few of the columns, but I won't be able to get many more columns in before the processing is too large.
SELECT
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
,count(1) AS Frequency
FROM
(SELECT a.account, ntile(3) over (order by sum(a.seconds) desc) as ntile
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
FROM demo as c
JOIN aggregates a on c.customer_account = a.account
WHERE a.month IN ('201804', '201805', '201806')
GROUP BY a.account
,f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
)
WHERE ntile = 1
GROUP BY
f19_24
,f25_34
,f35_44
,f45_49
,f50_54
,f55_59
,f60_64
The problem is that the GROUP BY will be far too cumbersome. Is there any other way??? It would be really helpful to be able to see where the high frequencies are in such a large dataset.
Using index can help you to get much faster result in this kind of queries .The best thing to do would depend on what other fields the table has and what other queries run against that table.Without more details, a non-clustered index on month,account that included the
f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64 on aggregates or demo or customer(because I dont know which table includes these fields ) for example this index:
CREATE NONCLUSTERED INDEX IX_fasterquery
ON aggregates(month,accoun)
INCLUDE (f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64);
That's because if you have that index in place, then SQL will not access the actual table at all when running the query, as it can find all rows with a given "month,accoun, createddate" in the index and it will be able to do that really fast as the index allows precisely for fast access when using the fields that define the key, and it will also have the "f19_24,f25_34,f35_44,f45_49,f50_54,f55_59,f60_64" value for each row and in your case by making this query as proc may you get bether result and the reason why I suggest this is here.

dense_rank filling up tempdb on SQL server?

I've got this query here which uses dense_rank to number groups in order to select the first group only. It is working but its slow and tempdb (SQL server) becomes so big that the disk is filled up. Is it normal for dense_rank that it's such a heavy operation? And how else should this be done then without resorting to coding?
select
a,b,c,d
from
(select a,b,c,d,
dense_rank() over (order by s.[time] desc) as gn
from [Order] o
JOIN Scan s ON s.OrderId = o.OrderId
JOIN PriceDetail p ON p.ScanId = s.ScanId) as p
where p.OrderNumber = #OrderNumber
and p.Number = #Number
and p.Time > getdate() - 20
and p.gn = 1
group by a,b,c,d,p.gn
Any operation that has to sort a large dataset may fill tempdb. dense_rank is no exception, just like rank, row_number, ntile etc etc.
You are asking for a sort over what appears to be a global, complete sort of every scan entry, since database start. The way you expressed the query the join must occur before the sort, so the sort will be both big and wide. After all is said and done, consuming a lot of IO, CPU and tempdb space, you will restrict the result to a small subset for only a specified order and some conditions (which mentions columns not present in projection, so they must be some made up example not the real code).
You have a filter on WHERE gn=1 followed by a GROUP BY gn. This is unnecessary, the gn is already unique from the predicate so it cannot contribute to the group by.
You compute the dense_rank over every order scan and then you filter by p.OrderNumber = #OrderNumber AND p.gn = 1. This makes even less sense. This query will only return results if the #OrderNumber happens to contain the scan with rank 1 over all orders! It cannot possibly be correct.
Your query makes no sense. The fact that is slow is just a bonus. Post your actual requirements.
If you want to learn about performance investigation, read How to analyse SQL Server performance.
PS. As a rule, computing ranks and selecting =1 can always be expressed as a TOP(1) correlated subquery, with usually much better results. Indexes help, obviously.
PPS. Use of group by without any aggregate function is yest another serious code smell.

Only return rows that match all criteria

Here is a rough schema:
create table images (
image_id serial primary key,
user_id int references users(user_id),
date_created timestamp with time zone
);
create table images_tags (
images_tag_id serial primary key,
image_id int references images(image_id),
tag_id int references tags(tag_id)
);
The output should look like this:
{"images":[
{"image_id":1, "tag_ids":[1, 2, 3]},
....
]}
The user is allowed to filter images based on user ID, tags, and offset image_id. For instance, someone can say "user_id":1, "tags":[1, 2], "offset_image_id":500, which will give them all images that are from user_id 1, have both tags 1 AND 2, and an image_id of 500 or less.
The tricky part is the "have both tags 1 AND 2". It is more straight-forward (and faster) to return all images that have either 1, 2, or both. I don't see any way around this other than aggregating, but it is much slower.
Any help doing this quickly?
Here is the current query I am using which is pretty slow:
select * from (
select i.*,u.handle,array_agg(t.tag_id) as tag_ids, array_agg(tag.name) as tag_names from (
select i.image_id, i.user_id, i.description, i.url, i.date_created from images i
where (?=-1 or i.user_id=?)
and (?=-1 or i.image_id <= ?)
and exists(
select 1 from image_tags t
where t.image_id=i.image_id
and (?=-1 or user_id=?)
and (?=-1 or t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?))
)
order by i.image_id desc
) i
left join image_tags t on t.image_id=i.image_id
left join tag using (tag_id) --not totally necessary
left join users u on i.user_id=u.user_id --not totally necessary
group by i.image_id,i.user_id,i.description,i.url,i.date_created,u.handle) sub
where (?=-1 or sub.tag_ids #> ?)
limit 100;
When the execution plan of this statement is determined, at prepare time, the PostgresSQL planner doesn't know which of these ?=-1 conditions will be true or not.
So it has to produce a plan to maybe filter on a specific user_id, or maybe not, and maybe filter on a range on image_id or maybe not, and maybe filter on a specific set of tag_id, or maybe not. It's likely to be a dumb, unoptimized plan, that can't take advantage of indexes.
While your current strategy of a big generic query that covers all cases is OK for correctness, for performance you might need to abandon it in favor or generating the minimal query given the parametrized conditions that are actually filled in.
In such a generated query, the ?=-1 or ... will disappear, only the joins that are actually needed will be present, and the dubious t.tag_id in (?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?) will go or be reduced to what's strictly necessary.
If it's still slow given certain sets of parameters, then you'll have a much easier starting point to optimize on.
As for the gist of the question, testing the exact match on all tags, you might want to try the idiomatic form in an inner subquery:
SELECT image_id FROM image_tags
WHERE tag_id in (?,?,...)
GROUP BY image_id HAVING count(*)=?
where the last ? is the number of tags passed as parameters.
(and completely remove sub.tag_ids #> ? as an outer condition).
Among other things, your GROUP BY clause is likely wider than any of your indices (and/or includes columns in unlikely combinations). I'd probably re-write your query as follows (turning #Daniel's subquery for the tags into a CTE):
WITH Tagged_Images (SELECT Image_Tags.image_id, ARRAY_AGG(Tag.tag_id) as tag_ids,
ARRAY_AGG(Tag.name) as tag_names
FROM Image_Tags
JOIN Tag
ON Tag.tag_id = Image_Tags.tag_id
WHERE tag_id IN (?, ?)
GROUP BY image_id
HAVING COUNT(*) = ?)
SELECT Images.image_id, Images.user_id,
Images.description, Images.url, Images.date_created,
Tagged_Images.tag_ids, Tagged_Images.tag_names,
Users.handle
FROM Images
JOIN Tagged_Images
ON Tagged_Images.image_id = Images.image_id
LEFT JOIN Users
ON Users.user_id = Images.user_id
WHERE Images.user_id = ?
AND Images.date_created < ?
ORDER BY Images.date_created, Images.image_id
LIMIT 100
(Untested - no provided dataset. note that I'm assuming you're building the criteria dynamically, to avoid condition flags)
Here's some other stuff:
Note that Tagged_Images will have at minimum the indicated tags, but might have more. If you want images with only those tags (exactly 2, no more, no less), an additional level needs to be added to the CTE.
There's a number of examples floating around of stored procs that turn comma-separated lists into virtual tables (heck, I've done it with recursive CTEs), which you could use for the IN() clause. It doesn't matter that much here, though, due to needing dynamic SQL anyways...
Assuming that Images.image_id is auto-generated, doing ranges searches or ordering by it is largely pointless. There are relatively few cases where humans care about the value held here. Except in cases where you're searching for one specific row (for updating/deleting/whatever), conceptual data sets don't really care either; the value of itself is largely meaningless. What does image_id < 500 actually tell me? Nothing - just that a given number was assigned to it. Are you using it to restrict based on "early" versus "late" images? Then use the proper data for that, which would be date_created. For pagination? Well, you have to do that after all the other conditions, or you get weird page lengths (like 0 in some cases). Generated keys should be relied on for one property only: uniqueness. This is the reason I stuck it at the end of the ORDER BY - to ensure a consistent ordering. Assuming that date_created has a high enough resolution as a timestamp, even this is unnecessary.
I'm fairly certain your LEFT JOIN to Users should probably be a regular (INNER) JOIN, but you didn't provide enough information for me to be sure.
Aggregation is not likely to be the thing slowing you down. A query such as:
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
where images_tags.tag_id in (1,2)
group by images.image_id
having count(*) = 2
will get you all of the images that have tags 1 and 2 and it will run quickly if you have indexes on both image_tags columns:
create index on images_tags(tag_id);
create index on images_tags(image_id);
The slowest part of the query is likely to be the in part of the where clause. You can speed that up if you are prepared to create a temporary table with the target tags in:
create temp table target_tags(tag_id int primary key);
insert into target_tags values (1);
insert into target_tags values (2);
select images.image_id
from images
join images_tags on (images.image_id=images_tags.image_id)
join target_tags on images_tags.tag_id=target_tags.tag_id
group by images.image_id
having count(*) = (select count(*) from target_tags)

Oracle performance issue in getting first row in sub query

I have a performance issue on the following (example) select statement that returns the first row using a sub query:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT DISTINCT
FIRST_VALUE(L.LOCATION) OVER (ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
),
P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
The DISTINCT is causing the performance issue by performing a SORT and UNIQUE but I can't figure out an alternative.
I would however prefer something akin to the following but referencing within 2 select statements doesn't work:
SELECT ITEM_NUMBER,
PROJECT_NUMBER,
NVL((SELECT LOCATION
FROM (SELECT L.LOCATION LOCATION
ROWNUM RN
FROM LOCATIONS L
WHERE L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
ORDER BY L.SORT1, L.SORT2 DESC
) R
WHERE RN <=1
), P.PROJECT_NUMBER) LOCATION
FROM PROJECT P
Additionally:
- My permissions do not allow me to create a function.
- I am cycling through 10k to 100k records in the main query.
- The sub query could return 3 to 7 rows before limiting to 1 row.
Any assistance in improving the performance is appreciated.
It's difficult to understand without sample data and cardinalities, but does this get you what you want? A unique list of projects and items, with the first occurrence of a location?
SELECT
P.ITEM_NUMBER,
P.PROJECT_NUMBER,
MIN(L.LOCATION) KEEP (DENSE_RANK FIRST ORDER BY L.SORT1, L.SORT2 DESC) LOCATION
FROM
LOCATIONS L
INNER JOIN
PROJECT P
ON L.ITEM_NUMBER=P.ITEM_NUMBER
AND L.PROJECT_NUMBER=P.PROJECT_NUMBER
GROUP BY
P.ITEM_NUMBER,
P.PROJECT_NUMBER
I encounter similar problem in the past -- and while this is not ultimate solution (in fact might just be a corner-cuts) -- Oracle query optimizer can be adjusted with the OPTIMIZER_MODE init param.
Have a look at chapter 11.2.1 on http://docs.oracle.com/cd/B28359_01/server.111/b28274/optimops.htm#i38318
FIRST_ROWS
The optimizer uses a mix of cost and heuristics to find a best plan
for fast delivery of the first few rows. Note: Using heuristics
sometimes leads the query optimizer to generate a plan with a cost
that is significantly larger than the cost of a plan without applying
the heuristic. FIRST_ROWS is available for backward compatibility and
plan stability; use FIRST_ROWS_n instead.
Of course there are tons other factors you should analyse like your index, join efficiency, query plan etc..

Comparison Group by VS Over Partition By

Assuming one table CAR with two columns CAR_ID (int) and VERSION (int).
I want to retrieve the maximum version of each car.
So there are two solutions (at least) :
select car_id, max(version) as max_version
from car
group by car_id;
Or :
select car_id, max_version
from ( select car_id, version
, max(version) over (partition by car_id) as max_version
from car
) max_ver
where max_ver.version = max_ver.max_version
Are these two queries similarly performant?
I know this is extremely old but thought it should be pointed out.
select car_id, max_version
from (select car_id
, version
, max(version) over (partition by car_id) as max_version
from car ) max_ver
where max_ver.version = max_ver.max_version
Not sure why you did option two like that... in this case the sub select should be theoretically slower because your selecting from the same table 2x and then joining the results back to itself.
Just remove version from your inline view and they are the same thing.
select car_id, max(version) over (partition by car_id) as max_version
from car
The performance really depends on the optimizer in this situation, but yes the as original answer suggests inline views as they do narrow results. Though this is not a good example being its the same table with no filters in the selections given.
Partitioning is also helpful when you are selecting a lot of columns but need different aggregations that fit the result set. Otherwise you are forced to group by every other column.
Yes It may affects
Second query is an example of Inline View.
It's a very useful method for performing reports with various types of counts or use of any aggregate functions with it.
Oracle executes the subquery and then uses the resulting rows as a view in the FROM clause.
As we consider about performance , always recommend inline view instead of choosing another subquery type.
And one more thing second query will give all max records,while first one will give you only one max record.
see here
It will depend on your indexing scheme and the amount of data in the table. The optimizer will likely make different decisions based on the data that's actually inside the table.
I have found, at least in SQL Server (I know you asked about Oracle) that the optimizer is more likely to perform a full scan with the PARTITION BY query vs the GROUP BY query. But that's only in cases where you have an index which contains CAR_ID and VERSION (DESC) in it.
The moral of the story is that I would test thoroughly to choose the right one. For small tables, it doesn't matter. For really, really big data sets, neither may be fast...