I have a query that I have made into a MYSQL view. This particular view is central to our application so we are looking at tuning it. There is a primary key on Map_Id,User_No,X,Y. I am pretty comfortable tuning SQL server queries but not totally sure about how MySql works in this aspect. Would it help to put an index on it that covers points and update_stamp as well? Reads on this table are 90% so while it has lots of inserts, it does not compare to the amount of reads.
Description: Get the person with the most points for each x,y coord in a given map. Tie break by who has the latest update stamp and then by user id.
SELECT GP.Map_Id AS Map_Id,GP.User_No AS User_No,GP.X AS X,GP.Y AS Y, GP.Points AS Points,GP.Update_Stamp AS Update_Stamp
FROM (Grid_Points GP LEFT JOIN Grid_Points GP2
ON (
(
(GP2.Map_Id = GP.Map_Id) AND (GP2.X = GP.X) AND (GP2.Y = GP.Y) AND
((GP2.Points > GP.Points) OR ((GP2.Points = GP.Points) AND (GP2.Update_Stamp > GP.Update_Stamp)) OR
((GP2.Points = GP.Points) AND (GP2.Update_Stamp = GP.Update_Stamp) AND (GP2.User_No < GP.User_No)))
)
)
)
WHERE ISNULL(GP2.User_No);
Wow man, you really like to use parentheses. :-)
You're right, a compound index may help. You might even be able to make it a covering index. Probably either an index on Grid_Points(Map_Id,X,Y) or else an index on Grid_Points(Points,Update_Stamp,User_No) would be what I try.
Always test query optimization with EXPLAIN to see if the optimizer is using your index. Read that documentation section until you understand the cryptic notes in the EXPLAIN report.
The EXPLAIN report will probably show you which index it decides to use. You should be aware that MySQL uses only one index per table in a given query.
Here's how I would write that query, relying on order of precedence between AND and OR instead of so many nested parentheses:
SELECT GP.Map_Id, GP.User_No, GP.X, GP.Y, GP.Points, GP.Update_Stamp
FROM Grid_Points GP LEFT JOIN Grid_Points GP2
ON GP2.Map_Id = GP.Map_Id AND GP2.X = GP.X AND GP2.Y = GP.Y
AND (
GP2.Points > GP.Points
OR
GP2.Points = GP.Points
AND GP2.Update_Stamp > GP.Update_Stamp
OR
GP2.Points = GP.Points
AND GP2.Update_Stamp = GP.Update_Stamp
AND GP2.User_No < GP.User_No
)
WHERE GP2.User_No IS NULL;
You are using my favorite method for finding greatest-n-per-group in MySQL. MySQL doesn't optimize GROUP BY very well (it often incurs a temporary table which gets serialized to disk), so the left outer join solution that you're using is usually a lot better, at least for MySQL. In other brands of RDBMS, this solution may not have such an advantage.
I wouldn't match it to itself, I'd do it as a "group by" and then possibly match back to get who the person is.
SELECT Map_Id, X, Y, max(Points)
FROM Grid_Points
GROUP BY Map_Id, X, Y;
This would give you a table of Map_Id, X, and Y, then the maximum points.
You could then join those results back to Grid_Points to find which user is = to those points.
Related
I have a SQLite database A with numeric columns for start and stop that is quite large (1M rows). And I have a second list of numeric ranges B beginning and end that is medium (10K rows).
I would like to find the set of entries in A that overlap with ranges in B.
I could do this with a python script that iterates through list B and does 10K database queries, but I'm wondering if there's a more SQLish way to do it. List B could potentially be slurped into the database as an indexed TEMP TABLE if that helps the process.
Possible simplification, though not optimal, is that list A could be treated as a single location, position, allowing us to only look for A.position that fall inside B.beginning and B.end.
One trick I use to speed this up is to define a CHUNK. This can be as simple as the midpoint of the start and end, divided by a chunksize and then cast as an integer. To build off #Gordon Linoff's answer, you could use a 10k window chunk as follows:
with a_chunk as (
select a.*, cast((a.start+a.end)/(2*10000) as integer) as CHUNK
from a
),
b_chunk as (
select b.*, cast((b.start+b.end)/(2*10000) as integer) as CHUNK
from b
)
select ac.*, bc.*
from a_chunk ac join b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start;
This divides your search space so that rather than joining every row in a against every row in b, you're only joining entries within the same 10k-width window. This should still be an O(m*n) operation but will be considerably faster due to the restricted search space and smaller m/n sizes.
However, this comes with caveats. For instance, the intervals (9995, 9999) and (9998, 10008) will get placed in different chunks despite being clearly overlapping, and your resultant query would miss that. Therefore you can get your edge cases by replacing the single select statement with
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK - 1
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK
and ac.start < bc.end
and ac.end > bc.start
union
select ac.*, bc.*
from a_chunk ac join
b_chunk bc
on ac.CHUNK = bc.CHUNK + 1
and ac.start < bc.end
and ac.end > bc.start;
Even this isn't perfect though. If you have intervals significantly larger than your 10k window size, you could likely still overlook some results. Increasing the window size to accommodate this would come at the cost of joining more entries at a time, which the chunks were designed to avoid. The best balance will likely be finding an appropriate window size and then covering edge cases by including enough UNIONs to include on ac.CHUNK = bc.CHUNK + {-n...n} for however large you think n should be.
Rather than using a CTE, you can also speed this up in SQLite by hard-coding CHUNK as a column in your tables and then creating clustered indexes on each table for (CHUNK, start). You may or may not benefit from including end in this index as well, though you'll have to EXPLAIN QUERY PLAN on your specific case to see whether the optimizer actually does this. The trade-off, of course, is increased storage space, which may not be ideal depending on what you're trying to do.
This admittedly feels like a hack and I'm trying to answer a similar question for my own project. I've heard that the only efficient solution is to manually take the data and implement an interval tree. However, with millions of rows, I'm not sure how efficient it would be to take this from sqlite and build a tree manually in your programming language of choice. If anyone has any better solutions I'd be happy to hear. At least in python, the ncls library seems like it could get the job done.
You can easily express this in SQL as a join. For partial overlaps, this would be:
select a.*, b.*
from a join
b
on a.start < b.end and a.end > b.start;
However, this will be slow, because it will be doing a nested loop comparison. So, although concise, this won't necessarily be much faster.
I have the following piece of code which runs quickly (<1s):
SELECT
[Policy].[Value] AS [PolicyId]
,[Person].[Value] AS [PersonId]
,[Person].[Index] AS [PersonIndex]
FROM
[dbo].[View] AS [Policy]
INNER JOIN [dbo].[ViewPerson] AS [Person] WITH(INDEX([Index])) ON ([Policy].[CollectionId] = [Person].[CollectionId]
AND [Person].[Name] = 'PersonId' AND [Policy].[Name] = 'PolicyId')
WHERE
[Policy].[CollectionId] = 10003
-- AND [Policy].[Value] = [Person].[Value]
This will return 2 rows from my database. When I comment out the last line to apply a stronger filter it returns only 1 row from my database, but will take much longer to run (~20s).
Is there a method to reduce the time this query takes to run when a filter is applied to it? Ideally I'd like it to run at the same speed as the original.
You were told in comments, that forcing the engine to use a special index is - in most cases - not the best idea. The engine is pretty good in finding the best plan and it will work best if you let it go its own route.
Secondly you were told already, that the execution plan is the best place to start. As we do not see any details, the following is pure guessing:
If I get this correctly, your query will use CollectionId to filter for one given id (just very few Policy rows). For these rows, the JOIN on a VIEW (we have no idea, what is behind here!) tries to link person rows.
The filter should work against a very reduced set.
Your observations let me assume, that the second line in WHERE is dealing with a much larger set. I'm pretty sure, that the filter for CollectionId=10003 pulls after the other filter... The execution plan will show the details...
What you can do:
Take away the index hint
Try to add the second line in the WHERE with AND to the ON-clause of the JOIN.
Something along this:
SELECT
[Policy].[Value] AS [PolicyId]
,[Person].[Value] AS [PersonId]
,[Person].[Index] AS [PersonIndex]
FROM
[dbo].[View] AS [Policy]
INNER JOIN [dbo].[ViewPerson] AS [Person] ON ([Policy].[CollectionId] = [Person].[CollectionId]
AND [Person].[Name] = 'PersonId'
AND [Policy].[Name] = 'PolicyId'
AND [Policy].[Value] = [Person].[Value])
WHERE
[Policy].[CollectionId] = 10003;
I'm developing a simple app to return a random selection of exercises, one for each bodypart.
bodypart is an indexed enum column on an Exercise model. DB is PostgreSQL.
The below achieves the result I want, but feels horribly inefficient (hitting the db once for every bodypart):
BODYPARTS = %w(legs core chest back shoulders).freeze
#exercises = BODYPARTS.map do |bp|
Exercise.public_send(bp).sample
end.shuffle
So, this gives a random exercise for each bodypart, and mixes up the order at the end.
I could also store all exercises in memory and select from them; however, I imagine this would scale horribly (there are only a dozen or so seed records at present).
#exercises = Exercise.all
BODYPARTS.map do |bp|
#exercises.select { |e| e[:bodypart] == bp }.sample
end.shuffle
Benchmarking these shows the select approach as the more effective on a small scale:
Queries: 0.072902 0.020728 0.093630 ( 0.088008)
Select: 0.000962 0.000225 0.001187 ( 0.001113)
MrYoshiji's answer: 0.000072 0.000008 0.000080 ( 0.000072)
My question is whether there's an efficient way to achieve this output, and, if so, what that approach might look like. Ideally, I'd like to keep this to a single db query.
Happy to compose this using ActiveRecord or directly in SQL. Any thoughts greatly appreciated.
From my comment, you should be able to do (thanks PostgreSQL's DISTINCT ON):
Exercise.select('distinct on (bodypart) *')
.order('bodypart, random()')
Postgres' DISTINCT ON is very handy and performance is typically great, too - for many distinct bodyparts with few rows each. But for only few distinct values of bodypart with many rows each (big table - and your use case) there are far superior query techniques.
This will be massively faster in such a case:
SELECT e.*
FROM unnest(enum_range(null::bodypart)) b(bodypart)
CROSS JOIN LATERAL (
SELECT *
FROM exercises
WHERE bodypart = b.bodypart
-- ORDER BY ??? -- for a deterministic pick
LIMIT 1 -- arbitrary pick!
) e;
Assuming that bodypart is the name of the enum as well as the table column.
enum_range is an enum support function that (quoting the manual):
Returns all values of the input enum type in an ordered array
I unnest it and run a LATERAL subquery for each value, which is very fast when supported with the right index. Detailed explanation for the query technique and the needed index (focus on chapter "2a. LATERAL join"):
Optimize GROUP BY query to retrieve latest record per user
For just an arbitrary row for each bodypart, a simple index on exercises(bodypart) does the job. But you can have a deterministic pick like "the latest entry" with the right multicolumn index and a matching ORDER BY clause and almost the same performance.
Related:
Is it a bad practice to query pg_type for enums on a regular basis?
Select first row in each GROUP BY group?
I have a table order, which is very straightforward, it is storing order data.
I have a view, which is storing currency pair and currency rate. The view is created as below:
create or replace view view_currency_rate as (
select c.* from currency_rate c, (
select curr_from, curr_to, max(rate_date) max_rate_date from currency_rate
where system_rate > 0
group by curr_from, curr_to) r
where c.curr_from = r.curr_from
and c.curr_to = r.curr_to
and c.rate_date = r.max_rate_date
and c.system_rate > 0
);
nothing fancy here, this view populate the latest currency rate (curr_from -> curr_to) from the currency_rate table.
When I do as below, it populate 80k row (all data) because I have plenty of records in order table. And the time spent is less than 5 seconds.
First Query:
select * from
VIEW_CURRENCY_RATE c, order a
where
c.curr_from = A.CURRENCY;
I want to add in more filter, so I thought it could be faster, so I added this:
Second Query:
select * from
VIEW_CURRENCY_RATE c, order a
where
a.id = 'xxxx'
and c.curr_from = A.CURRENCY;
And now it run over 1 minute! I totally have no idea what happen to this. I thought it would be some oracle optimizer goes wrong, so I try to find another way, think of just the 80K data can be populated quite fast, so I try to get the data from it, so I nested the SQL as below:
select * from (
select * from
VIEW_CURRENCY_RATE c, order a
where
c.curr_from = A.CURRENCY
)
where id = 'xxxx';
It run damn slow as well! I running out of idea, can anyone explain what happen to my script?
Updated on 6-Sep-2016
After I know how to 'explain plan', I capture the screen:
Fist query (fast one with 80K data):
Second query (slow one):
The slow one totally break the view and form a new SQL! This is super weird that how can Oracle optimize this like that?
It seems problem relates to the plan of second query. because it uses of nest loops inplace of hash joint.
at first check if _hash_join_enable is true if it isn't true change it to true. if it is true there are some problem with oracle optimizer. for test it use of USE_HASH(tab2 tab1) hint.
Regards
mohsen
I am using Mike solution, I re-write the script, and it is running fast now, although the root cause is not determined, probably due to the oracle optimizer algorithm working in different way that I expect.
I have a local access database and in it a query which takes values from a form to populate a drop down menu. The weird (to me) thing is that with most options this query is quick (blink of an eye), but with a few options it's very slow (>10 seconds).
What the query is does is a follows: It populates a dropdown menu to record animals seen at a specific sighting, but only those animals which have not been recorded at that specific sighting yet (to avoid duplicate entries).
SELECT DISTINCT tblAnimals.AnimalID, tblAnimals.Nickname, tblAnimals.Species
FROM tblSightings INNER JOIN (tblAnimals INNER JOIN tblAnimalsatSighting ON tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID) ON tblSightings.SightingID = tblAnimalsatSighting.SightingID
WHERE (((tblAnimals.Species)=[form]![Species]) AND ((tblAnimals.CurrentGroup)=[form]![AnimalGroup2]) AND ((tblAnimals.[Dead?])=False) AND ((Exists (select tblAnimalsatSighting.AnimalID FROM tblAnimalsatSighting WHERE tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID AND tblAnimalsatSighting.SightingID = [form]![SightingID]))=False));
It performs well for all groups of 2 of the 4 possible species, for 1 species it performs well for 4 of the 5 groups, but not for the last group, and for the last species it performs very slowly for both groups. Anybody an idea what can be the cause of this kind of behavior? Is it problems with the query? Or duplicate entries in the tables which can cause this? I don't think it's duplicates in the tables, I've checked that, and there are some, but they appear both for groups where there are problems and where there aren't. Could I re-write the query so it performs faster?
As noted in our comments above, you confirmed that the extra joins were not really need and were in fact going to limit the results to animal that had already had a sighting. Those joins would also likely contribute to a slowdown.
I know that Access probably added most of the parentheses automatically but I've removed them and converted the subquery to a not exists form that's a lot more readable.
SELECT tblAnimals.AnimalID, tblAnimals.Nickname, tblAnimals.Species
FROM tblAnimals
WHERE
tblAnimals.Species = [form]![Species]
AND tblAnimals.CurrentGroup = [form]![AnimalGroup2]
AND tblAnimals.[Dead?] = False
AND NOT EXISTS (
SELECT tblAnimalsatSighting.AnimalID
FROM tblAnimalsatSighting
WHERE
tblAnimals.AnimalID = tblAnimalsatSighting.AnimalID
AND tblAnimalsatSighting.SightingID = [form]![SightingID]
);