Problem taming MySQL query performance with OR statement

Problem taming MySQL query performance with OR statement - sql

[Warning: long post ahead!]
I've banging my head at this for quite some time now but can't get on a common denominator what is going on. I've found a solution to workaround, see at the end, but my inner Zen is not satisfied yet.
I've a main table with forum messages (it's from Phorum), simplified looks like this (ignore the anon_user_id for the moment, I will get later to it):
CREATE TABLE `test_msg` (
`message_id` int(10) unsigned NOT NULL auto_increment,
`status` tinyint(4) NOT NULL default '2',
`user_id` int(10) unsigned NOT NULL default '0',
`datestamp` int(10) unsigned NOT NULL default '0',
`anon_user_id` int(10) unsigned NOT NULL default '0',
PRIMARY KEY (`message_id`)
);
Messages can be anonymized by the software, in which case the user_id is set to 0. The software also allows posting complete anonymous messages which we endorse. In our case we need to still know which user posted a message, so through the hook system provided by Phorum we have a second table we update accordingly:
CREATE TABLE `test_anon` (
`message_id` bigint(20) unsigned NOT NULL,
`user_id` bigint(20) unsigned NOT NULL,
KEY `fk_user_id` (`user_id`),
KEY `fk_message_id` (`message_id`)
);
For the view in the profile, I need to get the a list of messages from a user, no matter whether they've have been anonmized by him or not.
A user itself has always the right to see the message he wrote anonymously or later anonymized.
Because user_id gets set to 0 if anonymized, we can't simply use WHERE for it; we need to join with our own second table. Formulate the above into SQL looks like this (the status = 2 is required, other states would mean the post is hidden, pending approval, etc.):
SELECT * FROM test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10
This query by itself, whenever the query cache is empty, takes a few second, something along 4 seconds currently. Things get worse when multiple users issue the query and the query cache is empty (which just happens; people post messages and the cached queries are invalid); we faced in our internal testing phase and reports were that the system sometimes slows down. We've seen queries taking 30 to 60 seconds because of the concurrent-ness. I don't want to start imagine what happens when we expand our user base ...
Now it's not like I didn't to any analysis about the bottleneck.
I tried rewriting the WHERE clause, adding indice and deleting them like hell.
This is when I found out that when I do not use any index, the query performs lighting fast under certain conditions. Using no index, the query looks like:
SELECT * FROM test_msg AS m USE INDEX()
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (test_anon.user_id = 20 OR m.user_id = 20)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10
Now here comes the certain condition: the LIMIT limits the result to 10 rows. Assume my complete result n = 26. Using a LIMIT 0,10 to LIMIT 16,0 takes zero seconds (something along < 0.01s): these are the cases were the result is always 10 rows.
Starting with LIMIT 17,10 , the result will be only 9 rows. Starting at this point, the query starts taking around four seconds again. The is applicable for all results where the result set is smaller then the number of maximum rows limited through LIMIT. Irritating!
Going back to the first CREATE TABLE statement, I also conducted tests without the LEFT JOIN; we just assume user_id=0 and anon_user_id=<the previous user_id> for anonymized messages, in other words, completely bypassing the second table:
SELECT * FROM test_msg
WHERE status = 2 AND (user_id = 20 OR anon_user_id = 20)
ORDER BY m.datestamp DESC
LIMIT 20,10
Result: it did not matter. The performance is still within 4 or 5 seconds; forcing to not use an index with USE INDEX() does not speed up this query.
This is were I really got puzzled now. Index will always only be used for the status column, the OR prevents other indices from being used, this is also what the MySQL documentation told me in this regard.
An alternate solution I tried: do not use the test_anon table to only relate to anonymized messages, but simply to all messages. This allows me to write a query like this:
SELECT * FROM test_msg AS m, test_anon AS t
WHERE m.message_id = t.message_id
AND t.user_id = 20
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 20,10
This query always gave me instant results (== < 0.01 seconds), no matter what LIMIT, etc.
Yes, I've found a solution. I've not yet rewritten the whole application to the model though.
But I'ld like to better understand what the rational is behind my observed behavior (especially forcing no index speeding up queries). On paper nothing looked wrong with the original approach.
Some numbers (they aren't that big anyway):
~ one million messages
message table data size is around 600MB
message table index size is around 350MB
number of anonymized messages in test_anon < 3% of all messages
number of messages from registered users < 25% of all messages
All tables are MyISAM; I tried with InnnoDB but performance was much more worse.

You in fact have two different queries here which are better processed as separate queries.
To improve the LIMIT, you need to use LIMIT on LIMIT technique:
SELECT *
FROM (
SELECT *
FROM test_msg AS m
WHERE m.user_id = 20
AND m.status = 2
ORDER BY
m.datestamp DESC
LIMIT 20
) q1
UNION ALL
SELECT *
(
SELECT m.*
FROM test_msg m
JOIN test_anon a
ON a.message_id = m.message_id
WHERE a.user_id = 20
AND m.user_id = 0
AND m.status = 2
ORDER BY
m.datestamp DESC
LIMIT 20
) q2
ORDER BY
datestamp DESC
LIMIT 20
See this entry in my blog for more detail on this solution:
MySQL: LIMIT on LIMIT
You need to create two composite indexes for this to work fast:
test_msg (status, user_id, datestamp)
test_msg (status, user_id, message_id, datestamp)
Then you need to choose what the index will be used for in the second query: ordering or filtering.
In your query, the index cannot be used for both, since you're filtering on a range on message_id.
See this article for more explainations:
Choosing index
In a couple of words:
If there are lots of anonymous messages from this user, i. e. there is high probability that the message will be found somewhere in the beginning of the index, then the index should be used for sorting. Use the first index.
If there are few anonymous messages from this user, i. e. there is low probability that the message will be found somewhere in the beginning of the index, then the index should be used for filtering. Use the second index.
If there is a possibility to redesign the tables, just add another column is_anonymous to the table test_msg.
It will solve lots of problems.

The problem is that you're doing a join for the entire table. You need to tell the optimizer that you only need to join for two user ID's: zero and your desired user ID. Like this:
SELECT * FROM test_msg AS m
LEFT JOIN test_anon ON test_anon.message_id = m.message_id
WHERE (m.user_id = 20 OR m.user_id = 0)
AND (test_anon.user_id = 20 OR test_anon.user_id IS NULL)
AND m.status = 2
ORDER BY m.datestamp DESC
LIMIT 0,10
Does this work better?

Related

How to improve the efficiency of below query in SQL Server?

I have a ten million level database. The client needs to read data and perform calculation.
Due to the large amount of data, if it is saved in the application cache, memory will be overflow and crash will occur.
If I use select statement to query data from the database in real time, the time may be too long and the number of operations on the database may be too frequent.
Is there a better way to read the database data? I use C++ and C# to access SQL Server database.
My database statement is similar to the following:
SELECT TOP 10 y.SourceName, MAX(y.EndTimeStamp - y.StartTimeStamp) AS ProcessTimeStamp
FROM
(
SELECT x.SourceName, x.StartTimeStamp, IIF(x.EndTimeStamp IS NOT NULL, x.EndTimeStamp, 134165256277210658) AS EndTimeStamp
FROM
(
SELECT
SourceName,
Active,
LEAD(Active) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) NextActive,
TicksTimeStamp AS StartTimeStamp,
LEAD(TicksTimeStamp) OVER(PARTITION BY SourceName ORDER BY TicksTimeStamp) EndTimeStamp
FROM Table1
WHERE Path = N'App1' and TicksTimeStamp >= 132165256277210658 and TicksTimeStamp < 134165256277210658
) x
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
) y
GROUP BY y.SourceName
ORDER BY ProcessTimeStamp DESC, y.SourceName
The database structure is roughly as follows:
ID Path SourceName TicksTimeStamp Active
1 App1 Pipe1 132165256277210658 1
2 App1 Pipe1 132165256297210658 0
3 App1 Pipe1 132165956277210658 1
4 App2 Pipe2 132165956277210658 1
5 App2 Pipe2 132165956277210658 0
I use the ExecuteReader of C #. The same SQL statement runs on SQL Management for 4s, but the time returned by the ExecuteReader is 8-9s. Does the slow time have anything to do with this interface?

I don't really 'get' the entire query but I'm wondering about this part:
WHERE (x.Active = 1 and x.NextActive = 0) OR (x.Active = 1 and x.NextActive = null)
SQL doesn't really like OR's so why not convert this to
WHERE x.Active = 1 and ISNULL(x.NextActive, 0) = 0
This might cause a completely different query plan. (or not)
As CharlieFace mentioned, probably best to share the query plan so we might get an idea of what's going on.
PS: I'm also not sure what those 'ticksTimestamps' represent, but it looks like you're fetching a pretty wide range there, bigger volumes will also cause longer processing time. Even though you only return the top 10 it still has to go through the entire range to calculate those durations.

I agree with #Charlieface. I think the index you want is as follows:
CREATE INDEX idx ON Table1 (Path, TicksTimeStamp) INCLUDE (SourceName, Active);
You can add both indexes (with different names of course) and see which one the execution engine chooses.

I can suggest adding the following index which should help the inner query using LEAD:
CREATE INDEX idx ON Table1 (SourceName, TicksTimeStamp, Path) INCLUDE (Active);
The key point of the above index is that it should allow the lead values to be rapidly computed. It also has an INCLUDE clause for Active, to cover the entire select.

PostgreSQL - Optimizing query performance

I'm analyzing my queries performance using New Relic and this one in particular is taking a long time to complete:
SELECT "events".*
FROM "events"
WHERE ("events"."deleted_at" IS NULL AND
"events"."eventable_id" = $? AND
"events"."eventable_type" = $? OR
"events"."deleted_at" IS NULL AND
"events"."eventable_id" IN (SELECT "flow_recipients"."id" FROM "flow_recipients" WHERE "flow_recipients"."contact_id" = $?) AND "events"."eventable_type" = $?)
ORDER BY "events"."created_at" DESC
LIMIT $? OFFSET $?
Sometimes this query takes more than 8 seconds to be completed, and I can't understand why. I have taken a look at the query explain, but I'm not sure I can understand it:
Is there something wrong with my indexes? Is there something I can optimize? How could I further investigate what's going on?
I suspect that the fact that I'm using SELECT events.* instead of selecting only the columns I'm interested could have some impact, but I'm using a LIMIT of 15, so I'm not sure it would impact that much.
[EDIT]
I have an index on created_at column and another index on eventable_id and eventable_type columns. Apparently, this second index is not being used, and I don't know why.

The cause of the long execution time is
that the optimizer hopes it can find enough matching rows quickly by scanning all rows in the sorting order and picking out those that match the condition, but the executor has to scan 630835 rows until it finds enough matching rows.
For every row that is being examined, the subselect is executed.
You should rewrite that OR to a UNION:
SELECT * FROM events
WHERE deleted_at IS NULL
AND eventable_id = $?
AND eventable_type = $?
UNION
SELECT * FROM events e
WHERE deleted_at IS NULL
AND eventable_type = $?
AND EXISTS (SELECT 1
FROM flow_recipients f
WHERE f.id = e.eventable_id
AND f.contact_id = $?);
This query does the same thing if events has a primary key.
Useful indexes depend on the execution plan chosen, but these ones might be good:
CREATE INDEX ON events (eventable_type, eventable_id)
WHERE deleted_at IS NULL;
CREATE INDEX ON flow_recipients (contact_id);

Nested subquery in Access alias causing "enter parameter value"

I'm using Access (I normally use SQL Server) for a little job, and I'm getting "enter parameter value" for Night.NightId in the statement below that has a subquery within a subquery. I expect it would work if I wasn't nesting it two levels deep, but I can't think of a way around it (query ideas welcome).
The scenario is pretty simple, there's a Night table with a one-to-many relationship to a Score table - each night normally has 10 scores. Each score has a bit field IsDouble which is normally true for two of the scores.
I want to list all of the nights, with a number next to each representing how many of the top 2 scores were marked IsDouble (would be 0, 1 or 2).
Here's the SQL, I've tried lots of combinations of adding aliases to the column and the tables, but I've taken them out for simplicity below:
select Night.*
,
( select sum(IIF(IsDouble,1,0)) from
(SELECT top 2 * from Score where NightId=Night.NightId order by Score desc, IsDouble asc, ID)
) as TopTwoMarkedAsDoubles
from Night

This is a bit of speculation. However, some databases have issues with correlation conditions in multiply nested subqueries. MS Access might have this problem.
If so, you can solve this by using aggregation with a where clause that chooses the top two values:
select s.nightid,
sum(IIF(IsDouble, 1, 0)) as TopTwoMarkedAsDoubles
from Score as s
where s.id in (select top 2 s2.id
from score as s2
where s2.nightid = s.nightid
order by s2.score desc, s2.IsDouble asc, s2.id
)
group by s.nightid;
If this works, it is a simply matter to join Night back in to get the additional columns.

Your subquery can only see one level above it. so Night.NightId is totally unknown to it hence why you are being prompted to enter a value. You can use a Group By to get the value you want for each NightId then correlate that back to the original Night table.
Select *
From Night
left join (
Select N.NightId
, sum(IIF(S.IsDouble,1,0)) as [Number of Doubles]
from Night N
inner join Score S
on S.NightId = S.NightId
group by N.NightId) NightsWithScores
on Night.NightId = NightsWithScores.NightId
Because of the IIF(S.IsDouble,1,0) I don't see the point is using top.

Optimizing MySQL statement with lot of count(row) an sum(row+row2)

I need to use InnoDB storage engine on a table with about 1mil or so records in it at any given time. It has records being inserted to it at a very fast rate, which are then dropped within a few days, maybe a week. The ping table has about a million rows, whereas the website table only about 10,000.
My statement is this:
select url
from website ws, ping pi
where ws.idproxy = pi.idproxy and pi.entrytime > curdate() - 3 and contentping+tcpping is not null
group by url
having sum(contentping+tcpping)/(count(*)-count(errortype)) < 500 and count(*) > 3 and
count(errortype)/count(*) < .15
order by sum(contentping+tcpping)/(count(*)-count(errortype)) asc;
I added an index on entrytime, yet no dice. Can anyone throw me a bone as to what I should consider to look into for basic optimization of this query. The result set is only like 200 rows, so I'm not getting killed there.

In the absence of the schemas of the relations, I'll have to make some guesses.
If you're making WHERE a.attrname = b.attrname clauses, that cries out for a JOIN instead.
Using COUNT(*) is both redundant and sometimes less efficient than COUNT(some_specific_attribute). The primary key is a good candidate.
Why would you test contentping+tcpping IS NOT NULL, asking for a calculation that appears unnecessary, instead of just testing whether the attributes individually are null?
Here's my attempt at an improvement:
SELECT url
FROM website AS ws
JOIN ping AS pi
ON ws.idproxy = pi.idproxy
WHERE
pi.entrytime > CURDATE() - 3
AND pi.contentping IS NOT NULL
AND pi.tcpping IS NOT NULL
GROUP BY url
HAVING
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) < 500
AND COUNT(pi.idproxy) > 3
AND COUNT(pi.errortype) / COUNT(pi.idproxy) < 0.15
ORDER BY
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) ASC;
Performing lots of identical calculations in both the HAVING and ORDER BY clauses will likely be costing you performance. You could either put them in the SELECT clause, or create a view that has those calculations as attributes and use that view for accessing the values.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify

I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.

I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.

I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.

It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas