How to optimize an intersection date range query in postgres - sql

I have the following table
1 id int4 NO NULL "nextval('users_id_seq'::regclass)" NULL
2 first_name varchar(100) YES NULL NULL NULL
3 last_name varchar(100) YES NULL NULL NULL
4 start timestamptz YES NULL NULL NULL
5 end timestamptz YES NULL NULL NULL
6 bio text YES NULL NULL NULL
I've tried creating the following indexes
create index foo on users (start);
create index foo2 on users ("start", "end");
I've created 1 million random rows. I would like to optimize this query, or at least I would like it to use some indexes
explain select * from "users" where "start" < '2021-09-24 01:00:00.000' and "end" > '2018-09-24 01:00:00.000'
Original query was wrong as many commenters pointed out.
explain select * from "users" where "start" < '2021-09-24 01:00:00.000' and '2021-08-24 01:00:00.000' < "end" ;
The output is
Gather (cost=1000.00..50520.72 rows=52278 width=238)
Workers Planned: 2
-> Parallel Seq Scan on users (cost=0.00..44292.92 rows=21782 width=238)
" Filter: ((start < '2021-09-24 01:00:00+00'::timestamp with time zone) AND ('2021-08-24 01:00:00+00'::timestamp with time zone < ""end""))"
It runs in 2.5 sec, however, my goal is not really to optimize speed. I'm more interested in how the indexes work. Specifically I can see it uses sequential scan. I'm curious what kind of index would actually be suited to such a query. As far as I can see, a B+Tree would not really help in this situation as it's not really possible to encode a date range efficiently into a B+Tree. I think it reduces to this problem. I've read a litte about covering index but also don't think it applies here?
Maybe I just haven't inserted enough rows though for this to really trouble postgres?

PostgreSQL supports range data types - including time ranges, e.g. TSTZRANGE. Range types support GiST indexing (https://www.postgresql.org/docs/current/btree-gist.html) and are normally used as exclusion constraints to prevent timeframes overlapping, e.g. when booking time slots for your dentist.
Since your query seems to be of type Give me all timeframes which DO NOT overlap with the given timeframe - the range type and Btree_GiST indexing look like a perfect solution.
So instead of 2 columns start and end you will have one column [start, end) - meaning that the time interval is closed on the left and open on the right (start is included in the timeframe, but end is excluded from it).

Related

PostgreSQL - Optimizing query performance

I'm analyzing my queries performance using New Relic and this one in particular is taking a long time to complete:
SELECT "events".*
FROM "events"
WHERE ("events"."deleted_at" IS NULL AND
"events"."eventable_id" = $? AND
"events"."eventable_type" = $? OR
"events"."deleted_at" IS NULL AND
"events"."eventable_id" IN (SELECT "flow_recipients"."id" FROM "flow_recipients" WHERE "flow_recipients"."contact_id" = $?) AND "events"."eventable_type" = $?)
ORDER BY "events"."created_at" DESC
LIMIT $? OFFSET $?
Sometimes this query takes more than 8 seconds to be completed, and I can't understand why. I have taken a look at the query explain, but I'm not sure I can understand it:
Is there something wrong with my indexes? Is there something I can optimize? How could I further investigate what's going on?
I suspect that the fact that I'm using SELECT events.* instead of selecting only the columns I'm interested could have some impact, but I'm using a LIMIT of 15, so I'm not sure it would impact that much.
[EDIT]
I have an index on created_at column and another index on eventable_id and eventable_type columns. Apparently, this second index is not being used, and I don't know why.
The cause of the long execution time is
that the optimizer hopes it can find enough matching rows quickly by scanning all rows in the sorting order and picking out those that match the condition, but the executor has to scan 630835 rows until it finds enough matching rows.
For every row that is being examined, the subselect is executed.
You should rewrite that OR to a UNION:
SELECT * FROM events
WHERE deleted_at IS NULL
AND eventable_id = $?
AND eventable_type = $?
UNION
SELECT * FROM events e
WHERE deleted_at IS NULL
AND eventable_type = $?
AND EXISTS (SELECT 1
FROM flow_recipients f
WHERE f.id = e.eventable_id
AND f.contact_id = $?);
This query does the same thing if events has a primary key.
Useful indexes depend on the execution plan chosen, but these ones might be good:
CREATE INDEX ON events (eventable_type, eventable_id)
WHERE deleted_at IS NULL;
CREATE INDEX ON flow_recipients (contact_id);

how to speed up max() query

In PostgreSql 8.4 query
explain analyze SELECT
max( kuupaev||kellaaeg ) as res
from ALGSA
where laonr=1 and kuupaev <='9999-12-31' and
kuupaev||kellaaeg <= '9999-12-3123 59'
Takes 3 seconds to run:
"Aggregate (cost=3164.49..3164.50 rows=1 width=10) (actual time=2714.269..2714.270 rows=1 loops=1)"
" -> Seq Scan on algsa (cost=0.00..3110.04 rows=21778 width=10) (actual time=0.105..1418.743 rows=70708 loops=1)"
" Filter: ((kuupaev <= '9999-12-31'::date) AND (laonr = 1::numeric) AND ((kuupaev || (kellaaeg)::text) <= '9999-12-3123 59'::text))"
"Total runtime: 2714.363 ms"
How to speed it up in PostgreSQL 8.4.4 ?
Table structure is below.
algsa table has index on kuupaev maybe this can be used?
Or is it possible to change query to add some other index to make it fast. Exising columns in table cannot changed.
CREATE TABLE firma1.algsa
(
id serial NOT NULL,
laonr numeric(2,0),
kuupaev date NOT NULL,
kellaaeg character(5) NOT NULL DEFAULT ''::bpchar,
... other columns
CONSTRAINT algsa_pkey PRIMARY KEY (id),
CONSTRAINT algsa_id_check CHECK (id > 0)
)
);
CREATE INDEX algsa_kuupaev_idx ON firma1.algsa USING btree (kuupaev);
Update
Tried analyze verbose firma1.algsa;
INFO: analyzing "firma1.algsa"
INFO: "algsa": scanned 1640 of 1640 pages, containing 70708 live rows and 13 dead rows; 30000 rows in sample, 70708 estimated total rows
Query returned successfully with no result in 1185 ms.
but query run time was still 2.7 seconds.
Why there are 30000 rows in sample . Isn't it too much, should this decreased?
This was a known issue in old versions of PostgreSQL - but it looks like it might've been resolved by 8.4; in fact, the docs for 8.0 have the caveat but the docs for 8.1 do not.
So you don't need to upgrade major versions for this reason, at least. You should however upgrade to the current 8.4 series release 8.4.16, as you're missing several years worth of bug fixes and tweaks.
The real problem here is that you're using max on an expression, not a simple value, and there's no functional index for that expression.
You could try creating an index on the expression kuupaev||kellaaeg ... but I suspect you have data model problems, and that there's a better solution by fixing your data model.
It looks like kuupaev is kuupƤev, or date, and kellaaeg might be time. If so: never use the concatenation (||) operator for combining dates and times; use interval addition, eg kuupaev + kellaaeg. Instead of char you should be using the data type time or interval with a CHECK constraint for kellaaeg, depending on what it means and whether it's limited to 24 hours or not. Or, better still, use a single field of type timestamp (for local time) or timestamp with time zone (for global time) to store the combined date and time.
If you do this, you can create a simple index on the combined column that replaces both kellaaeg and kuupaev and use that for min and max among other things. If you need just the date part or just the time part for some things, use the date_trunc, extract and date_part functions; see the documentation.
See this earlier answer for another example of where separate date and time columns are a bad idea.
You should still plan an upgrade to 9.2. The upgrade path from 8.4 to 9.2 isn't too rough, you really just have to watch out for the setting of standard_conforming_strings on by default and the change of bytea_output from escape to hex. Both can be set back to the 8.4 defaults during transition and porting work. 8.4 won't be supported for much longer.
My first instinct would be to try an index:
create index algsa_laonr_kuupaev_kellaaeg_idx
on ALGSA (laonr asc, (kuupaev||kellaaeg) desc)
... and try the query as:
SELECT kuupaev||kellaaeg as res
from ALGSA
where laonr=1 and
kuupaev||kellaaeg <= '9999-12-3123 59'
order by
laonr asc,
kuupaev||kellaaeg desc
limit 1

Is it faster to check that a Date is (not) NULL or compare a bit to 1/0?

I'm just wondering what is faster in SQL (specifically SQL Server).
I could have a nullable column of type Date and compare that to NULL, or I could have a non-nullable Date column and a separate bit column, and compare the bit column to 1/0.
Is the comparison to the bit column going to be faster?
In order to check that a column IS NULL SQL Server would actually just check a bit anyway. There is a NULL BITMAP stored for each row indicating whether each column contains a NULL or not.
I just did a simple test for this:
DECLARE #d DATETIME
,#b BIT = 0
SELECT 1
WHERE #d IS NULL
SELECT 2
WHERE #b = 0
The actual execution plan results show the computation as exactly the same cost relative to the batch.
Maybe someone can tear this apart, but to me it seems there's no difference.
MORE TESTS
SET DATEFORMAT ymd;
CREATE TABLE #datenulltest
(
dteDate datetime NULL
)
CREATE TABLE #datebittest
(
dteDate datetime NOT NULL,
bitNull bit DEFAULT (1)
)
INSERT INTO #datenulltest ( dteDate )
SELECT CASE WHEN CONVERT(bit, number % 2) = 1 THEN '2010-08-18' ELSE NULL END
FROM master..spt_values
INSERT INTO #datebittest ( dteDate, bitNull )
SELECT '2010-08-18', CASE WHEN CONVERT(bit, number % 2) = 1 THEN 0 ELSE 1 END
FROM master..spt_values
SELECT 1
FROM #datenulltest
WHERE dteDate IS NULL
SELECT 2
FROM #datebittest
WHERE bitNull = CONVERT(bit, 1)
DROP TABLE #datenulltest
DROP TABLE #datebittest
dteDate IS NULL result:
bitNull = 1 result:
OK, so this extended test comes up with the same responses again.
We could do this all day - it would take some very complex query to find out which is faster on average.
All other things being equal, I would say the Bit would be faster because it is a "smaller" data type. However, if performance is very important here (and I assume it is because of the question) then you should always do testing, as there may be other factors such as indexes, caching that affect this.
It sounds like you are trying to decide on a datatype for field which will record whether an event X has happened or not. So, either a timestamp (when X happened) or just a Bit (1 if X happened, otherwise 0). In this case I would be tempted to go for the Date as it gives you more information (not only whether X happened, but also exactly when) which will most likely be useful in the future for reporting purposes. Only go against this if the minor performance gain really is more important.
Short answer, If you have only 1s and 0s something like bit-map index 1,0 is uber fast. Nulls are not indexed on certain sqlengines so 'is null' and 'not null' are slow. However, do think of the entity semantics before dishing this out. It is always better to have a semantic table definition, if you know what I mean.
The speed comes from ability to use indices and not from data size in this case.
Edit
Please refer to Martin Smith's answer. That makes more sense for sqlserver, I got carried away by oracle DB, my mistake here.
The bit will be faster as loading th bit to memory will load only 1 byte and loading the date will take 8 bytes. The comparison itself will take the same time, but the loading from the disk will take more time. Unless you use a very old server or need to load more then 10^8 rows you won't notice anything.

Optimizing MySQL statement with lot of count(row) an sum(row+row2)

I need to use InnoDB storage engine on a table with about 1mil or so records in it at any given time. It has records being inserted to it at a very fast rate, which are then dropped within a few days, maybe a week. The ping table has about a million rows, whereas the website table only about 10,000.
My statement is this:
select url
from website ws, ping pi
where ws.idproxy = pi.idproxy and pi.entrytime > curdate() - 3 and contentping+tcpping is not null
group by url
having sum(contentping+tcpping)/(count(*)-count(errortype)) < 500 and count(*) > 3 and
count(errortype)/count(*) < .15
order by sum(contentping+tcpping)/(count(*)-count(errortype)) asc;
I added an index on entrytime, yet no dice. Can anyone throw me a bone as to what I should consider to look into for basic optimization of this query. The result set is only like 200 rows, so I'm not getting killed there.
In the absence of the schemas of the relations, I'll have to make some guesses.
If you're making WHERE a.attrname = b.attrname clauses, that cries out for a JOIN instead.
Using COUNT(*) is both redundant and sometimes less efficient than COUNT(some_specific_attribute). The primary key is a good candidate.
Why would you test contentping+tcpping IS NOT NULL, asking for a calculation that appears unnecessary, instead of just testing whether the attributes individually are null?
Here's my attempt at an improvement:
SELECT url
FROM website AS ws
JOIN ping AS pi
ON ws.idproxy = pi.idproxy
WHERE
pi.entrytime > CURDATE() - 3
AND pi.contentping IS NOT NULL
AND pi.tcpping IS NOT NULL
GROUP BY url
HAVING
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) < 500
AND COUNT(pi.idproxy) > 3
AND COUNT(pi.errortype) / COUNT(pi.idproxy) < 0.15
ORDER BY
SUM(pi.contentping + pi.tcpping) / (COUNT(pi.idproxy) - COUNT(pi.errortype)) ASC;
Performing lots of identical calculations in both the HAVING and ORDER BY clauses will likely be costing you performance. You could either put them in the SELECT clause, or create a view that has those calculations as attributes and use that view for accessing the values.

table design + SQL question

I have a table foodbar, created with the following DDL. (I am using mySQL 5.1.x)
CREATE TABLE foodbar (
id INT NOT NULL AUTO_INCREMENT,
user_id INT NOT NULL,
weight double not null,
created_at date not null
);
I have four questions:
How may I write a query that returns
a result set that gives me the
following information: user_id,
weight_gain where weight_gain is
the difference between a weight and
a weight that was recorded 7 days
ago.
How may I write a query that will
return the top N users with the
biggest weight gain (again say over
a week).? An 'obvious' way may be to
use the query obtained in question 1
above as a subquery, but somehow
picking the top N.
Since in question 2 (and indeed
question 1), I am searching the
records in the table using a
calculated field, indexing would be
preferable to optimise the query -
however since it is a calculated
field, it is not clear which field
to index (I'm guessing the 'weight'
field is the one that needs
indexing). Am I right in that
assumption?.
Assuming I had another field in the
foodbar table (say 'height') and I
wanted to select records from the
table based on (say) the product
(i.e. multiplication) of 'height'
and 'weight' - would I be right in
assuming again that I need to index
'height' and 'weight'?. Do I also
need to create a composite key (say
(height,weight)). If this question
is not clear, I would be happy to
clarify
I don't see why you should need the synthetic key, so I'll use this table instead:
CREATE TABLE foodbar (
user_id INT NOT NULL
, created_at date not null
, weight double not null
, PRIMARY KEY (user_id, created_at)
);
How may I write a query that returns a result set that gives me the following information: user_id, weight_gain where weight_gain is the difference between a weight and a weight that was recorded 7 days ago.
SELECT curr.user_id, curr.weight - prev.weight
FROM foodbar curr, foodbar prev
WHERE curr.user_id = prev.user_id
AND curr.created_at = CURRENT_DATE
AND prev.created_at = CURRENT_DATE - INTERVAL '7 days'
;
the date arithmetic syntax is probably wrong but you get the idea
How may I write a query that will return the top N users with the biggest weight gain (again say over a week).? An 'obvious' way may be to use the query obtained in question 1 above as a subquery, but somehow picking the top N.
see above, add ORDER BY curr.weight - prev.weight DESC and LIMIT N
for the last two questions: don't speculate, examine execution plans. (postgresql has EXPLAIN ANALYZE, dunno about mysql) you'll probably find you need to index columns that participate in WHERE and JOIN, not the ones that form the result set.
I think that "just somebody" covered most of what you're asking, but I'll just add that indexing columns that take part in a calculation is unlikely to help you at all unless it happens to be a covering index.
For example, it doesn't help to order the following rows by X, Y if I want to get them in the order of their product X * Y:
X Y
1 8
2 2
4 4
The products would order them as:
X Y Product
2 2 4
1 8 8
4 4 16
If mySQL supports calculated columns in a table and allows indexing on those columns then that might help.
I agree with just somebody regarding the primary key, but for what you're asking regarding the weight calculation, you'd be better off storing the delta rather than the weight:
CREATE TABLE foodbar (
user_id INT NOT NULL,
created_at date not null,
weight_delta double not null,
PRIMARY KEY (user_id, created_at)
);
It means you'd store the users initial weight in say, the user table, and when you write records to the foodbar table, a user could supply the weight at that time, but the query would subtract the initial weight from the current weight. So you'd see values like:
user_id weight_delta
------------------------
1 2
1 5
1 -3
Looking at that, you know that user 1 gained 4 pounds/kilos/stones/etc.
This way you could use SUM, because it's possible for someone to have weighings every day - using just somebody's equation of curr.weight - prev.weight wouldn't work, regardless of time span.
Getting the top x is easy in MySQL - use the LIMIT clause, but mind that you provide an ORDER BY to make sure the limit is applied correctly.
It's not obvious, but there's some important information missing in the problem you're trying to solve. It becomes more noticeable when you think about realistic data going into this table. The problem is that you're unlikely to to have a consistent regular daily record of users' weights. So you need to clarify a couple of rules around determining 'current-weight' and 'weight x days ago'. I'm going to assume the following simplistic rules:
The most recent weight reading is the 'current-weight'. (Even though that could be months ago.)
The most recent weight reading more than x days ago will be the weight assumed at x days ago. (Even though for example a reading from 6 days ago would be more reliable than a reading from 21 days ago when determining weight 7 days ago.)
Now to answer the questions:
1&2: Using the above extra rules provides an opportunity to produce two result sets: current weights, and previous weights:
Current weights:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Similarly for the x days ago reading:
select rd.*,
w.Weight
from (
select User_id,
max(Created_at) AS Read_date
from Foodbar
where Created_at < DATEADD(dd, -7, GETDATE()) /*Or appropriate MySql equivalent*/
group by User_id
) rd
inner join Foodbar w on
w.User_id = rd.User_id
and w.Created_at = rd.Read_date
Now simply join these results as subqueries
select cur.User_id,
cur.Weight as Cur_weight,
prev.Weight as Prev_weight
cur.Weight - prev.Weight as Weight_change
from (
/*Insert query #1 here*/
) cur
inner join (
/*Insert query #2 here*/
) prev on
prev.User_id = cur.User_id
If I remember correctly the MySql syntax to get the top N weight gains would be to simply add:
ORDER BY cur.Weight - prev.Weight DESC limit N
2&3: Choosing indexes requires a little understanding of how the query optimiser will process the query:
The important thing when it comes to index selection is what columns you are filtering by or joining on. The optimiser will use the index if it is determined to be selective enough (note that sometimes your filters have to be extremely selective returning < 1% of data to be considered useful). There's always a trade of between slow disk seek times of navigating indexes and simply processing all the data in memory.
3: Although weights feature significantly in what you display, the only relevance is in terms of filtering (or selection) is in #2 to get the top N weight gains. This is a complex calculation based on a number of queries and a lot of processing that has gone before; so Weight will provide zero benefit as an index.
Another note is that even for #2 you have to calculate the weight change of all users in order to determine the which have gained the most. Therefore unless you have a very large number of readings per user you will read most of the table. (I.e. a table scan will be used to obtain the bulk of the data)
Where indexes can benefit:
You are trying to identify specific Foodbar rows based on User_id and Created_at.
You are also joining back to the Foodbar table again using User_id and Created_at.
This implies an index on User_id, Created__at would be useful (more-so if this is the clustered index).
4: No, unfortunately it is mathematically impossible to determine how the individual values H and W would independently determine the ordering of the product. E.g. both H=3 & W=3 are less than 5, yet if H=5 and W=1 then the product 3*3 is greater than 5*1.
You would have to actually store the calculation an index on that additional column. However, as indicated in my answer to #3 above, it is still unlikely to prove beneficial.