Speeding up PostgreSQL query where data is between two dates - sql

I have a large table (> 50m rows) which has some data with an ID and timestamp:
id, timestamp, data1, ..., dataN
...with a multi-column index on (id, timestamp).
I need to query the table to select all rows with a certain ID where the timestamp is between two dates, which I am currently doing using:
SELECT * FROM mytable WHERE id = x AND timestamp BETWEEN y AND z
This currently takes over 2 minutes on a high end machine (2x 3Ghz dual-core Xeons w/HT, 16GB RAM, 2x 1TB drives in RAID 0) and I'd really like to speed it up.
I have found this tip which recommends using a spatial index, but the example it gives is for IP addresses. However, the speed increase (436s to 3s) is impressive.
How can I use this with timestamps?

That tip is only suitable when you have two columns A and B and use queries like:
where 'a' between A and B
That's not:
where A between 'a' and 'b'
Using index on date(column) rather than column could speed it up a little bit.

Could you EXPLAIN the query for us? Then we know how the database executes your query. And what about the configuration? What are the settings for shared_buffers and work_mem? And when did you (or your system) the last vacuum and analyze? And last thing, what OS and pgSQL-version are you using?
You can create wonderfull indexes but without proper settings, the database can't use them very efficient.

Make sure the index is TableID+TableTimestamp, and you do a query like:
SELECT
....
FROM YourTable
WHERE TableID=..YourID..
AND TableTimestamp>=..startrange..
AND TableTimestamp<=..endrange..
if you apply functions to the table's TableTimestamp column in the WHERE, you will not be able to completely use the index.
if you are already doing all of this, then your hardware might not be up to the task.
if you are using version 8.2 or later, you should try:
WHERE (TableID, TableTimestamp) >= (..YourID.., ..startrange.. )
and (TableID, TableTimestamp) <= (..YourID.., ..endrange..)

Related

Correct way to create one index with two columns in Timescaledb

I use porstgreSQL extension called Timescaledb.i have a table called timestampdb.I want to ask if this is the right way to create one index with two columns(timestamp1,id13).
Some of my queries that i use look like this:
select * from timestampdb where timestamp1 >='2020-01-01 00:05:00' and timestamp1<='2020-01-02 00:05:00' and id13>'5',
select date_trunc('hour',timestamp1) as hour,avg(id13) from timestampdb where timestamp1 >='2020-01-01 00:05:00' and timestamp1<='2020-01-02 00:05:00' group by hour order by hour ,
select date_trunc('hour',timestamp1) as hour,max(id13) from timestampdb where timestamp1<='2020-01-01 00:05:00' group by hour order by hour desc limit 5
After creating the table i do this:
create_hypertable("timestampdb",timestamp1) and then CREATE INDEX ON timestampdb (timestamp1,id13)
Is this the proper way?Will this create one index with two columns?Or one index in timestamp1 and one index for(timestamp1,id13)
Yes this is the proper way. The provided call will actually create an index combined with those two columns. You want to make sure that the dominant index column (ie the first) is the time-column. Which it is in your code. That way tsdb queries still find your data first by time (which is pretty important on large data sets). And your queries match that index too: they primarily search time-range-based.
You might want to check the way postgres executes your queries by executing
EXPLAIN ANALYZE ;
Or use pgadmin and click the “explain”-button.
By that you can make sure that your Indizes are hit and whether postgres has enough heap buffer to cache the tsdb table pages or needs to read from disk (which is slower by effectively a factor of 1000 to 10000).
I always find those resources helpful:
TSDB YT Channel: https://youtube.com/c/TimescaleDB
TSDB Slack Channel: https://slack.timescale.com/

GCP: select query with unnest from array has very big process data to run compared to hardcoded values

In bigQuery GCP, I am trying to grab some data in a table where the date is the same as a date in a list of values I have got. If I hardcode the list of values in the select it is vastly cheaper in process to run than if I use a temp structure like an array...
Is there a way to use the temp structure but avoid the enormous processing cost ?
Why is it so expensive for something small simple like this.
please see below examples:
**-----1/ array structure example: this query process's 144.8 GB----------**
WITH
get_a as (
SELECT
GENERATE_DATE_ARRAY('2000-01-01','2000-01-02') as array_of_dates
)
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
get_a as b
UNNEST(b.array_of_dates) as c
WHERE
c in (CAST(a.ingest_time AS DATE)
)
**------2/ hardcoded example: this query processes 936.5 MB over 154 X's less ? --------**
SELECT
a.heading as title
a.ingest_time as proc_date
FROM
'veiw_a.events' as a
WHERE
(CAST(a.ingest_time as DATE)) IN ('2000-01-01','2000-01-02')
Presumably, your view_a.events table is partitioned by the ingest_time.
The issue is that partition pruning is very conservative (buggy?). With the direct comparisons, BigQuery is smart enough to recognize exactly which partitions are used for the query. But with the generated version, BigQuery is not able to figure this out, so the entire table needs to be read.

Poor DB Performance when using ORDER BY

I'm working with a non-profit that is mapping out solar potential in the US. Needless to say, we have a ridiculously large PostgreSQL 9 database. Running a query like the one shown below is speedy until the order by line is uncommented, in which case the same query takes forever to run (185 ms without sorting compared to 25 minutes with). What steps should be taken to ensure this and other queries run in a more manageable and reasonable amount of time?
select A.s_oid, A.s_id, A.area_acre, A.power_peak, A.nearby_city, A.solar_total
from global_site A cross join na_utility_line B
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
--order by A.area_acre
offset 0 limit 11;
The sort is not the problem - in fact the CPU and memory cost of the sort is close to zero since Postgres has Top-N sort where the result set is scanned while keeping up to date a small sort buffer holding only the Top-N rows.
select count(*) from (1 million row table) -- 0.17 s
select * from (1 million row table) order by x limit 10; -- 0.18 s
select * from (1 million row table) order by x; -- 1.80 s
So you see the Top-10 sorting only adds 10 ms to a dumb fast count(*) versus a lot longer for a real sort. That's a very neat feature, I use it a lot.
OK now without EXPLAIN ANALYZE it's impossible to be sure, but my feeling is that the real problem is the cross join. Basically you're filtering the rows in both tables using :
where (A.power_peak between 1.0 AND 100.0)
and A.area_acre >= 500
and A.solar_avg >= 5.0
AND A.pc_num <= 1000
and (A.fips_level1 = '06' AND A.fips_country = 'US' AND A.fips_level2 = '025')
and B.volt_mn_kv >= 69
and B.fips_code like '%US06%'
and B.status = 'active'
OK. I don't know how many rows are selected in both tables (only EXPLAIN ANALYZE would tell), but it's probably significant. Knowing those numbers would help.
Then we got the worst case CROSS JOIN condition ever :
and ST_within(ST_Centroid(A.wkb_geometry), ST_Buffer((B.wkb_geometry), 1000))
This means all rows of A are matched against all rows of B (so, this expression is going to be evaluated a large number of times), using a bunch of pretty complex, slow, and cpu-intensive functions.
Of course it's horribly slow !
When you remove the ORDER BY, postgres just comes up (by chance ?) with a bunch of matching rows right at the start, outputs those, and stops since the LIMIT is reached.
Here's a little example :
Tables a and b are identical and contain 1000 rows, and a column of type BOX.
select * from a cross join b where (a.b && b.b) --- 0.28 s
Here 1000000 box overlap (operator &&) tests are completed in 0.28s. The test data set is generated so that the result set contains only 1000 rows.
create index a_b on a using gist(b);
create index b_b on a using gist(b);
select * from a cross join b where (a.b && b.b) --- 0.01 s
Here the index is used to optimize the cross join, and speed is ridiculous.
You need to optimize that geometry matching.
add columns which will cache :
ST_Centroid(A.wkb_geometry)
ST_Buffer((B.wkb_geometry), 1000)
There is NO POINT in recomputing those slow functions a million times during your CROSS JOIN, so store the results in a column. Use a trigger to keep them up to date.
add columns of type BOX which will cache :
Bounding Box of ST_Centroid(A.wkb_geometry)
Bounding Box of ST_Buffer((B.wkb_geometry), 1000)
add gist indexes on the BOXes
add a Box overlap test (using the && operator) which will use the index
keep your ST_Within which will act as a final filter on the rows that pass
Maybe you can just index the ST_Centroid and ST_Buffer columns... and use an (indexed) "contains" operator, see here :
http://www.postgresql.org/docs/8.2/static/functions-geometry.html
I would suggest creating an index on area_acre. You may want to take a look at the following: http://www.postgresql.org/docs/9.0/static/sql-createindex.html
I would recommend doing this sort of thing off of peak hours though because this can be somewhat intensive with a large amount of data. One thing you will have to look at as well with indexes is rebuilding them on a schedule to ensure performance over time. Again this schedule should be outside of peak hours.
You may want to take a look at this article from a fellow SO'er and his experience with database slowdowns over time with indexes: Why does PostgresQL query performance drop over time, but restored when rebuilding index
If the A.area_acre field is not indexed that may slow it down. You can run the query with EXPLAIN to see what it is doing during execution.
First off I would look at creating indexes , ensure your db is being vacuumed, increase the shared buffers for your db install, work_mem settings.
First thing to look at is whether you have an index on the field you're ordering by. If not, adding one will dramatically improve performance. I don't know postgresql that well but something similar to:
CREATE INDEX area_acre ON global_site(area_acre)
As noted in other replies, the indexing process is intensive when working with a large data set, so do this during off-peak.
I am not familiar with the PostgreSQL optimizations, but it sounds like what is happening when the query is run with the ORDER BY clause is that the entire result set is created, then it is sorted, and then the top 11 rows are taken from that sorted result. Without the ORDER BY, the query engine can just generate the first 11 rows in whatever order it pleases and then it's done.
Having an index on the area_acre field very possibly may not help for the sorting (ORDER BY) depending on how the result set is built. It could, in theory, be used to generate the result set by traversing the global_site table using an index on area_acre; in that case, the results would be generated in the desired order (and it could stop after generating 11 rows in the result). If it does not generate the results in that order (and it seems like it may not be), then that index will not help in sorting the results.
One thing you might try is to remove the "CROSS JOIN" from the query. I doubt that this will make a difference, but it's worth a test. Because a WHERE clause is involved joining the two tables (via ST_WITHIN), I believe the result is the same as an inner join. It is possible that the use of the CROSS JOIN syntax is causing the optimizer to make an undesirable choice.
Otherwise (aside from making sure indexes exist for fields that are being filtered), you could play a bit of a guessing game with the query. One condition that stands out is the area_acre >= 500. This means that the query engine is considering all rows that meet that condition. But then only the first 11 rows are taken. You could try changing it to area_acre >= 500 and area_acre <= somevalue. The somevalue is the guessing part that would need adjustment to make sure you get at least 11 rows. This, however, seems like a pretty cheesy thing to do, so I mention it with some reticence.
Have you considered creating Expression based indexes for the benefit of the hairier joins and where conditions?

SQL Performance: UNION or ORDER BY

The problem: we have a very complex search query. If its result yields too few rows we expand the result by UNIONing the query with a less strict version of the same query.
We are discussing wether a different approach would be faster and/or better in quality. Instead of UNIONing we would create a custom sql function which would return a matching score. Then we could simply order by that matching score.
Regarding performance: will it be slower than a UNION?
We use PostgreSQL.
Any suggestions would be greatly appreciated.
Thank you very much
Max
A definitive answer can only be given if you measure the performance of both approaches in realistic environments. Everything else is guesswork at best.
There are so many variables at play here - the structure of the tables and the types of data in them, the distribution of the data, what kind of indices you have at your disposal, how heavy the load on the server is - it's almost impossible to predict any outcome, really.
So really - my best advice is: try both approaches, on the live system, with live data, not just with a few dozen test rows - and measure, measure, measure.
Marc
You want to order by the "return value" of your custom function? Then the database server can't use an index for that. The score has to be calculated for each record in the table (that hasn't been excluded with a WHERE clause) and stored in some temporary storage/table. Then the order by is performed on that temporary table. So this easily can get slower than your union queries (depending on your union statements of course).
To add my little bit...
+1 to marc_s, completely agree with what he said - I would only say, you need a test db server with realistic data volumes in to test on, as opposed to production server.
For the function approach, the function would be executed for each record, and then ordered by that result - this will not be an indexed column and so I'd expect to see a negative impact in performance. However, how big that impact is and whether it is actually negative when compared to the cumulative time of the other approach, is only going to be known by testing.
In PostgreSQL 8.3 and below, UNION implied DISTINCT which implied sorting, that means ORDER BY, UNION and DISTINCT were always of same efficiency, since the atter two aways used sorting.
On PostgreSQL 8.3, this query returns the sorted results:
SELECT *
FROM generate_series(1, 10) s
UNION
SELECT *
FROM generate_series(5, 15) s
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Since PostgreSQL 8.4 it became possible to use HashAggregate for UNION which may be faster (and almost always is), but does not guarantee ordered output.
The same query returns the following on PostgreSQL 8.4:
SELECT *
FROM generate_series(1, 10) s
UNION
SELECT *
FROM generate_series(5, 15) s
10
15
8
6
7
11
12
2
13
5
4
1
3
14
9
, and as you can see the resuts are not sorted.
PostgreSQL change list mentions this:
SELECT DISTINCT and UNION/INTERSECT/EXCEPT no longer always produce sorted output (Tom)
So in new PostgreSQL versions, I'd advice to use UNION, since it's more flexible.
In old versions, the performance will be the same.

Creating a quicker MySQL Query

I'm trying to create a faster query, right now i have large databases. My table sizes are 5 col, 530k rows, and 300 col, 4k rows (sadly i have 0 control over architecture, otherwise I wouldn't be having this silly problem with a poor db).
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1, table2
WHERE table2.foo_0 = table1.foo_0
AND table1.bar1 >= NOW()
AND foo_20="tada"
ORDER BY
date desc
LIMIT 0,10
I've indexed the table2.foo_0 and table1.foo_0 along with foo_20 in hopes that it would allow for faster querying.. i'm still at nearly 7 second load time.. is there something else I can do?
Cheers
I think an index on bar1 is the key. I always run into performance issues with dates because it has to compare each of the 530K rows.
Create the following indexes:
CREATE INDEX ix_table1_0_1 ON table1 (foo_1, foo_0)
CREATE INDEX ix_table2_20_0 ON table2 (foo_20, foo_0)
and rewrite you query as this:
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1
JOIN table2
ON table2.foo_0 = table1.foo_0
AND table2.foo_20 = "tada"
WHERE table1.bar1 >= NOW()
ORDER BY
table1.foo_1 DESC
LIMIT 0, 10
The first index will be used for ORDER BY, the second one will be used for JOIN.
You, though, may benefit more from creating the first index like this:
CREATE INDEX ix_table1_0_1 ON table1 (bar, foo_0)
which may apply more restrictive filtering on bar.
I have a blog post on this:
Choosing index
, which advices on how to choose which index to create for cases like that.
Indexing table1.bar1 may improve the >=NOW comparison.
A compound index on table2.foo_0 and table2.foo_20 will help.
An index on table2.foo_1 may help the sort.
Overall, pasting the output of your query with EXPLAIN prepended may also give some hints.
table2 needs a compound index on foo_0, foo_20, and bar1.
An index on table1.foo_0, table1.bar1 could help too, assuming that foo_20 belongs to table1.
See How to use MySQL indexes and Optimizing queries with explain.
Use compound indexes that corresponds to your WHERE equalities (in general leftmost col in the index), WHERE commparison to abolute value (middle), and ORDER BY clause (right, in the same order).