How to optimize query by index PostgreSQL - sql

I want to fetch users that has 1 or more processed bets. I do this by using next sql:
SELECT user_id FROM bets
WHERE bets.state in ('guessed', 'losed')
GROUP BY user_id
HAVING count(*) > 0;
But running EXPLAIN ANALYZE I noticed no index is used and query execution time is very high. I tried add partial index like:
CREATE INDEX processed_bets_index ON bets(state) WHERE state in ('guessed', 'losed');
But EXPLAIN ANALYZE output not changed:
HashAggregate (cost=34116.36..34233.54 rows=9375 width=4) (actual time=235.195..237.623 rows=13310 loops=1)
Filter: (count(*) > 0)
-> Seq Scan on bets (cost=0.00..30980.44 rows=627184 width=4) (actual time=0.020..150.346 rows=626674 loops=1)
Filter: ((state)::text = ANY ('{guessed,losed}'::text[]))
Rows Removed by Filter: 20951
Total runtime: 238.115 ms
(6 rows)
Records with other statuses except (guessed, losed) a little.
How do I create proper index?
I'm using PostgreSQL 9.3.4.

I assume that the state mostly consists of 'guessed' and 'losed', with maybe a few other states as well in there. So most probably the optimizer do not see the need to use the index since it would still fetch most of the rows.
What you do need is an index on the user_id, so perhaps something like this would work:
CREATE INDEX idx_bets_user_id_in_guessed_losed ON bets(user_id) WHERE state in ('guessed', 'losed');
Or, by not using a partial index:
CREATE INDEX idx_bets_state_user_id ON bets(state, user_id);

Related

Query Optimization with WHERE condition and a single JOIN

I have 2 tables with one-to-many relationship.
Users-> 1 million (1)
Requests-> 10 millions (n)
What I'm trying to do, is to fetch the user alongside with the latest request made; and be able to filter the whole dataset based on the (last) request columns.
The current query is fetching the correct results but it is painfully slow. ~7-9 seconds
SELECT *
FROM users AS u
INNER JOIN requests AS r
ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date AND r.ignored = false)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0
I have also tried to JOIN the r.created_at as a second ON condition instead of filtering on the WHERE statement, but without a difference in performance.
UPDATE:
Indexes:
Users: last_request_date
Requests: created_at, user_id(foreign)
Execution plan: https://explain.depesz.com/s/JsLr#source
Execution plan:
Limit (cost=1000.88..21080.19 rows=10 width=139) (actual time=15966.670..15990.322 rows=10 loops=1)
Buffers: shared hit=3962420 read=152361
-> Gather Merge (cost=1000.88..757990.77 rows=377 width=139) (actual time=15966.653..15990.138 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=3962420 read=152361
-> Nested Loop (cost=0.86..756947.24 rows=157 width=139) (actual time=9456.384..10622.180 rows=7 loops=3)
Buffers: shared hit=3962420 read=152361
" -> Parallel Index Scan Backward using users_last_request_date on users ""User"" (cost=0.42..55742.72 rows=420832 width=75) (actual time=0.061..2443.484 rows=333340 loops=3)"
Buffers: shared hit=5102 read=15849
-> Index Scan using requests_user_id on requests (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
" Index Cond: (user_id = ""User"".id)"
" Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
Rows Removed by Filter: 10
Buffers: shared hit=3957318 read=136512
Planning Time: 0.745 ms
Execution Time: 15990.489 ms
The biggest bottleneck from your execution plan was this part, requests might add one more column created_at to the index (because there is a filter cost)
-> Index Scan using requests_user_id on requests (cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
" Index Cond: (user_id = ""User"".id)"
" Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
Rows Removed by Filter: 10
Buffers: shared hit=3957318 read=136512
so you might try to create an index like the below.
CREATE INDEX IX_requests ON requests (
user_id,
created_at
);
If the ignored = false is a small amount on the requests table you can try to use Partial Indexes which might help you reduce your storage and improve your index performance.
CREATE INDEX FIX_requests ON requests (
user_id,
created_at
)
WHERE ignored = false;
On the other thing, I would use an index as below for users table because there is an order by on the last_request_date column and users table join request table by id
CREATE INDEX IX_users ON users (
last_request_date,
id
);
NOTE
I would avoid using SELECT * because it might cost more than IO we might not need use select all columns from the table in most scenes.
Try creating this BTREE index to handle the requests table lookup more efficiently.
CREATE INDEX id_ignored_date ON requests (user_id, ignored, created_at);
Your plan says
-> Index Scan using requests_user_id on requests
(cost=0.43..1.66 rows=1 width=64) (actual time=0.010..0.010 rows=0 loops=1000019)
"Index Cond: (user_id = ""User"".id)"
"Filter: ((NOT ignored) AND (""User"".last_request_date = created_at))"
and this index will move the Filter conditions into the Index Cond, which should be faster.
Pro tip: #Kendle is right. Don't use SELECT * in production software, especially performance-sensitive software, unless you have a good reason. It makes your RDBMS server, network, and client program work harder for no good reason.
Edit: Read this about how to use multicolumn BTREE indexes effectively. https://www.postgresql.org/docs/current/indexes-multicolumn.html
As you only need the last 10 users I would suggest that we only fetch the last 100 records from requests. This may avoid a million join comparisons, to test as the query optimiser may already be doing this.
This number should be modified according to your application. It may be that the last 10 records will always be 10 different users or that we need to fetch more than 100 to be sure of having 10 users.
SELECT *
FROM users AS u
INNER JOIN (select * from requests
where r.ignored = false
order by created_at desc
limit 100) AS r
ON u.id = r.user_id
WHERE (r.created_at = u.last_request_date)
ORDER BY u.last_request_date DESC
LIMIT 10 OFFSET 0

Why is my query uses filtering instead of index cond when I use an `OR` condition?

I have a transactions table in PostgreSQL with block_height and index as BIGINT values. Those two values are used for determining the order of the transactions in this table.
So if I want to query transactions from this table that comes after the given block_height and index, I'd have to put this on the condition
If two transactions are in the same block_height, then check the ordering of their index
Otherwise compare their block_height
For example if I want to get 10 transactions that came after block_height 100000 and index 5:
SELECT * FROM transactions
WHERE (
(block_height = 10000 AND index > 5)
OR (block_height > 10000)
)
ORDER BY block_height, index ASC
LIMIT 10
However I find this query to be extremely slow, it took up to 60 seconds for a table with 50 million rows.
However if I split up the condition and run them individually like so:
SELECT * FROM transactions
WHERE block_height = 10000 AND index > 5
ORDER BY block_height, index ASC
LIMIT 10
and
SELECT * FROM transactions
WHERE block_height > 10000
ORDER BY block_height, index ASC
LIMIT 10
Both queries took at most 200ms on the same table! It is much faster to do both queries and then UNION the final result instead of putting an OR in the condition.
This is the part of the query plan for the slow query (OR-ed condition):
-> Nested Loop (cost=0.98..11689726.68 rows=68631 width=73) (actual time=10230.480..10234.289 rows=10 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.56..3592792.96 rows=16855334 width=73) (actual time=10215.698..10219.004 rows=1364 loops=1)
Filter: (((block_height = $1) AND (index > $2)) OR (block_height > $3))
Rows Removed by Filter: 2728151
And this is the query plan for the fast query:
-> Nested Loop (cost=0.85..52.62 rows=1 width=73) (actual time=0.014..0.014 rows=0 loops=1)
-> Index Scan using src_transactions_block_height_index on src_transactions (cost=0.43..22.22 rows=5 width=73) (actual time=0.014..0.014 rows=0 loops=1)
Index Cond: ((block_height = $1) AND (index > $2))
I see the main difference to be the use of Filter instead of Index Cond between the query plans.
Is there any way to do this query in a performant way without resorting to the UNION workaround?
The fact that block_height is compared to two different parameters which you know just happen to be equal might be a problem. What if you use $1 twice, rather than $1 and $3?
But better yet, try a tuple comparison
WHERE (block_height, index) > (10000, 5)
This can become fast with a two-column index on (block_height, index).

Postgres / Postgis query optimization

I have a query in Postgres / Postgis which is based around finding the nearest points of a given point filtered by some other columns in the table.
The table consists of a bit over 10 million rows and the query looks like this:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1,2,3)
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
The geom column is indexed using GIST and col1 is also indexed.
When the WHERE clause finds many rows that are also near the point this is blazing fast using the geom index:
Limit (cost=0.42..10575.49 rows=1000 width=12) (actual time=0.150..6.742 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..2148612.35 rows=203177 width=12) (actual time=0.149..6.663 rows=1000 loops=1)
Order By: (geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1,2,3}'::double precision[]))
Rows Removed by Filter: 3348
Planning Time: 0.288 ms
Execution Time: 6.817 ms
The problem occurs when the WHERE clause does not find many rows that are close in distance to the given point. Example:
SELECT t.id FROM my_table t
WHERE round(t.col1) IN (1) // 1 is very rare near the given point
ORDER BY t.geom <-> st_transform(st_setsrid('POINT(lon lat)'::geometry, 4326), 3857)
LIMIT 1000;
This query runs much much slower:
Limit (cost=0.42..14487.97 rows=1000 width=12) (actual time=8443.514..10629.745 rows=1000 loops=1)
-> Index Scan using my_table_geom_idx on my_table t (cost=0.42..1962368.41 rows=135452 width=12) (actual time=8443.513..10629.553 rows=1000 loops=1)
Order By: (t.geom <-> '.....'::geometry)
Filter: (round(t.col1) = ANY ('{1}'::double precision[]))
Rows Removed by Filter: 5866030
Planning Time: 0.292 ms
Execution Time: 10629.906 ms
I created an index on round(col1) trying to speed up searches on col1 but postgres uses the geom index only which works great when there are many rows nearby that fit the criteria but not so great if there are few rows that match.
If I remove the LIMIT clause Postgres uses the index on col1, which works great when there are few resulting rows but is very slow when the result contains many rows, so I would like to keep the LIMIT clause.
Any suggestions on how I could optimize this query or create an index that handles this?
EDIT:
Thank you for all the suggestions and feedback!
I tried the tip from #JGH and restricted my query using st_dwithin as to not order the entire table before limiting.
...where st_dwithin(geom, searchpoint, 10000)
This greatly reduced the time of the slow query, down to a few milliseconds. Restricting the search to a constant distance works well in my application, so I will use this as the solution.

Optimizing a Django `.exists()` query

I have a .exists() query in an app I am writing. I want to optimize it.
The current ORM expression yields SQL that looks like this:
SELECT DISTINCT
(1) AS "a",
"the_model"."id",
... snip every single column on the_model
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
The explain plan looks like this:
Limit (cost=176.60..176.63 rows=1 width=223)
-> Unique (cost=176.60..177.40 rows=29 width=223)
-> Sort (cost=176.60..176.67 rows=29 width=223)
Sort Key: id, ...SNIP...
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=223)
Index Cond: (user_id = 6)
Filter: ...SNIP...
If I manually modify the above SQL and remove the individual table columns so it looks like this:
SELECT DISTINCT
(1) AS "a",
FROM "the_model"
WHERE (
...snip criteria...
LIMIT 1
the explain plan shows a couple fewer steps, which is great.
Limit (cost=0.43..175.89 rows=1 width=4)
-> Unique (cost=0.43..175.89 rows=1 width=4)
-> Index Scan using ...SNIP... on ...SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ...SNIP...
I can go further by removing the DISTINCT keyword from the query, thus yielding an even shallower execution plan, although the cost saving here is minor:
Limit (cost=0.43..6.48 rows=1 width=4)
-> Index Scan using ..SNIP... on ..SNIP... (cost=0.43..175.89 rows=29 width=4)
Index Cond: (user_id = 6)
Filter: ..SNIP...
I can modify the ORM expression using .only('id') to select only one field. However, that does not result in the result I would like. It is doing an unnecessary sort on id. Ideally, I would like to do a .only(None), since none of the columns are needed here. They only add weight.
I would also like to remove the DISTINCT keyword, if possible. I don't think it adds much execution time if the columns are removed though.
It seems like this could be done across the board, as .exists() returns a boolean. None of the returned columns are used for anything. They only complicate the query and reduce performance.
I found that during my QuerySet construction, .distinct() was being called before my .exists(). It was causing the columns to be selected unnecessarily.

Help to choose NoSQL database for project

There is a table:
doc_id(integer)-value(integer)
Approximate 100.000 doc_id and 27.000.000 rows.
Majority query on this table - searching documents similar to current document:
select 10 documents with maximum of
(count common to current document value)/(count ov values in document).
Nowadays we use PostgreSQL. Table weight (with index) ~1,5 GB. Average query time ~0.5s - it is to hight. And, for my opinion this time will grow exponential with growing of database.
Should I transfer all this to NoSQL base, if so, what?
QUERY:
EXPLAIN ANALYZE
SELECT D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM testing.text_attachment D
WHERE D.doc_id !=29758 -- 29758 - is random id
AND D.doc_crc32 IN (select testing.get_crc32_rows_by_doc_id(29758)) -- get_crc32... is IMMUTABLE
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10
Limit (cost=95.23..95.26 rows=10 width=8) (actual time=1849.601..1849.641 rows=10 loops=1)
-> Sort (cost=95.23..95.28 rows=20 width=8) (actual time=1849.597..1849.609 rows=10 loops=1)
Sort Key: (((((count(d.doc_crc32))::numeric * 1.0) / (testing.get_count_by_doc_id(d.doc_id))::numeric))::real)
Sort Method: top-N heapsort Memory: 25kB
-> HashAggregate (cost=89.30..94.80 rows=20 width=8) (actual time=1211.835..1847.578 rows=876 loops=1)
-> Nested Loop (cost=0.27..89.20 rows=20 width=8) (actual time=7.826..928.234 rows=167771 loops=1)
-> HashAggregate (cost=0.27..0.28 rows=1 width=4) (actual time=7.789..11.141 rows=1863 loops=1)
-> Result (cost=0.00..0.26 rows=1 width=0) (actual time=0.130..4.502 rows=1869 loops=1)
-> Index Scan using crc32_idx on text_attachment d (cost=0.00..88.67 rows=20 width=8) (actual time=0.022..0.236 rows=90 loops=1863)
Index Cond: (d.doc_crc32 = (testing.get_crc32_rows_by_doc_id(29758)))
Filter: (d.doc_id <> 29758)
Total runtime: 1849.753 ms
(12 rows)
1.5 GByte is nothing. Serve from ram. Build a datastructure that helps you searching.
I don't think your main problem here is the kind of database you're using but the fact that you don't in fact have an "index" for what you're searching: similarity between documents.
My proposal is to determine once which are the 10 documents similar to each of the 100.000 doc_ids and cache the result in a new table like this:
doc_id(integer)-similar_doc(integer)-score(integer)
where you'll insert 10 rows per document each of them representing the 10 best matches for it. You'll get 400.000 rows which you can directly access by index which should take down search time to something like O(log n) (depending on index implementation).
Then, on each insertion or removal of a document (or one of its values) you iterate through the documents and update the new table accordingly.
e.g. when a new document is inserted:
for each of the documents already in the table
you calculate its match score and
if the score is higher than the lowest score of the similar documents cached in the new table you swap in the similar_doc and score of the newly inserted document
If you're getting that bad performance out of PostgreSQL, a good start would be to tune PostgreSQL, your query and possibly your datamodel. A query like that should serve a lot faster on such a small table.
First, is 0.5s a problem or not? And did you already optimize your queries, datamodel and configuration settings? If not, you can still get better performance. Performance is a choice.
Besides speed, there is also functionality, that's what you will loose.
===
What about pushing the function to a JOIN:
EXPLAIN ANALYZE
SELECT
D.doc_id as doc_id,
(count(D.doc_crc32) *1.0 / testing.get_count_by_doc_id(D.doc_id))::real as avg_doc
FROM
testing.text_attachment D
JOIN (SELECT testing.get_crc32_rows_by_doc_id(29758) AS r) AS crc ON D.doc_crc32 = r
WHERE
D.doc_id <> 29758
GROUP BY D.doc_id
ORDER BY avg_doc DESC
LIMIT 10