Index doesn't improve performance - sql

I have a simple table structure in my postgres database:
CREATE TABLE device
(
id bigint NOT NULL,
version bigint NOT NULL,
device_id character varying(255),
date_created timestamp without time zone,
last_updated timestamp without time zone,
CONSTRAINT device_pkey PRIMARY KEY (id )
)
I'm often querying data based on deviceId column. The table has 3,5 million rows, so it leads to performance issues:
"Seq Scan on device (cost=0.00..71792.70 rows=109 width=8) (actual time=352.725..353.445 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 353.463 ms"
Hence I've created index on device_id column:
CREATE INDEX device_device_id_idx
ON device
USING btree
(device_id );
However my problem is, that database still uses sequential scan, not index scan. The query plan after creating the index is the same:
"Seq Scan on device (cost=0.00..71786.33 rows=109 width=8) (actual time=347.133..347.508 rows=2 loops=1)"
" Filter: ((device_id)::text = '352184052470420'::text)"
"Total runtime: 347.538 ms"
The result of the query are 2 rows, so I'm not selecting a big portion of the table. I don't really understand why index is disregarded. What can I do to improve the performance?
edit:
My query:
select id from device where device_id ='357560051102491A';
I've run analyse on the device table, which didn't help
device_id contains also characters.

You may need to look at the queries. To use an index, the queries need to be sargable. That means certain ways to construct the queries are better than other ways. I am not familiar with Postgre but in SQl Server this would include such things (very small sample of the bad constructs):
Not doing data transformations in the join - instead store the data
properly
Not using correlated subqueries - use derived tables or temp table
instead
Not using OR conditions - use UNION ALL instead
Your first step shoud be to get a good book on performance tuning for your specific database. It will talk about what constructions to avoid for your particular database engine.

Indexes are not used when you cast a column to a different type:
((device_id)::text = '352184052470420'::text)
Instead, you can do this way:
(device_id = ('352184052470420'::character varying))
(or maybe you can change device_id to TEXT in the original table, if you wish.)
Also, remember to run analyze device after index has been created, or it will not be used anyway.

It seems, like time resolves everything. I'm not sure what have happened, but currently its working fine.
From the time I've posted this question I didn't change anything and now I get this query plan:
"Bitmap Heap Scan on device (cost=5.49..426.77 rows=110 width=166)"
" Recheck Cond: ((device_id)::text = '357560051102491'::text)"
" -> Bitmap Index Scan on device_device_id_idx (cost=0.00..5.46 rows=110 width=0)"
" Index Cond: ((device_id)::text = '357560051102491'::text)"
Time breakdown (timezone GMT+2):
~15:50 I've created the index
~16:00 I've dropepd and recreated the index several times, since it was not working
16:05 I've run analyse device (didn't help)
16:44:49 from app server request_log, I can see that the requests executing the query are still taking around 500 ms
16:56:59 I can see first request, which takes 23 ms (the index started to work!)
The question stays, why it took around 1:10 hour for the index to be applied? When I was creating indexes in the same database few days ago the changes were immediate.

Related

Can't improve SQL join speed with indexes

I'm totally new to SQL and I am trying to speed up join queries for very large data. I started adding indexes (but to be honest, I don't have a deep understanding of them) and not seeing much change, I decided to benchmark on a more simple, simulated example. I'm using the psql interface of PostgreSQL 11.5 on MacOS 10.14.6. The data server is hosted locally on my computer. I apologize for any lack of relevant information, first time posting about SQL.
Databases' Structures
I created two initially identical databases, db and db_idx. I never put any index or key on tables in db, while I try putting indexes and keys on tables in db_idx. I then run simple join queries within db and db_idx separately and I compare the performance. Specifically, db_idx is made of two tables:
A client table with with 100,000 rows and the following structure:
Table "public.client"
Column | Type | Collation | Nullable | Default
-------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_name | text | | |
Indexes:
"pkey_c" PRIMARY KEY, btree (client_id)
A client_additional table with 70,000 rows and the following structure:
Table "public.client_additional"
Column | Type | Collation | Nullable | Default
------------+---------+-----------+----------+---------
client_id | integer | | not null |
client_age | integer | | |
Indexes:
"pkey_ca" PRIMARY KEY, btree (client_id)
"cov_idx" btree (client_id, client_age)
The client_id column in the client_additional table contains a subset of client's client_id values. Note the primary keys, and the other index I created on client_additional. I thought these would increase the benchmark query speed (see below) but it did not.
Importantly the db database is exactly the same (same structure, same values) except that it has no index or key.
Side note: the client and client_additional table should perhaps be a single table, since they give information at exactly the same level (client level). However, the database I'm using in real life came structured this way: some tables are split into several tables by "topic" although they give information at the same level. I don't know if that matters for my issue.
Benchmark Query
I'm using the following query, which mimics a lot what I need to do with real data:
SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM client
INNER JOIN client_additional
ON client.client_id = client_additional.client_id;
Benchmark Results
On both databases, the benchmark query takes about 630 ms. Removing the keys and/or indexes in db_idx does not change anything. These benchmark results carry over to larger data sizes: speed is identical in the indexed and non-indexed cases.
That's where I am. How do I explain these results? Can I improve the join speed and how?
Use the EXPLAIN verb to see how the SQL engine intends to resolve the query. (Different SQL engines present this in different ways.) You can conclusively see whether the index will be used.
Also, you'll first need to load the tables with a lot of test data, because EXPLAIN will tell you what the SQL engine intends to do right now, and this decision is based in part on the size of the table and various other statistics. If the table is virtually empty, the SQL engine might decide that the index wouldn't be helpful now.
SQL engines use all kinds of very clever tricks to optimize performance, so it's actually rather difficult to get a useful timing test. But, if EXPLAIN tells you that the index is being used, that's pretty much the answer that you're looking for.
Setting up a small test DB, adding some rows and running your query:
CREATE TABLE client
(
client_id integer PRIMARY KEY,
client_name text
);
CREATE TABLE client_additional
(
client_id integer PRIMARY KEY,
client_age integer
);
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,100000),'Phil');
INSERT INTO client_additional (client_id, client_age) VALUES (generate_series(1,70000),21);
ANALYZE;
EXPLAIN ANALYZE SELECT
client_additional.client_id,
client_additional.client_age,
client.client_name
FROM
client
INNER JOIN
client_additional
ON
client.client_id = client_additional.client_id;
gave me this plan:
Hash Join (cost=1885.00..3590.51 rows=70000 width=11) (actual time=158.958..44 1.222 rows=70000 loops=1)
Hash Cond: (client.client_id = client_additional.client_id)
-> Seq Scan on client (cost=0.00..1443.00 rows=100000 width=7) (actual time =0.019..100.318 rows=100000 loops=1)
-> Hash (cost=1010.00..1010.00 rows=70000 width=8) (actual time=158.785..15 8.786 rows=70000 loops=1)
Buckets: 131072 Batches: 1 Memory Usage: 3759kB
-> Seq Scan on client_additional (cost=0.00..1010.00 rows=70000 width =8) (actual time=0.016..76.507 rows=70000 loops=1)
Planning Time: 0.357 ms
Execution Time: 506.739 ms
What you can see from this is both tables were sequentially scanned, the values from each table were hashed and a hash join was done. Postgres determined this was the optimal way to execute this query.
If you were to recreate the tables without the Primary Key (and therefore remove the implicit index on the PK column of each), you get exactly the same plan, as Postgres has determined that the quickest way to execute this query is by ignoring the indexes and by hashing the table's values then doing a hash join on the two sets of hashed values to get the result.
After changing the number of rows in the client table like so:
TRUNCATE Client;
INSERT INTO client (client_id, client_name) VALUES (generate_series(1,200000),'phil');
ANALYZE;
Then I re-ran the same query and I see this plan instead:
Merge Join (cost=1.04..5388.45 rows=70000 width=13) (actual time=0.050..415.50
3 rows=70000 loops=1)
Merge Cond: (client.client_id = client_additional.client_id)
-> Index Scan using client_pkey on client (cost=0.42..6289.42 rows=200000 width=9) (actual time=0.022..86.897 rows=70001 loops=1)
-> Index Scan using client_additional_pkey on client_additional (cost=0.29..2139.29 rows=70000 width=8) (actual time=0.016..86.818 rows=70000 loops=1)
Planning Time: 0.517 ms
Execution Time: 484.264 ms
Here you can see that index scans were done, as Postgres has determined that this plan is a better one based on the current number of rows in the tables.
The point is that Postgres will use the indexes when it feels they will produce a faster result, but the thresholds before they are used are somewhat higher than you may have expected.
All best,
Phil
You have a primary key on the two tables which will be used for the joins. If you want to really see the queries slow down, remove the primary keys.
What is happening? Well, my guess is that the execution plans are the same with or without the secondary indexes. You would need to look at the plans themselves.
Unlike most other databases, Postgres does not get a benefit from covering indexes, because lock information is stored in the data pages only. So, the data pages always need to be accessed.

Postgresql select count query takes long time

I have a table named events in my Postgresql 9.5 database. And this table has about 6 million records.
I am runnig a select count(event_id) from events query. But this query takes 40seconds. This is very long time for a database. My event_id field of table is primary key and indexed. Why this takes very long time? (Server is ubuntu vm on vmware has 4cpu)
Explain:
"Aggregate (cost=826305.19..826305.20 rows=1 width=0) (actual time=24739.306..24739.306 rows=1 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
" -> Seq Scan on event_source (cost=0.00..812594.55 rows=5484255 width=0) (actual time=0.014..24087.050 rows=6320689 loops=1)"
" Buffers: shared hit=13 read=757739 dirtied=53 written=48"
"Planning time: 0.369 ms"
"Execution time: 24739.364 ms"
I know that this is an old question and the existing answer covers the vast majority of info around this, but I just ran into a situation where a table of 1.3 million rows was taking about 35 seconds to perform a simple SELECT COUNT(*). None of the other solutions helped. The issue ended up being that the table was just bloated and hadn't been vacuumed, so Postgres couldn't figure out the most optimal way to query the data. After I ran this, the query time dropped down to about 25ms!
VACUUM (ANALYZE, VERBOSE, FULL) my_table_name;
Hope this helps someone else!
There are multiple factors playing a big role in the decision for PostgreSQL how to execute the count(), but first of all, the column you use inside the count function does not matter. In fact, if you don't need DISTINCT count, stick with count(*).
You can try the following to force an index-only scan:
SELECT count(*) FROM (SELECT event_id FROM events) t;
...if that still results in a sequential scan, than most likely the index is not much smaller than the table itself. To still see how an index-only scan would perform, you can enforce it with:
SELECT count(*) FROM (SELECT event_id FROM events ORDER BY 1) t;
IF that is not much faster, you should also consider an upgrade of the PostgreSQL to at least version 9.6, which introduces parallel sequential scans to speed up these things.
In addition, you can achieve dramatic speedups choosing from a variety of techniques to provide counts which largely depend on your use-case and your requirements:
Faster PostgreSQL Counting
Last but not least, please always provide the output of an extended explain as #a_horse_with_no_name already recommended, e.g.:
EXPLAIN (ANALYZE, BUFFERS) SELECT count(event_id) FROM events;

Behavior of WHERE clause on a primary key field

select * from table where username="johndoe"
In Postgres, if username is not a primary key, I know it will iterate through all the records.
But if it is a primary key field, will the above SQL statement iterate through the entire table, or terminate as soon as username is matched. In other words, does "where" act differently when it is running on a primary key column or not?
Primary keys (and all indexed columns for that matter) take advantage of indexes when those column(s) are used as filter predicates, like WHERE and JOIN...ON clauses.
As a real world example, my application has a table called Log_Games, which is a table with millions of rows, ID as the primary key, and a number of other non-indexed columns such as ParsedAt. Compare the below:
INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ID" = 792046
INDEXED QUERY PLAN
Index Scan using "Log_Games_pkey" on "Log_Games" (cost=0.43..8.45 rows=1 width=4190) (actual time=0.024..0.024 rows=1 loops=1)
Index Cond: ("ID" = 792046)
Planning time: 1.059 ms
Execution time: 0.066 ms
NON-INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ParsedAt" = '2015-05-07 07:31:24+00'
NON-INDEXED QUERY PLAN
Seq Scan on "Log_Games" (cost=0.00..141377.34 rows=18 width=4190) (actual time=0.013..793.094 rows=1 loops=1)
Filter: ("ParsedAt" = '2015-05-07 07:31:24+00'::timestamp with time zone)
Rows Removed by Filter: 1924676
Planning time: 0.794 ms
Execution time: 793.135 ms
The query with the indexed clause uses the index Log_Games_pkey, resulting in a query that executes in 0.066ms. The query with the non-indexed clause reverts to a sequential scan, which means it goes from the start to the finish of the table to see which columns match, an operation that causes the execution time to blow out to 793.135ms.
There are plenty of good resources around the web that can help you read execution plans and decide when they may need supporting indexes. A good place to start is the PostgreSQL documentation:
https://www.postgresql.org/docs/9.6/static/using-explain.html

Why isn't Postgres using the index with Distinct?

I have this table:
CREATE TABLE public.prodhistory (
curve_id int4 NOT NULL,
start_prod_date date NOT NULL,
prod_date date NOT NULL,
monthly_prod_rate float4 NOT NULL,
eff_date timestamp NOT NULL,
/* Keys */
CONSTRAINT prodhistorypk
PRIMARY KEY (curve_id, prod_date, start_prod_date, eff_date),
/* Foreign keys */
CONSTRAINT prodhistory2typecurves_fk
FOREIGN KEY (curve_id)
REFERENCES public.typecurves(curve_id)
) WITH (
OIDS = FALSE
);
CREATE INDEX prodhistory_idx_curve_id01
ON public.prodhistory
(curve_id);
with ~42M rows.
And I execute this query:
SELECT DISTINCT curve_id FROM prodhistory
Which I expect would be very quick, given the index. But no, 270 secs. So I explain, and I get:
HashAggregate (cost=824870.03..824873.08 rows=305 width=4) (actual time=211834.018..211834.097 rows=315 loops=1)
Output: curve_id
Group Key: prodhistory.curve_id
-> Seq Scan on public.prodhistory (cost=0.00..718003.22 rows=42746722 width=4) (actual time=12.751..200826.299 rows=43218808 loops=1)
Output: curve_id
Planning time: 0.115 ms
Execution time: 211848.137 ms
I'm not to experienced in reading these plans, but a Seq Scan on the DB seems bad.
Any thoughts? I'm sort of stumped.
This plan is chosen because PostgreSQL thinks it is cheaper.
You can compare by setting
SET enable_seqscan=off;
and then re-running your EXPLAIN (ANALYZE) statement. Compare cost and actual time in both cases and check if PostgreSQL estimated correctly or not.
If you find that using an Index Scan or Index Only Scan is actually cheaper, you could consider twiddling the cost parameters to match your machine better, e.g. lower random_page_cost or cpu_index_tuple_cost or raise cpu_tuple_cost.
PostgreSQL "index only scans" aren't always as cheap as you might think.
The reason is that each row needs to be checked as to whether it is visible to the MVCC snapshot or not.
Whether this is cheap or not depends on the table's visibility map.
If you force an index only scan (as per laurenz-albe's answer):
SET enable_seqscan=off;
Then run your query with:
EXPLAIN (ANALYZE ON, BUFFERS ON)
And see query plan output with "heap fetches" as below this means that the table's actual row data is being accessed, not just the index.
Index Only Scan using my_index on my_table (cost=0.42..17792.01 rows=595195 width=20) (actual time=37.942..2330.737 rows=539105 loops=1)
Heap Fetches: 234180
The official documentation describes this here:
https://www.postgresql.org/docs/current/indexes-index-only-scans.html
You may be able to resolve this by altering the way the table is updated, or by adjusting your auto vacuum settings.

Why does this simple query not use the index in postgres?

In my postgreSQL database I have a table named "product". In this table I have a column named "date_touched" with type timestamp. I created a simple btree index on this column. This is the schema of my table (I omitted irrelevant column & index definitions):
Table "public.product"
Column | Type | Modifiers
---------------------------+--------------------------+-------------------
id | integer | not null default nextval('product_id_seq'::regclass)
date_touched | timestamp with time zone | not null
Indexes:
"product_pkey" PRIMARY KEY, btree (id)
"product_date_touched_59b16cfb121e9f06_uniq" btree (date_touched)
The table has ~300,000 rows and I want to get the n-th element from the table ordered by "date_touched". when I want to get the 1000th element, it takes 0.2s, but when I want to get the 100,000th element, it takes about 6s. My question is, why does it take too much time to retrieve the 100,000th element, although I've defined a btree index?
Here is my query with explain analyze that shows postgreSQL does not use the btree index and instead sorts all rows to find the 100,000th element:
first query (100th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------
Limit (cost=3035.26..3038.29 rows=1 width=12) (actual time=160.208..160.209 rows=1 loops=1)
-> Index Scan using product_date_touched_59b16cfb121e9f06_uniq on product (cost=0.42..1000880.59 rows=329797 width=12) (actual time=16.651..159.766 rows=1001 loops=1)
Total runtime: 160.395 ms
second query (100,000th element):
explain analyze
SELECT product.id
FROM product
ORDER BY product.date_touched ASC
LIMIT 1
OFFSET 100000;
QUERY PLAN
------------------------------------------------------------------------------------------------------
Limit (cost=106392.87..106392.88 rows=1 width=12) (actual time=6621.947..6621.950 rows=1 loops=1)
-> Sort (cost=106142.87..106967.37 rows=329797 width=12) (actual time=6381.174..6568.802 rows=100001 loops=1)
Sort Key: date_touched
Sort Method: external merge Disk: 8376kB
-> Seq Scan on product (cost=0.00..64637.97 rows=329797 width=12) (actual time=1.357..4184.115 rows=329613 loops=1)
Total runtime: 6629.903 ms
It is a very good thing, that SeqScan is used here. Your OFFSET 100000 is not a good thing for the IndexScan.
A bit of theory
Btree indexes contain 2 structures inside:
balanced tree and
double-linked list of keys.
First structure allows for fast keys lookups, second is responsible for the ordering. For bigger tables, linked list cannot fit into a single page and therefore it is a list of linked pages, where each page's entries maintain ordering, specified during index creation.
It is wrong to think, though, that such pages are sitting together on the disk. In fact, it is more probable that those are spread across different locations. And in order to read pages based on the index's order, system has to perform random disk reads. Random disk IO is expensive, compared to sequential access. Therefore good optimizer will prefer a SeqScan instead.
I highly recommend “SQL Performance Explained” book to better understand indexes. It is also available on-line.
What is going on?
Your OFFSET clause would cause database to read index's linked list of keys (causing lots of random disk reads) and than discarding all those results, till you hit the wanted offset. And it is good, in fact, that Postgres decided to use SeqScan + Sort here — this should be faster.
You can check this assumption by:
running EXPLAIN (analyze, buffers) of your big-OFFSET query
than do SET enable_seqscan TO 'off';
and run EXPLAIN (analyze, buffers) again, comparing the results.
In general, it is better to avoid OFFSET, as DBMSes not always pick the right approach here. (BTW, which version of PostgreSQL you're using?)
Here's a comparison of how it performs for different offset values.
EDIT: In order to avoid OFFSET one would have to base pagination on the real data, that exists in the table and is a part of the index. For this particular case, the following might be possible:
show first N (say, 20) elements
include maximal date_touched that is shown on the page to all the “Next” links. You can compute this value on the application side. Do similar for the “Previous” links, except include minimal date_touch for these.
on the server side you will get the limiting value. Therefore, say for the “Next” case, you can do a query like this:
SELECT id
FROM product
WHERE date_touched > $max_date_seen_on_the_page
ORDER BY date_touched ASC
LIMIT 20;
This query makes best use of the index.
Of course, you can adjust this example to your needs. I used pagination as it is a typical case for the OFFSET.
One more note — querying 1 row many times, increasing offset for each query by 1, will be much more time consuming, than doing a single batch query that returns all those records, which are then iterated from on the application side.