I have something like the following table
CREATE TABLE mytable
(
id serial NOT NULL
search_col int4 NULL,
a1 varchar NULL,
a2 varchar NULL,
...
a50 varchar NULL,
CONSTRAINT mytable_pkey PRIMARY KEY (id)
);
CREATE INDEX search_col_idx ON mytable USING btree (search_col);
This table has approximately 5 million rows and it takes about 15 seconds to perform a search operation like
select *
from mytable
where search_col = 83310
It is crucial for me to increase performance, but even clustering the table after search_col did not bring a major benefit.
However, I tried the following:
create table test as (select id, search_col, a1 from mytable);
A search on this table, having the same amount of rows as the original one, takes approximately 0.2 seconds. Why that and how can I use this for what I need?
Index Scan using search_col_idx on mytable (cost=0.43..2713.83 rows=10994 width=32802) (actual time=0.021..13.015 rows=12018 loops=1)
Seq Scan on test (cost=0.00..95729.46 rows=12347 width=19) (actual time=0.246..519.501 rows=12018 loops=1)
The result of DBeaver's Execution Plan
|Knotentyp|Entität|Kosten|Reihen|Zeit|Bedingung|
|Index Scan|mytable|0.43 - 3712.86|12018|13.141|(search_col = 83310)|
Execution Plan from psql:
Index Scan using mytable_search_col_idx on mytable (cost=0.43..3712.86 rows=15053 width=32032) (actual time=0.015..13.889 rows=12018 loops=1)
Index Cond: (search_col = 83310)
Planning time: 0.640 ms
Execution time: 23.910 ms
(4 rows)
One way that the columns would impact the timing would be if the columns were large. Really large.
In most cases, a row resides on a single data page. The index points to the page and the size of the row has little impact on the timing, because the timing is dominated by searching the index and fetching the row.
However, if the columns are really large, then that can require reading many more bytes from disk, which takes more time.
That said, another possibility is that the statistics are out-of-date and the index isn't being used on the first query.
Related
I have a table with containing about 1m records.
When I run select * from table it will cause timeout and I see the query is in state IO: DataFileRead.
When I run the select * from table where id>0 and id<=2147483647 which id is primary key it returns all data in couple of seconds.
Should I always include where clause even for returning all records?
Table schema
CREATE TABLE table
(
id integer NOT NULL GENERATED BY DEFAULT AS IDENTITY ( INCREMENT 1 START 1 MINVALUE 1 MAXVALUE 2147483647 CACHE 1 ),
batch_id integer,
area_id integer,
asset_group text COLLATE pg_catalog."default",
asset_id text COLLATE pg_catalog."default",
parent_id text COLLATE pg_catalog."default",
reference_key text COLLATE pg_catalog."default",
maintainer_code text COLLATE pg_catalog."default",
type_code text COLLATE pg_catalog."default",
super_type_code text COLLATE pg_catalog."default"
)
The primary key is integer if I specify whole range of integer it returns data quickly but without where it takes one hour.
Even if I use column names for example select id,type_code from table it's very slow comparing to select id,type_code from table where id>0 and id<=2147483647
Below is the execution plan without using where:
Seq Scan on table (cost=0.00..6894676.46 rows=630746 width=379) (actual time=2590902.656..4068047.762 rows=792777 loops=1)
Planning Time: 0.095 ms
Execution Time: 4068076.818 ms
And when using where:
Bitmap Heap Scan on table (cost=597265.81..1252327.52 rows=630747
width=379) (actual time=72.493..211.108 rows=792777 loops=1)
Recheck Cond: ((id > 0) AND (id < 2147483647))
Heap Blocks: exact=30533
-> Bitmap Index Scan on pk_information_model_entry (cost=0.00..597108.12 rows=630747 width=0) (actual time=64.017..64.017 rows=792777 loops=1)
Index Cond: ((id > 0) AND (id < 2147483647))
Planning Time: 8.594 ms
Execution Time: 233.207 ms
I'm aware using index can improve it but why using where clause will make such a difference?
Your table seems to be massively bloated (full of totally empty pages). Using the index allows to skip the reading of those pages. You could fix it with a VACUUM FULL of the table, or using something like pg_squeeze.
You might also want to investigate how it got that way in the first place, so you can prevent it from recurring.
To reduce planning time, PostgreSQL doesn't consider using an index unless it "might possibly be useful". But just overcoming extreme bloat is not considered to be "possibly useful", which is why it only uses the index after you introduce a dummy WHERE clause which references the column.
I have a simple count query that can use Index Only Scan, but it still take so long in PostgresQL!
I have a cars table with 2 columns type bigint and active boolean, I also have a multi-column index on those columns
CREATE TABLE cars
(
id BIGSERIAL NOT NULL
CONSTRAINT cars_pkey PRIMARY KEY ,
type BIGINT NOT NULL ,
name VARCHAR(500) NOT NULL ,
active BOOLEAN DEFAULT TRUE NOT NULL,
created_at TIMESTAMP(0) WITH TIME ZONE default NOW(),
updated_at TIMESTAMP(0) WITH TIME ZONE default NOW(),
deleted_at TIMESTAMP(0) WITH TIME ZONE
);
CREATE INDEX cars_type_active_index ON cars(type, active);
I inserted some test data with 950k records, type=1 have 600k records
INSERT INTO cars (type, name) (SELECT 1, 'car-name' FROM generate_series(1,600000));
INSERT INTO cars (type, name) (SELECT 2, 'car-name' FROM generate_series(1,200000));
INSERT INTO cars (type, name) (SELECT 3, 'car-name' FROM generate_series(1,100000));
INSERT INTO cars (type, name) (SELECT 4, 'car-name' FROM generate_series(1,50000));
Let 's run VACUUM ANALYZE and force PostgresQL to use Index Only Scan
VACUUM ANALYSE;
SET enable_seqscan = OFF;
SET enable_bitmapscan = OFF;
OK, I have a simple query on type and active
EXPLAIN (VERBOSE, BUFFERS, ANALYSE)
SELECT count(*)
FROM cars
WHERE type = 1 AND active = true;
Result:
Aggregate (cost=24805.70..24805.71 rows=1 width=0) (actual time=4460.915..4460.918 rows=1 loops=1)
Output: count(*)
Buffers: shared hit=2806
-> Index Only Scan using cars_type_active_index on public.cars (cost=0.42..23304.23 rows=600590 width=0) (actual time=0.051..2257.832 rows=600000 loops=1)
Output: type, active
Index Cond: ((cars.type = 1) AND (cars.active = true))
Filter: cars.active
Heap Fetches: 0
Buffers: shared hit=2806
Planning time: 0.213 ms
Execution time: 4461.002 ms
(11 rows)
Look at the query explain result,
It used Index Only Scan, with index only scan, depending on visibilities map, PostgresQL sometime need to fetch Table Heap to check for visibility of the tuple, But I already run VACUUM ANALYZE so you can see Heap fetch = 0, so reading the index is enough for answer this query.
The size of the index is quite small, it can all fit on the Buffer cache (Buffers: shared hit=2806), PostgresQL does not need to fetch pages from disk.
From there, I can't understand why PostgresQL take that long (4.5s) to answer the query, 1M records is not a big number of records, everything is already cached on memory, and the data on index is visible, it does not need to fetch Heap.
PostgreSQL 9.5.10 on x86_64-pc-linux-gnu, compiled by gcc (Debian 4enter code here.9.2-10) 4.9.2, 64-bit
I tested it on docker 17.09.1-ce, Macbook pro 2015.
I am still new to PostgresQL and trying to map my knowledge with the real cases.
Thanks so much,
It seems like I found the reason, it not about PostgresQL problems, it 's because of running in docker. When I run directly in my mac, the time will be around 100ms which is fast enough.
Another thing I figured out is the reason why PostgresQL still use seq scan instead of index only scan (that why I have to disable seq_scan and bitmapscan in my test):
The size of table is not so big compare to the size of the index, if I add more columns to the table or length of columns is longer, the bigger size of the table, the more chance index can be use.
random_page_cost value by default is 4, my disk is quite fast so I can set it to 1-2, it will help the psql's explainer estimate cost more correctly.
select * from table where username="johndoe"
In Postgres, if username is not a primary key, I know it will iterate through all the records.
But if it is a primary key field, will the above SQL statement iterate through the entire table, or terminate as soon as username is matched. In other words, does "where" act differently when it is running on a primary key column or not?
Primary keys (and all indexed columns for that matter) take advantage of indexes when those column(s) are used as filter predicates, like WHERE and JOIN...ON clauses.
As a real world example, my application has a table called Log_Games, which is a table with millions of rows, ID as the primary key, and a number of other non-indexed columns such as ParsedAt. Compare the below:
INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ID" = 792046
INDEXED QUERY PLAN
Index Scan using "Log_Games_pkey" on "Log_Games" (cost=0.43..8.45 rows=1 width=4190) (actual time=0.024..0.024 rows=1 loops=1)
Index Cond: ("ID" = 792046)
Planning time: 1.059 ms
Execution time: 0.066 ms
NON-INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ParsedAt" = '2015-05-07 07:31:24+00'
NON-INDEXED QUERY PLAN
Seq Scan on "Log_Games" (cost=0.00..141377.34 rows=18 width=4190) (actual time=0.013..793.094 rows=1 loops=1)
Filter: ("ParsedAt" = '2015-05-07 07:31:24+00'::timestamp with time zone)
Rows Removed by Filter: 1924676
Planning time: 0.794 ms
Execution time: 793.135 ms
The query with the indexed clause uses the index Log_Games_pkey, resulting in a query that executes in 0.066ms. The query with the non-indexed clause reverts to a sequential scan, which means it goes from the start to the finish of the table to see which columns match, an operation that causes the execution time to blow out to 793.135ms.
There are plenty of good resources around the web that can help you read execution plans and decide when they may need supporting indexes. A good place to start is the PostgreSQL documentation:
https://www.postgresql.org/docs/9.6/static/using-explain.html
I have this table:
CREATE TABLE public.prodhistory (
curve_id int4 NOT NULL,
start_prod_date date NOT NULL,
prod_date date NOT NULL,
monthly_prod_rate float4 NOT NULL,
eff_date timestamp NOT NULL,
/* Keys */
CONSTRAINT prodhistorypk
PRIMARY KEY (curve_id, prod_date, start_prod_date, eff_date),
/* Foreign keys */
CONSTRAINT prodhistory2typecurves_fk
FOREIGN KEY (curve_id)
REFERENCES public.typecurves(curve_id)
) WITH (
OIDS = FALSE
);
CREATE INDEX prodhistory_idx_curve_id01
ON public.prodhistory
(curve_id);
with ~42M rows.
And I execute this query:
SELECT DISTINCT curve_id FROM prodhistory
Which I expect would be very quick, given the index. But no, 270 secs. So I explain, and I get:
HashAggregate (cost=824870.03..824873.08 rows=305 width=4) (actual time=211834.018..211834.097 rows=315 loops=1)
Output: curve_id
Group Key: prodhistory.curve_id
-> Seq Scan on public.prodhistory (cost=0.00..718003.22 rows=42746722 width=4) (actual time=12.751..200826.299 rows=43218808 loops=1)
Output: curve_id
Planning time: 0.115 ms
Execution time: 211848.137 ms
I'm not to experienced in reading these plans, but a Seq Scan on the DB seems bad.
Any thoughts? I'm sort of stumped.
This plan is chosen because PostgreSQL thinks it is cheaper.
You can compare by setting
SET enable_seqscan=off;
and then re-running your EXPLAIN (ANALYZE) statement. Compare cost and actual time in both cases and check if PostgreSQL estimated correctly or not.
If you find that using an Index Scan or Index Only Scan is actually cheaper, you could consider twiddling the cost parameters to match your machine better, e.g. lower random_page_cost or cpu_index_tuple_cost or raise cpu_tuple_cost.
PostgreSQL "index only scans" aren't always as cheap as you might think.
The reason is that each row needs to be checked as to whether it is visible to the MVCC snapshot or not.
Whether this is cheap or not depends on the table's visibility map.
If you force an index only scan (as per laurenz-albe's answer):
SET enable_seqscan=off;
Then run your query with:
EXPLAIN (ANALYZE ON, BUFFERS ON)
And see query plan output with "heap fetches" as below this means that the table's actual row data is being accessed, not just the index.
Index Only Scan using my_index on my_table (cost=0.42..17792.01 rows=595195 width=20) (actual time=37.942..2330.737 rows=539105 loops=1)
Heap Fetches: 234180
The official documentation describes this here:
https://www.postgresql.org/docs/current/indexes-index-only-scans.html
You may be able to resolve this by altering the way the table is updated, or by adjusting your auto vacuum settings.
How does SQL actually run?
For example, if I want to find a row with row_id=123, will SQL query search row by row from the top of memory?
This is a topic of query optimization. Briefly speaking, based on your query, the database system first tries to generate and optimize a query plan that possibly has optimal performance, then executes that plan.
For selections like row_id = 123, the actually query plan depends on whether you have an index or not. If you do not, a table scan will be used to examine the table row by row. But if you do have an index on row_id, there is a chance to skip most of the rows by using the index. In this case, the DB will not search row by row.
If you're running PostgreSQL or MySQL, you can use
EXPLAIN SELECT * FROM table WHERE row_id = 123;
to see the query plan generated by your system.
For an example table,
CREATE TABLE test(row_id INT); -- without index
COPY test FROM '/home/user/test.csv'; -- 40,000 rows
The EXPLAIN SELECT * FROM test WHERE row_id = 123 outputs:
QUERY PLAN
------------------------------------------------------
Seq Scan on test (cost=0.00..677.00 rows=5 width=4)
Filter: (row_id = 123)
(2 rows)
which means the database will do a sequential scan on the whole table and find the rows with row_id = 123.
However, if you create an index on the column row_id = 123:
CREATE INDEX test_idx ON test(row_id);
then the same EXPLAIN will tell us that the database will use an index scan to avoid going through the whole table:
QUERY PLAN
--------------------------------------------------------------------------
Index Only Scan using test_idx on test (cost=0.00..8.34 rows=5 width=4)
Index Cond: (row_id = 123)
(2 rows)
You can also use EXPLAIN ANALYZE to see actual performance of your SQL queries. On my machine, the total runtimes for sequential scan and index scan are 14.738 ms and 0.171 ms, respectively.
For details of query optimization, refer to Chapters 15 and 16 in the Database Systems: The Complete Book.