Why is this SQL statement so fast? - sql

The table in question has 3.8M records. The data column is indexed on a different field: "idx2_table_on_data_id" btree ((data ->> 'id'::text)). I assumed the sequential scan would be very slow but it is completing in just over 1 second. data->'array' does not exist in many of the records, fyi. Why is this running so quickly? Postgres v10
db=> explain analyze select * from table where jsonb_array_length(data->'array') != 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------
Seq Scan on table (cost=0.00..264605.21 rows=3797785 width=681) (actual time=0.090..1189.997 rows=1762 loops=1)
Filter: (jsonb_array_length((data -> 'array'::text)) <> 0)
Rows Removed by Filter: 3818154
Planning time: 0.561 ms
Execution time: 1190.492 ms
(5 rows)

We could tell for sure if you had run EXPLAIN (ANALYZE, BUFFERS), but odds are that most of the data were cached in RAM.
Also jsonb_array_length(data->'array') is not terribly expensive if the JSON is short.

Related

Postgres select query making sequential scan instead of index scan on table with 18 Million rows

I have a postgres table which has almost 18 Million rows and I am trying to run this query
select * from answer where submission_id = 5 and deleted_at is NULL;
There is an partial index on the table on column submission_id. This is the command used to create index
CREATE INDEX answer_submission_id ON answer USING btree (submission_id) WHERE (deleted_at IS NULL)
This is the explain analyse of the above select query
Gather (cost=1000.00..3130124.70 rows=834 width=377) (actual time=7607.568..7610.130 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
-> Parallel Seq Scan on answer (cost=0.00..3129041.30 rows=348 width=377) (actual time=6501.951..7604.623 rows=1 loops=3)
Filter: ((deleted_at IS NULL) AND (submission_id = 5))
Rows Removed by Filter: 62213625
Buffers: shared hit=2144966 read=3
I/O Timings: read=6.169
Planning Time: 0.117 ms
Execution Time: 7610.154 ms
Ideally it should pick the answer_submission_id index. But postgres is going for an sequential scan.
Any help would be really thankful
The execution plan shows us there is a deviation between the estimated read row and the actual read row.
Postgresql optimizer is a cost-based optimizer (CBO) queries will be executed by the smallest cost from execution plans.
so that the wrong statistics might choose a bad execution plan.
There is a link to represent the wrong statistics causing a slow query.
Why are bad row estimates slow in Postgres?
Firstly I will use this query to search last_analyze & last_vacuum last time.
SELECT
schemaname, relname,
last_vacuum, last_autovacuum,
vacuum_count, autovacuum_count,
last_analyze,last_autoanalyze
FROM pg_stat_user_tables
where relname = 'tablename';
if your statistics are wrong we can use ANALYZE "tablename"to help us collect new statistics from the table, ANALYZE scans speed depends on table size.
For large tables, ANALYZE takes a random sample of the table contents, rather than examining every row. This allows even very large tables to be analyzed in a small amount of time. Note, however, that the statistics are only approximate, and will change slightly each time ANALYZE is run, even if the actual table contents did not change. This might result in small changes in the planner's estimated costs shown by EXPLAIN. In rare situations, this non-determinism will cause the planner's choices of query plans to change after ANALYZE is run. To avoid this, raise the amount of statistics collected by ANALYZE, as described below.
When we UPDATE and DELETE data that will create a dead tuple which might exist in the heap or indexes but we can't query that, VACUUM can help us to reclaim storage occupied by dead tuples.

Behavior of WHERE clause on a primary key field

select * from table where username="johndoe"
In Postgres, if username is not a primary key, I know it will iterate through all the records.
But if it is a primary key field, will the above SQL statement iterate through the entire table, or terminate as soon as username is matched. In other words, does "where" act differently when it is running on a primary key column or not?
Primary keys (and all indexed columns for that matter) take advantage of indexes when those column(s) are used as filter predicates, like WHERE and JOIN...ON clauses.
As a real world example, my application has a table called Log_Games, which is a table with millions of rows, ID as the primary key, and a number of other non-indexed columns such as ParsedAt. Compare the below:
INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ID" = 792046
INDEXED QUERY PLAN
Index Scan using "Log_Games_pkey" on "Log_Games" (cost=0.43..8.45 rows=1 width=4190) (actual time=0.024..0.024 rows=1 loops=1)
Index Cond: ("ID" = 792046)
Planning time: 1.059 ms
Execution time: 0.066 ms
NON-INDEXED QUERY
EXPLAIN ANALYZE
SELECT *
FROM "Log_Games"
WHERE "ParsedAt" = '2015-05-07 07:31:24+00'
NON-INDEXED QUERY PLAN
Seq Scan on "Log_Games" (cost=0.00..141377.34 rows=18 width=4190) (actual time=0.013..793.094 rows=1 loops=1)
Filter: ("ParsedAt" = '2015-05-07 07:31:24+00'::timestamp with time zone)
Rows Removed by Filter: 1924676
Planning time: 0.794 ms
Execution time: 793.135 ms
The query with the indexed clause uses the index Log_Games_pkey, resulting in a query that executes in 0.066ms. The query with the non-indexed clause reverts to a sequential scan, which means it goes from the start to the finish of the table to see which columns match, an operation that causes the execution time to blow out to 793.135ms.
There are plenty of good resources around the web that can help you read execution plans and decide when they may need supporting indexes. A good place to start is the PostgreSQL documentation:
https://www.postgresql.org/docs/9.6/static/using-explain.html

Why isn't Postgres using the index?

I have a table with an integer column called account_id. I have an index on that column.
But seems Postgres doesn't want to use my index:
EXPLAIN ANALYZE SELECT "invoices".* FROM "invoices" WHERE "invoices"."account_id" = 1;
Seq Scan on invoices (cost=0.00..6504.61 rows=117654 width=186) (actual time=0.021..33.943 rows=118027 loops=1)
Filter: (account_id = 1)
Rows Removed by Filter: 51462
Total runtime: 39.917 ms
(4 rows)
Any idea why that would be?
Because of:
Seq Scan on invoices (...) (actual ... rows=118027 <— this
Filter: (account_id = 1)
Rows Removed by Filter: 51462 <— vs this
Total runtime: 39.917 ms
You're selecting so many rows that it's cheaper to read the entire table.
Related earlier questions and answers from today for further reading:
Why doesn't Postgresql use index for IN query?
Postgres using wrong index when querying a view of indexed expressions?
(See also Craig's longer answer on the second one for additional notes on indexes subtleties.)

How to measure time in postgres

In postgres I want to know the time taken to generate plan. I know that \timing gives me the time taken to execute the plan after finding the optimal plan. But I want to find out the time which postgres takes in finding out the optimal plan. Is it possible to determine this time in postgres. If yes, then how?
Also query plan generators at times do not find the optimal plan. Can I force postgres to use the optimal plan for plan generation. If yes, then how can I do so?
For the time taken to prepare a plan and the time taken to execute it, you can use explain (which merely finds a plan) vs explain analyze (which actually runs it) with \timing turned on:
test=# explain select * from test where val = 1 order by id limit 10;
QUERY PLAN
--------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8)
Filter: (val = 1)
(3 rows)
Time: 0.759 ms
test=# explain analyze select * from test where val = 1 order by id limit 10;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.00..4.35 rows=10 width=8) (actual time=0.122..0.170 rows=10 loops=1)
-> Index Scan using test_pkey on test (cost=0.00..343.25 rows=789 width=8) (actual time=0.121..0.165 rows=10 loops=1)
Filter: (val = 1)
Rows Removed by Filter: 67
Total runtime: 0.204 ms
(5 rows)
Time: 1.019 ms
Note that there is a tiny overhead in both commands to actually output the plan.
Usually it happens e.g. in DB2 that since finding the optimal plan takes a lot of time..therefore the database engine decides to use a suboptimal plan...i think it must be the case with postgres also.
In Postgres, this only occurs if your query is gory enough that it cannot reasonably do an exhaustive search. When you reach the relevant thresholds (which are high, if your use-cases are typical), the planner uses the genetic query optimizer:
http://www.postgresql.org/docs/current/static/geqo-pg-intro.html
If it is then how can i fiddle with postgres such that it chooses the optimal plan.
In more general use cases, there are many things that you can fiddle, but be very wary of messing around with them (apart, perhaps, from collecting a bit more statistics on a select few columns using ALTER TABLE SET STATISTICS):
http://www.postgresql.org/docs/current/static/runtime-config-query.html
If you use \i command in psql client than in sql file add following row
\timing true

Why does this SUM() function take so long in PostgreSQL?

This is my query:
SELECT SUM(amount) FROM bill WHERE name = 'peter'
There are 800K+ rows in the table. EXPLAIN ANALYZE says:
Aggregate (cost=288570.06..288570.07 rows=1 width=4) (actual time=537213.327..537213.328 rows=1 loops=1)
-> Seq Scan on bill (cost=0.00..288320.94 rows=498251 width=4) (actual time=48385.201..535941.041 rows=800947 loops=1)
Filter: ((name)::text = 'peter'::text)
Rows Removed by Filter: 8
Total runtime: 537213.381 ms
All rows are affected, and this is correct. But why so long? A similar query without WHERE runs way faster:
ANALYZE EXPLAIN SELECT SUM(amount) FROM bill
Aggregate (cost=137523.31..137523.31 rows=1 width=4) (actual time=2198.663..2198.664 rows=1 loops=1)
-> Index Only Scan using idx_amount on bill (cost=0.00..137274.17 rows=498268 width=4) (actual time=0.032..1223.512 rows=800955 loops=1)
Heap Fetches: 533399
Total runtime: 2198.717 ms
I have an index on amount and an index on name. Have I missed any indexes?
ps. I managed to solve the problem just by adding a new idex ON bill(name, amount). I didn't get why it helped, so let's leave the question open for some time...
Since you are searching for a specific name, you should have an index that has name as the first column, e.g. CREATE INDEX IX_bill_name ON bill( name ).
But Postgres can still opt to do a full table scan if it estimates your index to not be specific enough, i.e. if it thinks it is faster to just scan all rows and pick the matching ones instead of consulting an index and start jumping around in the table to gather the matching rows. Postgres uses a cost-based estimation technique that weights random disk reads to be more expensive than sequential reads.
For an index to actually be used in your situation, there should be no more than 10% of the rows matching what you are searching for. Since most of your rows have name=peter it is actually faster to do a full table scan.
As to why the SUM without filtering runs faster has to do with overall width of the table. With a where-clause, postgres has to sequentially read all rows in the table so it can disregard those that do not match the filter. Without a where-clause, postgres can instead read all the amounts from the index. Because the index on amounts contains the amounts and pointers to each corresponding rows, but no other data from the table, it is simply less data to wade through. Based on the big different in performance I guess you have quite a lot of other fields in your table..