Understanding Rails / PG Explain - ruby-on-rails-3

I know it's kind of an awkward question.. but I don't understand what EXPLAIN explains..
My query is User.last, it took more than 0.5 second
This is probably the simplest of queries, but it seems explain has trouble explaining it..
I don't understand anything that comes after QUERY PLAN
Whats width? what is cost? How does it explain where the query took more time?
[40] pry(main)> User.last
User Load (671.0ms) SELECT "users".* FROM "users" ORDER BY "users"."id" DESC LIMIT 1
EXPLAIN (39.0ms) EXPLAIN SELECT "users".* FROM "users" ORDER BY "users"."id" DESC LIMIT 1
EXPLAIN for: SELECT "users".* FROM "users" ORDER BY "users"."id" DESC LIMIT 1
QUERY PLAN
--------------------------------------------------------------------
Limit (cost=1.08..1.08 rows=1 width=2861)
-> Sort (cost=1.08..1.09 rows=5 width=2861)
Sort Key: id
-> Seq Scan on users (cost=0.00..1.05 rows=5 width=2861)
(4 rows)

Query Plan displays performance cost for each part of the query.
For example:
cost=0.00 - Estimated start-up cost (Time expended before output scan can start, e.g., time to do the sorting in a sort node.)
..1.05 - Estimated total cost (If all rows were to be retrieved, which they may not be: for example, a query with a LIMIT clause will stop short of paying the total cost of the Limit plan node's input node.)
rows=5 - Estimated number of rows output by this plan node (Again, only if executed to completion.)
width=2861 - Estimated average width (in bytes) of rows output by this plan node
From here.
For more information check this and that links.

Related

Postgres query time execution varying a lot between two queries (not rendering issue with low results)

I have a table tableA which contains 600k rows of data (and other tables B, C, D etc...).
This simple query takes 700ms (??) to execute on PgAdmin:
select id from tableA limit 1
Here is the explain :
"Limit (cost=0.00..0.03 rows=1 width=93) (actual time=0.009..0.010 rows=1 loops=1)"
" -> Seq Scan on tableA (cost=0.00..16082.65 rows=619265 width=93) (actual time=0.008..0.008 rows=1 loops=1)"
"Planning Time: 0.058 ms"
"Execution Time: 0.026 ms"
First question is why does it take 700ms when planning + execution time is less than 1ms? I am far away from the db, so let's assume it's a network latency issue.
Now I have this other query :
SELECT "tableA"."id" AS "tag_id", "tableA"."shortId" AS "tag_shortId", "tableA"."externalId" AS "tag_externalId", "tableA"."index" AS "tag_index", "tableA"."createdDate" AS "tag_createdDate",
"tableA"."updatedDate" AS "tag_updatedDate", "tableA"."personId" AS "personId", "tableA"."createdById" AS "tag_createdById", "tableB"."id" AS "tableB_id", "tableB"."name" AS "tableB_name",
"tableB"."image" AS "tableB_image", "tableB"."address" AS "tableB_address", "tableB"."city" AS "tableB_city", "tableB"."state" AS "tableB_state", "tableB"."zip" AS "tableB_zip", "tableB"."country"
AS "tableB_country", "tableB"."capacity" AS "tableB_capacity", "tableB"."type" AS "tableB_type", "tableB"."externalURL" AS "tableB_externalURL", "tableB"."externalId" AS "tableB_externalId",
"tableB"."createdDate" AS "tableB_createdDate", "tableB"."updatedDate" AS "tableB_updatedDate", "tableB"."tagCount" AS "tableB_tagCount", "tableB"."brandId" AS "tableB_brandId",
"tableB"."ownerId" AS "tableB_ownerId", "tableB"."createdById" AS "tableB_createdById", "target"."id" AS "target_id", "target"."name" AS "target_name", "target"."image" AS "target_image",
"target"."startDate" AS "target_startDate", "target"."endDate" AS "target_endDate", "target"."isPerpetual" AS "target_isPerpetual", "target"."isLive" AS "target_isLive",
"target"."defaultExternalURL" AS "target_defaultExternalURL", "target"."externalId" AS "target_externalId", "target"."createdDate" AS "target_createdDate", "target"."updatedDate"
AS "target_updatedDate", "target"."personId" AS "target_tableBId", "target"."totoId" AS "target_totoId", "target"."brandId"
AS "target_brandId", "target"."createdById" AS "target_createdById"
FROM "tableA" "tableA"
LEFT JOIN "tag_data" "tagData" ON "tagData"."tagId"="tableA"."id"
LEFT JOIN "tableB" "tableB" ON "tableB"."id"="tableA"."personId"
LEFT JOIN "target" "target" ON "tableA"."personId" = "target"."personId"
AND (("target"."startDate" <= NOW() AND "target"."endDate" >= NOW()) OR "target"."isPerpetual" = true)
LIMIT 25
The query itself doesn't really matter, just to show that it is a pretty big query with multiple conditional joins etc... This query runs in 400ms compared to the first query of 700ms (???)
Here is the query explain :
*doing bunch of merge and index search etc...*
"Planning Time: 0.630 ms"
"Execution Time: 0.353 ms"
So we can see that query 2 takes longer to execute (about 1ms) which makes sense compared to query 1.
My question is, why query 1 takes 700ms (with explain < 1ms and 1 row of data) and query 2 takes 400ms (with explain of 1ms and 25 rows of data), especially if the main reason of the 700ms is 'poor network' ?
Any help appreciated
PS: the times mentioned are averages and the DB is running on RDS with very low activity
UPDATE: Why explain analyze and execution query time is different does not answer, I'm returning 1 row there not thousands of row that needs "rendering"

PostgreSql Select query performance issue

I have a simple select query:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id
Then I want to get the first 100 results, so I use this:
SELECT * FROM entities WHERE entity_type_id = 1 ORDER BY entity_id LIMIT 100
The problem is that the second query works much slower then the first one. It takes less than a second to execute the first query and more than a minute to execute the second one.
These are execution plans for the queries:
without limit:
Sort (cost=26201.43..26231.42 rows=11994 width=72)
Sort Key: entity_id
-> Index Scan using entity_type_id_idx on entities (cost=0.00..24895.34 rows=11994 width=72)
Index Cond: (entity_type_id = 1)
with limit:
Limit (cost=0.00..8134.39 rows=100 width=72)
-> Index Scan using xpkentities on entities (cost=0.00..975638.85 rows=11994 width=72)
Filter: (entity_type_id = 1)
I don't understand why these two plans are so different and why the performance decreases so much. How should I tweak the second query to make it work faster?
I use PostgreSql 9.2.
You want the 100 smallest entity_id's matching your condition. Now - if those were numbers 1..100 then clearly using the entity_id index is the best way to handle this - everything is pre-sorted. In fact, if the 100 you wanted were in the range 1..200 then it still makes sense. Probably 1..1000 would.
So - PostgreSQL thinks it will find lots of entity_type_id=1 values at the "start" of the table. It estimates a cost of 8134 vs 26231 to filter by type then sort. In your case it is wrong.
Now - either there is some correlation which isn't obvious (that's bad - we can't tell the planner about that at present), or we don't have up-to-date or sufficient stats.
Does an ANALYZE entities make any difference? You can see what values the planner knows about by reading the planner-stats page in the manuals.

Optimization of aggregate SQL query

I am running a query against a table in postgressql 9.2.
The table has a lot of fields, but the ones relevant to this is:
video_id BIGINT NOT NULL
day_date DATE NOT NULL
total_plays BIGINT default 0
total_playthrough_average DOUBLE PRECISION
total_downloads BIGINT default 0
The query takes this form:
SELECT
SUM(total_plays) AS total_plays
CASE SUM(total_downloads)
WHEN 0 THEN 100
ELSE SUM(total_playthrough_average * total_downloads) / SUM(total_downloads) END AS total_playthrough_average
FROM
mytable
WHERE
video_id = XXXX
# Date parameter - examplified by current month
AND day_date >= DATE('2013-09-01') AND day_date <= DATE('2013-09-30')
The point of the query is to find the playthrough_average (a score of how much of the video the average person sees, between 0 and 100) of all videos, weighted by the downloads each video has (so the average playthrough of a video with 100 downloads weighs more than that of a video with 10 downloads).
The table uses the following index (among others):
"video_index1" btree (video_id, day_date, textfield1, textfield2, textfield3)
Doing an EXPLAIN ANALYZE on the query gives me this:
Aggregate (cost=153.33..153.35 rows=1 width=24) (actual time=6.219..6.221 rows=1 loops=1)
-> Index Scan using video_index1 on mytable (cost=0.00..152.73 rows=40 width=24) (actual time=0.461..5.387 rows=105 loops=1)
Index Cond: ((video_id = 6702200) AND (day_date >= '2013-01-01'::date) AND (day_date <= '2013-12-31'::date))
Total runtime: 6.757 ms
This seems like everything is dandy, but this is only when I test with a query that has already been performed. When my program is running I get a lot of queries taking 10-30 seconds (usually every few seconds). I am running it with 6-10 simultaneous processes making these queries (among others).
Is there something I can tweak in the postgresql settings to get better performance out of this? The table is updated constantly, although maybe only once or twice per hour per video_id, with both INSERT and UPDATE queries.
Your summing does not make sense to me. I think what you want is
select
sum(total_plays) as total_plays,
sum(total_downloads) as total_downloads,
sum(total_playthrough_average * total_downloads) as total_playthrough_average
from mytable
where
video_id = 1
and day_date between '2013-09-01' and '2013-09-30'
SQL Fiddle

Postgres min function performance

I need the lowest value for runnerId.
This query:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ;
takes 80 ms (1968 result rows).
This:
SELECT min("runnerId") FROM betlog WHERE "marketId" = '107416794' ;
takes 1600 ms.
Is there a faster way to find the minimum, or should I calc the min in my java program?
"Result (cost=100.88..100.89 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Limit (cost=0.00..100.88 rows=1 width=9)"
" -> Index Scan using runneridindex on betlog (cost=0.00..410066.33 rows=4065 width=9)"
" Index Cond: ("runnerId" IS NOT NULL)"
" Filter: ("marketId" = 107416794::bigint)"
CREATE INDEX marketidindex
ON betlog
USING btree
("marketId" COLLATE pg_catalog."default");
Another idea:
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" LIMIT 1 >1600ms
SELECT "runnerId" FROM betlog WHERE "marketId" = '107416794' ORDER BY "runnerId" >>100ms
How can a LIMIT slow the query down?
What you need is a multi-column index:
CREATE INDEX betlog_mult_idx ON betlog ("marketId", "runnerId");
If interested, you'll find in-depth information about multi-column indexes in PostgreSQL, links and benchmarks under this related question on dba.SE.
How did I figure?
In a multi-column index, rows are ordered (and thereby clustered) by the first column of the index ("marketId"), and each cluster is in turn ordered by the second column of the index - so the first row matches the condition min("runnerId"). This makes the index scan extremely fast.
Concerning the paradox effect of LIMIT slowing down a query - the Postgres query planner has a weakness there. The common workaround is to use a CTE (not necessary in this case). Find more information under this recent, closely related question:
PostgreSQL query taking too long
The min statement will be executed by PostgreSQL using a sequential scan of the entire table. You could optimize the query using the following approach:
SELECT col FROM sometable ORDER BY col ASC LIMIT 1;
When you had an index on ("runnerId") (or at least with "runnerId" as the high order column) but did not have the index on ("marketId", "runnerId") it compared the cost of passing all rows with a matching "marketId" using the index on that column and picking out the minimum "runnerId" from that set to the cost of scanning using the index on "runnerId" and stopping when it found the first row with a matching "marketId". Based on available statistics and the assumption that "marketId" values would be randomly distributed within the index entries for the index on "runnerId" it estimated a lower cost for the latter approach.
It also estimated the cost of scanning the whole table and picking the minimum from matching rows as well as probably a number of other alternatives. It does not always use a certain type of plan, but compares costs of all the alternatives.
The problem is that the assumption that values will be randomly distributed in the range is not necessarily true (as in this example), leading to a scan of a high percentage of the range to find the rows lurking at the end. For some values of "marketId", where the chosen value is available near the beginning of the "runnerId" index, this plan should be very fast.
There has been discussion in the PostgreSQL developer community of how we might bias against plans which are "risky" in terms of running long if the data distribution is not what was assumed, and there has been work on tracking multi-column statistics so that correlated values don't run into such problems. Expect improvements in this area in the next few releases. Until then, Erwin's suggestions are on target for how to work around the issue.
Basically it comes down to making a more attractive plan available or introducing an optimization barrier. In this case you can provide a more attractive option by adding the index on ("marketId", "runnerId") -- which allows a very direct way to get straight to the answer. The planner assigns a very low cost to that alternative, causing it to be chosen. If you preferred not to add the index, you could force an optimization barrier by doing something like this:
SELECT min("runnerId")
FROM (SELECT "runnerId" FROM betlog
WHERE "marketId" = '107416794'
OFFSET 0) x;
When there is an OFFSET clause (even for an offset of zero) it forces the subquery to be planned separately and its results fed to the outer query. I would expect this to run in 80 ms rather than the 1600 ms you get without the optimization barrier. Of course, if you can add the index, the speed of the query when data is cached should be less than 1 ms.

PostgreSQL: NOT IN versus EXCEPT performance difference (edited #2)

I have two queries that are functionally identical. One of them performs very well, the other one performs very poorly. I do not see from where the performance difference arises.
Query #1:
SELECT id
FROM subsource_position
WHERE
id NOT IN (SELECT position_id FROM subsource)
This comes back with the following plan:
QUERY PLAN
-------------------------------------------------------------------------------
Seq Scan on subsource_position (cost=0.00..362486535.10 rows=128524 width=4)
Filter: (NOT (SubPlan 1))
SubPlan 1
-> Materialize (cost=0.00..2566.50 rows=101500 width=4)
-> Seq Scan on subsource (cost=0.00..1662.00 rows=101500 width=4)
Query #2:
SELECT id FROM subsource_position
EXCEPT
SELECT position_id FROM subsource;
Plan:
QUERY PLAN
-------------------------------------------------------------------------------------------------
SetOp Except (cost=24760.35..25668.66 rows=95997 width=4)
-> Sort (cost=24760.35..25214.50 rows=181663 width=4)
Sort Key: "*SELECT* 1".id
-> Append (cost=0.00..6406.26 rows=181663 width=4)
-> Subquery Scan on "*SELECT* 1" (cost=0.00..4146.94 rows=95997 width=4)
-> Seq Scan on subsource_position (cost=0.00..3186.97 rows=95997 width=4)
-> Subquery Scan on "*SELECT* 2" (cost=0.00..2259.32 rows=85666 width=4)
-> Seq Scan on subsource (cost=0.00..1402.66 rows=85666 width=4)
(8 rows)
I have a feeling I'm missing either something obviously bad about one of my queries, or I have misconfigured the PostgreSQL server. I would have expected this NOT IN to optimize well; is NOT IN always a performance problem or is there a reason it does not optimize here?
Additional data:
=> select count(*) from subsource;
count
-------
85158
(1 row)
=> select count(*) from subsource_position;
count
-------
93261
(1 row)
Edit: I have now fixed the A-B != B-A problem mentioned below. But my problem as stated still exists: query #1 is still massively worse than query #2. This, I believe, follows from the fact that both tables have similar numbers of rows.
Edit 2: I'm using PostgresQL 9.0.4. I cannot use EXPLAIN ANALYZE because query #1 takes too long. All of these columns are NOT NULL, so there should be no difference as a result of that.
Edit 3: I have an index on both these columns. I haven't yet gotten query #1 to complete (gave up after ~10 minutes). Query #2 returns immediately.
Query #1 is not the elegant way for doing this... (NOT) IN SELECT is fine for a few entries, but it can't use indexes (Seq Scan).
Not having EXCEPT, the alternative is to use a JOIN (HASH JOIN):
SELECT sp.id
FROM subsource_position AS sp
LEFT JOIN subsource AS s ON (s.position_id = sp.id)
WHERE
s.position_id IS NULL
EXCEPT appeared in Postgres long time ago... But using MySQL I believe this is still the only way, using indexes, to achieve this.
Since you are running with the default configuration, try bumping up work_mem. Most likely, the subquery ends up getting spooled to disk because you only allow for 1Mb of work memory. Try 10 or 20mb.
Your queries are not functionally equivalent so any comparison of their query plans is meaningless.
Your first query is, in set theory terms, this:
{subsource.position_id} - {subsource_position.id}
^ ^ ^ ^
but your second is this:
{subsource_position.id} - {subsource.position_id}
^ ^ ^ ^
And A - B is not the same as B - A for arbitrary sets A and B.
Fix your queries to be semantically equivalent and try again.
If id and position_id are both indexed (either on their own or first column in a multi-column index), then two index scans are all that are necessary - it's a trivial sorted-merge based set algorithm.
Personally I think PostgreSQL simply doesn't have the optimization intelligence to understand this.
(I came to this question after diagnosing a query running for over 24 hours that I could perform with sort x y y | uniq -u on the command line in seconds. Database less than 50MB when exported with pg_dump.)
PS: more interesting comment here:
more work has been put into optimizing
EXCEPT and NOT EXISTS than NOT IN, because the latter is substantially
less useful due to its unintuitive but spec-mandated handling of NULLs.
We're not going to apologize for that, and we're not going to regard it as a bug.
What it comes down to is that except is different to not in with respect to null handling. I haven't looked up the details, but it means PostgreSQL (aggressively) doesn't optimize it.
The second query makes usage of the HASH JOIN feature of postgresql. This is much faster then the Seq Scan of the first one.