Query Issue - Postgres /pgadmin4 - sql

My query is taking hours to run, its over 9M rows in Postgres 9.3 in PgAdmin 4. The data is all string due to the format it came in, i had to use a subquery to break it into columns. So in order to craete the aggregated views i cast each of the elements i need. Because there are millions of rows, there are data elements that have nulls at some point in the table, so i needed the where statement. I have indecies for each of the items in the where clause. All basic btree indexes. Example: Create index loan_age_idx on "gnma2" using btree ("Loan_Age"). The where statement though which is required due to the cast functions are resulting in zero rows returned. What can i do to remove the wheres but keep the cast statement?
create table "t1" as
Select
gnma2."Issuer_ID",
gnma2."As_of_Date",
gnma2."Agency",
gnma2."Loan_Purpose",
gnma2."State",
gnma2."Months_Delinquent",
count(gnma2."Disclosure_Sequence_Number") as "Total_Loan_Count",
avg(cast(gnma2."Loan_Interest_Rate" as double precision))/1000 as "avg_int_rate",
avg(cast(gnma2."Original_Principal_Balance" as real))/100 as "avg_OUPB",
avg(cast(gnma2."Unpaid_Principal_Balance" as real))/100 as "avg_UPB",
avg(cast(gnma2."Loan_Age" as real)) as "avg_loan_age",
avg(cast(gnma2."Loan_To_Value" as real))/100 as "avg_LTV",
avg(cast(gnma2."Total_Debt_Expense_Ratio_Percent" as real))/100 as "avg_DTI",
avg(cast(gnma2."Credit_Score" as real)) as "avg_credit_score",
left(gnma2."First_Payment_Date",4) as "Origination_Yr"
From public."gnma2"
where
"Loan_Age" >= '1' and
"Loan_To_Value" >= '0' and
"Total_Debt_Expense_Ratio_Percent" >= '0' and
"Credit_Score" >= '0' and
"Loan_Interest_Rate" >= '0' and
"Original_Principal_Balance" >= '1' and
"Unpaid_Principal_Balance" >= '0'
Group By
gnma2."Issuer_ID",
gnma2."Agency",
gnma2."Loan_Purpose",
gnma2."State",
gnma2."As_of_Date",
gnma2."Months_Delinquent",
gnma2."First_Payment_Date";
QUERY PLAN
HashAggregate (cost=3496556.07..3496556.65 rows=13 width=91) (actual
time=124214.207..124214.207 rows=0 loops=1)
-> Bitmap Heap Scan on gnma2
(cost=166750.75..3496546.65 rows=130 width=91) (actual time=124214.202..124214.20
2 rows=0 loops=1)
Recheck Cond: ("Loan_Age" >= '1'::bpchar)
Rows Removed by Index Recheck: 4785233
Filter: (("Loan_To_Value" >= '0'::bpchar) AND ("Total_Debt_Expense_Ratio_Percent" >= '0'::bpchar) AND ("Cr
edit_Score" >= '0'::bpchar) AND ("Loan_Interest_Rate" >= '0'::bpchar) AND ("Original_Principal_Balance" >= '1'::bpc
har) AND ("Unpaid_Principal_Balance" >= '0'::bpchar))
Rows Removed by Filter: 9311198
-> Bitmap Index Scan on loan_age_indx (cost=0.00..166750.72 rows=9029087 width=0) (actual time=5015.865.
.5015.865 rows=9311198 loops=1)
Index Cond: ("Loan_Age" >= '1'::bpchar)
Total runtime: 124214.524 ms

Related

Why does it do an index scan when there are indexes on all the relevant columns?

I have an index on car_id and an index on ended_at.
Is this query taking so long because I am ordering it by id and I have separate unique indexes?
Would it be better if I ordered it by ended_at and then made an index on both ended_at and car_id?
SELECT "trip_reports".*
FROM "trip_reports"
WHERE "trip_reports"."car_id" = $1 AND (ended_at < '2020-11-03 17:31:09')
ORDER BY "trip_reports"."id" DESC
LIMIT $2
Duration is 6.05 minutes.
The query plan:
Limit (cost=0.56..4512.17 rows=1 width=1156)
-> Index Scan Backward using trip_reports_pkey on trip_reports (cost=0.56..9830786.80 rows=2179 width=1156)
Filter: ((ended_at < '2020-11-03 20:55:57'::timestamp without time zone) AND (car_id = 103638))
EXPLAIN (ANALYZE, BUFFERS)
Limit (cost=0.56..4512.67 rows=1 width=1156) (actual time=976974.363..976974.363 rows=0 loops=1)
Buffers: shared hit=2071575 read=3222036
-> Index Scan Backward using trip_reports_pkey on trip_reports (cost=0.56..9831877.02 rows=2179 width=1156) (actual time=976974.361..976974.361 rows=0 loops=1)
Filter: ((ended_at < '2020-11-03 17:31:09'::timestamp without time zone) AND (car_id = 119780))
Rows Removed by Filter: 22862225
Buffers: shared hit=2071575 read=3222036
Planning time: 0.113 ms
Execution time: 976975.711 ms
In my opinion:
Try creating an index for the WHERE condition:
create index trip_reports_carid_endtime on trip_reports (car_id, ended_at);
(column order in index is important)
If you did not do it before:
vacuum (analyse) trip_reports;

Is there a way to optimize casting jsonb value as double precision in Posgresql by indexing it?

I have Postgresql table with > 1M records, and there is jsonb column. My application makes a lot of selects like this:
select from table_name where (cast(json ->> 'value' as double precision) >= 1;
This query takes almost 1500 ms to execute. I can't change the query itself, so i need to build an index for this operation.
I tried to build btree index:
create index concurrently ix_json_value_as_number on table_name using btree(cast(json ->> 'value' as double precision));
but it didnt give me any query time improvements. Any ideas?
P.S. I use Postgres 9.4.
P.S.S Request explain:
Aggregate (cost=176328.20..176328.20 rows=1 width=0) (actual time=1358.130..1358.130 rows=1 loops=1)
-> Bitmap Heap Scan on table_name (cost=3055.46..175455.10 rows=349240 width=0) (actual time=168.344..1285.381 rows=966650 loops=1)
Recheck Cond: (((bar->> 'foo'::text))::double precision >= 2.39999999999999991::double precision)
Rows Removed by Index Recheck: 36082
Heap Blocks: exact=58417 lossy=105558
-> Bitmap Index Scan on ix_json_value_as_number (cost=0.00..2968.15 rows=349240 width=0) (actual time=152.711..152.711 rows=966711 loops=1)
Index Cond: (((bar ->> 'foo'::text))::double precision >= 2.39999999999999991::double precision)
Planning time: 0.095 ms
Execution time: 1358.813 ms

How to optimize a MAX SQL query with GROUP BY DATE

I'm trying to optimize a query from a table with 3M rows.
The columns are value, datetime and point_id.
SELECT DATE(datetime), MAX(value) FROM historical_points WHERE point_id=1 GROUP BY DATE(datetime);
This query takes 2 seconds.
I tried indexing the point_id=1 but the results were not much better.
Is it possible to index the MAX query or is there a better way to do it? Maybe with an INNER JOIN?
EDIT:
This is the explain analyze of similar one, that is tackling the case better. This one also ha performance problem.
EXPLAIN ANALYZE SELECT DATE(datetime), MAX(value), MIN(value) FROM buildings_hispoint WHERE point_id=64 AND datetime BETWEEN '2017-09-01 00:00:00' AND '2017-10-01 00:00:00' GROUP BY DATE(datetime);
>GroupAggregate (cost=84766.65..92710.99 rows=336803 width=68) (actual time=1461.060..2701.145 rows=21 loops=1)
> Group Key: (date(datetime))
> -> Sort (cost=84766.65..85700.23 rows=373430 width=14) (actual time=1408.445..1547.929 rows=523621 loops=1)
> Sort Key: (date(datetime))
> Sort Method: external sort Disk: 11944kB
> -> Bitmap Heap Scan on buildings_hispoint (cost=10476.02..43820.81 rows=373430 width=14) (actual time=148.970..731.154 rows=523621 loops=1)
> Recheck Cond: (point_id = 64)
> Filter: ((datetime >= '2017-09-01 00:00:00+02'::timestamp with time zone) AND (datetime Rows Removed by Filter: 35712
> Heap Blocks: exact=14422
> -> Bitmap Index Scan on buildings_measurementdatapoint_ffb10c68 (cost=0.00..10382.67 rows=561898 width=0) (actual time=125.150..125.150 rows=559333 loops=1)
> Index Cond: (point_id = 64)
>Planning time: 0.284 ms
>Execution time: 2704.566 ms
Without seeing EXPLAIN output is difficult to say something. My guess is that you must include DATE() call on index definition:
CREATE INDEX historical_points_idx ON historical_points (DATE(datetime), point_id);
Also, if point_id has more distinct values than DATE(datetime) then you must reverse column order:
CREATE INDEX historical_points_idx ON historical_points (point_id, DATE(datetime));
Keep in mind that cardinality of columns is very important to the planner, columns with high selectivity is preferred to go first.
SELECT DISTINCT ON (DATE(datetime)) DATE(datetime), value
FROM historical_points WHERE point_id=1
ORDER BY DATE(datetime) DESC, value DESC;
Put an computed index on DATE(datetime), value. [I hope those aren't your real column names. Using reserved words like VALUE as a column name is a recipe for confusion.]
The SELECT DISTINCT will work like a GROUP ON. The ORDER BY replaces the MAX, and will be fast if indexed.
I owe this technique to #ErwinBrandstetter.

Slow PostgreSQL query with (incorrect?) indexes

I have an Events table with 30 million rows. The following query returns in 25 seconds
SELECT DISTINCT "events"."id", "calendars"."user_id"
FROM "events"
LEFT JOIN "calendars" ON "events"."calendar_id" = "calendars"."id"
WHERE "events"."deleted_at" is null
AND tstzrange('2016-04-21T12:12:36-07:00', '2016-04-21T12:22:36-07:00') #> lower(time_range)
AND ("status" is null or (status->>'pre_processed') IS NULL)
status is a jsonb column with an index on status->>'pre_processed'. Here are the other indexes that were created on the events table. time_range is of type TSTZRANGE.
CREATE INDEX events_time_range_idx ON events USING gist (time_range);
CREATE INDEX events_lower_time_range_index on events(lower(time_range));
CREATE INDEX events_upper_time_range_index on events(upper(time_range));
CREATE INDEX events_calendar_id_index on events (calendar_id)
I'm definitely out of my comfort zone on this and am trying to reduce the query time. Here's the output of explain analyze
HashAggregate (cost=7486635.89..7486650.53 rows=1464 width=48) (actual time=26989.272..26989.306 rows=98 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7486628.57 rows=1464 width=48) (actual time=316.110..26988.941 rows=98 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1464 width=50) (actual time=316.049..26985.344 rows=98 loops=1)
Filter: ((deleted_at IS NULL) AND ((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange #> lower(time_range)))
Rows Removed by Filter: 31592898
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.50 rows=1 width=48) (actual time=0.030..0.031 rows=1 loops=98)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 1.468 ms
Execution time: 26989.370 ms
And here is the explain analyze with the events.deleted_at part of the query removed
HashAggregate (cost=7487382.57..7487398.33 rows=1576 width=48) (actual time=23880.466..23880.503 rows=115 loops=1)
Group Key: events.id, calendars.user_id
-> Nested Loop Left Join (cost=0.42..7487374.69 rows=1576 width=48) (actual time=16.612..23880.114 rows=115 loops=1)
-> Seq Scan on events (cost=0.00..7475629.43 rows=1576 width=50) (actual time=16.576..23876.844 rows=115 loops=1)
Filter: (((status IS NULL) OR ((status ->> 'pre_processed'::text) IS NULL)) AND ('["2016-04-21 19:12:36+00","2016-04-21 19:22:36+00")'::tstzrange #> lower(time_range)))
Rows Removed by Filter: 31592881
-> Index Scan using calendars_pkey on calendars (cost=0.42..7.44 rows=1 width=48) (actual time=0.022..0.023 rows=1 loops=115)
Index Cond: (events.calendar_id = (id)::text)
Planning time: 0.372 ms
Execution time: 23880.571 ms
I added the index on the status column. Everything else what already there and I'm unsure how to proceed going forward. Any suggestions on how to get the query time down to a more manageable number?
The B-tree index on lower(time_range) can only be used for conditions involving the <, <=, =, >= and > operators. The #> operator may rely on these internally, but as far as the planner is concerned, this range check operation is a black box, and so it can't make use of the index.
You will need to reformulate your condition in terms of the B-tree operators, i.e.:
lower(time_range) >= '2016-04-21T12:12:36-07:00' AND
lower(time_range) < '2016-04-21T12:22:36-07:00'
So add an index for events.deleted_at to get rid of the nasty sequential scan. What does it look like after that?

Slow PostgreSQL query in production - help me understand this explain analyze output

I have a query that is taking 9 minutes to run on PostgreSQL 9.0.0 on x86_64-unknown-linux-gnu, compiled by GCC gcc (GCC) 4.1.2 20080704 (Red Hat 4.1.2-46), 64-bit
This query is automatically generated by hibernate for my application. It's trying to find all of the "teacher members" in a school. A membership is a user with a role in a group. There are several types of groups, but here what matters are schools and services. If someone is a teacher member in a service and a member in this school (15499) then they are what we are looking for.
This query used to run fine in production and still runs fine in development, but in production it is now taking several minutes to run. Can you help me understand why?
Here's the query:
select distinct user1_.ID as ID14_, user1_.FIRST_NAME as FIRST2_14_, user1_.LAST_NAME as LAST3_14_, user1_.STREET_1 as STREET4_14_, user1_.STREET_2 as STREET5_14_, user1_.CITY as CITY14_, user1_.us_state_id as us7_14_, user1_.REGION as REGION14_, user1_.country_id as country9_14_, user1_.postal_code as postal10_14_, user1_.USER_NAME as USER11_14_, user1_.PASSWORD as PASSWORD14_, user1_.PROFESSION as PROFESSION14_, user1_.PHONE as PHONE14_, user1_.URL as URL14_, user1_.bio as bio14_, user1_.LAST_LOGIN as LAST17_14_, user1_.STATUS as STATUS14_, user1_.birthdate as birthdate14_, user1_.ageInYears as ageInYears14_, user1_.deleted as deleted14_, user1_.CREATEDATE as CREATEDATE14_, user1_.audit as audit14_, user1_.migrated2008 as migrated24_14_, user1_.creator as creator14_
from DIR_MEMBERSHIPS membership0_
inner join DIR_USERS user1_ on membership0_.USER_ID=user1_.ID, DIR_ROLES role2_, DIR_GROUPS group4_
where membership0_.role=role2_.ID
and membership0_.GROUP_ID=group4_.id
and membership0_.GROUP_ID=15499
and case when membership0_.expires is null
then 1
else case when (membership0_.expires > CURRENT_TIMESTAMP and (membership0_.startDate is null or membership0_.startDate < CURRENT_TIMESTAMP))
then 1
else 0 end
end =1
and membership0_.deleted=false
and role2_.deleted=false
and role2_.NAME='ROLE_MEMBER'
and group4_.deleted=false
and user1_.STATUS='active'
and user1_.deleted=false
and (membership0_.USER_ID in (
select membership7_.USER_ID
from DIR_MEMBERSHIPS membership7_, DIR_USERS user8_, DIR_ROLES role9_
where membership7_.USER_ID=user8_.ID
and membership7_.role=role9_.ID
and case when membership7_.expires is null
then 1
else case when (membership7_.expires > CURRENT_TIMESTAMP
and (membership7_.startDate is null or membership7_.startDate < CURRENT_TIMESTAMP))
then 1
else 0 end
end =1
and membership7_.deleted=false
and role9_.NAME='ROLE_TEACHER_MEMBER'));
Explain analyze output:
HashAggregate (cost=61755.63..61755.64 rows=1 width=3334) (actual time=652504.302..652504.307 rows=4 loops=1)
-> Nested Loop (cost=4355.35..61755.56 rows=1 width=3334) (actual time=304.450..652504.217 rows=6 loops=1)
-> Nested Loop (cost=4355.35..61747.28 rows=1 width=3342) (actual time=304.419..652504.060 rows=6 loops=1)
-> Nested Loop Semi Join (cost=4355.35..61738.97 rows=1 width=32) (actual time=304.385..652503.961 rows=6 loops=1)
Join Filter: (user_id = user_id)
-> Nested Loop (cost=0.00..32.75 rows=1 width=16) (actual time=0.190..26.703 rows=758 loops=1)
-> Seq Scan on dir_roles role2_ (cost=0.00..1.25 rows=1 width=8) (actual time=0.032..0.038 rows=1 loops=1)
Filter: ((NOT deleted) AND ((name)::text = 'ROLE_MEMBER'::text))
-> Index Scan using dir_memberships_role_group_id_index on dir_memberships membership0_ (cost=0.00..31.49 rows=1 width=24) (actual time=0.151..25.626 rows=758 loops=1)
Index Cond: ((role = role2_.id) AND (group_id = 15499))
Filter: ((NOT deleted) AND (CASE WHEN (expires IS NULL) THEN 1 ELSE CASE WHEN ((expires > now()) AND ((startdate IS NULL) OR (startdate < now()))) THEN 1 ELSE 0 END END = 1))
-> Nested Loop (cost=4355.35..61692.86 rows=1069 width=16) (actual time=91.088..843.967 rows=79986 loops=758)
-> Nested Loop (cost=4355.35..54185.33 rows=1069 width=8) (actual time=91.065..555.830 rows=79986 loops=758)
-> Seq Scan on dir_roles role9_ (cost=0.00..1.25 rows=1 width=8) (actual time=0.006..0.013 rows=1 loops=758)
Filter: ((name)::text = 'ROLE_TEACHER_MEMBER'::text)
-> Bitmap Heap Scan on dir_memberships membership7_ (cost=4355.35..53983.63 rows=16036 width=16) (actual time=91.047..534.236 rows=79986 loops=758)
Recheck Cond: (role = role9_.id)
Filter: ((NOT deleted) AND (CASE WHEN (expires IS NULL) THEN 1 ELSE CASE WHEN ((expires > now()) AND ((startdate IS NULL) OR (startdate < now()))) THEN 1 ELSE 0 END END = 1))
-> Bitmap Index Scan on dir_memberships_role_index (cost=0.00..4355.09 rows=214190 width=0) (actual time=87.050..87.050 rows=375858 loops=758)
Index Cond: (role = role9_.id)
-> Index Scan using dir_users_pkey on dir_users user8_ (cost=0.00..7.01 rows=1 width=8) (actual time=0.003..0.003 rows=1 loops=60629638)
Index Cond: (id = user_id)
-> Index Scan using dir_users_pkey on dir_users user1_ (cost=0.00..8.29 rows=1 width=3334) (actual time=0.011..0.011 rows=1 loops=6)
Index Cond: (id = user_id)
Filter: ((NOT deleted) AND ((status)::text = 'active'::text))
-> Index Scan using dir_groups_pkey on dir_groups group4_ (cost=0.00..8.28 rows=1 width=8) (actual time=0.023..0.023 rows=1 loops=6)
Index Cond: (group4_.id = 15499)
Filter: (NOT group4_.deleted)
Total runtime: 652504.827 ms
(29 rows)
I am reading and reading forum posts and the user manual, but I can't figure out what would make this run faster, except maybe if it were possible to make indexes for the select that uses the now() function.
I rewrote your query and assume this will be faster:
SELECT u.id AS id14_, u.first_name AS first2_14_, u.last_name AS last3_14_, u.street_1 AS street4_14_, u.street_2 AS street5_14_, u.city AS city14_, u.us_state_id AS us7_14_, u.region AS region14_, u.country_id AS country9_14_, u.postal_code AS postal10_14_, u.user_name AS user11_14_, u.password AS password14_, u.profession AS profession14_, u.phone AS phone14_, u.url AS url14_, u.bio AS bio14_, u.last_login AS last17_14_, u.status AS status14_, u.birthdate AS birthdate14_, u.ageinyears AS ageinyears14_, u.deleted AS deleted14_, u.createdate AS createdate14_, u.audit AS audit14_, u.migrated2008 AS migrated24_14_, u.creator AS creator14_
FROM dir_users u
WHERE u.status = 'active'
AND u.deleted = FALSE
AND EXISTS (
SELECT 1
FROM dir_memberships m
JOIN dir_roles r ON r.id = m.role
JOIN dir_groups g ON g.id = m.group_id
WHERE m.group_id = 15499
AND m.user_id = u.id
AND (m.expires IS NULL
OR m.expires > now() AND (m.startdate IS NULL OR m.startdate < now()))
AND m.deleted = FALSE
AND r.deleted = FALSE
AND r.name = 'ROLE_MEMBER'
AND g.deleted = FALSE
)
AND EXISTS (
SELECT 1
FROM dir_memberships m
JOIN dir_roles r ON r.id = m.role
WHERE (m.expires IS NULL
OR m.expires > now() AND (m.startDate IS NULL OR m.startDate < now()))
AND m.deleted = FALSE
AND m.user_id = u.id
AND r.name = 'ROLE_TEACHER_MEMBER'
)
Rewrite with EXISTS
Replaced the weird case ... end = 1 expressions with simple expressions
Rewrote all JOINs with explicit join syntax to make it easier to read.
Transformed the big JOIN construct and the IN expression into two EXISTS semi-joins, which voids the necessity for DISTINCT. This should be quite a bit faster.
Lots of minor edits to make the query simpler, but they don't change the substance.
Especially use simper aliases - what you had was noisy and confusing.
Indexes
If this isn't fast enough yet, and your write performance can deal with more indexes, add this partial multi-column index:
CREATE INDEX dir_memberships_g_id_u_id_idx ON dir_memberships (group_id, user_id)
WHERE deleted = FALSE;
The WHERE conditions have to match your query for the index to be useful!
I assume that you already have primary keys and indexes on relevant foreign keys.
Further:
CREATE INDEX dir_memberships_u_id_role_idx ON dir_memberships (user_id, role)
WHERE deleted = FALSE;
Why user_id a second time?. See:
Working of indexes in PostgreSQL
Is a composite index also good for queries on the first field?
Also, since user_id is already used in another index you are not blocking HOT-updates (which can only be used with columns not involved in any indexes.
Why role?
I assume both columns are of type integer (4 bytes). I have seen in your detailed question, that you run a 64 bit OS where MAXALIGN 8 bytes, so another integer will not make the index grow at all. I threw in role which might be useful for the second EXISTS semi-join.
If you have many "dead" users, this might also help:
CREATE INDEX dir_users_id_idx ON dir_users (id)
WHERE status = 'active' AND deleted = FALSE;
As always, check with EXPLAIN to see whether the indexes actually get used. You wouldn't want useless indexes consuming resources.
Are we fast yet?
Of course, all the usual advice for performance optimization applies, too.
The query, minus the last 4 conditions, i.e.
and group4_.deleted=false
and user1_.STATUS='active'
and user1_.deleted=false
and (membership0_.USER_ID in (...))
returns 758 rows. Each of these 758 rows will then go through the select membership7_.USER_ID ... subquery, which takes 843.967 miliseconds to run.
843.967 * 758 = 639726.986, there goes the 10 minutes.
As for tuning the query, I don't think you need DIR_USERS user8_ in the subquery. You can start by removing it, and also changing the subquery to use EXISTS instead of IN.
By the way, is the database being vacuumed? Even without any tuning, it doesn't look that complex a query or that much of data to require 10 minutes.