Better way to create index to optimize query on date range - sql

I am running a query :
select * from users where user_id > 200 and date >= '2020-01-12' and date <= '2020-12-20'.
I have created indexes on user_id, date separately. The total no of records are over 2 billion(2.5 to be exact). There is partitioning included in the table , partitioned by the date range. But still the query is taking over a minute to execute.
can anyone suggest better way to use multi-column index on id, date.

Related

SQL query slow performance 'LIKE CNTR%04052021%

we have a database that is growing every day. roughly 40M records as of today.
This table/database is located in Azure.
The table has a primary key 'ClassifierID', and the query is running on this primary key.
The primary key is in the format of ID + timestamp (mmddyyy HHMMSS), for example 'CNTR00220200 04052021 073000'
Here is the query to get all the IDs by date
**Select distinct ScanID
From ClassifierResults
Where ClassifierID LIKE 'CNTR%04052020%**
Very simple and straightforward, but it sometimes takes over a min to complete. Do you have any suggestion how we can optimize the query? Thanks much.
The best thing here would be to fix your design so that a) you are not storing the ID and timestamp in the same text field, and b) you are storing the timestamp in a proper date/timestamp column. Using your single point of data, I would suggest the following table design:
ID | timestamp
CNTR00220200 | timestamp '2021-04-05 07:30:00'
Then, create an index on (ID, timestamp), and use this query:
SELECT *
FROM yourTable
WHERE ID LIKE 'CNTR%' AND
timestamp >= '2021-04-05' AND timestamp < '2021-04-06';
The above query searches for records having an ID starting with CNTR and falling exactly on the date 2021-04-05. Your SQL database should be able to use the composite index I suggested above on this query.

Is there any disadvantage in indexing text date eg. 2018-06-03 to increase date range query performance in postgresql?

Consider the following db schema
my_primary_id (text) primary index
my_date (timestamp with timezone)
is there a way to index my_date such that I can have fast date range query?
My first thought is to make my_date a secondary index, however after thinking about it a bit, if each day I have 100k to 200k items, the cardinality of the my_date index will be similar to the number of rows I have.
since big index table -> slower query, I thought maybe I should store an extra column
`yyyy-mm-dd`
and index that instead?
Is there any disadvantage in that if I can guarantee that the date range query I do does not return more than 5% of my table size (preventing it using seq scan)?
My query pattern is the following
select * from my_table
where my_date >= my_start_date and my_date < my_end_date
You can index the date part of a timestamp by casting it to `date:
create index on the_table (my_date::date);
To make a query use that index, you need to use the same expression in your query:
select *
from my_table
where my_date::date >= date '2018-01-01'
and my_date::date < date '2018-02-01';
I think an index on the timestamp column should be usable just as well, if you compare your column with a timestamp value:
select *
from my_table
where my_date >= timestamp '2018-01-01 00:00:00'
and my_date < timestamp '2018-02-01 00:00:00';
You can partition the table by date. This speeds up querying tremendously if you have millions of records in date order and only need to work with a subset.
https://www.postgresql.org/docs/current/static/ddl-partitioning.html

Is hive partitioning hierarchical in nature?

Say we have a table partitioned as:-
CREATE EXTERNAL TABLE MyTable (
col1 string,
col2 string,
col3 string
)
PARTITIONED BY(year INT, month INT, day INT, hour INT, combination_id BIGINT);
Now obviously year is going to store year value (e.g. 2016), the month will store month va.ue (e.g. 7) the day will store day (e.g. 18) and hour will store hour value in 24 hour format (e.g. 13). And combination_id is going to be combination of padded (if single digit value pad it with 0 on left) values for all these. So in this case for example the combination id is 2016071813.
So we fire query (lets call it Query A):-
select * from mytable where combination_id = 2016071813
Now Hive doesn't know that combination_id is actually combination of year,month,day and hour. So will this query not take proper advantage of partitioning?
In other words, if I have another query, call it Query B, will this be more optimal than query A or there is no difference?:-
select * from mytable where year=2016 and month=7 and day=18 and hour=13
If Hive partitioning scheme is really hierarchical in nature then Query B should be better from performance point of view is what I am thinking. Actually I want to decide whether to get rid of combination_id altogether from partitioning scheme if it is not contributing to better performance at all.
The only real advantage for using combination id is to be able to use BETWEEN operator in select:-
select * from mytable where combination_id between 2016071813 and 2016071823
But if this is not going to take advantage of partitioning scheme, it is going to hamper performance.
Yes. Hive partitioning is hierarchical.
You can simply check this by printing the partitions of the table using below query.
show partitions MyTable;
Output:
year=2016/month=5/day=5/hour=5/combination_id=2016050505
year=2016/month=5/day=5/hour=6/combination_id=2016050506
year=2016/month=5/day=5/hour=7/combination_id=2016050507
In your scenario, you don't need to specify combination_id as partition column if you are not using for querying.
You can partition either by
Year, month, day, hour columns
or
combination_id only
Partitioning by Multiple columns helps in performance in grouping operations.
Say if you want to find maximum of a col1 for 'March' month of the years (2016 & 2015).
It can easily fetch the records by going to the specific 'Year' partition(year=2016/2015) and month partition(month=3)

Create a unique index on a non-unique column

Not sure if this is possible in PostgreSQL 9.3+, but I'd like to create a unique index on a non-unique column. For a table like:
CREATE TABLE data (
id SERIAL
, day DATE
, val NUMERIC
);
CREATE INDEX data_day_val_idx ON data (day, val);
I'd like to be able to [quickly] query only the distinct days. I know I can use data_day_val_idx to help perform the distinct search, but it seems this adds extra overhead if the number of distinct values is substantially less than the number of rows in the index covers. In my case, about 1 in 30 days is distinct.
Is my only option to create a relational table to only track the unique entries? Thinking:
CREATE TABLE days (
day DATE PRIMARY KEY
);
And update this with a trigger every time we insert into data.
An index can only index actual rows, not aggregated rows. So, yes, as far as the desired index goes, creating a table with unique values like you mentioned is your only option. Enforce referential integrity with a foreign key constraint from data.day to days.day. This might also be best for performance, depending on the complete situation.
However, since this is about performance, there is an alternative solution: you can use a recursive CTE to emulate a loose index scan:
WITH RECURSIVE cte AS (
( -- parentheses required
SELECT day FROM data ORDER BY 1 LIMIT 1
)
UNION ALL
SELECT (SELECT day FROM data WHERE day > c.day ORDER BY 1 LIMIT 1)
FROM cte c
WHERE c.day IS NOT NULL -- exit condition
)
SELECT day FROM cte;
Parentheses around the first SELECT are required because of the attached ORDER BY and LIMIT clauses. See:
Combining 3 SELECT statements to output 1 table
This only needs a plain index on day.
There are various variants, depending on your actual queries:
Optimize GROUP BY query to retrieve latest row per user
Unused index in range of dates query
Select first row in each GROUP BY group?
More in my answer to your follow-up querstion:
Counting distinct rows using recursive cte over non-distinct index

Efficient querying of multi-partition Postgres table

I've just restructured my database to use partitioning in Postgres 8.2. Now I have a problem with query performance:
SELECT *
FROM my_table
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100;
There are 45 million rows in the table. Prior to partitioning, this would use a reverse index scan and stop as soon as it hit the limit.
After partitioning (on time_stamp ranges), Postgres does a full index scan of the master table and the relevant partition and merges the results, sorts them, then applies the limit. This takes way too long.
I can fix it with:
SELECT * FROM (
SELECT *
FROM my_table_part_a
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
SELECT * FROM (
SELECT *
FROM my_table_part_b
WHERE time_stamp >= '2010-02-10' and time_stamp < '2010-02-11'
ORDER BY id DESC
LIMIT 100) t
UNION ALL
... and so on ...
ORDER BY id DESC
LIMIT 100
This runs quickly. The partitions where the times-stamps are out-of-range aren't even included in the query plan.
My question is: Is there some hint or syntax I can use in Postgres 8.2 to prevent the query-planner from scanning the full table but still using simple syntax that only refers to the master table?
Basically, can I avoid the pain of dynamically building the big UNION query over each partition that happens to be currently defined?
EDIT: I have constraint_exclusion enabled (thanks #Vinko Vrsalovic)
Have you tried Constraint Exclusion (section 5.9.4 in the document you've linked to)
Constraint exclusion is a query
optimization technique that improves
performance for partitioned tables
defined in the fashion described
above. As an example:
SET constraint_exclusion = on;
SELECT count(*) FROM measurement WHERE logdate >= DATE '2006-01-01';
Without
constraint exclusion, the above query
would scan each of the partitions of
the measurement table. With constraint
exclusion enabled, the planner will
examine the constraints of each
partition and try to prove that the
partition need not be scanned because
it could not contain any rows meeting
the query's WHERE clause. When the
planner can prove this, it excludes
the partition from the query plan.
You can use the EXPLAIN command to
show the difference between a plan
with constraint_exclusion on and a
plan with it off.
I had a similar problem that I was able fix by casting conditions in WHERE.
EG: (assuming the time_stamp column is timestamptz type)
WHERE time_stamp >= '2010-02-10'::timestamptz and time_stamp < '2010-02-11'::timestamptz
Also, make sure the CHECK condition on the table is defined the same way...
EG:
CHECK (time_stamp < '2010-02-10'::timestamptz)
I had the same problem and it boiled down to two reasons in my case:
I had indexed column of type timestamp WITH time zone and partition constraint by this column with type timestamp WITHOUT time zone.
After fixing constraints ANALYZE of all child tables was needed.
Edit: another bit of knowledge - it's important to remember that constraint exclusion (which allows PG to skip scanning some tables based on your partitioning criteria) doesn't work with, quote: non-immutable function such as CURRENT_TIMESTAMP
I had requests with CURRENT_DATE and it was part of my problem.