Best way to exclude outdated data from a search in PostgreSQL - sql

I have a table containing the following columns:
an integer column named id
a text column named value
a timestamp column named creation_date
Currently, indexes have been created for the id and value columns.
I must search this table for a given value and want to make search as fast as I can. But I don't really need to look through records that are older than one month. So, ideally I would like to exclude them from the index.
What would be the best way to achieve this:
Perform table partitioning. Only search through the subtable for the appropriate month.
Create a partial index including only the recent records. Recreate it every month.
Something else?
(PS.: "the best solution" means the solution that is the most convenient, fast and easy to maintain)

Partial index
A partial index would be perfect for that, or even a partial multicolumn index. But your condition
don't need to search value in records older than one month
is not stable. The condition of a partial index can only work with literals or IMMUTABLE functions, i.e., constant values. You mention Recreate it every month, but that would not agree with your definition older than one month. You see the difference right?
If you should only need a the current (or last) month, index recreation as well as the query itself become quite a bit simpler!
I'll got with your definition "not older than one month" for the rest of this answer. I had to deal with situations like this before. The following solution worked best for me:
Base your index conditions on a fixed timestamp and use the same timestamp in your queries to convince the query planner it can use the partial index. This kind of partial will stay useful over an extended period of time, only its effectiveness deteriorates as new rows are added and older rows drop out of your time frame. The index will return more and more false positives that an additional WHERE clause has to eliminate from your query. Recreate the index to update its condition.
Given your test table:
CREATE TABLE mytbl (
value text
,creation_date timestamp
);
Create a very simple IMMUTABLE SQL function:
CREATE OR REPLACE FUNCTION f_mytbl_start_ts()
RETURNS timestamp AS
$func$
SELECT '2013-01-01 0:0'::timestamp
$func$ LANGUAGE sql IMMUTABLE;
Use the function in the condition of the partial index:
CREATE INDEX mytbl_start_ts_idx ON mytbl(value, creation_date)
WHERE (creation_date >= f_mytbl_start_ts());
value comes first. Explanation in this related answer on dba.SE.
Input from #Igor in the comments made me improve my answer. A partial multicolumn index should make ruling out false positives from the partial index faster - it's in the nature of the index condition that it's always increasingly outdated (but still a lot better than not having it).
Query
A query like this will make use of the index and should be perfectly fast:
SELECT value
FROM mytbl
WHERE creation_date >= f_mytbl_start_ts() -- !
AND creation_date >= (now() - interval '1 month')
AND value = 'foo';
The only purpose of the seemingly redundant WHERE clause: creation_date >= f_mytbl_start_ts() is to make the query planner use the partial index.
You can drop and recreate function and index manually.
Full automation
Or you can automate it in a bigger scheme with possibly lots of similar tables:
Disclaimer: This is advanced stuff. You need to know what you are doing and consider user privileges, possible SQL injection and locking issues with heavy concurrent load!
This "steering table" receives a line per table in your regime:
CREATE TABLE idx_control (
tbl text primary key -- plain, legal table names!
,start_ts timestamp
);
I would put all such meta objects in a separate schema.
For our example:
INSERT INTO idx_control(tbl, value)
VALUES ('mytbl', '2013-1-1 0:0');
A "steering table" offers the additional benefit that you have an overview over all such tables and their respective settings in a central place and you can update some or all of them in sync.
Whenever you change start_ts in this table the following trigger kicks in and takes care of the rest:
Trigger function:
CREATE OR REPLACE FUNCTION trg_idx_control_upaft()
RETURNS trigger AS
$func$
DECLARE
_idx text := NEW.tbl || 'start_ts_idx';
_func text := 'f_' || NEW.tbl || '_start_ts';
BEGIN
-- Drop old idx
EXECUTE format('DROP INDEX IF EXISTS %I', _idx);
-- Create / change function; Keep placeholder with -infinity for NULL timestamp
EXECUTE format('
CREATE OR REPLACE FUNCTION %I()
RETURNS timestamp AS
$x$
SELECT %L::timestamp
$x$ LANGUAGE SQL IMMUTABLE', _func, COALESCE(NEW.start_ts, '-infinity'));
-- New Index; NULL timestamp removes idx condition:
IF NEW.start_ts IS NULL THEN
EXECUTE format('
CREATE INDEX %I ON %I (value, creation_date)', _idx, NEW.tbl);
ELSE
EXECUTE format('
CREATE INDEX %I ON %I (value, creation_date)
WHERE creation_date >= %I()', _idx, NEW.tbl, _func);
END IF;
RETURN NULL;
END
$func$ LANGUAGE plpgsql;
Trigger:
CREATE TRIGGER upaft
AFTER UPDATE ON idx_control
FOR EACH ROW
WHEN (OLD.start_ts IS DISTINCT FROM NEW.start_ts)
EXECUTE PROCEDURE trg_idx_control_upaft();
Now, a simple UPDATE on the steering table calibrates index and function:
UPDATE idx_control
SET start_ts = '2013-03-22 0:0'
WHERE tbl = 'mytbl';
You can run a cron job or call this manually.
Queries using the index don't change.
-> SQLfiddle.
I updated the fiddle with a small test case of 10k rows to demonstrate it works.
PostgreSQL will even do an index-only scan for my example query. Won't get any faster than this.

Related

Using Functions for Easy writing makes SQL really slow

create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2 AS
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
SELECT takes around 5 seconds (80k + Lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
UPPER(UPPER(translate(TABLEB.STRING_X,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'))
=
UPPER(translate(TABLEB.N_STRING,
'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu')
Using function takes over 3 minutes (80k + lines)
SELECT TABLE_A.STRING_X
FROM TABLE_A
INNER JOIN TABLE_B ON TABLE_B.ID = TABLE_A.IDTB
WHERE
FUNCTION_X(TABLE_A.STRING_X) = FUNCTION_X(TABLE_B.N_STRING)
I dont know whats makes it so heavy.
If your first query, with the UPPER(UPPER(translate(...))) inline in the query takes only 5 seconds and the tables are big, I would look to see if you have a function based index having those functions on either or both tables.
An index, as you probably know, stores a sorted version of the data so that rows can be found quickly. But they're only useful if you are searching on the data that is sorted in the index. (Think of an index in a book, in which keywords sorted alphabetically -- useful for searching for a particular word, not so useful for finding references to words ending in the letter "r").
If there is a function based index on UPPER(UPPER(translate(...))) that is helping your original query, you are losing the benefit when your query specifies FUNCTION_X(...) instead. Oracle is not smart enough to realize they are the same function. You would need to create function based indexes on the expression you actually use in the query -- i.e, on FUNCTION_X(...).
Also, you can help performance by telling Oracle that your function is deterministic (i.e., always returns the same value for the same input) and intended to be used in SQL queries. So, in addition to the function based indexes, a better definition of your function would be:
create or replace FUNCTION FUNCTION_X
(
N_STRING IN VARCHAR2
) RETURN VARCHAR2
DETERMINISTIC -- add this
AS
PRAGMA UDF; -- add this too
BEGIN
RETURN UPPER(translate(N_STRING, 'ÁÇÉÍÓÚÀÈÌÒÙÂÊÎÔÛÃÕËÜáçéíóúàèìòùâêîôûãõëü','ACEIOUAEIOUAEIOUAOEUaceiouaeiouaeiouaoeu'));
END FUNCTION_X;
JOINS, of course, intend to exploit index values. The problem with your second query is that you are demanding that the SQL engine must execute this function-call for each and every line. It therefore cannot do anything better than a "full table scan," evaluating the function over each and every row ... actually over a Cartesian Product of the two tables taken together!!
You must find an alternative way to do that.

Creating index on timestamp column for query which uses year function

I have a HISTORY table with 9 million records. I need to find year-wise, month-wise records created. I was using query no 1, However it timed out several times.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
GROUP BY
year(created), MONTHNAME(created);
I decided to add where year(created), this time the query took 30 mins (yes it takes so long) to execute.
SELECT
year(created) as year,
MONTHNAME(created) as month,
count(*) as ymcount
FROM
HISTORY
WHERE
year(created) = 2010
GROUP BY
year(created), MONTHNAME(created) ;
I was planning to add an index on created timestamp column, however before doing so, I need the opinion (since its going to take a long time to index such a huge table).
Will adding an index on created(timestamp) column improve performance, considering year function is used on the column?
An index won't really help because you have formed the query such that it must perform a complete table scan, index or no index. You have to form the where clause so it is in the form:
where field op constant
where field is, of course, your field; op is = <= => <> between in, etc. and constant is either a direct constant, 42, or an operation that can be executed once and the result cached, getdate().
Like this:
where created >= DateFromParts( #year, 1, 1 )
and created < DateFromParts( #year + 1, 1, 1 )
The DateFromParts function will generate a value which remains in effect for the duration of the query. If created is indexed, now the optimizer will be able to seek to exactly where the correct dates start and tell when the last date in the range has been processed and it can stop. You can keep year(created) everywhere else -- just get rid of it from the where clause.
This is called sargability and you can google all kinds of good information on it.
P.S. This is in Sql Server format but you should be able to calculate "beginning of specified year" and "beginning of year after specified year" in whatever DBMS you're using.
An index will be used, when it helps narrow down the number of rows read.
It will also be used, when it avoids reading the table at all. This is the case, when the index contains all the columns referenced in the query.
In your case the only column referenced is created, so adding an index on this column should help reducing the necessary reads and improve the overall runtime of your query. However, if created is the only column in the table, the index won't change anything in the first query, because it doesn't reduce the number of pages to be read.
Even with a large table, you can test, if an index makes a difference. You can copy only part of the rows to a new table and compare the execution plans on the new table with and without an index, e.g.
insert into testhistory
select *
from history
fetch first 100000 rows only
You want what's known as a Calendar Table (the particular example uses SQL Server, but the solution should be adaptable). Then, you want lots of indices on it (since writes are few, and this is a primary dimension table for analysis).
Assuming you have a minimum Calendar Table that looks like this:
CREATE TABLE Calendar (isoDate DATE,
dayOfMonth INTEGER,
month INTEGER,
year INTEGER);
... with an index over [dayOfMonth, month, year, isoDate], your query can be re-written like this:
SELECT Calendar.year, Calendar.month,
COUNT(*) AS ymCount
FROM Calendar
JOIN History
ON History.created >= Calendar.isoDate
AND History.created < Calendar.isoDate + 1 MONTH
WHERE Calendar.dayOfMonth = 1
GROUP BY Calendar.year, Calendar.month
The WHERE Calendar.dayOfMonth = 1 is automatically limiting results to 12-per-year. The start of the range is trivially located with the index (given the SARGable data), and the end of the range as well (yes, doing math on a column generally disqualifies indices... on the side the math is used. If the optimizer is at all smart it's going to going to gen a virtual intermediate table containing the start/end of range).
So, index-based (and likely index-only) access for the query. Learn to love indexed dimension tables, that can be used for range queries (Calendar Tables being one of the most useful).
I'll assume you are using SQL Server based on your tags.
Yes, the index will make your query faster.
I recommend only using the 'created' column as a key for the index and to not include any additional columns from the History table because they will be unused and only result in more reads than what is necessary.
And of course, be mindful when you create indexes on tables that have a lot of INSERT, UPDATE, DELETE activity as your new index will make these actions more expensive when being performed on the table.
As been stated before, in your case, an index won't be used because the index is created on the column 'created' and you are querying on 'year(created)'.
What you can do is add two generated columns year_gen = year(create) and month_gen = MONTHNAME(created) to your table and index these two columns. The DB2 Query Optimizer will automatically use these two generated columns and it will also use the indices created on these columns.
The code should be something like (but not 100% sure since I have no DB2 to test)
SET INTEGRITY FOR HISTORY OFF CASCADE DEFERRED #
ALTER TABLE HISTORY ADD COLUMN YEAR_GEN SMALLINT GENERATED ALWAYS AS (YEAR(CREATE)),
ADD COLUMN MONTH_GEN VARCHAR(20) GENERATED ALWAYS AS (YEAR(CREATE)) #
SET INTEGRITY FOR HISTORY IMMEDIATE CHECKED FORCE GENERATED #
CREATE INDEX HISTORY_YEAR_IDX ON HISTORY YEAR_GEN ASC CLUSTER #
CREATE INDEX HISTORY_MONTH_IDX ON HISTORY YEAR_GEN ASC #
Just a sidenote: the set integrity off is mandatory to add generated columns. Your table is inaccessible untill you reset the integrity to checked and you force the re-calculation of the generated columns (this might take a while in your case).
Setting integrity off without cascade deferred will set every table with a foreign key to the HISTORY table to OFF too. You will have to manually reset the integrity of these tables too. If I remember correctly, using cascade deferred in combination with incomming foreign keys may cause DB2 to set the integrity of your table to 'checked by user'.

Optimize performance for queries on recent rows of a large table

I have a large table:
CREATE TABLE "orders" (
"id" serial NOT NULL,
"person_id" int4,
"created" int4,
CONSTRAINT "orders_pkey" PRIMARY KEY ("id")
);
90% of all requests are about orders from the last 2-3 days by a person_id, like:
select * from orders
where person_id = 1
and created >= extract(epoch from current_timestamp)::int - 60 * 60 * 24 * 3;
How can I improve performance?
I know about Partitioning, but what about existing rows? And it looks like I need to create INHERITS tables manually every 2-3 days.
A partial, multicolumn index on (person_id, created) with a pseudo-IMMUTABLE condition would help (a lot). Needs to be recreated from time to time to keep performance up.
Note, if your table is not very big, you can largely simplify and use a plain multicolumn index.
Or consider table partitioning in Postgres 12 or later (where the feature finally matured).
A primitive function provides a constant point in time, 3 or more days back (represented by a unix epoch in your case):
CREATE OR REPLACE FUNCTION f_orders_idx_start()
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1 AS
'SELECT 1387497600';
PARALLEL SAFE only for Postgres 10 or later.
1387497600 being the result of:
SELECT extract(epoch from now())::integer - 259200;
-- 259200 being the result of 60 * 60 * 24 * 3
Base your partial index on this pseudo-IMMUTABLE condition:
CREATE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= f_orders_idx_start();
Base your query on the same condition:
SELECT *
FROM orders
WHERE person_id = 1
AND created >= f_orders_idx_start() -- match partial idx condition
AND created >= extract(epoch from now())::integer - 259200; -- actual condition
The line AND created >= f_orders_idx_start() seems redundant, but is instrumental to convince Postgres to use the partial index.
A function to recreate function and index from time to time. Possibly with a cron-job every night:
CREATE OR REPLACE FUNCTION f_orders_reindex_partial()
RETURNS void AS
$func$
DECLARE
-- 3 days back, starting at 00:00
_start int := extract(epoch from now()::date -3)::int;
BEGIN
IF _start = f_orders_idx_start() THEN
-- do nothing, nothing changes.
ELSE
DROP INDEX IF EXISTS orders_created_recent_idx;
-- Recreate IMMUTABLE function
EXECUTE format('
CREATE OR REPLACE FUNCTION f_orders_idx_start()
RETURNS int LANGUAGE sql IMMUTABLE PARALLEL SAFE COST 1 AS
$$SELECT %s $$'
, _start
);
-- Recreate partial index
CREATE INDEX orders_created_recent_idx ON orders (person_id, created)
WHERE created >= f_orders_idx_start();
END IF;
END
$func$ LANGUAGE plpgsql;
Then, to rebase your index, call (ideally with little or no concurrent load):
SELECT f_orders_reindex_partial(); -- that's all
If you cannot afford dropping and recreating the index due to concurrent load, consider REINDEX CONCURRENTLY in Postgres 12 or later. It's dead simple:
REINDEX INDEX orders_created_recent_idx;
All queries continue to work, even if you never call this function. Performance slowly deteriorates over time with the growing partial index.
I am using this regime successfully with a couple of big tables and similar requirements. Very fast.
For Postgres 9.2 or later, and if your table has only few, small columns, and if the table is not heavily written, it might pay to make that a covering index:
CREATE INDEX orders_created_recent_idx ON orders (person_id, created, id)
WHERE created >= f_orders_idx_start();
In Postgres 11 or later, you might want to use INCLUDE instead:
CREATE INDEX orders_created_recent_idx ON orders (person_id, created) INCLUDE (id)
WHERE created >= f_orders_idx_start();
Suggesstion:-
It may help you.
Since the table size is growing, your query performance going to be decreased gradually. Better maintain 3-5 days ( If you are very sure about going to access only 2-3 days ) records and periodically migrate the old records to backup table.

How to create a stored function updating rows in Postgres?

I have used Postgres with my Django project for some time now but I never needed to use stored functions. It is very important for me to find the most efficient solution for the following problem:
I have a table, which contains the following columns:
number | last_update | growth_per_second
And I need an efficient solution to update the number based on the last_update and the growth factor, and set the last_update value to current time. I will probably have 100, maybe 150k rows. I need to update all rows in the same time, if possible, but if it will take too long I can split it in smaller parts.
Store what you can't calculate
quickly.
Are you sure you need to maintain this information? If so, can you cache it if querying it is slow? You're setting yourself up for massive table thrash by trying to keep this information consistent in the database.
First if you want to go this route, start with the PostgreSQL documentation on server programming, then come back with a question based on what you have tried. You will want to get familiar with this area anyway because depending on what you are doing....
Now, assuming your data is all inserts and no updates, I would not store this information in your database directly. If it is a smallish amount of information you will end up with index scans anyway and if you are returning a smallish result set you should be able to calculate this quickly.
Instead I would do this: have your last_update column be a foreign key to the same table. Suppose your table looks like this:
CREATE TABLE hits (
id bigserial primary key,
number_hits bigint not null,
last_update_id bigint references hits(id),
....
);
Then I would create the following functions. Note the caveats below.
CREATE FUNCTION last_update(hits) RETURNS hits IMMUTABLE LANGUAGE SQL AS $$
SELECT * FROM hits WHERE id = $1.last_update_id;
$$;
This function allows you, on a small result set, to traverse to the last update record. Note the immutable designation here is only safe if you are guaranteeing that there are no updates or deletions on the hits table. If you do these, then you should change it to stable, and you lose the ability to index output. If you make this guarantee and then must do an update, then you MUST rebuild any indexes that use this (reindex table hits), and this may take a while....
From there, we can:
CREATE FUNCTION growth(hits) RETURNS numeric immutable language sql as $$
SELECT CASE WHEN ($1.last_update).number_hits = 0 THEN NULL
ELSE $1.number_hits / ($1.last_update).number_hits
END;
$$;
Then we can:
SELECT h.growth -- or alternatively growth(h)
FROM hits
WHERE id = 12345;
And it will automatically calculate it. If we want to search on growth, we can index the output:
CREATE INDEX hits_growth_idx ON hits (growth(hits));
This will precalculate for searching purposes. This way if you want to do a:
SELECT * FROM hits WHERE growth = 1;
It can use an index scan on predefined values.
Of course you can use the same techniques to precalculate and store, but this approach is more flexible and if you have to work with a large result set, you can always self-join once, and calculate that way, bypassing your functions.

How do I find the last time that a PostgreSQL database has been updated?

I am working with a postgreSQL database that gets updated in batches. I need to know when the last time that the database (or a table in the database)has been updated or modified, either will do.
I saw that someone on the postgeSQL forum had suggested that to use logging and query your logs for the time. This will not work for me as that I do not have control over the clients codebase.
You can write a trigger to run every time an insert/update is made on a particular table. The common usage is to set a "created" or "last_updated" column of the row to the current time, but you could also update the time in a central location if you don't want to change the existing tables.
So for example a typical way is the following one:
CREATE FUNCTION stamp_updated() RETURNS TRIGGER LANGUAGE 'plpgsql' AS $$
BEGIN
NEW.last_updated := now();
RETURN NEW;
END
$$;
-- repeat for each table you need to track:
ALTER TABLE sometable ADD COLUMN last_updated TIMESTAMP;
CREATE TRIGGER sometable_stamp_updated
BEFORE INSERT OR UPDATE ON sometable
FOR EACH ROW EXECUTE PROCEDURE stamp_updated();
Then to find the last update time, you need to select "MAX(last_updated)" from each table you are tracking and take the greatest of those, e.g.:
SELECT MAX(max_last_updated) FROM (
SELECT MAX(last_updated) AS max_last_updated FROM sometable
UNION ALL
SELECT MAX(last_updated) FROM someothertable
) updates
For tables with a serial (or similarly-generated) primary key, you can try avoid the sequential scan to find the latest update time by using the primary key index, or you create indices on last_updated.
-- get timestamp of row with highest id
SELECT last_updated FROM sometable ORDER BY sometable_id DESC LIMIT 1
Note that this can give slightly wrong results in the case of IDs not being quite sequential, but how much accuracy do you need? (Bear in mind that transactions mean that rows can become visible to you in a different order to them being created.)
An alternative approach to avoid adding 'updated' columns to each table is to have a central table to store update timestamps in. For example:
CREATE TABLE update_log(table_name text PRIMARY KEY, updated timestamp NOT NULL DEFAULT now());
CREATE FUNCTION stamp_update_log() RETURNS TRIGGER LANGUAGE 'plpgsql' AS $$
BEGIN
INSERT INTO update_log(table_name) VALUES(TG_TABLE_NAME);
RETURN NEW;
END
$$;
-- Repeat for each table you need to track:
CREATE TRIGGER sometable_stamp_update_log
AFTER INSERT OR UPDATE ON sometable
FOR EACH STATEMENT EXECUTE stamp_update_log();
This will give you a table with a row for each table update: you can then just do:
SELECT MAX(updated) FROM update_log
To get the last update time. (You could split this out by table if you wanted). This table will of course just keep growing: either create an index on 'updated' (which should make getting the latest one pretty fast) or truncate it periodically if that fits with your use case, (e.g. take an exclusive lock on the table, get the latest update time, then truncate it if you need to periodically check if changes have been made).
An alternative approach- which might be what the folks on the forum meant- is to set 'log_statement = mod' in the database configuration (either globally for the cluster, or on the database or user you need to track) and then all statements that modify the database will be written to the server log. You'll then need to write something outside the database to scan the server log, filtering out tables you aren't interested in, etc.
It looks like you can use pg_stat_database to get a transaction count and check if this changes from one backup run to the next - see this dba.se answer and comments for more details
I like Jack's approach. You can query the table stats and know the number of inserts, updates, deletes and so:
select n_tup_upd from pg_stat_user_tables where relname = 'YOUR_TABLE';
every update will increase the count by 1.
bare in mind this method is viable when you have a single DB. multiple instances will require different approach probably.
See the following article:
MySQL versus PostgreSQL: Adding a 'Last Modified Time' Column to a Table
http://www.pointbeing.net/weblog/2008/03/mysql-versus-postgresql-adding-a-last-modified-column-to-a-table.html
You can write a stored procedure in an "untrusted language" (e.g. plpythonu): This allows access to the files in the postgres "base" directory. Return the larges mtime of these files in the stored procedure.
But this is only vague, since vacuum will change these files and the mtime.