Db2 max(DATE_FIELD) optimization - sql

I encountered this SQL statement in our legacy software:
select max(DATE_FIELD) from schema.VERY_LARGE_TABLE.
Now, this in itself would not be a problem, but VERY_LARGE_TABLE contains over 20,000,000 records and execution of the statement sometimes exceeds specified timeout of 30 seconds. This now represents an issue as users do not see a correct date.
I have stored this date in cache, so that once the date is obtained it is not obtained any more that day. But this does not help with original issue.
I was wondering, is there a way to optimize a table or SQL statement to perform faster than it does now?
[Edit] This behaviour is also only in place until table is in DB2 server's cache. After that, the query runs in approximately 2 seconds.
[Edit2] The platform is IBM DB2 9.7.6 for LUW.
[Edit3] DDL:
create table MYSCHEMA.VERY_LARGE_TABLE (
ID integer not null generated always as identity (start with 1,
increment by 1,
minvalue 1,
maxvalue 2147483647,
no cycle,
cache 20),
CODE_FIELD varchar(15) not null,
DATE_FIELD date default CURRENT DATE not null
)
in TS_LARGE;
create unique index MYSCHEMA.VERY_LARGE_TABLE_1 on MYSCHEMA.VERY_LARGE_TABLE (CODE_FIELD asc, DATE_FIELD asc);
create index MYSCHEMA.VERY_LARGE_TABLE_ARCHIVE on MYSCHEMA.VERY_LARGE_TABLE (DATE_FIELD asc, CODE_FIELD asc);
create index MYSCHEMA.INDX_VERY_LARGE_TABLE_ID on MYSCHEMA.VERY_LARGE_TABLE (ID asc);
alter table MYSCHEMA.VERY_LARGE_TABLE add constraint MYSCHEMA.PK_VERY_LARGE_TABLE primary key (CODE_FIELD, DATE_FIELD);

Couple of things you can do.
Create a index on the date_filed
Execute the select statement with with ur at the end of the statement like
select max(DATE_FIELD) from schema.VERY_LARGE_TABLE with ur this will speed up the process.
Try to increase the primary and secondary logfile sizes but take care to take the help of DBA before doing this as this will effect the database, If you are not sure then don't go for this option.
If there are bulk operations going on table issue frequent commits.

As long as you have an index only on DATE_FIELD DESCENDING, then try
SELECT DATE_FIELD
FROM VERY_LARGE_TABLE
ORDER BY DATE_FIELD DESC
FETCH FIRST ROW ONLY
WITH UR

Related

Why using MAX function in query cause postgresql performance issue?

I have a table with three columns time_stamp, device_id and status s.t status type is json. Also time_stamp and device_id columns have index. I need to grab latest value of status with id 1.3.6.1.4.1.34094.1.1.1.1.1 which is not null.
You can find query execution time of following command With and Without using MAX bellow.
Query with MAX:
SELECT DISTINCT MAX(time_stamp) FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
Query without MAX:
SELECT DISTINCT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}');
First query takes about 3sec and second one takes just 3msec with two different plans. I think both queries should have same query plan, Why it does not use index in when it wants to calculate MAX? How can improve running time of first query?
PS I use postgres 9.6(dockerized version).
Also this is table definition.
-- Table: device.status_events
-- DROP TABLE device.status_events;
CREATE TABLE device.status_events
(
time_stamp timestamp with time zone NOT NULL,
device_id bigint,
status jsonb,
is_active boolean DEFAULT true,
CONSTRAINT status_events_device_id_fkey FOREIGN KEY (device_id)
REFERENCES device.devices (id) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
OIDS=FALSE
);
ALTER TABLE device.status_events
OWNER TO monitoring;
-- Index: device.status_events__time_stamp
-- DROP INDEX device.status_events__time_stamp;
CREATE INDEX status_events__time_stamp
ON device.status_events
USING btree
(time_stamp);
The index you show us cannot produce the first plan you show us. With that index, the plan would have to be applying a filter for the jsonb column, which it isn't. So the index must be a partial index, with the filter being applied at the index level so that it is not needed in the plan.
PostgreSQL is using an index for the max query, it just isn't the index you want it to.
All of your devide_id=7 must have low timestamps, but PostgreSQL doesn't know this. It thinks that by walking down the timestamps index, it will quickly find a device_id=7 and then be done. But instead it needs to walk a large chunk of the index before finding such a row.
You can force it away from the "wrong" index by changing the aggregate expression to something like:
MAX(time_stamp + interval '0')
Or you could instead build a more tailored index, which the planner will choose instead of the falsely attractive one:
create index on device.status_events (device_id , time_stamp)
where status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}';
I believe this should generate a better plan
SELECT time_stamp FROM device.status_events WHERE
(device_id = 7) AND
(status->'1.3.6.1.4.1.34094.1.1.1.1.1' != '{}')
ORDER BY timestamp DESC
LIMIT 1
Let me know how that works for you.

Big Table Optimisation

I have a 12million row table, so not enormous, but I want to optimize it for reads as much as possible.
for example currently running
SELECT *
FROM hp.historicalposition
WHERE instrumentid = 1167 AND fundid = 'XXX'
ORDER BY date;
returns 4200 rows and is taking about 4 seconds the first time it is run and 1 second the second time.
What indices might help and and are there any other suggestions?
CREATE TABLE hp.historicalposition
(
date date NOT NULL,
fundid character(3) NOT NULL,
instrumentid integer NOT NULL,
quantityt0 double precision,
quantity double precision,
valuation character varying,
fxid character varying,
localt0 double precision,
localt double precision,
CONSTRAINT attrib_fund_fk FOREIGN KEY (fundid)
REFERENCES funds (fundid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION,
CONSTRAINT attrib_instr_fk FOREIGN KEY (instrumentid)
REFERENCES instruments (instrumentid) MATCH SIMPLE
ON UPDATE NO ACTION ON DELETE NO ACTION
)
Here is your query:
SELECT *
FROM hp.historicalposition
WHERE instrumentid = 1167 AND fundid = 'XXX'
ORDER BY date;
The best index is a composite index:
create index idx_historicalposition_instrumentid_fundid_date) on historicalposition(instrumentid, fundid, date);
This satisfies the where clause and can also be used for the order by.
You definitely need `instrumentid, fundid` index:
create index hp.historicalposition_instrumentid_fundid_idx
on hp.historicalposition(instrumentid,fundid);
You can then organize your table data so it's order on the disk physically matches this index:
cluster hp.historicalposition using hp.historicalposition_instrumentid_fundid_idx;
General ideas, not necessarily all applicable to postgresql (in fact, they come from the Oracle world):
Partition by time (e.g. day/week/whatever seems most applicable)
If there is only one way of accessing the data and the table is of write-once type, then using index organised table could help (a.k.a. clustered index). Also tweak the write settings not to leave any space in the pages written to disk.
Consider using compression - to reduce the number of physical reads needed
Have a database job that regularly updates the statistics

Implement a ring buffer

We have a table logging data. It is logging at say 15K rows per second.
Question: How would we limit the table size to the 1bn newest rows?
i.e. once 1bn rows is reached, it becomes a ring buffer, deleting the oldest row when adding the newest.
Triggers might load the system too much. Here's a trigger example on SO.
We are already using a bunch of tweaks to keep the speed up (such as stored procedures, Table Parameters etc).
Edit (8 years on) :
My recent question/answer here addresses a similar issue using a time series database.
Unless there is something magic about 1 billion, I think you should consider other approaches.
The first that comes to mind is partitioning the data. Say, put one hour's worth of data into each partition. This will result in about 15,000*60*60 = 54 million records in a partition. About every 20 hours, you can remove a partition.
One big advantage of partitioning is that the insert performance should work well and you don't have to delete individual records. There can be additional overheads depending on the query load, indexes, and other factors. But, with no additional indexes and a query load that is primarily inserts, it should solve your problem better than trying to delete 15,000 records each second along with the inserts.
I don't have a complete answer but hopefully some ideas to help you get started.
I would add some sort of numeric column to the table. This value would increment by 1 until it reached the number of rows you wanted to keep. At that point the procedure would switch to update statements, overwriting the previous row instead of inserting new ones. You obviously won't be able to use this column to determine the order of the rows, so if you don't already I would also add a timestamp column so you can order them chronologically later.
In order to coordinate the counter value across transactions you could use a sequence, then perform a modulo division to get the counter value.
In order to handle any gaps in the table (e.g. someone deleted some of the rows) you may want to use a merge statement. This should perform an insert if the row is missing or an update if it exists.
Hope this helps.
Here's my suggestion:
Pre-populate the table with 1,000,000,000 rows, including a row number as the primary key.
Instead of inserting new rows, have the logger keep a counter variable that increments each time, and update the appropriate row according to the row number.
This is actually what you would do with a ring buffer in other contexts. You wouldn't keep allocating memory and deleting; you'd just overwrite the same array over and over.
Update: the update doesn't actually change the data in place, as I thought it did. So this may not be efficient.
Just an idea that is to complicated to write in a comment.
Create a few log tables, 3 as an example, Log1, Log2, Log3
CREATE TABLE Log1 (
Id int NOT NULL
CHECK (Id BETWEEN 0 AND 9)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log1] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log2 (
Id int NOT NULL
CHECK (Id BETWEEN 10 AND 19)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log2] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
CREATE TABLE Log3 (
Id int NOT NULL
CHECK (Id BETWEEN 20 AND 29)
,Message varchar(10) NOT NULL
,CONSTRAINT [PK_Log3] PRIMARY KEY CLUSTERED ([Id] ASC) ON [PRIMARY]
)
Then create a partitioned view
CREATE VIEW LogView AS (
SELECT * FROM Log1
UNION ALL
SELECT * FROM Log2
UNION ALL
SELECT * FROM Log3
)
If you are on SQL2012 you can use a sequence
CREATE SEQUENCE LogSequence AS int
START WITH 0
INCREMENT BY 1
MINVALUE 0
MAXVALUE 29
CYCLE
;
And then start to insert values
INSERT INTO LogView (Id, Message)
SELECT NEXT VALUE FOR LogSequence
,'SomeMessage'
Now you just have to truncate the logtables on some kind of schedule
If you don't have sql2012 you need to create the sequence some other way
I'm looking for something similar myself (using a table as a circular buffer) but it seems like a simpler approach (for me) will be just to periodically delete old entries (e.g. the lowest IDs or lowest create/lastmodified datetimes or entries over a certain age). It's not a circular buffer but perhaps it is a close enough approximation for some. ;)

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

SQL Server Index Usage with an Order By

I have a table named Workflow. There are 38M rows in the table. There is a PK on the following columns:
ID: Identity Int
ReadTime: dateTime
If I perform the following query, the PK is not used. The query plan shows an index scan being performed on one of the nonclustered indexes plus a sort. It takes a very long time with 38M rows.
Select TOP 100 ID From Workflow
Where ID > 1000
Order By ID
However, if I perform this query, a nonclustered index (on LastModifiedTime) is used. The query plan shows an index seek being performed. The query is very fast.
Select TOP 100 * From Workflow
Where LastModifiedTime > '6/12/2010'
Order By LastModifiedTime
So, my question is this. Why isn't the PK used in the first query, but the nonclustered index in the second query is used?
Without being able to fish around in your database, there are a few things that come to my mind.
Are you certain that the PK is (id, ReadTime) as opposed to (ReadTime, id)?
What execution plan does SELECT MAX(id) FROM WorkFlow yield?
What about if you create an index on (id, ReadTime) and then retry the test, or your query?
Since Id is an identity column, having ReadTime participate in the index is superfluous. The clustered key already points to the leaf data. I recommended you modify your indexes
CREATE TABLE Workflow
(
Id int IDENTITY,
ReadTime datetime,
-- ... other columns,
CONSTRAINT PK_WorkFlow
PRIMARY KEY CLUSTERED
(
Id
)
)
CREATE INDEX idx_LastModifiedTime
ON WorkFlow
(
LastModifiedTime
)
Also, check that statistics are up to date.
Finally, If there are 38 million rows in this table, then the optimizer may conclude that specifying criteria > 1000 on a unique column is non selective, because > 99.997% of the Ids are > 1000 (if your identity seed started at 1). In order for an index to considered helpful, the optimizer must conclude that < 5% of the records would be selected. You can use an index hint to force the issue (as already stated by Dan Andrews). What is the structure of the non-clustered index that was scanned?