optimize sql scan query to get results from postgres db - sql

I'm working on small sql logic
I've one table Messages contaning message_id, accountid as columns
Data is keep coming in this table with unique message id.
My target is to store these mesaages table data into another database. [From postgres(source) DB to postgres(destination) DB]
For this I have set up a ETL job. Which is helping me to transfer the data.
Here comes the problem, In postgres(source) DB where messages table is located, in that table message_id is not in sorted form. And data looks like this .....
And my etl job runs after in every half an hour, My motive is whenever etl job runs, takes the data from source db to destinaton db on the basis of message_id. In destination db, I'm having one stored procedure which helps me to get the max(message_id) from messages table and store that value in another table. So in ETL I use that value in query which I use to fire on source db for getting the data greater than message_id I got from destination db.
So its kind a load incremental process.using etl. But the query am using to get data from source db is like this http://prnt.sc/b3u5il
SELECT * FROM (SELECT * FROM MESSAGES ORDER BY message_id) as a WHERE message_id >"+context.vid+"
This query scans the all table every time it runs...so itakes so much time to execute. I'm getting my desired results. But is there any way so that I could perform this process in more faster way.
Can anyone help me to optimize this query (don't know whether its possible or not) ? or any other suggestions are welcome.
Thanks

The most efficient way to improve performance in your case is to a add a INDEX to your sort column in this case message_id for better performance.
In this way , your query will perform an index scan instead of a full table scan which hampers the performance.
You can create an index by using following statement:
CREATE INDEX index_name
ON table_name (column_name)

Yes.
If message_id is not the leading column in the primary key or a secondary index, then create an index:
... ON MESSAGES (message_id)
And eliminate the inline view:
SELECT m.*
FROM MESSAGES m
WHERE m.message_id > ?
ORDER BY m.message_id

Create a B-Tree Index :
You can adjust the ordering of a B-tree index by including the options ASC, DESC, NULLS FIRST, and/or NULLS LAST when creating the index; for example:
CREATE INDEX test2_info_nulls_low ON test2 (info NULLS FIRST);
CREATE INDEX test3_desc_index ON test3 (id DESC NULLS LAST);

Related

Select data with a dynamic where clause on non-indexed column

I have a table with 30 columns and millions of entries.
I want to execute a stored procedure on this table to search data.
The search criteria are passed in a parameter to this SP.
If I serach data with a dynamic WHERE clause on non-indexed column, it spends a lot of time.
Below is an example :
Select counterparty_name from counterparty where counterparty_name = 'test'
In this example this counterparty is in th row number 5000000.
As explained,I can't create an index to this table .
I would like to know if the processing time is normal.
I would like to know if there is any recommandation that can improve the execution time?
Best regards.
If you do not have an index on the column then it will have to do a scan of the clustered index in order to look for the data (or maybe a smaller index which might have that column included in it). As such it is going to take a long time.

Date column in delete statement along with the index column

Need help on below issue.
I need to delete rows from a table having huge amount of data getting inserted on daily basis, I have written a procedure which deletes the rows based on a column having index on it which to me should be enough but my collegue suggested me to use a date column as well to delete the data as this will use date parition (Parition is based on date).
My doubt is which delete statement would be faster to delete the data.
E.g
1. Column name :- FILE_NAME (Having index)
delete from table_name where column_name1=file_name
2. Column name1 :- FILE_NAME (HHaving index) and column name2:- TXN_DATE (no index, Partition is on this column)
delete from table_name where column_name1=file_name and txn_date=date_value
Please advise.
Thanks
Yes, your colleague is right. The second query will be quicker.
The process is called partition pruning. Using the column, based on which partitions are created will automatically hit only the necessary partitions where the data is available.
You can also directly reference the partition if you can determine the name of the partition for the date_value, as
DELETE FROM table_name
PARTITION (partition_date_value)
WHERE column_name1=file_name;
References:
Examples for DELETE on Oracle Database SQL Language Reference
Partition Pruning
Another Partition Pruning website
If file name is a index that actually improves the navigation on your table, i think it would be faster to use the first one.

Selecting the most optimal query

I have table in Oracle database which is called my_table for example. It is type of log table. It has an incremental column which is named "id" and "registration_number" which is unique for registered users. Now I want to get latest changes for registered users so I wrote queries below to accomplish this task:
First version:
SELECT t.*
FROM my_table t
WHERE t.id =
(SELECT MAX(id) FROM my_table t_m WHERE t_m.registration_number = t.registration_number
);
Second version:
SELECT t.*
FROM my_table t
INNER JOIN
( SELECT MAX(id) m_id FROM my_table GROUP BY registration_number
) t_m
ON t.id = t_m.m_id;
My first question is which of above queries is recommended and why? And second one is if sometimes there is about 70.000 insert to this table but mostly the number of inserted rows is changing between 0 and 2000 is it reasonable to add index to this table?
An analytical query might be the fastest way to get the latest change for each registered user:
SELECT registration_number, id
FROM (
SELECT
registration_number,
id,
ROW_NUMBER() OVER (PARTITION BY registration_number ORDER BY id DESC) AS IDRankByUser
FROM my_table
)
WHERE IDRankByUser = 1
As for indexes, I'm assuming you already have an index by registration_number. An additional index on id will help the query, but maybe not by much and maybe not enough to justify the index. I say that because if you're inserting 70K rows at one time the additional index will slow down the INSERT. You'll have to experiment (and check the execution plans) to figure out if the index is worth it.
In order to check for faster query, you should check the execution plan and cost and it will give you a fair idea. But i agree with solution of Ed Gibbs as analytics make query run much faster.
If you feel this table is going to grow very big then i would suggest partitioning the table and using local indexes. They will definitely help you to form faster queries.
In cases where you want to insert lots of rows then indexes slow down insertion as with each insertion index also has to be updated[I will not recommend index on ID]. There are 2 solutions i have think of for this:
You can drop index before insertion and then recreate it after insertion.
Use reverse key indexes. Check this link : http://oracletoday.blogspot.in/2006/09/there-is-option-to-create-index.html. Reverse key index can impact your query a bit so there will be trade off.
If you look for faster solution and there is a really need to maintain list of last activity for each user, then most robust solution is to maintain separate table with unique registration_number values and rowid of last record created in log table.
E.g. (only for demo, not checked for syntax validity, sequences and triggers omitted):
create table my_log(id number not null, registration_number number, action_id varchar2(100))
/
create table last_user_action(refgistration_number number not null, last_action rowid)
/
alter table last_user_action
add constraint pk_last_user_action primary key (registration_number) using index
/
create or replace procedure write_log(p_reg_num number, p_action_id varchar2)
is
v_row_id rowid;
begin
insert into my_log(registration_number, action_id)
values(p_reg_num, p_action_id)
returning rowid into v_row_id;
update last_user_action
set last_action = v_row_id
where registration_number = p_reg_num;
end;
/
With such schema you can simple query last actions for every user with good performance:
select
from
last_user_action lua,
my_log l
where
l.rowid (+) = lua.last_action
Rowid is physical storage identity directly addressing storage block and you can't use it after moving to another server, restoring from backups etc. But if you need such functionality it's simple to add id column from my_log table to last_user_action too, and use one or another depending on requirements.

Avoid "SELECT TOP 1" and "ORDER BY" in Queries

I have the very table in sql server 2008 with lot of data
|ID|Name|Column_1|Column_2|
|..|....|........|........|
more than 18,000 records. So i need to the the row with the lowest value of Column_1 that is date but could by any data type (that is unsorted) so I use these sentence
SELECT TOP 1 ID, Name from table ORDER BY Column_1 ASC
But this is very very slow. And i think that i don't need to to sort the whole table. My question es how to get the same date with out using TOP 1 and ORDER BY
I cannot see why 18,000 rows of information would cause too much of a slow down, but that is obviously without seeing what the data is you are storing.
If you are regularly going to be using the Column_1 field, then I would suggest you place a non-clustered index on it... that will speed up your query.
You can do it by "designing" your table via Sql Server Management Studio, or directly via TSQL...
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC)
More information on MSDN about creating indexes here
Update thanks to comments by #GarethD who helped me with this, as I wasn't actually aware of it.
As an extra part of the above TSQL statement, it will increase the speed of your queries if you include the names of the other columns that will be used within the index....
CREATE INDEX IX_myTable_Column_1 ON myTable (Column_1 ASC) INCLUDE (ID, Name)
As GarethD points out, using this SQLFiddle as proof, the execution plan is much quicker as it avoids a "RID" (or Row Identifier) lookup.
More information on MSDN about creating indexes with include columns here
Thank you #GarethD
Would this work faster? When I read this question, this was the code that came to mind:
Select top 1 ID, Name
from table
where Column_1 = (Select min(Column_1) from table)

How to optimize the following delete SQL query?

I have a following delete query in oracle. There will be about 1000 records to be deleted from the database at a time.
I have used "in" the query. Is there any better way to write this query?
DELETE FROM BI_EMPLOYEE_ACTIVITY
WHERE EMPLOYEE_ID in (
SELECT
EMP_ID
FROM
BI_EMPLOYEE
WHERE
PRODUCT_ID = IN_PRODUCT_ID
);
It is not really possible to answer this question as we're missing a description of the data distribution: How many rows are in each table? What's the relationship between the tables? How many rows are affected by the delete?
I'll be assuming that both tables are large (since this is an optimization question) and that BI_EMPLOYEE and BI_EMPLOYEE_ACTIVITY have a parent-child 1..N relationship.
If there are few rows affected by the delete, this means that not many employees have the same PRODUCT_ID and each employee has few activities. In this case it would make sense to index both BI_EMPLOYEE (product_id) and BI_EMPLOYEE_ACTIVITY (employee_id).
This is probably not the case though, the delete probably affects lots of rows. In that case the indexes could be a hindrance. If the delete affects lots of rows, the fastest access path probably is FULL TABLE SCAN + HASH JOIN.
We need some metrics here: how many rows are deleted? How long does it take? This is because large DML will always take time, especially DELETE since they produce the largest amount of undo.
There are alternatives to a large DELETE, as explained in "Deleting many rows from a big table" from asktom:
recreate the table without the deleted rows
partition the data, do a parallel delete
partition the data so that the delete is done by dropping a partition
Putting index on EMP_ID may help, I dont believe if any other optimization is possible, query is quite simple and straight forward
Create an index on PRODUCT_ID column. This would speed up the search. If the column is of varchar type, make use to function index if you are converting values to uppercase or lowercase
Maybe you can try EXIST instead of IN:
DELETE FROM BI_EMPLOYEE_ACTIVITY
WHERE EXISTS (
SELECT
EMP_ID
FROM
BI_EMPLOYEE
WHERE
PRODUCT_ID = IN_PRODUCT_ID
AND
EMP_ID = EMPLOYEE_ID
);
Create an index on BI_EMPLOYEE table for PRODUCT_ID, EMP_ID columns in this order (product_id on the first place).
And create an index on the BI_EMPLOYEE_ACTIVITY table for the column EMPLOYEE_ID
I'll just add that other than creating an index for the query, you need to take a look at the locking issue when your table grows really big, try to lock the table in exclusive mode (if possible) as this will only take a lock from the db, and if it's not possible try to commit the delete over each 2500 records so if you're stuck with row locking you don't endup starving the database of locks.