Massive Delete statement - How to improve query execution time? - sql

I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST that contains information about all the batchs that were already executed.
BATCH_LIST has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.
APP_USERS has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.
What has been done so far :
Get all the BATCH_ID that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).
Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.

If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

Related

Insert latest records efficiently in hive

I have around 90 tables in hive, 10 each are combined using union all in to 9 master tables.
These 90 base tables are inserted with new rows every 15 minutes. We need to bring in the newly inserted rows in master tables every 15 minutes.
Checking the ID with "not in" is consuming some time.
I have time stamps column as well, getting data based on that as well taking time
Is there a efficient way of achieving this. " Inserting newly added records in base tables into master every 15 minutes"
I can think of two options.
Option 1 - You can create a new table to keep max date timestamp for each master,stage combination. Table should be like this
masters,stages, mxts
master1,stage1, 2021-01-01 12:30:30
...
Then use it in sql like similar to above sql.
select * from Staging table-1 s
Join maxtimestamp On timestamp > mxts and stages='stage1' and masters='master1'
union all
select * from Staging table-2 s
Join maxtimestamp On timestamp > mxts and stages='stage2'and masters='master1'
And then insert max timespamp into the new table everyday after load.
Option 2 - if you can add a new column to master table called record_created_by to keep a track which stage is creating the data.
And your insert statement would be like this
select s.*, 'master1~stage1' as record_created_by from Staging table-1 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage1') mx On timestamp > mxts
union all
select s.*, 'master1~stage2' as record_created_by from Staging table-2 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage2') mx On timestamp > mxts
Pls note your first time insert statement would be same above sql but without timestamp part. If you have multiple stages, you can add them like this sql.
First option is way faster but you need to create and maintain a new table.

Append Query Doesn't Append Missing Items

I have 2 tables. Table 1 has data from the bank account. Table 2 aggregates data from multiple other tables; to keep things simple, we will just have 2 tables. I need to append the data from table 1 into table 2.
I have a field in table2, "SrceFk". The concept is that when a record from Table1 appends, it will fill the table2.SrceFk with the table1 primary key and the table name. So record 302 will look like "BANK/302" after it appends. This way, when I run the append query, I can avoid duplicates.
The query is not working. I deleted the record from table2, but when I run the query, it just says "0 records appended". Even though the foreign key is not present.
I am new to SQL, Access, and programming in general. I understand basic concepts. I have googled this issue and looked on stackOverflow, but no luck.
This is my full statement:
INSERT INTO Main ( SrceFK, InvoDate, Descrip, AMT, Ac1, Ac2 )
SELECT Bank.ID &"/"& "BANK", Bank.TransDate, Bank.Descrip, Bank.TtlAmt, Bank.Ac1, Bank.Ac2
FROM Bank
WHERE NOT EXISTS
(
SELECT * FROM Main
WHERE Main.SrceFK = Bank.ID &"/"& "BANK"
);
I expect the query to add records that aren't present in the table, as needed.

Random sample table with Hive, but including matching rows

I have a large table containing a userID column and other user variable columns, and I would like to use Hive to extract a random sample of users based on their userID. Furthermore, sometimes these users will be on multiple rows and if a randomly selected userID is contained in other parts of the table I would like to extract those rows too.
I had a look at the Hive sampling documentation and I see that something like this can be done to extract a 1% sample:
SELECT * FROM source
TABLESAMPLE (1 PERCENT) s;
but I am not sure how to add the constraint where I would like all other instances of those 1% userIDs selected too.
You can use rand() to split the data randomly and with the proper percent of userid in your category. I recommend rand() because setting the seed to something make the results repeatable.
select c.*
from
(select userID
, if(rand(5555)<0.1, 'test','train') end as type
from
(select userID
from mytable
group by userID
) a
) b
right outer join
(select *
from userID
) c
on a.userid=c.userid
where type='test'
;
This is set up for entity level modeling purposes, which is why I have test and train as types.

Selectively retrieve data from tables when one record in first table is linked to multiple records in second table

I have 2 tables:
1. Tbl_Master: columns:
a. SEQ_id
b. M_Email_id
c. M_location_id
d. Del_flag
2. Tbl_User: columns
a. U_email_id
b. Last_logged_date
c. User_id
First table Is master table it has unique rows i.e. single record of all users in the system.
Each User can be uniquely identified by the email_id in each table.
One user can have multiple profile, which means for one us_email_id field in the tblUser table, there can be many user_id in tbl_User,
i.e there can be multiple entries in second table for each user.
Now I have to select only those users who have logged in for last time before, lets say '2012', i.e before 1-Jan-2012.
But if one user has 2 or more user_id and one user_id has last_logged_date less than 2012
But other user_id has greater than 2012 then such user should be ignored.
In the last all all the result user will be marked for deletion by setting DEL_flag in master table to ‘Yes’
For eg:
Record in Tbl_Master:
A123 ram#abc.com D234 No
A123 john#abc.com D256 No
Record in tbl_User can be Like:
ram#abc.com '11-Dec-2011' Ram1
ram#abc.com '05-Apr-2014' Ram2
john#abc.com '15-Dec-2010' John1
In such case only John's Record should be selected not of Ram whose one profile has last_logged_date>1-Jan-2012
Another possibility was
SELECT
m.M_Email_id,
MAX(u.Last_logged_date) AS last_login
FROM
Tbl_Master m
INNER JOIN
Tbl_User u on u.U_email_id = m.M_Email_id
GROUP BY m.M_Email_id
HAVING
-- Year(MAX(u.Last_logged_date)) < 2012 -- use the appropriate function of your DBMS
EXTRACT(YEAR FROM(MAX(u.Last_logged_date))) < 2012 -- should be the version for oracle
-- see http://docs.oracle.com/cd/B14117_01/server.101/b10759/functions045.htm#i1017161
Your UPDATE operation can use this select in the WHERE clause.
Try this, this ans is in sql server, I haven't worked on Oracle.
select * from Tbl_Master
outer apply
(
select U_email_id,max(Last_logged_date)as LLogged,count(U_email_id) as RecCount
from Tbl_User
where Tbl_User.U_email_id = Tbl_Master.M_Email_id
group by U_email_id
)as a
where RecCount >2
and Year(LLogged) < '2012'
Try this DEMO
Hope it helps you.

Select from multiple tables with different columns

Say I got this SQL schema.
Table Job:
id,title, type, is_enabled
Table JobFileCopy:
job_id,from_path,to_path
Table JobFileDelete:
job_id, file_path
Table JobStartProcess:
job_id, file_path, arguments, working_directory
There are many other tables with varying number of columns and they all got foreign key job_id which is linked to id in table Job.
My questions:
Is this the right approach? I don't have requirement to delete anything at anytime. I will require to select and insert mostly.
Secondly, what is the best approach to get the list of jobs with relevant details from all the different tables in a single database hit? e.g I would like to select top 20 jobs with details, their details can be in any table (depends on column type in table Job) which I don't know until runtime.
select (case when type = 'type1' then (select field from table1) else (select field from table2) end) as a from table;
Could it be a solution for you?