Insert latest records efficiently in hive

Insert latest records efficiently in hive - hive

I have around 90 tables in hive, 10 each are combined using union all in to 9 master tables.
These 90 base tables are inserted with new rows every 15 minutes. We need to bring in the newly inserted rows in master tables every 15 minutes.
Checking the ID with "not in" is consuming some time.
I have time stamps column as well, getting data based on that as well taking time
Is there a efficient way of achieving this. " Inserting newly added records in base tables into master every 15 minutes"

I can think of two options.
Option 1 - You can create a new table to keep max date timestamp for each master,stage combination. Table should be like this
masters,stages, mxts
master1,stage1, 2021-01-01 12:30:30
...
Then use it in sql like similar to above sql.
select * from Staging table-1 s
Join maxtimestamp On timestamp > mxts and stages='stage1' and masters='master1'
union all
select * from Staging table-2 s
Join maxtimestamp On timestamp > mxts and stages='stage2'and masters='master1'
And then insert max timespamp into the new table everyday after load.
Option 2 - if you can add a new column to master table called record_created_by to keep a track which stage is creating the data.
And your insert statement would be like this
select s.*, 'master1~stage1' as record_created_by from Staging table-1 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage1') mx On timestamp > mxts
union all
select s.*, 'master1~stage2' as record_created_by from Staging table-2 s
Join (select max(timestamp) mxts from master where record_created_by='master1~stage2') mx On timestamp > mxts
Pls note your first time insert statement would be same above sql but without timestamp part. If you have multiple stages, you can add them like this sql.
First option is way faster but you need to create and maintain a new table.

Related

I don't understand how make task on SQL

There is a table with two fields: Id and Timestamp.
Id is an increasing sequence. Each insertion of a new record into the table leads to the generation of ID(n)=ID(n-1) + 1. Timestamp is a timestamp that, when inserted retroactively, can take any values less than the maximum time of all previous records.
Retroactive insertion is the operation of inserting a record into a table in which
ID(n) > ID(n-1)
Timestamp(n) < max(timestamp(1):timestamp(n-1))
Example of a table:
ID
Timestamp
1
2016.09.11
2
2016.09.12
3
2016.09.13
4
2016.09.14
5
2016.09.09
6
2016.09.12
7
2016.09.15
IDs 5 and 6 were inserted retroactively (their timestamps are lower than later records).
I need a query that will return a list of all ids that fit the definition of insertion retroactively. How can I do this?

It can be rephrased to :
Find every entries for which, in the same table, there is an entry with a lesser id (a previous entry) having a greater timestamp
It can be achieved using a WHERE EXISTS clause :
SELECT t.id, t.timestamp
FROM tbl t
WHERE EXISTS (
SELECT 1
FROM tbl t2
WHERE t.id > t2.id
AND t.timestamp < t2.timestamp
);
Fiddle for MySQL It should work with any DBMS, since it's a standard SQL syntax.

Update table based on LAG value and condition

I'm trying to update the table with a lagging value for specific field when that field is different from '1900-01-01 00:00:00'
select Ticket_ID, Business_Area, Priority, HF_Client_Name, closed_date, Closed_Date_ID, Next_Create_date from schema.table_name order by closed_date desc;
And this is the result I'm trying to get:
I truly appreciate any help I can get. My DB is MYSQL db.
I'm working on a solution with a LAG function, an will paste my code as soon as I have something worth showing.
Thank you all!
Rosa

I've managed to solve this by creating a temp table which contains all the tickets with 0. I joined my original table without the 0 tickets to this second temp table.
The most important piece was to create a temp table with maximal next created date based on that month:
select a.business_area, a.priority, a.hf_client_name, DATE_FORMAT(a.closed_date ,'%Y-%m') month_of_ticket, max(next_created_date) next_created_date
from tmp_next_lead_date a
group by a.business_area, a.priority, a.hf_client_name, DATE_FORMAT(a.closed_date ,'%Y-%m');
After that I just joined the two tables based on month of ticket and the rest of the key.
select a.ticket_id, a.business_area, a.priority, a.hf_client_name, a.closed_date,
/*DATE_FORMAT(a.closed_date ,'%Y-%m') month_of_ticket,*/ a.closed_date_id, b.next_created_date
from tmp_missing_combination a
join tmp_next_created_Date b
on a.business_area=b.business_area
and a.hf_client_name=b.hf_client_name
and **DATE_FORMAT(a.closed_date ,'%Y-%m')=b.month_of_ticket** ;

Massive Delete statement - How to improve query execution time?

I have a Spring batch that will run everyday to :
Read CSV files and import them into our database
Aggregate this data and save these aggregated data into another table.
We have a table BATCH_LIST that contains information about all the batchs that were already executed.
BATCH_LIST has the following columns :
1. BATCH_ID
2. EXECUTION_DATE
3. STATUS
Among the CSV files that are imported, we have one CSV file to feed a APP_USERS table, and another one to feed the ACCOUNTS table.
APP_USERS has the following columns :
1. USER_ID
2. BATCH_ID
-- more columns
ACCOUNTS has the following columns :
1. ACCOUNT_ID
2. BATCH_ID
-- more columns
In step 2, we aggregate data from ACCOUNTS and APP_USERS to insert rows into a USER_ACCOUNT_RELATION table. This table has exactly two columns : ACCOUNT_ID (refering to ACCOUNTS.ACCOUNT_ID) and USER_ID (refering to APP_USERS.USER_ID).
Now we want to add another step in our Spring batch. We want to delete all the data from USER_ACCOUNT_RELATION table but also APP_USERS and ACCOUNTS that are no longer relevant (ie data that was imported before sysdate - 2.
What has been done so far :
Get all the BATCH_ID that we want to remove from the database
SELECT BATCH_ID FROM BATCH_LIST WHERE trunc(EXECUTION_DATE) < sysdate - 2
For each BATCH_ID, we are calling the following methods :
public void deleteAppUsersByBatchId(Connection connection, long batchId) throws SQLException
// prepared statements to delete User account relation and user
And here are the two prepared statements :
DELETE FROM USER_ACCOUNT_RELATION
WHERE USER_ID IN (
SELECT USER_ID FROM APP_USERS WHERE BATCH_ID = ?
);
DELETE FROM APP_USERS WHERE BATCH_ID = ?
My issue is that it takes too long to delete data for one BATCH_ID (more than 1 hour).
Note : I only mentioned the APP_USERS, ACCOUNTS AND USER_ACCOUNT_RELATION tables, but I actually have around 25 tables to delete.
How can I improve the query time ?
(I've just tried to change the WHERE USER_ID IN () into an EXISTS. It is better but still way too long.

If that will be your regular process, ie you want to store only last 2 days, you don't need indexes, since every time you will delete 1/3 of all rows.
It's better to use just 3 deletes instead of 3*7 separate deletes:
DELETE FROM USER_ACCOUNT_RELATION
WHERE ACCOUNT_ID IN
(
SELECT u.ID
FROM {USER} u
join {FILE} f
on u.FILE_ID = f.file
WHERE trunc(f.IMPORT_DATE) < (sysdate - 2)
);
DELETE FROM {USER}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
DELETE FROM {ACCOUNT}
WHERE FILE_ID in (select FILE_ID from {file} where trunc(IMPORT_DATE) < (sysdate - 2));
Just replace {USER}, {FILE}, {ACCOUNT} with your real table names.
Obviously in case of partitioning option it would be much easier - daily interval partitioning, so you could easily drop old partitions.
But even in your case, there is also another more difficult but really fast solution - "partition views": for example for ACCOUNT, you can create 3 different tables ACCOUNT_1, ACCOUNT_2 and ACCOUNT_3, then create partition view:
create view ACCOUNT as
select 1 table_id, a1.* from ACCOUNT_1 a1
union all
select 2 table_id, a2.* from ACCOUNT_2 a2
union all
select 3 table_id, a3.* from ACCOUNT_3 a3;
Then you can use instead of trigger on this view to insert daily data into own table: first day into account_1,second - account_2, etc. And truncate old table each midnight. You can easily get table name using
select 'ACCOUNT_'|| (mod(to_char(sysdate, 'j'),3)+1) tab_name from dual;

How can I schedule a query in Google BigQuery to append new data to table?

I'm trying the new query scheduling feature in Google BigQuery, but I can't seem to get it to append new records to my table correctly.
I set Custom schedule to every 15 minutes and the Destination table write preference to Append to table.
SELECT DATETIME_TRUNC(DATETIME(logtime, 'America/Los_Angeles'), MINUTE) log_minute,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT product_id) unique_products
FROM mytable
WHERE DATE(logtime, 'America/Los_Angeles') >= "2019-05-01"
GROUP BY log_minute
ORDER BY log_minute
I expected to see 1 row per log_minute, but I'm seeing duplicates: 1 row per log_minute for each scheduled run so that after an hour, there are 5 duplicates of each row (1 at the start + 1 for every 15 minutes).

I expected to see 1 row per log_minute, but I'm seeing duplicates: 1 row per log_minute for each scheduled run
Do you want to append new rows? Of course you'll see a new row every time the query runs - because you are appending rows.
If you want to UPDATE the existing ones instead and add new ones, schedule a MERGE.
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax#merge_examples

Thanks for the tip, Felipe! For anyone who's trying to do the same thing, I edited the query to the following:
MERGE nextvr.sched_test_15min H
USING
(
SELECT TIMESTAMP(DATETIME_TRUNC(DATETIME(logtime, 'America/Los_Angeles'), MINUTE)) log_minute,
COUNT(DISTINCT user_id) users,
COUNT(DISTINCT product_id) products
FROM mytable
WHERE DATE(logtime, 'America/Los_Angeles') >= "2019-05-01"
GROUP BY log_minute
) N
ON H.log_minute = N.log_minute
WHEN MATCHED THEN
UPDATE
SET users = N.users, products = N.products
WHEN NOT MATCHED THEN
INSERT (log_minute, users, products)
VALUES (log_minute, users, products)
When creating the scheduled query, under Destination for query results section, leave the Table name field blank.

SQL Server Inner Join with Timestamps: is each record only assigned once?

I am working with timestamped records and need to do an inner join based on the timestamp difference. I have been using the DATEDIFF function and it seems to be working well. However, the amount of time between timestamps varies. To clarify, sometimes the record appears in table 2 within the same second as table 1, and sometimes the record in table 2 is up to 15 seconds behind the record in table 1. The records in table 1 are always timestamped before table 2. There is no other common field with which I can join, however there is a register number in each table that I am using to increase accuracy by ensuring that the registers are the same.
My question is: if I increase the timestamp difference to do the inner join (e.g. where the DATEDIFF = 1 or 2 or 3... or 15) will records only be joined once? Or would my table contain duplicate records from table 1 (e.g. where record 1 is joined to record 4 in table 2 where the diff is 4 seconds, and is also joined with record 7 from table 2 where the diff is 11 seconds)?
The reason my statement works now is that no registers have records with less than 6 seconds in between, so even if there are multiple timestamps that would match, the matching of registers eliminates this problem.
My Statement is currently working as:
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium INNER JOIN Sequoia ON Atrium.Reader = Sequoia.theader_pos_name
WHERE (
((DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=0
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=1
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=2
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=3
Or (DateDiff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=4
Or (Datediff(s,[Atrium].[Date2],[Sequoia].[theader_tdatetime]))=5)
)
ORDER BY Sequoia.theader_id;

you could CROSS APPLY to the closest record in proximity. That's by no means ideal however, what if there are multiple records written at the same time? You perhaps should give the first table an identity field, then update the next table with scopeidentity
SELECT *
INTO AtriumSequoiaJoin5
FROM Atrium CROSS APPLY
(SELECT TOP 1 * FROM Sequoia WHERE
Atrium.Reader = Sequoia.theader_pos_name
ORDER BY Datediff(millisecond,[Atrium].[Date2],[Sequoia].[theader_tdatetime])) DQ
ORDER BY Sequoia.theader_id;

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Insert latest records efficiently in hive - hive

Related

I don't understand how make task on SQL

Update table based on LAG value and condition

Massive Delete statement - How to improve query execution time?

How can I schedule a query in Google BigQuery to append new data to table?

SQL Server Inner Join with Timestamps: is each record only assigned once?

Categories

Resources