Better way to copy only the updated data from Oracle to Solr - sql

We have an Oracle Database which has about 500 000 records and we have an etl job which runs every day and it will delete all the records from Oracle and reinsert the data (because some of the data is updated on every day) and in this process the lastupdatedTime for all the records is current day. This whole process is maintained by some other team.
And We have an application which is related to search, for this we are using Solr for the better performance. To use Solr, we are reading all the data from Oracle DB and storing into SolrCollection and this is a separate Job.
My questions, is there any better way to get only the modified records from Oracle and update only those into my Solr Collection. If we go with lastUpdatedTime in Oracle Db it wont help in my scenario.

Related

SSIS Incremental Load-15 mins

I have 2 tables. The source table being from a linked server and destination table being from the other server.
I want my data load to happen in the following manner:
Everyday at night I have scheduled a job to do a full dump i.e. truncate the table and load all the data from the source to the destination.
Every 15 minutes to do incremental load as data gets ingested into the source on second basis. I need to replicate the same on the destination too.
For incremental load as of now I have created scripts which are stored in a stored procedure but for future purposes we would like to implement SSIS for this case.
The scripts run in the below manner:
I have an Inserted_Date column, on the basis of this column I take the max of that column and delete all the rows that are greater than or equal to the Max(Inserted_Date) and insert all the similar values from the source to the destination. This job runs evert 15 minutes.
How to implement similar scenario in SSIS?
I have worked on SSIS using the lookup and conditional split using ID columns, but these tables I am working with have a lot of rows so lookup takes up a lot of the time and this is not the right solution to be implemented for my scenario.
Is there any way I can get Max(Inserted_Date) logic into SSIS solution too. My end goal is to remove the approach using scripts and replicate the same approach using SSIS.
Here is the general Control Flow:
There's plenty to go on here, but you may need to learn how to set variables from an Execute SQL and so on.

Deleting records in a table with billion records using spark or scala

we have a table in Azure Data Warehouse with 17 billion records. Now we have a scenario where we have to delete records from this table based on some where condition. We are writing Spark in Scala language in Azure Databricks notebooks.
We searched for different options to do this in Spark, but all suggested to first read the entire table, delete records from this and then overwrite the entire table in Data Warehosue. However this approach will not work in our case due to huge number of records in our table.
Can you please suggest how we can achieve this functionality using spark/scala?
1) checked if we can call stored procedure through spark/scala code in azure databricks but Spark do not support stored procedures.
2) Tried reading the entire table first to delete the records but it goes into never ending loop.
Is possible to create view with select clause as per your requirement, then using of the view

Either insert or update database records via Apache NiFi flow

I am trying to transfer data between two databases with similar structure of tables using NiFi. Example of data structure:
User: {varchar name, integer id}.
There are no "Maximum-value Columns" so it is impossible to determine if there is new data or not. So each time I create "snapshot" of the full table content. The problem is that it is unclear either particular record should be inserted or updated in the target database.
I created two branches of processors: with inserts and with updates. Only insert works for new records and only update for existing. But (!) PutSQL processor works with bunch of flow files.
For example batch size is 100 and processors work once a day. Assume there was 98 records yesterday. They will be inserted. Today there are 200 records (98 from yesterday and 102 new). In this flow if NiFi tries to update first 100 records and insert them then both actions will fail: first 98 records should be updated while last 2 should be inserted.
How to solve this issue? I know it is possible to use batch size 1 but it work too slow.
I recommend solving this in your SQL statements, since NiFi will not know the prior status of the records. A MERGE statement would be ideal, if your database supports it (Oracle, SQL Server, MySQL insert). Otherwise, you can craft both an INSERT and an UPDATE for each record in the source table, making them conditional on the user existing in the table.

Pentaho update/insert

I am trying to have a setup in Pentaho where :
My source data is in MySQL DB and target database is Amazon redshift.
I want to have incremental loads on Redshift database table, based on the last updated timestamp from MySQL DB table.
Primary key is student ID.
Can I implement this using update/insert in Pentaho ?
Insert/Update step in Pentaho Data Integration serves the purpose of inserting the row if it doesn't exist in the destination table or updating it if it's already there. It has nothing to do with incremental loads, but if your loads should be inserting or updating the record based on some Change Data Capture mechanism then this is the right step at the end of the process.
For example you could go one of two ways:
If you have a CDC then limit the data at Table Input for MySQL since you already know the last time a record has been modified (last load)
If you don't have a CDC and you are comparing entire tables then go for joining the sets to produce rows that has changed and then perform a load (slower solution)

Backing up portion of data in SQL

I have a huge schema containing billions of records, I want to purge data older than 13 months from it and maintain it as a backup in such a way that it can be recovered again whenever required.
Which is the best way to do it in SQL - can we create a separate copy of this schema and add a delete trigger on all tables so that when trigger fires, purged data gets inserted to this new schema?
Will there be only one record per delete statement if we use triggers? Or all records will be inserted?
Can we somehow use bulk copy?
I would suggest this is a perfect use case for the Stretch Database feature in SQL Server 2016.
More info: https://msdn.microsoft.com/en-gb/library/dn935011.aspx
The cold data can be moved to the cloud with your given date criteria without any applications or users being aware of it when querying the database. No backups required and very easy to setup.
There is no need for triggers, you can use job running every day, that will put outdated data into archive tables.
The best way I guess is to create a copy of current schema. In main part - delete all that is older then 13 months, in archive part - delete all for last 13 month.
Than create SP (or any SPs) that will collect data - put it into archive and delete it from main table. Put this is into daily running job.
The cleanest and fastest way to do this (with billions of rows) is to create a partitioned table probably based on a date column by month. Moving data in a given partition is a meta operation and is extremely fast (if the partition setup and its function is set up properly.) I have managed 300GB tables using partitioning and it has been very effective. Be careful with the partition function so dates at each edge are handled correctly.
Some of the other proposed solutions involve deleting millions of rows which could take a long, long time to execute. Model the different solutions using profiler and/or extended events to see which is the most efficient.
I agree with the above to not create a trigger. Triggers fire with every insert/update/delete making them very slow.
You may be best served with a data archive stored procedure.
Consider using multiple databases. The current database that has your current data. Then an archive or multiple archive databases where you move your records out from your current database to with some sort of say nightly or monthly stored procedure process that moves the data over.
You can use the exact same schema as your production system.
If the data is already in the database no need for a Bulk Copy. From there you can backup your archive database so it is off the sql server. Restore the database if needed to make the data available again. This is much faster and more manageable than bulk copy.
According to Microsoft's documentation on Stretch DB (found here - https://learn.microsoft.com/en-us/azure/sql-server-stretch-database/), you can't update or delete rows that have been migrated to cold storage or rows that are eligible for migration.
So while Stretch DB does look like a capable technology for archive, the implementation in SQL 2016 does not appear to support archive and purge.