NiFi to fetch huge amount of data from oracle failing - apache

We are using NiFi to pull data from oracle and perform some transformation. The pipeline works fine for small amount but fails with the error no output to read from socket when data volume is high -> 1million records.
Any Help or configuration changes that i need to do.

Related

Azure Data Factory - Rerun Failed Pipeline Against Azure SQL Table With Differential Date Filter

I am using ADF to keep an Azure SQL DB in sync with an on-prem DB. The on-prem DB is read only and the direction is one-way, from the Azure SQL DB to the on-prem DB.
My source table in the Azure SQL Cloud DB is quite large (10's of millions of rows) so I have the pipeline set to use an UPSERT (merge, trying to create a differential merge). I am using a filter on the Source table and the and the Filter Query has a WHERE condition that looks like this:
[HistoryDate] >= '#{formatDateTime(pipeline().parameters.windowStart, 'yyyy-MM-dd HH:mm' )}'
AND [HistoryDate] < '#{formatDateTime(pipeline().parameters.windowEnd, 'yyyy-MM-dd HH:mm' )}'
The HistoryDate column is auto-maintained in the source table with a getUTCDate() type approach. New records will always get a higher value and be included in the WHERE condition.
This works well, but here is my question: I am testing on my local machine before deploying to the client. When I am not working, my laptop hibernates and the pipeline rightfully fails because my local SQL Instance is "offline" during that run. When I move this to production this should not be an issue (computer hibernating), but what happens if the clients connection is temporarily lost (i.e, the client loses internet for a time)? Because my pipeline has a WHERE condition on the source to reduce the table size upsert to a practical number, any failure would result in a loss of any data created during that 5 minute window.
A failed pipeline can be rerun, but the run time would be different at that moment in time and I would effectively miss the block of records that would have been picked up if the pipeline had been run on time. pipeline().parameters.windowStart and pipeline().parameters.windowEnd will now be different.
As an FYI, I have this running every 5 minutes to keep the local copy in sync as close to real-time as possible.
Am I approaching this correctly? I'm sure others have this scenario and it's likely I am missing something obvious. :-)
Thanks...
Sorry to answer my own question, but to potentially help others in the future, it seems there was a better way to deal with this.
ADF offers a "Metadata-driven Copy Task" utility/wizard on the home screen that creates a pipeline. When I used it, it offers a "Delta Load" option for tables which takes a "Watermark". The watermark is a column for an incrementing IDENTITY column, increasing date or timestamp, etc. At the end of the wizard, it allows you to download a script that builds a table and corresponding stored procedure that maintains the values of each parameters after each run. For example, if I wanted my delta load to be based on an IDENTITY column, it stores the value of the max value of a particular pipeline run. The next time a run happens (trigger), it uses this as the MIN value (minus 1) and the current MAX value of the IDENTITY column to get the added records since the last run.
I was going to approach things this way, but it seems like ADF already does this heavy lifting for us. :-)

Bulk load into Snowflake with Petnatho Data Integration over JDBC is slow

We have several on premise databases and then so far had also our data warehouse as on premise. Now moving over to the cloud and data warehouse will be in Snowflake. But we still have more on premise source systems than in the cloud, so would like to stick with our on premise ETL solution. We are using Pentaho Data Integration (PDI) as our ETL tool.
The issue we have then is then that the PDI Table output step that is using the Snowflake JDBC driver is horribly slow for bulk loads into Snowflake. A year ago it was even worse, as it then just did INSERT INTO and COMMIT after every row. By today it has improved a lot, (when looking at the Snowflake history/logs) it now seems to do some kind of PUT to some temp Snowflake stage, but then from there still does some kind of INSERT to the target table and this is slow (in our test case then it took an hour to load 1 000 000 records in).
We have used the workaround for the bulk load into that we use SnowSQL (Snowflakes command line tool) scrips to make the bulk load into Snowflake that is orchestrated by PDI then. In our example case it takes then less than a minute to get the same 1 000 000 records into Snowflake.
All stuff that is then done inside the Snowflake database is just done via PDI SQL steps sent to Snowflake over JDBC and all our source system queries run fine with PDI. So the issue is only with the bulk load into Snowflake where we need to do some weird workaround:
Instead of:
PDI.Table input(get source data) >> PDI.Table output(write to Snowflake table)
we have then:
PDI.Table input(get source data) >> PDI.Write to local file >> Snowsql.PUT local file to Snowflake Stage >> Snowsql.COPY data from Snowflake Stage to Snowflake table >> PDI clear local file, also then clear Snowflake stage.
It works, but is much more complex than it needs to be (compared to previous on premise database load for example).
I don't even know if this issue is rather on the Snowflake (if the JDBC driver works not optimal) side or on the PDI side (if it just does not utilize the JDBC driver correctly), but would like to have it working better.
To bulk load in Snowflake, you need to do the put and copy.

OutputDataConversionError.TypeConversionError writing to Azure SQL DB using Stream Analytics from IoT Hub

I have wired up a Stream Analytics job to take data from an IoT Hub and write it to Azure SQL Database.
I am running into an issue with one input field which is a date/time object '2019-07-29T01:29:27.6246594Z' which always seems to result in an OutputDataConversionError.TypeConversionError -
[11:59:20 AM] Source 'eventssqldb' had 1 occurrences of kind 'OutputDataConversionError.TypeConversionError' between processing times '2019-07-29T01:59:20.7382451Z' and '2019-07-29T01:59:20.7382451Z'.
Input data sample (sourceeventtime is the problem - other datetime fields also fail).
{
"eventtype":"gamedata",
"scoretier":4,
"aistate":"on",
"sourceeventtime":"2019-07-28T23:59:24.6826565Z",
"EventProcessedUtcTime":"2019-07-29T00:13:03.4006256Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2019-07-28T23:59:25.7940000Z",
"IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"testdevice","ConnectionDeviceGenerationId":"636996260331615896","EnqueuedTime":"2019-07-28T23:59:25.7670000Z","StreamId":null}
}
The target field in Azure SQL DB is datetime2 and the incoming value can be converted successfully by Azure SQL DB using a query on the same server.
I've tried a bunch of different techniques including CAST on Stream Analytics, and changing the compatibility level of the Stream Analytics job all to no avail.
Testing the query using a dump of the data in Stream Analytics results in no errors either.
I have the same data writing to Table Storage fine, but need to change to Azure SQL DB to enable shorter automated Power BI refresh cycles.
I have tried multiple Stream Analytics jobs and can recreate each time with Azure SQL DB.
Turns out that this appears to have been a cached error message being displayed in the Azure Portal.
On further investigation through reviewing detailed logs it appears another value that was too long for the target SQL DB field (i.e. would have been truncated) was the actual source of the failure. Resolving this removed the error.

Copy failed records to dynamo db

I am copying 50 million records to amazon dynamodb using hive script. The script failed after running for 2 days with an item size exceeded exception.
Now if I restart the script again, it will start the insertions again from first record. Is there a way where I can say like "Insert only those records which are not in dynamo db" ?
You can use conditional writes to only write the item if it the specified attributes are not equal to the values you provide. This is done by using the ConditionExpression for a PutItem request. However, it still uses write capacity even if a write fails (emphasis mine) so this may not even be the best option for you:
If a ConditionExpression fails during a conditional write, DynamoDB
will still consume one write capacity unit from the table. A failed
conditional write will return a ConditionalCheckFailedException
instead of the expected response from the write operation. For this
reason, you will not receive any information about the write capacity
unit that was consumed. However, you can view the
ConsumedWriteCapacityUnits metric for the table in Amazon CloudWatch
to determine the provisioned write capacity that was consumed from the
table.

SSIS data import with resume

I need to push a large SQL table from my local instance to SQL Azure. The transfer is a simple, 'clean' upload - simply push the data into a new, empty table.
The table is extremely large (~100 million rows) and consist only of GUIDs and other simple types (no timestamp or anything).
I create an SSIS package using the Data Import / Export Wizard in SSMS. The package works great.
The problem is when the package is run over a slow or intermittent connection. If the internet connection goes down halfway through, then there is no way to 'resume' the transfer.
What is the best approach to engineering an SSIS package to upload this data, in a resumable fashion? i.e. in case of connection failure, or to allow the job to be run only between specific time windows.
Normally, in a situation like that, I'd design the package to enumerate through batches of size N (1k row, 10M rows, whatever) and log to a processing table what the last successful batch transmitted would be. However, with GUIDs you can't quite partition them out into buckets.
In this particular case, I would modify your data flow to look like Source -> Lookup -> Destination. In your lookup transformation, query the Azure side and only retrieve the keys (SELECT myGuid FROM myTable). Here, we're only going to be interested in rows that don't have a match in the lookup recordset as those are the ones pending transmission.
A full cache is going to cost about 1.5GB (100M * 16bytes) of memory assuming the Azure side was fully populated plus the associated data transfer costs. That cost will be less than truncating and re-transferring all the data but just want to make sure I called it out.
Just order by your GUID when uploading. And make sure you use the max(guid) from Azure as your starting point when recovering from a failure or restart.