Stream catch-up in Azure Data Factory pipeline - azure-data-factory-2

I have a scenario where I have to do catch-up of a stream which is already completed for that day, so when ever it rerun it has to check with stream completed date less than current date , then the stream has to run else it has to exit.
We have tried creating pipeline with Ifcondition, Foreach, but seems there are limitations with that

Related

Why is my Upsolver Kafka data source is stuck and/or not pulling any data

Kafta topic has messages but Upsolver data source is stuck or not pulling any new messages. We have about 15 such data sources and some are working fine but some seem stuck. What is happening?
When Kafka data source is created there is a configuration "Read From Start". When true (checked), the Kafka DS will only pull data if there were any messages in the topic at the time of Data source creation. If topic was empty, this Data source will get stuck and stalled. In such a case, data source should be created with this Read from start property unchecked. And then the data source will start pulling messages from the point of its creation.
Alternatively, you could play a sample dummy message per your schema in the topic before you create the data source with Read from Start = True if you don't or can't wait for data to arrive before creating.

Kafka S3 sink connector does not commit offset

I have the following case:
My lambda is sanding messages to Kafka Topic, this messages contains fields with different dates
My Kafka Connector has flush.size=1000 and partition messages from topic by: year,month,day fields to the S3 bucket.
The problem is that Kafka Connect does not commit offset on the topic. It reads the same offset all time -> it overwrites S3 object with the same data all time.
When I change flush.size=10, everything works thine.
How can I ocercome this problem to keep flush.size=1000?
Offsets only get committed when S3 file is written. If you're not sending 1000 events for each day partition, then those records will be held in memory. They shouldn't be duplicated/overridden in S3 since the sink connector has exactly once delivery (as documented)
Lowering the flush size is one solution. Or you can add scheduled rotation interval property

Reading data from GCS with BigQuery fails with "Not Found", but the date (files) exists

I have a service that is constantly updating files in GCS bucket with hive format:
bucket
device_id=aaaa
month=01
part-0.parquet
month=02
part-0.parquet
....
device_id=bbbb
month=01
part-0.parquet
month=02
part-0.parquet
....
If today we are at month=02 and I ran the following with BigQuery:
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '02';
I get the error: Not found: Files /bigstore/bucket_name/device_id=aaaa/month=02/part-0.parquet
I checked and the file is there when the query ran.
If I run
SELECT DISTINCT event_id
FROM `project_id.dataset.table`
WHERE month = '01';
I get results without any errors. I guess the error is related to the fact that I'm modifying the data while querying it. But as I understand this should not be the case with GCS, this is from their docs.
Because uploads are strongly consistent, you will never receive a 404 Not Found response or stale data for a read-after-write or read-after-metadata-update operation.
I saw some posts that this could be related to my bucket been Multi-region.
Any other insights?
It could be for some reason that you get this error.
When you load data from Cloud Storage into a BigQuery table, the
dataset that contains the table must be in the same regional or
multi- regional location as the Cloud Storage bucket.
Due to consistency, for buckets, while metadata updates are strongly
consistent for read-after-metadata-update operations, the process
could take time to finish the changes.
Using a Multi-region bucket is not recommended.
In this case, it could be due to consistency, because while you are updating the files GCS at the same time you are executing the query, so when you execute a query the parquet file was available to read and you didn’t get the error, but the next time the parquet file wasn’t available because the service was updating the file and you got the error.
Unfortunately, there is not a simple way, to solve this problem, but here are some options:
You can add a pub/sub routine to the bucket and/or file and quick off
your query after the service finished updating the files.
Make a workflow that blocks the updating of the files in their
buckets until their query finishes.
If the query fails with “not found” for file ABCD and you have
verified ABCD exists in GCS, then retry the query X times.
You need to backup your data into another location where you won't
update these files constantly, just once a day.
You could move the data into a managed storage where you won't have
this problem because you can do snapshotting.

Run Snowflake Task when a data share view is refreshed

Snowflake's documentation illustrates to have a TASK run on a scheduled basis when there are inserts/updates/deletions or other DML operations run on a table by creating a STREAM on that specific table.
Is there any way to have a TASK run if a view from a external Snowflake data share is refreshed, i.e. dropped and recreated?
As part of this proposed pipeline, we receive a one-time refresh of a view within a specific time period in a day and the goal would be to start a downstream pipeline that runs at most once during that time period, when the view is refreshed.
For example for the following TASK schedule
'USING CRON 0,10,20,30,40,50 8-12 * * MON,WED,FRI America/New York', the downstream pipeline should only run once every Monday, Wednesday, and Friday between 8-12.
Yes, I can point you to the documentation if you would like to see if this works for the tables you might already have set up:
Is there any way to have a TASK run if a view from a external
Snowflake data share is refreshed, i.e. dropped and recreated?
If you create a stored procedure to monitor the existence of the table, I have not tried that before though, I will see if I can ask an expert.
Separately, is there any way to guarantee that the task runs at most
once on a specific day or other time period?
Yes, you can use CRON to schedule optional parameters with specific days of the week or time: an example:
CREATE TASK delete_old_data
WAREHOUSE = deletion_wh
SCHEDULE = 'USING CRON 0 0 * * * UTC';
Reference: https://docs.snowflake.net/manuals/user-guide/tasks.html more specifically: https://docs.snowflake.net/manuals/sql-reference/sql/create-task.html#optional-parameters
A TASK can only be triggered by a calendar schedule, either directly or indirectly via a predecessor TASK being run by a schedule.
Since the tasks are only run on a schedule, they will not run more often than the schedule says.
A TASK can't be triggered by a data share change, so you have to monitor it on a calendar schedule.
This limitation is bound to be lifted sometime, but is valid as of Dec, 2019.

OutputDataConversionError.TypeConversionError writing to Azure SQL DB using Stream Analytics from IoT Hub

I have wired up a Stream Analytics job to take data from an IoT Hub and write it to Azure SQL Database.
I am running into an issue with one input field which is a date/time object '2019-07-29T01:29:27.6246594Z' which always seems to result in an OutputDataConversionError.TypeConversionError -
[11:59:20 AM] Source 'eventssqldb' had 1 occurrences of kind 'OutputDataConversionError.TypeConversionError' between processing times '2019-07-29T01:59:20.7382451Z' and '2019-07-29T01:59:20.7382451Z'.
Input data sample (sourceeventtime is the problem - other datetime fields also fail).
{
"eventtype":"gamedata",
"scoretier":4,
"aistate":"on",
"sourceeventtime":"2019-07-28T23:59:24.6826565Z",
"EventProcessedUtcTime":"2019-07-29T00:13:03.4006256Z",
"PartitionId":1,
"EventEnqueuedUtcTime":"2019-07-28T23:59:25.7940000Z",
"IoTHub":{"MessageId":null,"CorrelationId":null,"ConnectionDeviceId":"testdevice","ConnectionDeviceGenerationId":"636996260331615896","EnqueuedTime":"2019-07-28T23:59:25.7670000Z","StreamId":null}
}
The target field in Azure SQL DB is datetime2 and the incoming value can be converted successfully by Azure SQL DB using a query on the same server.
I've tried a bunch of different techniques including CAST on Stream Analytics, and changing the compatibility level of the Stream Analytics job all to no avail.
Testing the query using a dump of the data in Stream Analytics results in no errors either.
I have the same data writing to Table Storage fine, but need to change to Azure SQL DB to enable shorter automated Power BI refresh cycles.
I have tried multiple Stream Analytics jobs and can recreate each time with Azure SQL DB.
Turns out that this appears to have been a cached error message being displayed in the Azure Portal.
On further investigation through reviewing detailed logs it appears another value that was too long for the target SQL DB field (i.e. would have been truncated) was the actual source of the failure. Resolving this removed the error.