Data Factory - Data Lake File Created Event Trigger fires twice - azure-data-factory-2

I'm developing Pipeline in Azure Data Factory V2. It has very simple Copy activity. The Pipeline has to start when a file is added to Azure Data Lake Store Gen 2. In order to do that I have created a Event Trigger attached to ADLS_gen2 on Blob created. Then assigned trigger to pipeline and associate trigger data #triggerBody().fileName to pipeline parameter.
To test this I'm using Azure Storage Explorer and upload file to data lake. The problem is that the trigger in Data Factory is fired twice, resulting pipeline to be started twice. First pipeline run finish as expected and second one stays in processing.
Has anyone faced this issue? I have tried to delete the trigger in DF and create new one but the result was the same with new trigger.

I'm having the same issue myself.
When writing a file to ADLS v2 there is an initial a CreateFile operation and a FlushWithClose operation and they are both triggering a Microsoft.Storage.BlobCreated event type.
https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage
If you want to ensure that the Microsoft.Storage.BlobCreated event is triggered only when a Block Blob is completely committed, filter the event for the FlushWithClose REST API call. This API call triggers the Microsoft.Storage.BlobCreated event only after data is fully committed to a Block Blob.
https://learn.microsoft.com/en-us/azure/event-grid/how-to-filter-events
You can filter out the CreateFile operation by navigating to Event Subscriptions in the Azure portal and choosing the correct topic type (Storage Accounts) and subscription and location. Once you've done that you should be able to see the trigger and update the filter settings on it. I removed CreateFile.

On your Trigger definition, set 'Ignore empty blobs' to Yes.
The comment from #dtape is probably what's happening underneath, and toggling this ignore empty setting on is effectively filtering the Create portion out (but not the data written part).
This fixed the problem for me.

Related

Google Data Fusion Salesforce to Bigquery Pipeline, automatic way of managing schema updates in Salesforce

Hey I am trying to create some batch jobs that reads from a couple Salesforce Objects and pushes them to BQ. Every-time batch process runs it will truncate the table in BQ and push all the data in the SF object back into BQ. Is it possible for google data fusion to automatically detect changes in an object in Salesforce(like adding a new column or changing data types of a column) then be registered and pushed to BQ via google data fusion?
For SF side of the puzzle you could look into https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_describeGlobal.htm and If-Modified-Since header telling you if the definition of table(s) changed. That url is for all tables in the org or you run table-specific metadata describe calls with https://developer.salesforce.com/docs/atlas.en-us.api_rest.meta/api_rest/resources_sobject_describe.htm
But I can't tell you how to use it in your job.
You can use the provided answer of #eyescream to be the condition or the trigger for the update to BigQuery. You may push changes to BigQuery using the pre-built plugin Stream Source approach from Datafusion in which, as mentioned in this docmentation, it
tracks updates in Salesforce sObjects. Examples of sObjects are
opportunities, contacts, accounts, leads, any custom object, etc.
You may use this approach to automatically track changes and push them to BigQuery. You can also find the whole Salesforce Streaming Source configuration reference in this documentation as also redirected from google's official documentation.
However, if you want a more dynamic approach for your overall use case, you may also use the integration of BigQuery with Salesforce. However in this approach, you will need to build your own code in which you can also use #eyescream 's answer as the primary condition/trigger and then automatically push the update to your BigQuery schema.

Azure Storage V2 blob event - Not triggering ADF

I have an Azure Data Factory V2 (X), with a blob event trigger. The blob event trigger works fine and triggers the Data Facotry (X) when I manually upload a file using the storage explorer. But when I use another data factory (Y) to write the file to the same Blob Storage instead of manual write, the event doesn't trigger my Data Factory (X).
I have verified the following:
There are no multiple events under the 'Events' blade section of the Blob Storage.
There is a System Topic and a Subscription created with the correct filters. I have the 'BeginsWith' with my container name and 'EndsWith' with the file extension 'parquet'.
I have verified the related questions on Stack Overflow but this seems to be different.
Any clues what could be wrong or is this a known issue?
EDIT:
I checked the logs of the Event Topic & Subscription, when the file is written by the ADF (Y) there is no event generated but with the manual upload/write the event gets triggered.
If your event of trigger is blob created, then the blob event trigger is basically depends on the new Etag. When the new etag is created, then your pipeline will be triggered.
This is the trigger on my side:
On my side, I create 3 containers: test1,test2,test3. Pipeline on Datafactory Y send files from test1 to test2, and pipeline on Datafactory X have above trigger. If triggered, it will send files from test2 to test3. When file is writed by the pipeline on Datafactory Y, pipeline on Datafactory X will be triggered. And files will be send to test3 with no problem.
Basiclly, the principle of the 'blob created' is based on the new etag of your container. When your Datafactory send files to target container, then there is no doubt that a new etag will appear.
I notice you mentioned the BeginsWith and EndsWith, so I think this is the problem comes from. Please have a check of that.

Transferring Storage Accounts Table into Data Lake using Data Factory

I am trying to use Data Factory to transfer a table from Storage Accounts into Data Lake. Microsoft claims that one can, "store files of arbitrary sizes and formats into Data Lake". I use the online wizard and try to create a pipeline. Pipeline gets created, but I then always get an error saying:
Copy activity encountered a user error: ErrorCode=UserErrorTabularCopyBehaviorNotSupported,'Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=CopyBehavior property is not supported if the source is tabular data source.,Source=Microsoft.DataTransfer.ClientLibrary,'.
Any suggestions what I can do to be able to use Data Factory to transfer data from Storage Accounts table into Data Lake?
Thanks.
Your case is supported by ADF. While for the error you hit, there is a known defect that for some cases the copy wizard mis-generate a "CopyBehavior" property which is not applicable. We are fixing that now.
For you to workaround, go to Azure portal -> Author and deploy -> select that pipeline -> find the "CopyBehavior": "MergeFiles" under AzureDataLakeStoreSink and remove that line -> then deploy and rerun the activity.
If you happened to author an run-once pipeline, please re-author a scheduled one given the former is hard to be updated using JSON.
Thanks,
Linda

Detect plugin rollback

Pretty simple question, but I can't find anything about it..
I have a plugin in Dynamics CRM 2013 that listens to the create and update events of an account. Depending on some business rules, some info about this account is written to an external webservice.
However, sometimes a create or update action can be rolled back from outside the scope of your plugin (for example a third party plugin), so the account won't be created or updated. The crm plugin model handles this nicely by rolling back every SDK call made in this transaction. But as I've have written some info to an external service I need to know when a rollback occured so that I can rollback the external operation manually.
Is there any way to detect a rollback in the plugin execution pipeline and execute some custom code? Alternative solutions are welcome too.
Thx in advance.
There is no trigger that can be subscribed to when the plugin rolls back, but you can determine when it happens after the fact.
Define a new Entity (Call it "Transaction Tracker" or whatever makes sense). Define these attributes for the entity
OptionSet Attribute (Call it "RollbackAction", or again, whatever makes sense).
A Text Attribute that'll serve as a Data Field.
Define a new workflow that get's kicked off when a "TransactionTracker" get's created
Have it's first step be a Wait Condition that is defined as a process Timeout that waits for 1 minute.
Have it's next step be a Custom Workflow Activity that uses the Rollback action to determine how to parse the Text Attribute, to determine if the entity has been rolled back (If it's a Create, does it exist? If it's an update, is the entities Modified On date >= the Transaction Tracker's Date?
If it has been rolled back perform whatever action is nessacary, if it hasn't been rolled back, exit the workflow (Or optionally delete the TransactionTracker Entity)
Within your plugin, before making the external call, create an OrganizationServiceProxy (since you are creating it and not using the existing one, it will be created outside the transaction and therefore, will not get cleaned up).
Create a "TransactionTracker" entity with the out of transaction service, populating that attributes as necessary.
You may need to tweak the timeout, but besides that, it should work fine.

SSIS SQL Step Seems to fail

I have an SSIS package (ss2k12) I'm working on which begins with a SQL task to check if a table exists, create it if it doesn't, then truncates it. The table is a work table for a Data Flow task that follows.
When I run the task, it works. When I run the package (after deleting the table...) it fails looking for the missing table (which the sql task creates if it's missing....) Is this because it's "pre-checking" the data flow task? How do I get around the issue?
When a package receives the signal to start, the SSIS engine looks at every component and verifies that it exists, the meta data signatures match, etc. Then, when the component gets the signal that it can run, the metadata is then rechecked prior to execution.
To get around this issue, you need to use the DelayValidation property to indicate that validation should only occur when it is ready to execute.
Depending on how your package is structured, you might need to set this at both the Task (Data Flow) as well as the Package (control flow) level.