I have an Azure Data Factory V2 (X), with a blob event trigger. The blob event trigger works fine and triggers the Data Facotry (X) when I manually upload a file using the storage explorer. But when I use another data factory (Y) to write the file to the same Blob Storage instead of manual write, the event doesn't trigger my Data Factory (X).
I have verified the following:
There are no multiple events under the 'Events' blade section of the Blob Storage.
There is a System Topic and a Subscription created with the correct filters. I have the 'BeginsWith' with my container name and 'EndsWith' with the file extension 'parquet'.
I have verified the related questions on Stack Overflow but this seems to be different.
Any clues what could be wrong or is this a known issue?
EDIT:
I checked the logs of the Event Topic & Subscription, when the file is written by the ADF (Y) there is no event generated but with the manual upload/write the event gets triggered.
If your event of trigger is blob created, then the blob event trigger is basically depends on the new Etag. When the new etag is created, then your pipeline will be triggered.
This is the trigger on my side:
On my side, I create 3 containers: test1,test2,test3. Pipeline on Datafactory Y send files from test1 to test2, and pipeline on Datafactory X have above trigger. If triggered, it will send files from test2 to test3. When file is writed by the pipeline on Datafactory Y, pipeline on Datafactory X will be triggered. And files will be send to test3 with no problem.
Basiclly, the principle of the 'blob created' is based on the new etag of your container. When your Datafactory send files to target container, then there is no doubt that a new etag will appear.
I notice you mentioned the BeginsWith and EndsWith, so I think this is the problem comes from. Please have a check of that.
Related
I am creating c# program and want to execute it from custom activity azure data factory. However, I am not getting the steps that I should follow.
I have followed a Microsoft site for the same, but the steps are not clear. So please help.
The deployment happens at runtime. Basically, Data Factory passes the executable to the Batch service. If you haven't already done so, create an Azure Batch Linked Service to your Batch Account and reference it in the Custom Activity's "Azure Batch" tab.
You will need to load the executable package to a folder in Azure Blob Storage. Make sure to include the EXE and any dependent DLLs. In the "Settings" tab, do the following:
Reference the Blob Storage Linked Service
Reference the folder path that holds the executable(s).
Specify the command to execute (which should be the ConsoleAppName.exe).
Here is a screen shot of the Settings:
If you need to pass parameters from ADF to Batch, they are called "Extended properties", and are handled differently in your Console app than typical parameters. More information can be found at this answer.
I'm developing Pipeline in Azure Data Factory V2. It has very simple Copy activity. The Pipeline has to start when a file is added to Azure Data Lake Store Gen 2. In order to do that I have created a Event Trigger attached to ADLS_gen2 on Blob created. Then assigned trigger to pipeline and associate trigger data #triggerBody().fileName to pipeline parameter.
To test this I'm using Azure Storage Explorer and upload file to data lake. The problem is that the trigger in Data Factory is fired twice, resulting pipeline to be started twice. First pipeline run finish as expected and second one stays in processing.
Has anyone faced this issue? I have tried to delete the trigger in DF and create new one but the result was the same with new trigger.
I'm having the same issue myself.
When writing a file to ADLS v2 there is an initial a CreateFile operation and a FlushWithClose operation and they are both triggering a Microsoft.Storage.BlobCreated event type.
https://learn.microsoft.com/en-us/azure/event-grid/event-schema-blob-storage
If you want to ensure that the Microsoft.Storage.BlobCreated event is triggered only when a Block Blob is completely committed, filter the event for the FlushWithClose REST API call. This API call triggers the Microsoft.Storage.BlobCreated event only after data is fully committed to a Block Blob.
https://learn.microsoft.com/en-us/azure/event-grid/how-to-filter-events
You can filter out the CreateFile operation by navigating to Event Subscriptions in the Azure portal and choosing the correct topic type (Storage Accounts) and subscription and location. Once you've done that you should be able to see the trigger and update the filter settings on it. I removed CreateFile.
On your Trigger definition, set 'Ignore empty blobs' to Yes.
The comment from #dtape is probably what's happening underneath, and toggling this ignore empty setting on is effectively filtering the Create portion out (but not the data written part).
This fixed the problem for me.
I have a requirement where I want my airflow job to read a file from S3 and post its contents to slack.
Background
Currently, the airflow job has an S3 key sensor that waits for a file to be put in an S3 location and if that file doesn't appear in the stipulated time, it fails and pushes error messages to slack.
What needs to be done now
If airflow job succeeds, it needs to check another S3 location and if file there exists, then push its contents to slack.
Is this usecase possible with airflow?
You have already figured that the first step of your workflow has to be an S3KeySensor
As for the subsequent steps, depending of what you mean by ..it needs to check another S3 location and if file there exists,.., go can go about it in the following way
Step 1
a. If the file at another S3 location is also supposed to appear there in sometime, then of course you will require another S3KeySensor
b. Or else if this other file is expected to be there (or to not be there, but need not be waited upon to appear in sometime), we perform the check for presence of this file using check_for_key(..) function of S3_Hook (this can be done within python_callable of a simple PythonOperator / any other custom operator that you are using for step 2)
Step 2
By now, it is ascertained that either the second file is present in the expected location (or else we won't have come this far). Now you just need to read the contents of this file using read_key(..) function. After this you can push the contents to Slack using call(..) function of SlackHook. You might have an urge to use SlackApiOperator, (which you can, of course) but still reading the file from S3 and sending contents to Slack should be clubbed into single task. So you are better off doing these things in a generic PythonOperator by employing the same hooks that are used by the native operators also
I have some data stored in dynamo db and some highres images of each user stored in S3. The requirement is to be able to export a users data on demand. So by an api endpoint, collate all data and send it as a response. We are using aws lambda using node.js for business logic, s3 for storing images and sql db for storing relational data
I had set up a mechanism to connect api gateway to receive requests and put them in a sqs. The sqs would trigger a lambda which would run queries to gather all the data and image paths. We would copy all the images and data into a new bucket with custId as a folder name. Now heres where Im stuck. How to stream this data from our new aws bucket. All collected data is about 4gb. I have tried to stream via aws-lambda but keep failing. I am able to stream sigle files but not all data as zip. Hv done this in node, but would not want to set up an EC2 is possible and try to solve it directly with s3 and lambdas
CAnt seem to find a way to stream an entire folder from aws to the client as a response to an http request
Okay found the answer. Instead of trying to return a zip stream, Im now just zipping and saving the folder on the bucket itself and returning a signed url for it. Many node modules help us zip s3 folders without loading entire files in memory. Using that we have zipped our folder and returned a signed url. How it will behave under actual load remains to be seen. Will do that soon
I have build on top of AWS S3 sdk an operation which uses the copy operation of the amazon sdk.
I'm using the multi part copying as my object is larger than the maximum available (5GB)
enter link description here
My question is: what happen if all parts of the "multi part copy" are successfully done, but the last part?
Should i handle a situation of deleting the parts that have been copied?
Generally i'm expecting the copy operation to put the object in a tmp folder and only if the operation has been successful to mv it to the final name (the dest s3 bucket name). is it working like that?
If a part doesn't transfer successfully, you can send it again.
Until the parts are all copied and the multipart upload (including those created using put-part+copy) is completed, you don't have an accessible object... but you are still being charged for storage of what you have successfully uploaded/copied, unless you clean up manually or configure the bucket to automatically purge incomplete multipart objects.
Best practice is to do both -- configure the bucket to discard, but also configure your code to clean up after itself.
It looks like AWS sdk isn't writing/closing the object as an s3 object until it won't finish copying successfully the entire obj.
i have run a simple test which verifying rather it is writing the parts during the copy part code line, and it looks it won't write the obj to s3.
so the answer is that multi part won't write the obj until all part are copied successfully to the dest bucket.
there is no need for cleanup