Azure Data Factory inactive pipelines/resources cost - azure-data-factory-2

I want to know if there is a cost associated with pipelines, datasets, linked services, triggers and mapping data flows of Azure data factory when it is not in use.
if there is a cost associated with it, is there any way I can deprovision or disable the pipelines to avoid that cost as well.

I understanding is the inactive pipelines were only charged in ADF v1 but they are no longer charged in ADF v2.
Please refer the doc for pricing details:
https://azure.microsoft.com/en-us/pricing/details/data-factory/data-pipeline/

Related

Log Analytics retention policy and querying on logs

I would like to know how can we address this scenario in Azure Log Analytics where I need to generate Kube-audit logs of different cluster every week and also retain these logs for approx 400 days. Now storing it over Log Analytics will cost me more and its not an optimized architecture as I will not be require that so often. So I would like to know from experts whats the best way to design the architecture, where we get the kube audit logs which can be retained for 400 days and be available for querying when required without incurring too much cost.
PS: I also heard in my team that querying 400 days logs always times out in KQL.
Log analytics offerings:
Log analytics now provides the capability to manage several service tiers at table scope. Setting your data as archive, with no query capabilities at a much lower cost. offering spans for up to 7 years.
when needed, you can choose to elevate a subset of your data into the Analytics offering, providing you the capability to query it. The action of elevating your data is denoted as - "Search jobs"
Another option is to elevate an entire period in time to the Analytic offering, they call it - "Restore logs".
Table's different service tiers -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/data-retention-archive?tabs=api-1%2Capi-2
Search job offering -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/search-jobs?tabs=api-1%2Capi-2%2Capi-3
Restore logs -
https://learn.microsoft.com/en-us/azure/azure-monitor/logs/restore?tabs=api-1%2Capi-2
all are under public preview.
both offerings - Search jobs and Restore logs provides you the capability to engage your data on demand, can't comment or suggest regarding the actual cost.
Azure data explorer solution:
Another option is to use Azure storage to hold your data (as an example), Azure data explorer provides the capability to create an external table, that table is a logical view on top of your data, the data itself is kept outside of the ADX cluster. you can query your data by using ADX, expect degradation in query performance.
ADX external table offering -
https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/schema-entities/externaltables

Blob storage folders bckups

we have a lot of pipelines in the synapse workspace.
using serverless sqlpool which is set to online
dedicated sql pool is paused as we do not use it to hold data...
using DevOps Repository
the support team will be making some clean-up in the environment. i.e. Running an old terraform to re-create the environment, etc.
How is it possible to make sure that
Question:
I understand in our DevOps Repository everything seems to be backed-up except the blob storage folders...
How can we make sure that if in-case something gets lost/ or goes wrong during the workspace clean-up, we will be able to get everything back...?
Thank you
ADLS Gen2 has its own tools for ensuring that DR event won’t affect you. One of the most powerful tools there is replication including Geo-Replicated Storage option.
Data Lake Storage Gen2 already handles 3x replication under the hood to guard against localized hardware failures. Additionally, other replication options, such as ZRS or GZRS, improve HA, while GRS & RA-GRS improve DR. When building a plan for HA, in the event of a service interruption the workload needs access to the latest data as quickly as possible by switching over to a separately replicated instance locally or in a new region.
In a DR strategy, to prepare for the unlikely event of a catastrophic failure of a region, it is also important to have data replicated to a different region using GRS or RA-GRS replication. You must also consider your requirements for edge cases such as data corruption where you may want to create periodic snapshots to fall back to. Depending on the importance and size of the data, consider rolling delta snapshots of 1-, 6-, and 24-hour periods, according to risk tolerances.
For data resiliency with Data Lake Storage Gen2, it is recommended to geo-replicate your data via GRS or RA-GRS that satisfies your HA/DR requirements. Additionally, you should consider ways for the application using Data Lake Storage Gen2 to automatically fail over to the secondary region through monitoring triggers or length of failed attempts, or at least send a notification to admins for manual intervention. Keep in mind that there is tradeoff of failing over versus waiting for a service to come back online.
For more details refer to Best practices for using Azure Data Lake Storage Gen2.
And also here a great article which talks about : Azure Synapse Disaster Recovery Architecture.

S3 Integration with Snowflake: Best way to implement multi-tenancy?

My team is planning on building a data processing pipeline that will involve S3 integration with Snowflake. This article from Snowflake shows that an AWS IAM role must be created in order for Snowflake to access S3's data.
However, in our pipeline, we need to ensure multi-tenancy and data isolation between users. For example, let's assume that Alice and Bob has files in S3 under "s3://bucket-alice/file_a.csv" and "s3://bucket-bob/file_b.csv" respectively. Then, we want to make sure that, when staging Alice's data onto Snowflake, Alice can only access "s3://bucket-alice" and nothing under "s3://bucket-bob". This means that individual AWS IAM roles must be created for each user.
I do realize that Snowflake has it's own access control system, but my team wants to make sure that data isolation is fully achieved from the S3-to-Snowflake stage of the pipeline, and not only relying on Snowflake's access control.
We are worried that this will not be scalable, as AWS sets a limit of 5000 IAM users, and that will not be enough as we scale our product. Is this the only way of ensuring data multi-tenancy, and does anyone have a real-world application example of something like this?
Have you explored leveraging Snowflake's Internal Stage, instead? By default, every user gets their own internal stage that only they have permissions to from within Snowflake and NO access outside of Snowflake. Snowflake offers the ability to move data in and out of that Internal Stage using just about every driver/connector that Snowflake has available. This said, any pipeline/workflow that is being leveraged by 5000+ users would be able to use these connectors to load data to Snowflake Internal Stage (S3) without the need for any additional AWS IAM Users. Would that be a sufficient solution for your situation?

Get the cost of an individual pipline execution in Data Factory

I'm looking into using Azure Data Factory V2 for integration imports and want to know if there's a way to track the cost of individual pipelines being run?
For example if I had 3 pipelines which represented 3 different integrations would there be a way to see the cost incurred from each?
Is there also a way to do this in near real time, so that during a month I could somehow put a budget on each integration/pipeline?
The Azure Data Factory bills only show the total cost. We can't get the each pipeline cost in Data Factory, we must manually calculate the price.
We can see the pipeline level consumption: Monitor-->Pipeline run-->Consumption:
Azure document says that "The pipeline run consumption view shows you the amount consumed for each ADF meter for the specific pipeline run, but it does not show the actual price charged". We need manually calculate the cost by Pricing calculator.
For your questions, is there a way to see the cost incurred from each?
No, there isn't. Must manually calculate the cost.
Is there also a way to do this in near real time, so that during a month I could somehow put a budget on each integration/pipeline?
I'm afraid no.
Others have post almost same question, please ref here: Azure Data Factory Pipeline Consumption Details

ETL on Google Cloud - (Dataflow vs. Spring Batch) -> BigQuery

I am considering BigQuery as my data warehouse requirement. Right now, I have my data in google cloud (cloud SQL and BigTable). I have exposed my REST APIs to retrieve data from both. Now, I would like to retrieve data from these APIs, do the ETL and load the data into BigQuery. I am evaluating 2 options of ETL (daily frequency of job for hourly data) right now:-
Use JAVA Spring Batch and create microservice and use Kubernetes as deployment environment. Will it scale?
Use Cloud DataFlow for ETL
Then use BigQuery batch insert API (for initial load) and streaming insert API (for incremental load when new data available in source) to load BigQuery denormalized schema.
Please let me know your opinions.
Without knowing your data volumes, specifically how much new or diff data you have per day and how you are doing paging with your REST APIs - here is my guidance...
If you go down the path of a using Spring Batch you are more than likely going to have to come up with your own sharding mechanism: how will you divide up REST calls to instantiate your Spring services? You will also be in the Kub management space and will have to handle retries with the streaming API to BQ.
If you go down the Dataflow route you will have to write a some transform code to call your REST API and peform the paging to populate your PCollection destined for BQ. With the recent addition of Dataflow templates you could: create a pipeline that is triggered every N hours and parameterize your REST call(s) to just pull data ?since=latestCall. From there you could execute BigQuery writes. I recommend doing this in batch mode as 1) it will scale better if you have millions of rows 2) be less cumbersome to manage (during non-active times).
Since Cloud Dataflow has built in re-try logic for BiqQuery and provides consistency across all input and output collections -- my vote is for Dataflow in this case.
How big are your REST call results in record count?