Is there a way to reuse a single running databricks cluster in multiple mapping data flows - azure-data-factory-2

Is there a way to reuse a databricks cluster that is started by a web activity before
we run the mapping data flows and use the same running cluster in all of the data flows instead of letting all the data flow instances spin up their
own clusters which takes around 6 minutes for setting up each cluster?

Yes. Set the TTL in the Azure Integration Runtime under "Data Flow Properties" to an amount of time that there is a gap in between data flow job executions. This way, we can set-up a VM pool for you and reuse those resource to minimize the cluster start-up time: https://techcommunity.microsoft.com/t5/azure-data-factory/adf-adds-ttl-to-azure-ir-to-reduce-data-flow-activity-times/ba-p/878380.
To start the cluster, don't use a web activity. Use a "dummy" data flow as I demonstrate here: https://youtu.be/FFCbU4ujCiY?t=533.
In ADF, you cannot access the underlying compute engines (Databricks in this case), so you have to kick-off a dummy data flow to warm it up.
That cluster start-up will take 5-6 mins. But now, if you use that same Azure IR in your subsequent activities, as long as they are scheduled to execute within that TTL window, ADF can grab existing VM resources to spin-up the Spark clusters and marshall your data flow definition to the Spark job execution.
End-to-end that process should now take just 2 mins.

Related

Cloud Pub/Sub to BigQuery through Dataflow SQL

I Would like to understand the working of Dataflow pipeline.
In my case, I have something published to cloud pub/sub periodically which Dataflow then writes to BigQuery. The volume of messages that come through are in the thousands so my publisher client has a batch setting for 1000 messages, 1 MB and 10 sec of latency.
The question is: When published in a batch as stated above, Does Dataflow SQL take in the messages in the batch and writes it to BigQuery all in one go? or, Does it writes one message at a time?
On the other hand, Is there any benefit of one over the other?
Please comment if any other details required. Thanks
Dataflow SQL is just a way to define, with SQL syntax, an Apache Beam pipeline, and to run it on Dataflow.
Because it's PubSub, it's a streaming pipeline that is created based on your SQL definition. When you run your SQL command, a Dataflow job starts and wait the messages from pubSub.
If you publish a bunch of messages, Dataflow is able to scale up to process them as soon as possible.
Keep in ming that Dataflow streaming never scale to 0 and therefore you will always pay for 1 or more VM to keep your pipeline up and running.

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL")

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!
There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

resource management on spark jobs on Yarn and spark shell jobs

Our company has a 9 nodes clusters on cloudera.
We have 41 long running spark streaming jobs [YARN + cluster mode] & some regular spark shell jobs scheduled to run on 1pm daily.
All jobs are currently submitted at user A role [ with root permission]
The issue I encountered are that while all 41 spark streaming jobs are running, my scheduled jobs will not be able to obtain resource to run.
I have tried the YARN fair scheduler, but the scheduled jobs remain not running.
We expect the spark streaming jobs are always running, but it will reduce the resources occupied whenever other scheduled jobs start.
please feel free to share your suggestions or possible solutions.
Your spark streaming jobs are consuming too many resources for your scheduled jobs to get started. This is either because they're always scaled to a point that there aren't enough resources left for scheduled jobs or they aren't scaling back.
For the case where the streaming jobs aren't scaling back you could check whether you have dynamic resource allocation enabled for your streaming jobs. One way of checking is via the spark shell using spark.sparkContext.getConf.get("spark.streaming.dynamicAllocation.enabled"). If dynamic allocation is enabled then you could look at reducing the minimum resources for those jobs.

WebLogic WorkManager clustering/remote jobs

Does WebLoogic WorkManager have the ability to execute jobs on other servers on the cluster to effectively parallelize jobs?
There are two Work Managers - One on the server side that handles thread prioritization/queueing and the CommonJ Work Manager that can be used through the CommonJ API.
Within your application, you can define priorities within the container and also pursue parallel execution on the same server. However, if you are looking to process workload in parallel across multiple servers by having a single application server splitting up its current workload and redistributing it across the cluster, the bulk of the logic will have to be written into your application.
WebLogic does provide other mechanisms to make this easier (For example, you could have a primary node process the workload into units of work and put it on a durable distributed topic that the other servers read from) but it would be easier to use an existing product, such as Terracotta's EhCache or a compute cluster on Oracle's Coherence Grid.