Matillion: How to identify performance bottleneck - jvm

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?

By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here

Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

Related

Singlestore (MemSQL)

I have a Singlestore (previously MemSQL) cloud database set up.
My software is running in the background, constantly writing to a table.
When I try to query this table, it takes 10+ seconds. When the software is shut off, the query takes milliseconds.
What would be the reason for this? And is there anything that can be done to mitigate against this?
From a high level, cluster resources are much more utilized while the background software constantly writes to the table. The same resources that handle the constant writes are concurrently trying to serve the query, so it makes sense its faster when there is no writing.
A 'knob to turn' WRT database ingest performance is partition count - you can try creating a test DB w/ more partitions that the current DB (say 2x more). Then try querying from the test DB, both while the background software is running and while it is not - compare this to the DB w/ fewer partitions.
For general guidance on troubleshooting query performance, see this section of the docs: https://docs.singlestore.com/managed-service/en/query-data/query-procedures/troubleshooting-poorly-performing-queries.html
If you're an active customer, you can file a support ticket for the issue for some additional analysis of the backend workings

ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL")

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!
There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

Is it possible to limit number of oozie workflows running at the same time?

This is not clear to me from the docs. Here's our scenario and why we need this as succinctly as I can:
We have 60 coordinators running, launching workflows usually hourly, some of which have sub-workflows (some multiple in parallel). This works out to around 40 workflows running at any given time. However when cluster is under load or some underlying service is slow (e.g. impala or hbase), workflows will run longer than usual and back up so we can end up with 80+ workflows (including sub-workflows) running.
This sometimes results in ALL workflows hanging indefinitely, because we have only enough memory and cores allocated to this pool that oozie can start the launcher jobs (i.e. oozie:launcher:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ), but not their corresponding actions (i.e. oozie:action:T=sqoop:W=JobABC:A=sqoop-d596:ID=XYZ).
We could simply allocate enough resources to the pool to accommodate for these spikes, but that would be a massive waste (hundreds of cores and GBs that other pools/tenants could never use).
So I'm trying to enforce some limit on number of workflows running, even if that means some will be running behind sometimes. BTW all our coordinators are configured with execution=LAST_ONLY, and any delayed workflow will simply catch up fully on the next run. We are on CDH 5.13 with Oozie 4.1; pools are setup with DRF scheduler.
Thanks in advance for your ideas.
AFAIK there is not a configuration parameter that let you control the number of workflows running at a given time.
If your coordinators are scheduled to run approximately in the same time-window, you could think to collapse them in just one coordinator/workflow and use the fork/join control nodes to control the degree of parallelism. Thus you can distribute your actions in a K number of queues in your workflow and this will ensure that you will not have more than K actions running at the same time, limiting the load on the cluster.
We use a script to generate automatically the fork queues inside the workflow and distribute the actions (of course this is only for actions that can run in parallel, i.e. there no data dependencies etc).
Hope this helps

ETL in Amazon EC2 and utilization doubts

I am using pentaho for sometime. I just have a basic question on ETL infrastructure. I need to run job on a remote EC2 instance to extract data from multiple database say around 2000. I need to have a machine which is capable to doing this in EC2. This ETL Ec2 will be serving only as process point and the storage is in another host.Now I need to know which instance I should go for in Amazon.
These ETL jobs will just have select query and just put in the table output.
No complex transformation and no sorting.
Are the ETL processes CPU intensive or memory intensive?.
How to decide whether the ETL process is CPU or memory intensive or I/O intensive?
I would say it will be all upto you, i am using m3.medium instance according to data in my database and it is perfectly fine, if you have no problem with the amount of time it will take to execute the transformation then choose some small size instance or go with some higher instance.

Alternative for batch job scheduling (in compute pool)

Since I don't have root rights on the machines in a compute pool, and thus cannot adapt the load parameters of atd for batch, I'm looking for an alternative way to do job scheduling. Since the machines are used by multiple users, it should be able to take the load into account. Optionally, I'm looking for a way to do this for all the machines it the pool, I.e. there is one central queue with jobs that need to be ran, and a script that distributes them (over ssh) over the machines that are under a certain load. Any ideas?
First: go talk to the system administrators of the compute pool. Enterprise wide job schedulers have become a rather common component in infrastructures these days. Typically, these schedulers do not take into account system load though.
If the above doesn't lead to a good solution, you should carefully consider what load your jobs will impose on the machine: your jobs could be stressing the cpu more, consume large amounts of memory, generate lots of network or disk IO activity. Consequently, determining whether your job should start may depend on a lot of measurement, some of which you would not be able to do as an ordinary user (depends a bit on the kind of OS you are running, and how tight security is). In any case: you would only be able to take into account the load at the job's start up. Obviously, if every user would do this, you're back at square one in no time...
It might be a better idea to see with your system administrator if they have some sort of resource controls in place (e.g. projects in Solaris) through which they can make sure your batches are not tearing down the nodes in the compute pool. Next, write your batch jobs in such a way that they can cope with the OS declining requests for resources.
EDIT: As for the distributed nature: queueing up the jobs and having clients on all nodes point to the same queue, consuming as much as they can in the context of the resource controls...