ADF Dataflows; Do I have any control or influence over cluster startup time. (NOT "TTL") - azure-data-factory-2

Yes, I know about TTL; Yes, I'm configuring that; No, that's not what I'm asking about here.
Spinning up an initial cluster for a Dataflow takes around 5 minutes.
Starting acquiring compute from an existing "warm" cluster (i.e. one which has been left 'Alive' using TTL), for a new dataflow still appears to take 1-2 minutes.
Those are pretty large numbers, especially if you have a multi-step ETL process, and have broken up your pipeline to separate concerns (or if you're executing the dataflows in a loop, to process data per-source-day)
Controlling the TTL gives me some control over which of those two possibilities I'm triggering, but even 2 minutes can be a quite substantial overhead. (I have a pipeline where fully half the execution time is waiting for those 1-2 minute 'Acquire Compute' startups)
Do I have any control at all, over how long startup takes in each case? Is there anything that I can do to speed up the startup, or anything that I should avoid to prevent making things even worse!

There's a new feature in town, to fix exactly this problem.
Release blog:
https://techcommunity.microsoft.com/t5/azure-data-factory/how-to-startup-your-data-flows-execution-in-less-than-5-seconds/ba-p/2267365
ADF has added a new option in the Azure Integration Runtime for data flow TTL: Quick re-use. ... By selecting the re-use option with a TTL setting, you can direct ADF to maintain the Spark cluster for that period of time after your last data flow executes in a pipeline. This will provide much faster sequential executions using that same Azure IR in your data flow activities.

Related

Matillion: How to identify performance bottleneck

We're running Matillion (v1.54) on an AWS EC2 instance (CentOS), based on Tomcat 8.5.
We have developped a few ETL jobs by now, and their execution takes quite a lot of time (that is, up to hours). We'd like to speed up the execution of our jobs, and I wonder how to identify the bottle neck.
What confuses me is that both the m5.2xlarge EC2 instance (8 vCPU, 32G RAM) and the database (Snowflake) don't get very busy and seem to be sort of idle most of the time (regarding CPU and RAM usage as shown by top).
Our environment is configured to use up to 16 parallel connections.
We also added JVM options -Xms20g -Xmx30g to /etc/sysconfig/tomcat8 to make sure the JVM gets enough RAM allocated.
Our Matillion jobs do transformations and loads into a lot of tables, most of which can (and should) be done in parallel. Still we see, that most of the tasks are processed in sequence.
How can we enhance this?
By default there is only one JDBC connection to Snowflake, so your transformation jobs might be getting forced serial for that reason.
You could try bumping up the number of concurrent connections under the Edit Environment dialog, like this:
There is more information here about concurrent connections.
If you do that, a couple of things to avoid are:
Transactions (begin, commit etc) will force transformation jobs to
run in serial again
If you have a parameterized transformation job,
only one instance of it can ever be running at a time. More information on that subject is here
Because the Matillion server is just generating SQL statements and running them in Snowflake, the Matillion server is not likely to be the bottleneck. You should make sure that your orchestration jobs are submitting everything to Snowflake at the same time and there are no dependencies (unless required) built into your flow.
These steps will be done in sequence:
These steps will be done in parallel (and will depend on Snowflake warehouse size to scale):
Also - try the Alter Warehouse Component with a higher concurrency level

Mule is taking a long time for the simple select for the first execution

I am just using a HTTP listener and Select in mule flow. It is a get method, passing ID as an input, and the same ID is passed to select (input). It is taking 3 to 4 minutes of delay when we execute via mule for the first time, but in DB, it took only millisecond.
This delay only happens after adding the parameter in the select.
Someone help me, why there is a delay for the first time and how to resolve it?
Possible cause could be how you create Metadata. For example you use huge CSV file as example for your data structure. Mule reads whole file to have headers. It takes time.
Solution - if you create Metadata by example - use small examples with couple rows of data.
Usually the main points that cause performance issues in first executions are:
JVM and Mule Runtime warmup
Time to establish connections
The first one can not be avoided. For the second one usually a connection pool is used to mitigate it somewhat. Having said that 4 minutes is a very excessive time for either of those. You need to do some performance analysis, adding logs before and after operation in the flow, enabling debug logs for the database connector and even using a Java profiler connected to the Mule JVM to understand what could be happening.
You also have to consider if there is a high number of records that need to be processed, even if the database can answer quickly, it might take some time to format.

Microservices Architecture for highly frequent data access; in memory solutions?

let us define the following use case:
There has to be a simulation task fulfilled, which involves an iteration/simulation over [day1, day2, ..., dayN]. Every step of the iteration depends on the prior step, so the order is predefined.
The task has a state represented by Object1, this object is going to be changed within every step of the iteration.
The step of an iteration involves 2 different tasks: Task1 and Task2.
To fulfill Task1, data from Database1 is required.
For Task2 to be fulfilled, also external data is needed from a different database, namely Database2.
After Task1 has finished, Task2 needs to be applied.
Task1 and also Task2 needs to access Object1
After both tasks are done, the state of Object1 changes and one iteration step has finished.
This iteration/simulation task involves on average 10,000 iteration steps. And on average 100 iteration/simulation tasks need to be performed concurrently, started by several enduser.
Now we discuss a microservice architecture for the problem, due to the needed scalability of the application in production. Also for development purpose this is crucial, because Task1 and Task2 are recently added new features/parameters and scale differently in development.
So, to avoid the network bottleneck here, involving the constant
database access in every iteration and also the send data between
Task1 and Task2, what would be an appropriate system architecture to
this problem?
Should there be at least two different services for Task1 and
Task2 and maybe even one for the actual iteration/simulation state
control? Can someone maybe tell us a little bit more about the use of
an in memory data grid solution like hazlecast or only in-memory
database like redis for this problem?
The main question here is what are the arguments for a microservice
architecture due to probably communication/network bottleneck? The
only way to speed this up is to spawn all needed data for the
simulation task in memory and keep it there the whole time, to avoid
the network bottleneck?
Thanks for your answers and valuable input on this.
(This question is not about inter service communication, like messaging or REST http (pub/sub or req/resp), both could apply highly network load for this task.)
Now we discuss a microservice architecture for the problem, due to the needed scalability of the application in production. Also for development purpose this is crucial, because Task1 and Task2 are recently added new features/parameters and scale differently in development.
This is exactly what a stream processing platform is doing good. I recommend to use a system like Apache Kafka or Apache Pulsar for this problem.
Should there be at least two different services for Task1 and Task2 and maybe even one for the actual iteration/simulation state control?
Task1 and Task2 is what is called stream processors, they read (subscribe to) one topic, doing some operations/transformations and write (publishes) to another topic.
The main question here is what are the arguments for a microservice architecture due to probably communication/network bottleneck? The only way to speed this up is to spawn all needed data for the simulation task in memory and keep it there the whole time, to avoid the network bottleneck?
Again, this is exactly the problem that a system like Apache Kafka or Apache Pulsar is doing good. To scale writes and reads in a stream processing system, you can partition your topics.
With Hazelcast, you get the best of both worlds - data storage (cache in Hazelcast cluster) and compute/processing. Within the same Hazelcast cluster, you can create caches using Hazelcast data structures and load them with the data from database (pre-load warmup or on-demand loading of data in cache). Then you execute your tasks within the cluster using Hazelcast Jet APIs. This way, your tasks will have access to the data previously loaded into the cluster and the advantage - data is at nearest possible location to your tasks, therefore extremely low latency for tasks execution.
Another benefit of Jet - since Jet is a DAG implementation, you can connect multiple tasks with each other in direction that you like. For example, Task1 can input into Task2, Task2 can input into Task3, Task3 can input into Task1 and 2 both, and so on etc. This gives you full control over full job execution that may entail multiple tasks at different stages. Jet provides both Stream and Batch processing of tasks, with same flexibility in designing and execution of your jobs.
You may find it problematic to use Kafka for tasks execution if used outside of Kafka ecosystem. Jet is highly flexible and can be connected to any source/sink, including Kafka.

Bigquery streaming inserts taking time

During load testing of our module we found that bigquery insert calls are taking time (3-4 s). I am not sure if this is ok. We are using java biguqery client libarary and on an average we push 500 records per api call. We are expecting a million records per second traffic to our module so bigquery inserts are bottleneck to handle this traffic. Currently it is taking hours to push data.
Let me know if we need more info regarding code or scenario or anything.
Thanks
Pankaj
Since streaming has a limited payload size, see Quota policy it's easier to talk about times, as the payload is limited in the same way to both of us, but I will mention other side effects too.
We measure between 1200-2500 ms for each streaming request, and this was consistent over the last month as you can see in the chart.
We seen several side effects although:
the request randomly fails with type 'Backend error'
the request randomly fails with type 'Connection error'
the request randomly fails with type 'timeout' (watch out here, as only some rows are failing and not the whole payload)
some other error messages are non descriptive, and they are so vague that they don't help you, just retry.
we see hundreds of such failures each day, so they are pretty much constant, and not related to Cloud health.
For all these we opened cases in paid Google Enterprise Support, but unfortunately they didn't resolved it. It seams the recommended option to take for these is an exponential-backoff with retry, even the support told to do so. Which personally doesn't make me happy.
The approach you've chosen if takes hours that means it does not scale, and won't scale. You need to rethink the approach with async processes. In order to finish sooner, you need to run in parallel multiple workers, the streaming performance will be the same. Just having 10 workers in parallel it means time will be 10 times less.
Processing in background IO bound or cpu bound tasks is now a common practice in most web applications. There's plenty of software to help build background jobs, some based on a messaging system like Beanstalkd.
Basically, you needed to distribute insert jobs across a closed network, to prioritize them, and consume(run) them. Well, that's exactly what Beanstalkd provides.
Beanstalkd gives the possibility to organize jobs in tubes, each tube corresponding to a job type.
You need an API/producer which can put jobs on a tube, let's say a json representation of the row. This was a killer feature for our use case. So we have an API which gets the rows, and places them on tube, this takes just a few milliseconds, so you could achieve fast response time.
On the other part, you have now a bunch of jobs on some tubes. You need an agent. An agent/consumer can reserve a job.
It helps you also with job management and retries: When a job is successfully processed, a consumer can delete the job from the tube. In the case of failure, the consumer can bury the job. This job will not be pushed back to the tube, but will be available for further inspection.
A consumer can release a job, Beanstalkd will push this job back in the tube, and make it available for another client.
Beanstalkd clients can be found in most common languages, a web interface can be useful for debugging.

Does Redis persist data?

I understand that Redis serves all data from memory, but does it persist as well across server reboot so that when the server reboots it reads into memory all the data from disk. Or is it always a blank store which is only to store data while apps are running with no persistence?
I suggest you read about this on http://redis.io/topics/persistence . Basically you lose the guaranteed persistence when you increase performance by using only in-memory storing. Imagine a scenario where you INSERT into memory, but before it gets persisted to disk lose power. There will be data loss.
Redis supports so-called "snapshots". This means that it will do a complete copy of whats in memory at some points in time (e.g. every full hour). When you lose power between two snapshots, you will lose the data from the time between the last snapshot and the crash (doesn't have to be a power outage..). Redis trades data safety versus performance, like most NoSQL-DBs do.
Most NoSQL-databases follow a concept of replication among multiple nodes to minimize this risk. Redis is considered more a speedy cache instead of a database that guarantees data consistency. Therefore its use cases typically differ from those of real databases:
You can, for example, store sessions, performance counters or whatever in it with unmatched performance and no real loss in case of a crash. But processing orders/purchase histories and so on is considered a job for traditional databases.
Redis server saves all its data to HDD from time to time, thus providing some level of persistence.
It saves data in one of the following cases:
automatically from time to time
when you manually call BGSAVE command
when redis is shutting down
But data in redis is not really persistent, because:
crash of redis process means losing all changes since last save
BGSAVE operation can only be performed if you have enough free RAM (the amount of extra RAM is equal to the size of redis DB)
N.B.: BGSAVE RAM requirement is a real problem, because redis continues to work up until there is no more RAM to run in, but it stops saving data to HDD much earlier (at approx. 50% of RAM).
For more information see Redis Persistence.
It is a matter of configuration. You can have none, partial or full persistence of your data on Redis. The best decision will be driven by the project's technical and business needs.
According to the Redis documentation about persistence you can set up your instance to save data into disk from time to time or on each query, in a nutshell. They provide two strategies/methods AOF and RDB (read the documentation to see details about then), you can use each one alone or together.
If you want a "SQL like persistence", they have said:
The general indication is that you should use both persistence methods if you want a degree of data safety comparable to what PostgreSQL can provide you.
The answer is generally yes, however a fuller answer really depends on what type of data you're trying to store. In general, the more complete short answer is:
Redis isn't the best fit for persistent storage as it's mainly performance focused
Redis is really more suitable for reliable in-memory storage/cacheing of current state data, particularly for allowing scalability by providing a central source for data used across multiple clients/servers
Having said this, by default Redis will persist data snapshots at a periodic interval (apparently this is every 1 minute, but I haven't verified this - this is described by the article below, which is a good basic intro):
http://qnimate.com/redis-permanent-storage/
TL;DR
From the official docs:
RDB persistence [the default] performs point-in-time snapshots of your dataset at specified intervals.
AOF persistence [needs to be explicitly configured] logs every write operation received by the server, that will be played again at server startup, reconstructing the
original dataset.
Redis must be explicitly configured for AOF persistence, if this is required, and this will result in a performance penalty as well as growing logs. It may suffice for relatively reliable persistence of a limited amount of data flow.
You can choose no persistence at all.Better performance but all the data lose when Redis shutting down.
Redis has two persistence mechanisms: RDB and AOF.RDB uses a scheduler global snapshooting and AOF writes update to an apappend-only log file similar to MySql.
You can use one of them or both.When Redis reboots,it constructes data from reading the RDB file or AOF file.
All the answers in this thread are talking about the possibility of redis to persist the data: https://redis.io/topics/persistence (Using AOF + after every write (change)).
It's a great link to get you started, but it is defenently not showing you the full picture.
Can/Should You Really Persist Unrecoverable Data/State On Redis?
Redis docs does not talk about:
Which redis providers support this (AOF + after every write) option:
Almost none of them - redis labs on the cloud does NOT provide this option. You may need to buy the on-premise version of redis-labs to support it. As not all companies are willing to go on-premise, then they will have a problem.
Other Redis Providers does not specify if they support this option at all. AWS Cache, Aiven,...
AOF + after every write - This option is slow. you will have to test it your self on your production hardware to see if it fits your requirements.
Redis enterpice provide this option and from this link: https://redislabs.com/blog/your-cloud-cant-do-that-0-5m-ops-acid-1msec-latency/ let's see some banchmarks:
1x x1.16xlarge instance on AWS - They could not achieve less than 2ms latency:
where latency was measured from the time the first byte of the request arrived at the cluster until the first byte of the ‘write’ response was sent back to the client
They had additional banchmarking on a much better harddisk (Dell-EMC VMAX) which results < 1ms operation latency (!!) and from 70K ops/sec (write intensive test) to 660K ops/sec (read intensive test). Pretty impresive!!!
But it defenetly required a (very) skilled devops to help you create this infrastructure and maintain it over time.
One could (falsy) argue that if you have a cluster of redis nodes (with replicas), now you have full persistency. this is false claim:
All DBs (sql,non-sql,redis,...) have the same problem - For example, running set x 1 on node1, how much time it takes for this (or any) change to be made in all the other nodes. So additional reads will receive the same output. well, it depends on alot of fuctors and configurations.
It is a nightmare to deal with inconsistency of a value of a key in multiple nodes (any DB type). You can read more about it from Redis Author (antirez): http://antirez.com/news/66. Here is a short example of the actual ngihtmare of storing a state in redis (+ a solution - WAIT command to know how much other redis nodes received the latest change change):
def save_payment(payment_id)
redis.rpush(payment_id,”in progress”) # Return false on exception
if redis.wait(3,1000) >= 3 then
redis.rpush(payment_id,”confirmed”) # Return false on exception
if redis.wait(3,1000) >= 3 then
return true
else
redis.rpush(payment_id,”cancelled”)
return false
end
else
return false
end
The above example is not suffeint and has a real problem of knowing in advance how much nodes there actually are (and alive) at every moment.
Other DBs will have the same problem as well. Maybe they have better APIs but the problem still exists.
As far as I know, alot of applications are not even aware of this problem.
All in all, picking more dbs nodes is not a one click configuration. It involves alot more.
To conclude this research, what to do depends on:
How much devs your team has (so this task won't slow you down)?
Do you have a skilled devops?
What is the distributed-system skills in your team?
Money to buy hardware?
Time to invest in the solution?
And probably more...
Many Not well-informed and relatively new users think that Redis is a cache only and NOT an ideal choice for Reliable Persistence.
The reality is that the lines between DB, Cache (and many more types) are blurred nowadays.
It's all configurable and as users/engineers we have choices to configure it as a cache, as a DB (and even as a hybrid).
Each choice comes with benefits and costs. And this is NOT an exception for Redis but all well-known Distributed systems provide options to configure different aspects (Persistence, Availability, Consistency, etc). So, if you configure Redis in default mode hoping that it will magically give you highly reliable persistence then it's team/engineer fault (and NOT that of Redis).
I have discussed these aspects in more detail on my blog here.
Also, here is a link from Redis itself.