why is job applying as many TMs as default parallelism when starting - hadoop-yarn

If I set TM number=6, slot num each TM=8, memory each TM=4G, the job will apply 48 TMs, and use only one slot each, total memory usage is 192G. A few minutes later, it will release 40TMs, everything will be normal.
Is there any reason about it?

Since Flink 1.5, having taskmanagers with multiple slots is not supported on Yarn (using the default runtime mode). See the release notes for more details.
You can still use Flink with the legacy runtime by configuring mode: legacy.

Related

CodeBuild projects are not being queued when conccurent build is 1

We are using AWS CodeBuild along with GitHub webhooks to trigger a build process. When a PR is created for a branch that starts with a Jira ticket prefix, i.e oscs-278, we build a new environment with Terraform. When we make commits to the PR it triggers the build process to update that environment.
This flow works well for us, especially since as of February 2021, AWS CodeBuild allows you to set concurrent builds to 1. This is important for us as we should only ever have one deployment at one time, the rest should be queued.
However, our current build process takes up to 15 minutes, if we commit to the branch within this time frame, the project is not being queued if another build is in process.
Is this likely to be an issue with the GitHub webhooks, or something to do with AWS CodeBuild.
From the AWS docs:
The maximum number of builds in a queue is five times the concurrent build limit.
So in theory, I should have 5 in the queue (maximum)
CodeBuild won't queue new builds if the number of currently running builds is at your limit (which is 1). Attempts to start more builds in this condition will fail with an error. The AWS Docs say:
If the build project has a concurrent build limit set, builds return an error if the number of running builds reaches the concurrent build limit for the project. For more information, see Enable concurrent build limit.
This applies for webhooks and attempts to start them manually. The same docs also say:
If the build project does not have a concurrent build limit set, builds are queued if the number of running builds reaches the concurrent build limit for the platform and compute type. The maximum number of builds in a queue is five times the concurrent build limit. For more information, see Quotas for AWS CodeBuild.
That section sort of hints that you can get queuing behavior if you reset your project concurrency limit to a high number (say, 60) and then set the "platform and compute type" concurrency limit to 1, but this isn't possible because that limit isn't user-adjustable (and it would probably apply across all projects).
In short, I don't think you can make CodeBuild queue builds after a configured concurrency limit is reached. A (rather complex) alternative is to do your own locking inside your buildpsec.yml.

Dask Yarn failed to allocate number of workers

We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.
There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.

Restoring Large State in Apache Flink Streaming Job

We have a cluster running Hadoop and YARN on AWS EMR with one core and one master, each with 4 vCores, 32 GB mem, 32 GB disc. We only have one long-running YARN application, and within that, there are only one or two long-running Flink applications, each with a parallelism of 1. Checkpointing has a 10 minute interval with a minimum of 5 minutes between. We use EventTime with a window of 10 minutes and a watermark duration of 15 seconds. The state is stored in S3 through the FsStateBackend with async snapshots enabled. Exactly-Once checkpointing is enabled as well.
We have UUIDs set up for all operators but don't have HA set up for YARN or explicit max parallelism for the operators.
Currently, when restoring from a checkpoint (3GB) the processing holds at the windowing until a org.apache.flink.util.FlinkException: The assigned slot <container_id> was removed error is thrown during the next checkpoint. I have seen that all but the operator with the largest state (which is a ProcessFunction directly after the windowing), finish checkpointing.
I know it is strongly suggested to use RocksDB for production, but is that mandatory for a state that most likely won't exceed 50GB?
Where would be the best place to start addressing this problem? Parallelism?

Can VMs on Google Compute detect when they've been migrated?

Is it possible to notify an application running on a Google Compute VM when the VM migrates to different hardware?
I'm a developer for an application (HMMER) that makes heavy use of vector instructions (SSE/AVX/AVX-512). The version I'm working on probes its hardware at startup to determine which vector instructions are available and picks the best set.
We've been looking at running our program on Google Compute and other cloud engines, and one concern is that, if a VM migrates from one physical machine to another while running our program, the new machine might support different instructions, causing our program to either crash or execute more slowly than it could.
Is there a way to notify applications running on a Google Compute VM when the VM migrates? The only relevant information I've found is that you can set a VM to perform a shutdown/reboot sequence when it migrates, which would kill any currently-executing programs but would at least let the user know that they needed to restart the program.
We ensure that your VM instances never live migrate between physical machines in a way that would cause your programs to crash the way you describe.
However, for your use case you probably want to specify a minimum CPU platform version. You can use this to ensure that e.g. your instance has the new Skylake AVX instructions available. See the documentation on Specifying the Minimum CPU Platform for further details.
As per the Live Migration docs:
Live migration does not change any attributes or properties of the VM
itself. The live migration process just transfers a running VM from
one host machine to another. All VM properties and attributes remain
unchanged, including things like internal and external IP addresses,
instance metadata, block storage data and volumes, OS and application
state, network settings, network connections, and so on.
Google does provide few controls to set the instance availability policies which also lets you control aspects of live migration. Here they also mention what you can look for to determine when live migration has taken place.
Live migrate
By default, standard instances are set to live migrate, where Google
Compute Engine automatically migrates your instance away from an
infrastructure maintenance event, and your instance remains running
during the migration. Your instance might experience a short period of
decreased performance, although generally most instances should not
notice any difference. This is ideal for instances that require
constant uptime, and can tolerate a short period of decreased
performance.
When Google Compute Engine migrates your instance, it reports a system
event that is published to the list of zone operations. You can review
this event by performing a gcloud compute operations list --zones ZONE
request or by viewing the list of operations in the Google Cloud
Platform Console, or through an API request. The event will appear
with the following text:
compute.instances.migrateOnHostMaintenance
In addition, you can detect directly on the VM when a maintenance event is about to happen.
Getting Live Migration Notices
The metadata server provides information about an instance's
scheduling options and settings, through the scheduling/
directory and the maintenance-event attribute. You can use these
attributes to learn about a virtual machine instance's scheduling
options, and use this metadata to notify you when a maintenance event
is about to happen through the maintenance-event attribute. By
default, all virtual machine instances are set to live migrate so the
metadata server will receive maintenance event notices before a VM
instance is live migrated. If you opted to have your VM instance
terminated during maintenance, then Compute Engine will automatically
terminate and optionally restart your VM instance if the
automaticRestart attribute is set. To learn more about maintenance
events and instance behavior during the events, read about scheduling
options and settings.
You can learn when a maintenance event will happen by querying the
maintenance-event attribute periodically. The value of this
attribute will change 60 seconds before a maintenance event starts,
giving your application code a way to trigger any tasks you want to
perform prior to a maintenance event, such as backing up data or
updating logs. Compute Engine also offers a sample Python script
to demonstrate how to check for maintenance event notices.
You can use the maintenance-event attribute with the waiting for
updates feature to notify your scripts and applications when a
maintenance event is about to start and end. This lets you automate
any actions that you might want to run before or after the event. The
following Python sample provides an example of how you might implement
these two features together.
You can also choose to terminate and optionally restart your instance.
Terminate and (optionally) restart
If you do not want your instance to live migrate, you can choose to
terminate and optionally restart your instance. With this option,
Google Compute Engine will signal your instance to shut down, wait for
a short period of time for your instance to shut down cleanly,
terminate the instance, and restart it away from the maintenance
event. This option is ideal for instances that demand constant,
maximum performance, and your overall application is built to handle
instance failures or reboots.
Look at the Setting availability policies section for more details on how to configure this.
If you use an instance with a GPU or a preemptible instance be aware that live migration is not supported:
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set
to terminate and optionally restart. Compute Engine offers a 60 minute
notice before a VM instance with a GPU attached is terminated. To
learn more about these maintenance event notices, read Getting live
migration notices.
To learn more about handling host maintenance with GPUs, read
Handling host maintenance on the GPUs documentation.
Live migration for preemptible instances
You cannot configure a preemptible instances to live migrate. The
maintenance behavior for preemptible instances is always set to
TERMINATE by default, and you cannot change this option. It is also
not possible to set the automatic restart option for preemptible
instances.
As Ramesh mentioned, you can specify the minimum CPU platform to ensure you are only migrated to an instance which has at least the minimum CPU platform you specified. At a high level it looks like:
In summary, when you specify a minimum CPU platform:
Compute Engine always uses the minimum CPU platform where available.
If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is
available for the same price, Compute Engine uses the newer platform.
If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the
server returns a 400 error indicating that the CPU is unavailable.

How to limit resource usage of a job in YARN

I remember some recent version of YARN has a configuration parameter which controls the amount of memory (or cores) a job can use. I tried to find it from the Web but I couldn't yet. If you know the parameter, please let me know.
I know one way to go about this is to use some kind of scheduler but for now I need a job level control so that the job doesn't abuse the entire system.
Thanks!
You can control maximum and minimum resource which are allocated to each containers.
yarn.scheduler.minimum-allocation-mb: Minimum memory allocation for each container
yarn.scheduler.maximum-allocation-mb: Maximum memory allocation for each container
yarn.scheduler.minimum-allocation-vcores: Minimum core allocation for each container
yarn.scheduler.maximum-allocation-vcores: Maximum core allocation for each container
If you want to avoid abuse of user jobs, yarn.scheduler.maximum-allocation-* can be solution because RM refuses the request which requires above these restriction by throwing InvalidResourceRequestException.
ref: yarn-default.xml