Restoring Large State in Apache Flink Streaming Job - hadoop-yarn

We have a cluster running Hadoop and YARN on AWS EMR with one core and one master, each with 4 vCores, 32 GB mem, 32 GB disc. We only have one long-running YARN application, and within that, there are only one or two long-running Flink applications, each with a parallelism of 1. Checkpointing has a 10 minute interval with a minimum of 5 minutes between. We use EventTime with a window of 10 minutes and a watermark duration of 15 seconds. The state is stored in S3 through the FsStateBackend with async snapshots enabled. Exactly-Once checkpointing is enabled as well.
We have UUIDs set up for all operators but don't have HA set up for YARN or explicit max parallelism for the operators.
Currently, when restoring from a checkpoint (3GB) the processing holds at the windowing until a org.apache.flink.util.FlinkException: The assigned slot <container_id> was removed error is thrown during the next checkpoint. I have seen that all but the operator with the largest state (which is a ProcessFunction directly after the windowing), finish checkpointing.
I know it is strongly suggested to use RocksDB for production, but is that mandatory for a state that most likely won't exceed 50GB?
Where would be the best place to start addressing this problem? Parallelism?

Related

Would redis ever hit loadavg of 4 in a four core machine?

I am running a single instance of Redis in a 4 core machine. Is it correct to assume that the loadavg as seen via /proc/loadavg will be max 1 when the redis is at max capacity.
Max capacity defined as the max throughput that Redis can support.
This assumption stems from the fact that redis is single threaded and that the process will be attached to only one core.
Is there anything wrong with the assumption above? Will OS distribute the load to multiple cores in case one is loaded?
Checked with Redis maintainers. Redis is single threaded and will not hit a loadavg of 4 in a 4 core machine since that means that redis CPU is consuming CPU time from all the 4 cores which won't happen since the costly event loop operations happen on a single thread.

Efficient way to take hot snapshots from redis in production?

We have redis cluster which holds more than 2 million and these keys has been updated with the time interval of 1 minute. Now we have a requirement to take the snapshot of the redis db in a particular interval For eg every 10 minute. This snapshot should not pause the redis command execution.
Is there any async way of taking snapshot from redis ?
It would be really helpful if we get any suggestion on open source tools or frameworks.
The Redis BGSAVE is async and takes a snapshot.
It calls the fork() function of the OS. According to the Redis manual,
Fork() can be time consuming if the dataset is big, and may result in Redis to stop serving clients for some millisecond or even for one second if the dataset is very big and the CPU performance not great
Two million updates in one minutes, that is 30K+ QPS.
So you really have to try it out, run the benchmark that similutes your business, then issue BGSAVE, monitor the I/O and CPU usage of your system, and see if there's a spike in your redis calling latency.
Then issue LASTSAVE, which will tell you when your last success snapshot happened. So you can adjust your backup schedule.

Dask Yarn failed to allocate number of workers

We have a CDH cluster (version 5.14.4) with 6 worker servers with a total of 384 vcores (64 cores per server).
We are running some ETL processes using dask version 2.8.1, dask-yarn version 0.8 with skein 0.8 .
Currently we are having problem allocating the maximum number of workers .
We are not able to run a job with more the 18 workers! (we can see the actual number of workers in the dask dashboad.
The definition of the cluster is as follows:
cluster = YarnCluster(environment = 'path/to/my/env.tar.gz',
n_workers = 24,
worker_vcores = 4,
worker_memory= '64GB'
)
Even when increasing the number of workers to 50 nothing changes, although when changing the worker_vcores or worker_memory we can see the changes in the dashboard.
Any suggestions?
update
Following #jcrist answer I realized that I didn't fully understand the termenology between the Yarn web UI application dashboard and the Yarn Cluster parameters.
From my understanding:
a Yarn Container is equal to a dask worker.
When ever a Yarn cluster is generated there are 2 additional workers/containers that are running (one for a Schedualer and one for a logger - each with 1 vCore)
The limitation between the n_workers * worker_vcores vs. n_workers * worker_memory that I still need fully grok.
There is another issue - while optemizing I tried using cluster.adapt(). The cluster was running with 10 workers each with 10 ntrheads with a limit of 100GB but in the Yarn web UI there was only displayed 2 conteiners running (my cluster has 384 vCorres and 1.9TB so there is still plenty of room to expand). probably worth to open a different question.
There are many reasons why a job may be denied more containers. Do you have enough memory across your cluster to allocate that many 64 GiB chunks? Further, does 64 GiB tile evenly across your cluster nodes? Is your YARN cluster configured to allow jobs that large in this queue? Are there competing jobs that are also taking resources?
You can see the status of all containers using the ApplicationClient.get_containers method.
>>> cluster.application_client.get_containers()
You could filter on state REQUESTED to see just the pending containers
>>> cluster.application_client.get_containers(states=['REQUESTED'])
this should give you some insight as to what's been requested but not allocated.
If you suspect a bug in dask-yarn, feel free to file an issue (including logs from the application master for a problematic run), but I suspect this is more an issue with the size of containers you're requesting, and how your queue is configured/currently used.

Avoid redis processes to make RDB snapshots on the same time

Imagine setup of Redis Cluster for example, or just usual sharded setup, where we have N > 1 Redis processes per physical node. All our processes have same redis.conf and enabled SAVE options there with same SAVE period. So, if all our main Redis processes started on the same time - all of them will start SAVE on the same time or around it.
When we have 9 Redis processes and all of them start RDB snapshotting on the same time it:
Affects performance, because we make 9 forked processes that start consume CPU and do IO on the same time.
Requires too much reserved additional memory that can't be used as actual storage, because on write-heavy application Redis may use up to 2x the memory normally used during snapshotting. So... if we want to have redis processes for 100Gb on this node - we should take additional 100Gb for forking all processes on the same time to be safe.
Is there any best practice to modify this setup and make Redis processes start saving one by one or at least with some randomization?
I have only one idea with disabling schedule in redis.conf and write cron script that will start save one by one with time lag. But this solution looks like a hack and it should be some other practices here.

Infinispan Distributed Cache Issues

I am new to Infinispan. We have setup a Infinispan cluster so that we can make use of the Distributed Cache for our CPU and memory intensive task. We are using UDP as the communication medium and Infinispan MapReduce for distributed processing. The problem we are using facing is with the throughput. When we run the program on a single node machine, the total program completes in around 11 minutes i.e. processing a few hundred thousand records and finally emitting around 400 000 records.
However, when we use a cluster for the same dataset, we are seeing a throughput of only 200 records per second being updated between the nodes in the cluster, which impacts the overall processing time. Not sure which configuration changes are impacting the throughput so badly. I am sure its something to do with the configuration of either JGroups or Infinispan.
How can this be improved?