Initialization of cluster taking longer time and sometime throwing error in Azure databricks environment - jobs

I have scheduled job to run at specific time in Azure Databricks but i am getting cluster error while it getting initialized.

This is due to vm quota limit with your subscription . You can resolve this issue by increasing the quota ref

Related

Apache Ignite: Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE'

I am getting this exception intermittently while trying to run co-located join queries on cached data. Below are some of specifics of the environment and how the caches are initialized.
Running embedded with a spring boot application
Deployed in Kubernetes environment with TcpDiscoveryJdbcIpFinder
Running on 3+ nodes
The caches are created dynamically using BinaryObjects and QueryEntity
The affinity keys are forced to be a static value using AffinityKeyMapper (for the same group of data)
I am getting Getting affinity for too old topology version that is already out of history (try to increase 'IGNITE_AFFINITY_HISTORY_SiZE) sporadically. Sometimes this happens continuously for a few minutes. Sometimes it would work on a second or third try and sometimes we don't see this error for hours. I already increased IGNITE_AFFINITY_HISTORY_SiZE to 100000 and we are still getting this message.

How to debug aws fargate task running out of memory?

I'm running a task at fargate with CPU as 2048 and memory as 8192. Task after running some time is stopped with error
container was stopped as it ran out of memory.
Thing is that task does not fails every time. If I run the same task 10 time it fails 5 times and works 5 times. However If I take an ec2 machine with 2 vcpu and 4GB memory and try to run the same container it runs successfully.(Infact the memory usage on ec2 instance is very low).
Can somebody please guide me how to figure out the memory issue while running a fargate task?
Thanks
The way to start would be enabling memory metrics from container insights for your fargate tasks and Further correlating the Memory Usage graph with Application logs. help here
The difference between running on EC2 vs Fargate could probably be due to the fact that when you run a container on ECS Fargate, it runs on AWS's internal EC2 Instances. Now, here could possibly arise a Noisy Neighbour Situation although the chances would be pretty low.

Using TPU on GKE: Error recorded from infeed: Socket closed

Once in a while our GKE TPUEstimator based training job using TPUs fail with:
Error recorded from infeed: Socket closed
An error was raised. This may be due to a preemption in a connected worker or parameter server. The current session will be closed and a new session will be created. This error may also occur due to a gRPC failure caused by high memory or network bandwidth usage in the parameter servers. If this error occurs repeatedly, try increasing the number of parameter servers assigned to the job. Error: Socket closed
I have two questions about that:
What is happening here? I checked the pods memory usage and that did not spike. The TPU allocated to the pod is still there as well.
The job doesn't always raise an error to the pod. It continues to show as running unless someone manually checks the state and then takes action to restart it. Any way to make it always restart automatically?

Database Sync failing after a day

We setup database sync between two databases on the same server. It worked fine yesterday and then stopped working today. I tried killing connections to the database and stopping the web apps that are connected to the database thinking maybe it was a connection limit. I also reset the user and pass after verifying the connection is correct.
This is the error we're getting:
Database re-provisioning failed with the exception "The current operation could not be completed because the database is not provisioned for sync or you not have permissions to the sync configuration tables." For more information, provide tracing ID ‘b4b76a8c-38ae-4b48-ad08-6c07933c23c1’ to customer support.
The error log indicated that the previously provision related operation was not completed yet, so you were not able to re-provision the sync group at the same time.
May I know whether you are still experiencing this problem? If yet could you please provide the latest error log? I'll update the answer then.

Azure Cloud Service Deployment Error

I trying to Deploy a moderated size project in cloud as Service,
it giving me a fatal error , i am not able to figure out what the mean of error and cause is
Azure Deployment Stack Trace :
Role instances recycled for a certain amount of times during an update or upgrade operation.
This indicates that the new version of your service or the configuration settings you provided
when configuring the service prevent role instances from running.
The most likely reason for this is that your code throws an unhandled exception.
Please consider fixing your service or changing your configuration settings so that
role instances do not throw unhandled exceptions.
Then start another update or upgrade operation. Until you start another update or upgrade
operation, Windows Azure will continue trying to update your service to the new version or
configuration you provided
For this kind of deployment error, you better try to run your application first in the Azure Compute Emulator and then deploy in the Cloud. So you can get the unhandled exception information, also don't forget have try catch in your code.