Map reduce job gets stuck in accepted state - hive

I am having local installation og hadoop on my mac book air 8gb , but my mapreduce job gets stuck in Accepted state, wanted to know if i am missing something in the configurations, what am i doing wrong here

Open Scheduler and check the "default" queue for this account name.
Probably one of those pairs have the almost similar value:
"Max resources" and "Used resources" - the account used all allowed
to him resources and can't allocate new containers to run a new job.
"Max AM Resource" and "Used AM Resource" - there is no more allowed
resources to allocate an Application Manager container and start the job.

Related

How to overcome Spark "No Space left on the device" error in AWS Glue Job

I had used the AWS Glue Job with the PySpark to read the data from the s3 parquet files which is more than 10 TB, but the Job was failing during the execution of the Spark SQL Query with the error
java.io.IOException: No space left on the device
On analysis, I found AWS Glue workers G1.x has 4 vCPU, 16 GB of memory, 64 GB disk. So we tried to increase the number of workers
Even after increasing the number of Glue workers (G1.X) to 50, Glue Jobs keeps on failing with the same error.
Is there is way to configure the Spark local temp directory to s3 instead of the Local Filesystem? or can we mount EBS volume on the Glue workers.
I had tried configuring the property in the Spark Session builder, but Still, Spark is using the local tmp directory
SparkSession.builder.appName("app").config("spark.local.dir", "s3a://s3bucket/temp").getOrCreate()
As #Prajappati stated, there are several solutions.
These solutions are described in detail in the aws blog that presents s3 shuffle feature. I am going to ommit the shuffle configuration tweaking since it is not too much reliable. So, basically, you can either:
Scale out vertically, increasing the size of the machine (i.e. going from G.1X to G.2X) which increases the cost.
Disaggregate compute and storage: which in this case means using s3 as storage service for spills and shuffles.
At the time of writting, to configure this disaggreagation, the job must be configured with the following settings:
Glue 2.0 Engine
Glue job parameters:
Parameter
Value
Explanation
--write-shuffle-files-to-s3
true
Main parameter (required)
--write-shuffle-spills-to-s3
true
Optional
--conf
spark.shuffle.glue.s3ShuffleBucket=S3://<your-bucket-name>/<your-path>
Optional. If not set, the path --TempDir/shuffle-data will be used instead
Remember to assign the proper iam permissions to the job to access the bucket and write under the s3 path provided or configured by default.
According to the error message, it appears as if the Glue job is running out of disk space when writing a DynamicFrame.
As you may know, Spark will perform a shuffle on certain operations, writing the results to disk. When the shuffle is too large, it the job will fail and
There are 2 option to consider.
Upgrade your worker type to G.2X and/or increase the number of workers.
Implement AWS Glue Spark Shuffle manager with S3 [1]. To implement this option, you will need to downgrade to Glue version 2.0.
The Glue Spark shuffle manager will write the shuffle-files and shuffle-spills data to S3, lowering the probability of your job running out of memory and failing.
Please could you add the following additional job parameters. You can do this via the following steps:
Open the "Jobs" tab in the Glue console.
Select the job you want to apply this to, then click "Actions" then click "Edit Job".
Scroll down and open the drop down named "Security configuration, script libraries, and job parameters (optional)".
Under job parameters, enter the following key value pairs:
Key: --write-shuffle-files-to-s3 Value: true
Key: --write-shuffle-spills-to-s3 Value: true
Key: --conf Value:
spark.shuffle.glue.s3ShuffleBucket=S3://
Remember to replace the triangle brackets <> with the name of the S3 bucket where you would like to store the shuffle data.
5) Click "Save" then run the job.
FWIW I discovered that first thing you need to check is that Spark UI is not enabled on the job: https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-jobs.html
AWS documentation mentions that logs generated for Spark UI are flushed to S3 path every 30 seconds, but it doesn't look like they are rotated on the worker. So sooner or later, depending on the workload and worker type, all of them run out of disk space and the run fails with Command failed with exit code 10.
The documentation states that spark.local.dir is used to specify a local directory only.
This error can be addressed modifying the logging properties or, depending on the cluster manager used, the cluster manager properties such as for yarn in this answer.

Can VMs on Google Compute detect when they've been migrated?

Is it possible to notify an application running on a Google Compute VM when the VM migrates to different hardware?
I'm a developer for an application (HMMER) that makes heavy use of vector instructions (SSE/AVX/AVX-512). The version I'm working on probes its hardware at startup to determine which vector instructions are available and picks the best set.
We've been looking at running our program on Google Compute and other cloud engines, and one concern is that, if a VM migrates from one physical machine to another while running our program, the new machine might support different instructions, causing our program to either crash or execute more slowly than it could.
Is there a way to notify applications running on a Google Compute VM when the VM migrates? The only relevant information I've found is that you can set a VM to perform a shutdown/reboot sequence when it migrates, which would kill any currently-executing programs but would at least let the user know that they needed to restart the program.
We ensure that your VM instances never live migrate between physical machines in a way that would cause your programs to crash the way you describe.
However, for your use case you probably want to specify a minimum CPU platform version. You can use this to ensure that e.g. your instance has the new Skylake AVX instructions available. See the documentation on Specifying the Minimum CPU Platform for further details.
As per the Live Migration docs:
Live migration does not change any attributes or properties of the VM
itself. The live migration process just transfers a running VM from
one host machine to another. All VM properties and attributes remain
unchanged, including things like internal and external IP addresses,
instance metadata, block storage data and volumes, OS and application
state, network settings, network connections, and so on.
Google does provide few controls to set the instance availability policies which also lets you control aspects of live migration. Here they also mention what you can look for to determine when live migration has taken place.
Live migrate
By default, standard instances are set to live migrate, where Google
Compute Engine automatically migrates your instance away from an
infrastructure maintenance event, and your instance remains running
during the migration. Your instance might experience a short period of
decreased performance, although generally most instances should not
notice any difference. This is ideal for instances that require
constant uptime, and can tolerate a short period of decreased
performance.
When Google Compute Engine migrates your instance, it reports a system
event that is published to the list of zone operations. You can review
this event by performing a gcloud compute operations list --zones ZONE
request or by viewing the list of operations in the Google Cloud
Platform Console, or through an API request. The event will appear
with the following text:
compute.instances.migrateOnHostMaintenance
In addition, you can detect directly on the VM when a maintenance event is about to happen.
Getting Live Migration Notices
The metadata server provides information about an instance's
scheduling options and settings, through the scheduling/
directory and the maintenance-event attribute. You can use these
attributes to learn about a virtual machine instance's scheduling
options, and use this metadata to notify you when a maintenance event
is about to happen through the maintenance-event attribute. By
default, all virtual machine instances are set to live migrate so the
metadata server will receive maintenance event notices before a VM
instance is live migrated. If you opted to have your VM instance
terminated during maintenance, then Compute Engine will automatically
terminate and optionally restart your VM instance if the
automaticRestart attribute is set. To learn more about maintenance
events and instance behavior during the events, read about scheduling
options and settings.
You can learn when a maintenance event will happen by querying the
maintenance-event attribute periodically. The value of this
attribute will change 60 seconds before a maintenance event starts,
giving your application code a way to trigger any tasks you want to
perform prior to a maintenance event, such as backing up data or
updating logs. Compute Engine also offers a sample Python script
to demonstrate how to check for maintenance event notices.
You can use the maintenance-event attribute with the waiting for
updates feature to notify your scripts and applications when a
maintenance event is about to start and end. This lets you automate
any actions that you might want to run before or after the event. The
following Python sample provides an example of how you might implement
these two features together.
You can also choose to terminate and optionally restart your instance.
Terminate and (optionally) restart
If you do not want your instance to live migrate, you can choose to
terminate and optionally restart your instance. With this option,
Google Compute Engine will signal your instance to shut down, wait for
a short period of time for your instance to shut down cleanly,
terminate the instance, and restart it away from the maintenance
event. This option is ideal for instances that demand constant,
maximum performance, and your overall application is built to handle
instance failures or reboots.
Look at the Setting availability policies section for more details on how to configure this.
If you use an instance with a GPU or a preemptible instance be aware that live migration is not supported:
Live migration and GPUs
Instances with GPUs attached cannot be live migrated. They must be set
to terminate and optionally restart. Compute Engine offers a 60 minute
notice before a VM instance with a GPU attached is terminated. To
learn more about these maintenance event notices, read Getting live
migration notices.
To learn more about handling host maintenance with GPUs, read
Handling host maintenance on the GPUs documentation.
Live migration for preemptible instances
You cannot configure a preemptible instances to live migrate. The
maintenance behavior for preemptible instances is always set to
TERMINATE by default, and you cannot change this option. It is also
not possible to set the automatic restart option for preemptible
instances.
As Ramesh mentioned, you can specify the minimum CPU platform to ensure you are only migrated to an instance which has at least the minimum CPU platform you specified. At a high level it looks like:
In summary, when you specify a minimum CPU platform:
Compute Engine always uses the minimum CPU platform where available.
If the minimum CPU platform is not available or the minimum CPU platform is older than the zone default, and a newer CPU platform is
available for the same price, Compute Engine uses the newer platform.
If the minimum CPU platform is not available in the specified zone and there are no newer platforms available without extra cost, the
server returns a 400 error indicating that the CPU is unavailable.

How to limit resource usage of a job in YARN

I remember some recent version of YARN has a configuration parameter which controls the amount of memory (or cores) a job can use. I tried to find it from the Web but I couldn't yet. If you know the parameter, please let me know.
I know one way to go about this is to use some kind of scheduler but for now I need a job level control so that the job doesn't abuse the entire system.
Thanks!
You can control maximum and minimum resource which are allocated to each containers.
yarn.scheduler.minimum-allocation-mb: Minimum memory allocation for each container
yarn.scheduler.maximum-allocation-mb: Maximum memory allocation for each container
yarn.scheduler.minimum-allocation-vcores: Minimum core allocation for each container
yarn.scheduler.maximum-allocation-vcores: Maximum core allocation for each container
If you want to avoid abuse of user jobs, yarn.scheduler.maximum-allocation-* can be solution because RM refuses the request which requires above these restriction by throwing InvalidResourceRequestException.
ref: yarn-default.xml

How to terminate/remove a job flow in Amazon EMR?

I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and it failed due to some unknown reasons. Then I tried to terminate the job flow thru the AWS Management Console but the 'Terminate' button was disabled. Then I tried to terminate the job flow using the CLI and it showed that the job flow is terminated, but still it shows as failed in the job flow list when seen thru the CLI as well as in the Elastic MapReduce tab in the management console.
Please let me know how to remove the job flow from the list.
When I tried to debug the job flow it shows two errors:
The debugging functionality is not available for this job flow because you did not specify an Amazon S3 Log Path when you created it.
Job flow failed with reason: Invalid bucket name 'testBucket': buckets names must contain only lowercase letters, numbers, periods (.), and dashes (-).
You are facing two issues here:
Job Flow Failure
First and foremost, the problem triggering the termination state of the Amazon EMR job flow that's irritating you can be remedied immediately:
I created a job flow using Amazon Elastic MapReduce (Amazon EMR) and
it failed due to some unknown reasons.
The reason for the job flow failure can actually be inferred from error 2 in the listing you provided:
Job flow failed with reason: Invalid bucket name 'testBucket': buckets
names must contain only lowercase letters, numbers, periods (.), and
dashes (-). [emphasis mine]
Your bucket name 'testBucket' clearly violates the stated lowercase naming requirement, thus changing the name to lowercase only (e.g. 'testbucket' or 'test-bucket') will allow you to run the job flow as desired.
Termination State
Furthermore, the Job Flow termination state is presumably no problem at all. While it can happen in rare cases, that Amazon EC2 instances or other resources are actually stuck in some state, what you are seeing is perfectly reasonable and normal at first sight:
It may take a while to completely terminate a job flow in the first place, see TerminateJobFlows:
The call to TerminateJobFlows is asynchronous. Depending on the
configuration of the job flow, it may take up to 5-20 minutes for
the job flow to completely terminate and release allocated
resources, such as Amazon EC2 instances. [emphasis mine]
Even terminated EC2 resources may be listed for quite a while still, see e.g. the AWS team response to EC2 Instance stuck in "terminated" state:
Terminated means "gone forever"; although sometimes it hangs around in
the UI for several hours. [emphasis mine]
I regularly witness this behavior for EC2 instances, which usually vanish from the instance listing quite some hours later only indeed. Consequently I suspect that the terminated job flow has meanwhile vanished from your job flow list.
Update
I've actually suspected this to be the case indeed, but am still unable to find related information in the official documentation; however, apparently terminated job flows are potentially visible one way or another for up to two month even, see e.g. the AWS team response to Console not showing jobs older than a month:
While the console lists all running job flows, it only displays
terminated job flows launched in the last month. Alternatively, you
can use the Ruby CLI to list all job flows launched in the last two
months with the following command: [...] [emphasis mine]
if your application is running on hadoop yarn, you can always use yarn to manage your application:
yarn application -list
yarn application -kill application_name

Troubleshooting failover cluster problem in W2K8 / SQL05

I have an active/passive W2K8 (64) cluster pair, running SQL05 Standard. Shared storage is on a HP EVA SAN (FC).
I recently expanded the filesystem on the active node for a database, adding a drive designation. The shared storage drives are designated as F:, I:, J:, L: and X:, with SQL filesystems on the first 4 and X: used for a backup destination.
Last night, as part of a validation process (the passive node had been offline for maintenance), I moved the SQL instance to the other cluster node. The database in question immediately moved to Suspect status.
Review of the system logs showed that the database would not load because the file "K:\SQLDATA\whatever.ndf" could not be found. (Note that we do not have a K: drive designation.)
A review of the J: storage drive showed zero contents -- nothing -- this is where "whatever.ndf" should have been.
Hmm, I thought. Problem with the server. I'll just move SQL back to the other server and figure out what's wrong..
Still no database. Suspect. Uh-oh. "Whatever.ndf" had gone into the bit bucket.
I finally decided to just restore from the backup (which had been taken immediately before the validation test), so nothing was lost but a few hours of sleep.
The question: (1) Why did the passive node think the whatever.ndf files were supposed to go to drive "K:", when this drive didn't exist as a resource on the active node?
(2) How can I get the cluster nodes "re-syncd" so that failover can be accomplished?
I don't know that there wasn't a "K:" drive as a cluster resource at some time in the past, but I do know that this drive did not exist on the original cluster at the time of resource move.
Random thought based on what happened to me a few months ago... sound quite similar
Do you have NFTS mount points? I forget what it was exactly (me code monkey and relied on DBAs), but the mount points were either "double booked" or not part of the cluster resource or the SAN volumes were not configured correctly.
We had "zero size" drives (I used xp_fixeddrives) for our log files but we could still write to them.
Assorted reboots and failovers were unsuccessful. Basically, it was a thorough review of all settings in the SAN management tool.
A possibility for your K: drive...
The other thing I've seen is the mounted drives have letters as well as being mounted in folders. I used to use mounted folders for SQL Server but the backup system used a direct drive letter.