yarn usercache dir not resolved properly when running an example application

yarn usercache dir not resolved properly when running an example application - hadoop-yarn

I am using Hadoop 3.2.0 and trying to run a simple application in a docker container and I have made the required configuration changes both in yarn-site.xml and container-executor.cfg to choose LinuxContainerExecutor and docker runtime.
I use the example of distributed shell in one of the hortonworks blog. https://hortonworks.com/blog/trying-containerized-applications-apache-hadoop-yarn-3-1/
The problem I face here is when the application is submitted to YARN it fails with a reason related to directory creation issue with the below error
2019-02-14 20:51:16,450 INFO distributedshell.Client: Got application
report from ASM for, appId=2, clientToAMToken=null,
appDiagnostics=Application application_1550156488785_0002 failed 2
times due to AM Container for appattempt_1550156488785_0002_000002
exited with exitCode: -1000 Failing this attempt.Diagnostics:
[2019-02-14 20:51:16.282]Application application_1550156488785_0002
initialization failed (exitCode=20) with output: main : command
provided 0 main : user is myuser main : requested yarn user is
myuser Failed to create directory
/data/yarn/local/nmPrivate/container_1550156488785_0002_02_000001.tokens/usercache/myuser
- Not a directory
I have configured yarn.nodemanager.local-dirs in yarn-site.xml and I can see the same reflected in YARN web ui localhost:8088/conf
<property>
<name>yarn.nodemanager.local-dirs</name>
<value>/data/yarn/local</value>
<final>false</final>
<source>yarn-site.xml</source>
</property>
I do not understand why is it trying to create usercache dir inside the nmPrivate directory.
Note : I have verified the permissions for myuser to the directories and also have tried clearing the directories manually as suggested in a related post. But no fruit. I do not see any additional information about container launch failure in any other logs.
How do I debug why the usercache dir is not resolved properly??
Really appreciate any help on this.

Realized that this is all because of the users the services were started with and the permissions to the directories the services work on.
After making sure the required changes are done, I am able to seamlessly run the examples and other applications..
Thanks Hadoop user community for the direction. Adding the link here for more details.
http://mail-archives.apache.org/mod_mbox/hadoop-user/201902.mbox/browser

Related

Nexus Repo - could not lock user prefs

I'm running Sonatype Nexus 3 inside a docker container, with the following startup command:
docker run -d -p 80:8081 --ulimit nofile=65536:65536 --name nexus -v nexus-data:/nexus-data -e INSTALL4J_ADD_VM_PARAMS="-Xms4g -Xmx4g -XX:MaxDirectMemorySize=6717m -Djava.util.prefs.userRoot=${NEXUS_DATA}/javaprefs" sonatype/nexus3
After updating the docker image version from 3.30.0 to 3.40.1, I keep getting the following warnings regarding user prefs.
2022-07-18 13:14:45,860+0000 WARN [Timer-0] *SYSTEM java.util.prefs - Couldn't flush user prefs: java.util.prefs.BackingStoreException: Couldn't get file lock.
2022-07-18 13:15:15,860+0000 WARN [Timer-0] *SYSTEM java.util.prefs - Could not lock User prefs. Unix error code 2.
As you can see from the startup command, the user prefs directory is inside the docker volume and at directory /nexus-data/javaprefs . I have tried looking for existing locks inside the directory, but found none. I've also tried completely deleting the directory and saw that the warning still came up and the folder itself wasn't being created by Nexus.
I honestly don't even know if this is an important issue or not, since there is little to no documentation about the user preferences folder.
Even a way to turn off the warning log which fires every 30s would be useful.
----UPDATE----
I've tried doing a clean installation of Nexus through Docker, following the simple instructions inside the github sonatype nexus3 docker repository, and still find these warnings.
I even tried on a different OS (Windwos instead of linux, through Docker Desktop) and with and without a volume for /nexus-data.
At this point I believe it to be a bug in a newer Nexus version.

TLDR: Adding -Djava.util.prefs.userRoot=/nexus-data/javaprefs should solve the problem, assuming the nexus data directory is at /nexus-data/.
Just had the same issue after upgrading from 3.38.1 to 3.42.0. After some investigation found that indeed the java.util.prefs.userRoot property got lost somewhere between those versions. The default value in the vanilla Nexus 3.38.1 is /nexus-data/javaprefs.

Singularity: failed to resolve session directory

I wrote a Singularity container that works just fine on my computer. However, when a colleague of mine tries to run it, he gets the error output
FATAL: container creation failed: failed to resolve session directory /usr/local/var/singularity/mnt/session: lstat /usr/local/var: no such file or directory
In the past, he could run containers I build. In fact, he used being able to run a container with the same recipe. The change was that the version of Singularity on the machine I use to build it was upgraded.
I entered the error in a search engine, and I only found a single hit, https://forum.image.sc/t/improving-cluster-supercomputer-performance-tesla-v100-volta-16-32gb-gpu/37459/8, in which this is not resolved.
Does anybody know a way to fix this? Or what the source of the problem is? Or a workaround, preferably one that does not require me to downgrade Singularity? (The machine on which I build it is shared between several users, that's why I don't want to do that.)

Okay, this was somewhat trivial to solve, we just had the colleague create the required folder,
mkdir -p /usr/local/var/singularity/mnt/{container,final,overlay,session}

MLflow Artifacts Storing But Not Listing In UI

I've run into an issue using MLflow server. When I first ran the command to start an mlflow server on an ec2 instance, everything worked fine. Now, although logs and artifacts are being stored to postgres and s3, the UI is not listing the artifacts. Instead, the artifact section of the UI shows:
Loading Artifacts Failed
Unable to list artifacts stored under <s3-location> for the current run. Please contact your tracking server administrator to notify them of this error, which can happen when the tracking server lacks permission to list artifacts under the current run's root artifact directory.
But when I check in s3, I see the artifact in the s3 location that the error shows. What could possibly have started causing this as it used to work not too long ago and nothing was changed on the ec2 that is hosting mlflow?

I found the answer. The error was that mlflow could not find boto3, so a conda installation of that worked. The logs for this were buried and hard to find in stdout.

Jenkins + Phing: Build Failure - can't find build.xml

Trying to set up Jenkins on one of my servers for the first time and think I might be missing something.
Jenkins 1.545
Phing 2.6.1
Jenkins builds give me the following output.
Building in workspace /var/www/vhosts/domain.co.uk/httpdocs
looking for '/var/www/vhosts/domain.co.uk/httpdocs/build.xml' ...
looking for '/var/www/vhosts/domain.co.uk/httpdocs/build.xml' ...
looking for 'build.xml' ...
buildfile 'build.xml' not found.
Build step 'Invoke Phing targets' marked build as failure
Finished: FAILURE
If I run my build.xml on it's own it works fine.
I'm using a custom workspace at the moment, before I tried a symlink from the default workspace to my webroot, when I did that it found the build file but failed when trying to run phing. I know it's a problem with permissions but I'm not sure exactly what.
I'm running this on a plesk web server and have tried adding the jenkins user to the psacln and psaserv groups but that didn't work either.

I use hudson but I think is the same problem.
Provide to ant job the full path (advanced settings)
${WORKSPACE}/buil.xml
Assuming the correct set of jenkins user
RUN_AS_USER=jenkins
Go to the custom workspace and
chown -R jenkins:jenkins myworkspace
if it doesn't work
chmod -R 777 myworkspace
then you will fix later.
I hope it helps.

Deploying Custom Cartridges on Openshift Origin

I have created a new custom cartridge, in which I have packaged into an rpm using tito and installed using yum. This cartridge is being copied from my spec file to the /usr/libexec/openshift/cartridges directory, however, when I log into the origin home site and try to create an application my cartridge does not show up. I went digging in the ruby scripts and I found that there is a script named cartridge_cache.rb seems to be caching the cartridges it finds within the /usr/libexec/openshift/cartridges directory. I have tried to get origin to reload the cache to include my new cartridge by removing all the cache files within the /var/www/openshift/broker/cache directory then restarting the broker, but I have had no success. Is there somewhere I need to hardcode my cart name to some global variable or something ? Basically, Does anyone know how to get your custom cart to show up on the webpage for creating a new application.
UPDATE: So I ran into a slide deck that had one slide on how to install the cartridge. However, I still have had no success, but here is what I have tried since the previous post:
moved my cartridge directory from /usr/libexec/openshift/cartridges to /usr/libexec/openshift/catridges/v2
ran this command
oo-admin-cartridge -a install -s /usr/libexec/openshift/cartridges/v2/myfirstcart
which the output stated it installed the cartridge.
cleared cache with
bundle exec rake tmp:clear
restarted the openshift broker service
Also, just to make sure the cache was cleared out I went into the Rails console and ran Rails.cache.clear. And still no custom cartridge on the openshift webpage.

It works for me after cleaning cache
cd /var/www/openshift/broker
bundle exec rake tmp:clear
and restarting broker service
service openshift-broker restart
http://openshift.github.io/documentation/oo_administration_guide.html#clear-the-broker-application-cache

MCollective service on Node server (if you have separate servers for broker and node) must be restarted. e.g. with
service ruby193-mcollective restart
After that you should clear the caches on broker server e.g with
/usr/sbin/oo-admin-broker-cache --console
Then you should have new cartridges available

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas