OpenStack's virtual nodes permanently in paused state - virtual-machine

Recently I deployed Red Hat OpenStack 10 with Jenkins. I've found that my running nodes are became paused after a while.
virsh list stdout:
Id | Name | State
-------------------------
1 undercloud-0 paused
2 compute-0 paused
3 controller-0 paused
I tried to start or reboot VMs, but it didn't help. Machines are still in paused state. Is there any obvious things which I might miss?

I've found there is a lack of free space appeared after OpenStack runs for some time.
RHEL machines had smaller / partition and quite big /home partition. I found a VM images stored in /var and just moved it into /home
The steps are:
Stop all running VMs
# for i in $(virsh list --name); do virsh destroy $i; done
Create new directory and move images there
# mkdir /home/_images
# mv /var/lib/libvirt/images/* /home/_images
Remove the old directory with images and create a symlink to the new directory.
# rmdir /var/lib/libvirt/images
# ln -s /home/_images /var/lib/libvirt/images
Start VMs again (or reboot the machine), an ideal order is 1. undercloud-0, 2. controller-0, 3. compute-x nodes
# for i in $(virsh list --name); do virsh reboot $i; done

Related

How to fix these warnings "External file changes sync may be slow" and "The current inotify(7) watch limit is too low" in IntelliJ Project in Ubuntu

I am getting two warning messages in IntelliJ IDEA when I am opening my project.
1. IntelliJ IDEA cannot receive filesystem event notifications
for the project. Is it on a network drive?
2. The current inotify(7) watch limit is too low.
NOTE: I am using UBUNTU 20.04 LTS
As shared in this link:
Add new conf file..
$ sudo touch /etc/sysctl.d/60-jetbrains.conf
Open the file and add these lines
# Set inotify watch limit high enough for IntelliJ IDEA (PhpStorm, PyCharm, RubyMine, WebStorm).
# Create this file as /etc/sysctl.d/60-jetbrains.conf (Debian, Ubuntu), and
# run `sudo service procps start` or reboot.
# Source: https://confluence.jetbrains.com/display/IDEADEV/Inotify+Watches+Limit
#
# More information resources:
# -$ man inotify # manpage
# -$ man sysctl.conf # manpage
# -$ cat /proc/sys/fs/inotify/max_user_watches # print current value in use
fs.inotify.max_user_watches = 524288
Let’s restart the system
$ sudo sysctl -p --system
For Mac OsX users (My googling bring me here), if your project code is in Google Drive on your internal disk, you'll get this warning as well.
I believe it's because, Google Drive has its own file system(FS) which doesn't implement all features of the FS. (for example, you can't create symlink on a google drive)

How to set mounted folder permission in podman

Abstract
When I mount a folder to my container and the path to the folder is not yet created on the client podman will create it for me. I can set the permissions for the mounted folder on my host machine to match it to the container-user, but the created path folders do not have the same permissions.
Steps to reproduce
For example lets assume in my image the home directory of the user ist empty. Then I will do on my host:
$ mkdir foo
$ podman unshare chown 1000:100 foo
$ podman run -v $PWD/foo:/home/myuser/bar/foo:z [...] some/image:latest
that will result on my container as:
~ # ls -la
drwxr-xr-t 3 root root 4096 Jan 28 12:43 bar
~ # cd bar
~/bar # ls -la
drwxrwxr-x 2 1000 users 4096 Jan 28 12:42 foo
~/bar #
is this behavior intentional?
is there a way to tell podman to create the path with the same permissions as the destination folder?
I can imagine a work around, but it would be nice if I could tell it in the run command.
Use Case
In my case I try to run different jupyter notebooks as disposable container direct from docker.io. But I do want to share the user-settings. The user-settings folder is not present when the container mounts the volumes. So podman will create them, but as root. So the jupyter user cannot access the folders created by podman and will fail.
I could create a Buildfile from the images and create the folders in the buildphase. But I use different images all the time and I dont want to create a custom image for all my use cases.
I could mount the volume to the parent folder, but all kinds of different stuff gets stored there and I dont want to share this to all the different containers.
I could not dispose the containers after the initial boot, but I dont know when I want to reuse this container, if at all...
Maybe it is possible to map the jupyter user to your user with the --uidmap command-line option?
(untested)
$ mkdir foo
$ jupyterUID=1234 # Replace 1234 with the correct UID for the jupyter user
$ podman run -v $PWD/foo:/home/myuser/bar/foo:z [...] --uidmap=0:1:$jupyterUID --uidmap=$(expr $jupyterUID + 1):$(expr $jupyterUID + 1):$(expr 65536 - $jupyterUID - 1) --uidmap=${jupyterUID}:0:1 some/image:latest
I think something like this is needed when the container starts as the container root user and then runs a program as another user. If that other user would write files in a bind-mounted directory, the files would be owned by your normal user on the host. I don't know, though, if that is the case with your Jupyter container image.
Edit 4 April 2022
A related Stackoverflow answer that I wrote:
https://stackoverflow.com/a/71741794/757777
I also wrote a troubleshooting tip about using --uidmap and --gidmap in the Podman troubleshooting guide.

Ubuntu Server Backup and Restore via tar

I'm trying to learn how to backup and restore my Ubuntu Server via tar so I know that I have a safe system. After I untar and reboot, I have several issues, but they seem to be caused by a read-only file system. The source and destination server are both Ubuntu Server on the same version, 18.04.05 LTS. The source server is a VPS that has 6 GB RAM and 4vCPUs. The destination server is a VM on my FreeNAS machine with 6GB RAM and 2 vCPUs.
The primary applications that need to work are my Graylog server and Nagios server. I've mostly followed the instructions at Ubuntu.
First, my tar command is:
sudo tar -c --use-compress-program=pigz -f backup.tar.gz --exclude=/backup.tar.gz --exclude=/dev --exclude=/usr --exclude=/sbin --exclude=/proc --exclude=/sys --exclude=/tmp --exclude=/run --exclude=/mnt --exclude=/media --exclude=/lost+found --exclude=/home/*/.cache --exclude=/home/*/.gvfs --exclude=/home/*/.local/share/Trash --exclude=/var/log --exclude=/var/cache/apt/archives --exclude=/usr/src/linux-headers* --one-file-system /
I use pigz to utilize the VPS's 4 vCPUs to take less time. I transfer this to my VM which as a fresh copy of Ubuntu Server 18.04.05 and untar with:
sudo tar -xvpzf backup.tar.gz -C / --numeric-owner
After I reboot, I get the following as soon as I boot:
Unable to setup logging. [Errno 30] Read-only file system: '/var/log/landscape/sysinfo.log'
run-parts: /etc/update-motd.d/50-lanscape-sysinfo exited with return code 1
mktemp: failed to create file via template '/var/lib/update-notifier/tmp.XXXXXXXXXX': Read-only file system
run-parts: /etc/update-motd.d/95-hwe-eol exited with return code 1
/usr/lib/update-notifier/update-motd-fsck-at-reboot: 33: /usr/lib/update-motd-fsck-at-reboot: cannot create /var/lib/update-notifier/fsck-at-reboot: Read-only file system
I do see that some areas of the system do work like the original source. My SSH port changes, hostname changes, etc. But I get these above errors and my Graylog and Nagios servers do not work.
So I'm wondering where I went wrong in my process and any help would be appreciated. The source is a live server with backups so I'm safe there. I'm just making sure I have my ducks in a row for the future.

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy
To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.

/var/run/redis/redis.pid exists, process is already running or crashed

Redis went quite on me.
user#mycomputer:~$ redis-cli
Could not connect to Redis at 127.0.0.1:6379: Connection refused
I try to restart the service by doing this
sudo /etc/init.d/redis_6379 stop
/var/run/redis/redis.pid exists, process is already running or crashed
But no luck. Logs didn't show an error as well.
Got it fixed by backing up the redis.rdp file mine is located at
/var/lib/redis
check your config file "/etc/redis/redis.conf" for the rdp file's location and do this
sudo mv /var/lib/redis/redis.rdp /var/lib/redis/redis_backup.rdp
Then recreate the the redis.rdp file
sudo touch redis.rdp
Run the redis-server with the conf and it should work
sudo redis-server /etc/redis/redis.conf
Get it fixed in a tidy way: Recreate the the redis.rdp file as suggested here in one of answer, will purge all the cache recorded so far and redis will start up fresh with no cache data.
This is a warning message to notify system crash / improper shutdown: "/var/run/redis/redis.pid exists, process is already running or crashed"
Just delete /var/run/redis/redis.pid file and restart the server again.
Note: You might have lost latest cache changes due to untidy shutdown, which weren't flushed into the disk. This data loss can be minimized using frequent disk flush configuration in redis conf file(in my case it is #/etc/redis/6379.conf)
save 900 1
save 300 10
save 60 10000
Or try AOF persistence, more details [here][1]
Depends on how you installed redis, the pid can be found on /var/run/redis_6379.pid.
What happened is that redis crashed, but the pid is still there. So you just have to delete it.
sudo rm -f /var/run/redis_6379.pid
Then start redis again:
sudo /etc/init.d/redis_6379 start
If you can't find it, I suggest installing redis "more properly". Follow redis quickstart guide in the Installing Redis more properly section.
You can find it here:
https://redis.io/topics/quickstart
Run the redis-server with config.
sudo redis-server redis.conf