Zombie processes while using use_multiprocessing=True in Keras model.fit()

Zombie processes while using use_multiprocessing=True in Keras model.fit() - tensorflow

I am encountering Zombie processes when training a Neural Network using Keras' model.fit() method. Due to the <defunct> processes, the training does not end and all the effected processes have to be killed with SIGKILL. Restarting the training script does not reproduce the same problem, and sometimes completes execution. The problem does not occur when multiprocessing is disabled: model.fit(use_multiprocessing=False)
Here is an output of the ps aufx command.
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
...
root 774690 0.1 0.0 79032 70048 ? Ss Mai23 17:16 /usr/bin/python3 /usr/bin/tm legacy-worker run mlworker
root 1607844 0.0 0.0 2420 524 ? SNs Jun02 0:00 \_ /bin/sh -c /usr/bin/classifier-train
root 1607845 38.5 4.7 44686436 12505168 ? SNl Jun02 551:05 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639337 0.0 3.7 43834076 10005208 ? SN Jun02 0:00 \_ /opt/venvs/classifier-training-repo/bin/python /usr/bin/classifier-train
root 1639339 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639341 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639343 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639345 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639347 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
root 1639349 0.0 0.0 0 0 ? ZN Jun02 0:00 \_ [classifier-train] <defunct>
Here are the relevant code snippets:
def get_keras_model():
# some code here
model = keras.models.Model(
inputs=(input_layer_1, input_layer_2),
outputs=prediction_layer,
)
model.compile(loss=..., optimizer=..., metrics=...)
return model
def preprocess(data):
# Some code here to convert strings values into numpy arrays of dtype=np.uint32
return X, y
class DataSequence(keras.utils.Sequence):
def __init__(self, data, preprocess_func, keys, batch_size=4096):
self.keys = keys
self.data = data
self.batch_size = batch_size
self.preprocess_func = preprocess_func
def __len__(self):
# returns the number of batches
return int(np.ceil(len(self.keys) / float(self.batch_size)))
def __getitem__(self, idx):
keys = self.keys[idx * self.batch_size : (idx + 1) * self.batch_size]
return self.preprocess_func([self.data[key] for key in keys]
def train(model, data, preprocess):
train_sequence = DataSequence(data, preprocess, list(data.keys()))
history = model.fit(
x=train_sequence,
epochs=15,
steps_per_epoch=len(train_sequence),
verbose=2,
workers=8,
use_multiprocessing=True,
)
return model, history
data = {
"key_1": {"name": "black", "y": 0},
"key_2": {"name": "white", "y": 1},
# upto 70M docs in this dictionary
}
model = get_keras_model()
model, history = train(model, data, preprocess) # model training hangs
Log Output:
Multiple Caught signal 15. Terminating. log messages are displayed, also when the training script finishes execution and does not encounter any Zombie processes. Same behavior is seen with Exception in thread Thread-## outputs; it occurs also when the model training is not effected by zombie processes and finishes execution normally.
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22,024 - MainThread - INFO - Start working on fold 1/5
Jun 09 14:16:22 mlworker tm[575915]: 2022-06-09 14:16:22.725522: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instruc>
Jun 09 14:16:22 mlworker tm[575915]: To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23.439638: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1532] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 6882 MB memory: -> device: 0, name: Tesla P4, p>
Jun 09 14:16:23 mlworker tm[575915]: 2022-06-09 14:16:23,709 - MainThread - INFO - Fitting model ...
Jun 09 14:16:24 mlworker tm[575915]: Epoch 1/15
Jun 09 14:16:31 mlworker tm[575915]: 3/3 - 7s - loss: 6.9878 - acc: 1.0908e-04 - 7s/epoch - 2s/step
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:31 mlworker tm[575915]: Epoch 2/15
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:34 mlworker tm[575915]: 3/3 - 3s - loss: 6.9392 - acc: 0.0055 - 3s/epoch - 1s/step
...
Jun 09 14:16:48 mlworker tm[575915]: Epoch 7/15
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:51 mlworker tm[575915]: Exception in thread Thread-87:
Jun 09 14:16:51 mlworker tm[575915]: Traceback (most recent call last):
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 954, in _bootstrap_inner
Jun 09 14:16:51 mlworker tm[575915]: self.run()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/threading.py", line 892, in run
Jun 09 14:16:51 mlworker tm[575915]: self._target(*self._args, **self._kwargs)
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 759, in _run
Jun 09 14:16:51 mlworker tm[575915]: with closing(self.executor_fn(_SHARED_SEQUENCES)) as executor:
Jun 09 14:16:51 mlworker tm[575915]: File "/opt/venvs/classifier-training-repo/lib/python3.9/site-packages/keras/utils/data_utils.py", line 736, in pool_fn
Jun 09 14:16:51 mlworker tm[575915]: pool = get_pool_class(True)(
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 119, in Pool
Jun 09 14:16:51 mlworker tm[575915]: return Pool(processes, initializer, initargs, maxtasksperchild,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 212, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._repopulate_pool()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 303, in _repopulate_pool
Jun 09 14:16:51 mlworker tm[575915]: return self._repopulate_pool_static(self._ctx, self.Process,
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/pool.py", line 326, in _repopulate_pool_static
Jun 09 14:16:51 mlworker tm[575915]: w.start()
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/process.py", line 121, in start
Jun 09 14:16:51 mlworker tm[575915]: self._popen = self._Popen(self)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/context.py", line 277, in _Popen
Jun 09 14:16:51 mlworker tm[575915]: return Popen(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 19, in __init__
Jun 09 14:16:51 mlworker tm[575915]: self._launch(process_obj)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3.9/multiprocessing/popen_fork.py", line 73, in _launch
Jun 09 14:16:51 mlworker tm[575915]: os._exit(code)
Jun 09 14:16:51 mlworker tm[575915]: File "/usr/lib/python3/dist-packages/solute/click.py", line 727, in raiser
Jun 09 14:16:51 mlworker tm[575915]: raise Termination(128 + signo)
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:51 mlworker tm[575915]: solute.click.Termination: 143
Jun 09 14:16:52 mlworker tm[575915]: 3/3 - 3s - loss: 5.7624 - acc: 0.0726 - 3s/epoch - 1s/step
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:52 mlworker tm[575915]: Epoch 8/15
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:16:55 mlworker tm[575915]: 3/3 - 3s - loss: 5.6978 - acc: 0.1000 - 3s/epoch - 1s/step
...
Jun 09 14:17:02 mlworker tm[575915]: Epoch 11/15
Jun 09 14:17:05 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:05 mlworker tm[575915]: 3/3 - 3s - loss: 5.5029 - acc: 0.0804 - 3s/epoch - 1s/step
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:06 mlworker tm[575915]: Epoch 12/15
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
Jun 09 14:17:09 mlworker tm[575915]: Caught signal 15. Terminating.
No further log output was seen after the last message. The processes have to be killed with sudo kill -SIGKILL and model training has to be restarted again.
System information:
I have encountered the same problem on different machines with different GPUs and different Python versions.
OS Platform and Distribution: Debian GNU/Linux 11 (bullseye), Ubuntu
20.04.4 LTS
TensorFlow version: v2.9.0-18-gd8ce9f9c301 2.9.1 (Debian 11), v2.9.0-18-gd8ce9f9c301 2.9.1 (Ubuntu LTS)
Python version: Python
3.9.2 (Debian 11), Python 3.8.10 (Ubuntu LTS)
GPU model and memory: Tesla T4 (16 GB) on Debian 11, Tesla P4 (8 GB) on another Debian 11 machine, GeForce GTX 1080 Ti (12 GB) on Ubuntu LTS

We solved the problem with the following line at the start of the script:
signal.signal(signal.SIGTERM, signal.SIG_DFL)
Explanation:
We had a custom SIGTERM Handler in our script, which was interfering with the SIGTERMs sent to the Threads. This 1-line restores Python's default handler for SIGTERM and avoids running into unresponsive subprocesses.
There was no Bug in Tensorflow or Keras code :)

Related

GCE 8 GPU instance randomnly reboots while training is running

I have an 8 GPU GCE instance that randomnly reboots in the middle of a training routine. This happened a couple of times. The instance also appears to stay down for quite a while before it comes back up. I found some traces in the kernel log of a dump that looks like it might be the cause (?). Any ideas what I can do about this?
The configuration is pretty ordinary : An ubuntu instance running a python 3 Tensorflow App that's training on images and the Nvidia drivers are installed with the cuda toolkit.
The log is shown below. The last few lines indicating the system is booting up but nearly after 10 hours it appears
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749736] Call Trace:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749737] <IRQ> [<ffffffff813f8dd3>] dump_stack+0x63/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749746] [<ffffffff810ddd33>] __report_bad_irq+0x33/0xc0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749747] [<ffffffff810de0c7>] note_interrupt+0x247/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749749] [<ffffffff810db277>] handle_irq_event_percpu+0x167/0x1d0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749750] [<ffffffff810db31e>] handle_irq_event+0x3e/0x60
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749752] [<ffffffff810de639>] handle_fasteoi_irq+0x99/0x150
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749756] [<ffffffff8103119d>] handle_irq+0x1d/0x30
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749758] [<ffffffff8184341b>] do_IRQ+0x4b/0xd0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749761] [<ffffffff81841502>] common_interrupt+0x82/0x82
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749764] [<ffffffff81085d5e>] ? __do_softirq+0x7e/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749766] [<ffffffff810860e3>] irq_exit+0xa3/0xb0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749767] [<ffffffff818434e2>] smp_apic_timer_interrupt+0x42/0x50
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749769] [<ffffffff818417a2>] apic_timer_interrupt+0x82/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749770] <EOI> [<ffffffff81064606>] ? native_safe_halt+0x6/0x10
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749775] [<ffffffff81038e1e>] default_idle+0x1e/0xe0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749776] [<ffffffff8103962f>] arch_cpu_idle+0xf/0x20
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749780] [<ffffffff810c454a>] default_idle_call+0x2a/0x40
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749781] [<ffffffff810c48b1>] cpu_startup_entry+0x2f1/0x350
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749798] [<ffffffff810517c4>] start_secondary+0x154/0x190
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749799] handlers:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.752277] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.762984] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.773705] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.784444] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.795096] Disabling IRQ #10
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpu
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Linux version 4.4.0-79-generic (buildd#lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #100-Ubuntu SMP Wed May
17 19:58:14 UTC 2017 (Ubuntu 4.4.0-79.100-generic 4.4.67)

Aerospike not starting

My aerospike nodes do not come back from restarts with the following logs:
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::161) SIGSEGV received, aborting Aerospike Community Edition build 3.7.2 os ubuntu12.04
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: found 7 frames
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 0: /usr/bin/asd(as_sig_handle_segv+0x5d) [0x48a1d6]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 1: /lib/x86_64-linux-gnu/libc.so.6(+0x36cc0) [0x7f10fb8d0cc0]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 2: /usr/bin/asd(build_service_list+0x40) [0x4ab065]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 3: /usr/bin/asd(as_config_post_process+0x304) [0x4685bd]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 4: /usr/bin/asd(main+0x228) [0x45fd08]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 5: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f10fb8bbf45]
Jun 07 2016 20:56:23 GMT: WARNING (as): (signal.c::163) stacktrace: frame 6: /usr/bin/asd() [0x4608a9]
(gdb) info line *0x48a1d6
Line 163 of "base/signal.c" starts at address 0x48a1c7 <as_sig_handle_segv+78> and ends at 0x48a25e <as_sig_handle_segv+229>.
(gdb) info line *0x7efcdb681cc0
No line number information available for address 0x7efcdb681cc0
(gdb) info line *0x4ab065
Line 5479 of "base/thr_info.c" starts at address 0x4ab065 <build_service_list+64> and ends at 0x4ab06c <build_service_list+71>.
(gdb) info line *0x4685bd
Line 3338 of "base/cfg.c" starts at address 0x4685bd <as_config_post_process+772> and ends at 0x4685ca <as_config_post_process+785>.
(gdb) info line *0x45fd08
Line 416 of "base/as.c" starts at address 0x45fd08 <main+552> and ends at 0x45fd11 <main+561>.
(gdb) info line *0x7efcdb66cf45
No line number information available for address 0x7efcdb66cf45
(gdb) info line *0x4608a9
No line number information available for address 0x4608a9 <_start+41>
Clearing everything up and setting up node from scratch doesn't help. i net command shows failed nodes but I can't remove them as this operation requires a rolling restart – and nodes do not start after being stopped.
How can I figure out why does aerospike fail to start?

Got it.
Line 5479 of "base/thr_info.c" starts at address 0x4ab065 <build_service_list+64> and ends at 0x4ab06c <build_service_list+71>.
Goes here. I had 254 aliases on the internal interface. Removed 253 of them and now aerospike starts fine.

Customized SUSE Image not running in Google compute Engine

I have uploaded the customized image and created the VM instance of it. I am unable to do SSH in to it. As per troubleshooting guidelines I have attached the root persistent disk and from the log file I found that VM instance frequently booted and terminated from the log file "/var/log/messages". Please find the log file below
"
Nov 26 11:40:28 linux syslog-ng[1997]: syslog-ng starting up; version='2.0.9'
Nov 26 11:40:33 linux rchal: CPU frequency scaling is not supported by your processor.
Nov 26 11:40:33 linux rchal: boot with 'CPUFREQ=no' in to avoid this warning.
Nov 26 11:40:33 linux rchal: Cannot load cpufreq governors - No cpufreq driver available
Nov 26 11:40:33 linux kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 26 11:40:33 linux kernel: [ 18.645145] bootsplash: status on console 0 changed to on
Nov 26 11:40:57 linux kernel: [ 57.972129] Uniform Multi-Platform E-IDE driver
Nov 26 11:40:57 linux kernel: [ 57.988151] ide-cd driver 5.00
Nov 26 11:40:57 linux kernel: [ 58.089061] st: Version 20101219, fixed bufsize 32768, s/g segs 256
Nov 26 11:41:02 linux kernel: [ 62.944338] eth1: no IPv6 routers present
Nov 26 11:41:02 linux kernel: [ 63.259092] eth0: no IPv6 routers present
Nov 26 17:11:16 linux su: (to root) root on none
Nov 26 17:11:26 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:11:27 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:11:27 linux kernel: [ 88.008142] ip6_tables: (C) 2000-2006 Netfilter Core Team
Nov 26 17:11:27 linux kernel: [ 88.252544] ip_tables: (C) 2000-2006 Netfilter Core Team
Nov 26 17:11:27 linux kernel: [ 88.296835] nf_conntrack version 0.5.0 (7168 buckets, 28672 max)
Nov 26 17:11:28 linux SuSEfirewall2: batch committing...
Nov 26 17:11:29 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:11:42 linux ifdown: eth0
Nov 26 17:11:44 linux ifdown: eth1
Nov 26 17:11:55 linux ifup: lo
Nov 26 17:11:55 linux ifup: lo
Nov 26 17:11:55 linux ifup: IP address: 127.0.0.1/8
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:55 linux ifup: IP address: 127.0.0.2/8
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:56 linux ifup: eth0
Nov 26 17:11:56 linux ifup: eth0
Nov 26 17:11:57 linux ifup: IP address: 10.203.92.100/24
Nov 26 17:11:57 linux ifup:
Nov 26 17:11:58 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:11:59 linux ifup: eth1
Nov 26 17:11:59 linux ifup: eth1
Nov 26 17:11:59 linux ifup: IP address: 192.168.17.250/24
Nov 26 17:11:59 linux ifup:
Nov 26 17:12:01 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:12:02 linux ifup: tap0
Nov 26 17:12:03 linux kernel: [ 124.153528] tun: Universal TUN/TAP device driver, 1.6
Nov 26 17:12:03 linux kernel: [ 124.153528] tun: (C) 1999-2004 Max Krasnyansky <maxk#qualcomm.com>
Nov 26 17:12:03 linux kernel: [ 124.219136] ADDRCONF(NETDEV_UP): tap0: link is not ready
Nov 26 17:12:04 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:12:04 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:12:04 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:12:06 linux SuSEfirewall2: batch committing...
Nov 26 17:12:06 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:12:31 linux SuSEfirewall2: batch committing...
Nov 26 17:12:31 linux SuSEfirewall2: Firewall rules unloaded.
Nov 26 17:12:31 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:12:32 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:12:33 linux SuSEfirewall2: batch committing...
Nov 26 17:12:33 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:12:39 linux init: Re-reading inittab
Nov 26 17:12:45 linux ifdown: tap0
Nov 26 17:12:48 linux ifdown: eth0
Nov 26 17:12:50 linux ifdown: eth1
Nov 26 17:12:55 linux init: Entering runlevel: 3
Nov 26 17:12:56 linux SuSEfirewall2: batch committing...
Nov 26 17:12:56 linux SuSEfirewall2: Firewall rules set to CLOSE.
Nov 26 17:12:57 linux kernel: Kernel logging (proc) stopped.
Nov 26 17:12:57 linux kernel: Kernel log daemon terminating.
Nov 26 17:12:57 linux syslog-ng[1997]: Termination requested via signal, terminating;
Nov 26 17:12:57 linux syslog-ng[1997]: syslog-ng shutting down; version='2.0.9'
Nov 26 17:12:57 deepak syslog-ng[8245]: syslog-ng starting up; version='2.0.9'
Nov 26 17:12:57 deepak firmware.sh[8273]: Cannot find firmware file 'intel-ucode/06-17-0a'
Nov 26 17:13:02 deepak kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 26 17:13:02 deepak kernel: [ 178.548747] microcode: CPU0 sig=0x1067a, pf=0x40, revision=0x60b
Nov 26 17:13:02 deepak kernel: [ 178.669173] microcode: Microcode Update Driver: v2.00 <tigran#aivazian.fsnet.co.uk>, Peter Oruba
Nov 26 17:13:02 deepak kernel: [ 178.824111] microcode: CPU0 update to revision 0xa0b failed
Nov 26 17:13:05 deepak ifup: lo
Nov 26 17:13:05 deepak ifup: lo
Nov 26 17:13:05 deepak ifup: IP address: 127.0.0.1/8
Nov 26 17:13:05 deepak ifup:
Nov 26 17:13:05 deepak ifup:
Nov 26 17:13:06 deepak ifup: IP address: 127.0.0.2/8
Nov 26 17:13:06 deepak ifup:
Nov 26 17:13:07 deepak ifup: eth0
Nov 26 17:13:07 deepak ifup: eth0
Nov 26 17:13:07 deepak ifup: IP address: 10.203.92.100/24
Nov 26 17:13:07 deepak ifup:
Nov 26 17:13:08 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:09 deepak ifup: eth1
Nov 26 17:13:09 deepak ifup: eth1
Nov 26 17:13:09 deepak ifup: IP address: 192.168.17.250/24
Nov 26 17:13:09 deepak ifup:
Nov 26 17:13:10 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:16 deepak ifup: tap0
Nov 26 17:13:16 deepak kernel: [ 197.436436] ADDRCONF(NETDEV_UP): tap0: link is not ready
Nov 26 17:13:17 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:18 deepak auditd[9654]: Started dispatcher: /sbin/audispd pid: 9656
Nov 26 17:13:18 deepak kernel: [ 199.072126] auditd (9654): /proc/9654/oom_adj is deprecated, please use /proc/9654/oom_score_adj instead.
Nov 26 17:13:18 deepak audispd: priority_boost_parser called with: 4
Nov 26 17:13:18 deepak audispd: af_unix plugin initialized
Nov 26 17:13:18 deepak audispd: audispd initialized with q_depth=80 and 1 active plugins
Nov 26 17:13:18 deepak auditd[9654]: Init complete, auditd 1.7.7 listening for events (startup state disable)
Nov 26 17:13:18 deepak haveged: haveged starting up
Nov 26 17:13:18 deepak haveged: arch: x86 vendor: intel generic: 0 i_cache: 32 d_cache: 32 loop_idx: 30 loop_idxmax: 40 loop_sz: 31836 loop_szmax: 124334 etime: 30361 havege_ndpt 0
Nov 26 17:13:19 deepak kernel: [ 200.624132] BIOS EDD facility v0.16 2004-Jun-25, 1 devices found
Nov 26 17:13:20 deepak mcelog: mcelog read: No such device
Nov 26 17:21:10 deepak shadow[30512]: new group added - group=db2iadm1, gid=113, by=0
Nov 26 17:21:10 deepak shadow[30512]: running GROUPADD_CMD command - script=/usr/sbin/groupadd.local, account=db2iadm1, uid=113, gid=0, home=, by=0
Nov 26 17:21:11 deepak useradd[30526]: new account added - account=db2admin, uid=1005, gid=113, home=/home/db2admin, shell=/bin/bash, by=0
Nov 26 17:21:11 deepak useradd[30526]: account added to group - account=db2admin, group=video, gid=33, by=0
Nov 26 17:21:11 deepak useradd[30526]: account added to group - account=db2admin, group=dialout, gid=16, by=0
Nov 26 17:21:11 deepak useradd[30526]: home directory created - account=db2admin, uid=1005, home=/home/db2admin, by=0
Nov 26 17:21:11 deepak useradd[30526]: running USERADD_CMD command - script=/usr/sbin/useradd.local, account=db2admin, uid=1005, gid=113, home=/home/db2admin, by=0
Nov 26 17:21:11 deepak shadow[30530]: GID 113 is not unique - by=0
Nov 26 17:21:11 deepak shadow[30533]: new group added - group=db2fadm1, gid=114, by=0
Nov 26 17:21:11 deepak shadow[30533]: running GROUPADD_CMD command - script=/usr/sbin/groupadd.local, account=db2fadm1, uid=114, gid=0, home=, by=0
Nov 26 17:21:11 deepak useradd[30537]: new account added - account=db2fenc1, uid=1006, gid=114, home=/home/db2fenc1, shell=/bin/bash, by=0
Nov 26 17:21:11 deepak useradd[30537]: account added to group - account=db2fenc1, group=video, gid=33, by=0
Nov 26 17:21:11 deepak useradd[30537]: account added to group - account=db2fenc1, group=dialout, gid=16, by=0
Nov 26 17:21:11 deepak useradd[30537]: home directory created - account=db2fenc1, uid=1006, home=/home/db2fenc1, by=0
Nov 26 17:21:11 deepak useradd[30537]: running USERADD_CMD command - script=/usr/sbin/useradd.local, account=db2fenc1, uid=1006, gid=114, home=/home/db2fenc1, by=0
Nov 26 17:21:16 deepak su: (to db2admin) root on none
Nov 26 17:21:33 deepak su: (to db2admin) root on none
Nov 26 17:21:44 deepak su: (to db2admin) root on none
Nov 26 17:21:55 deepak su: (to db2admin) root on none
Nov 26 17:21:57 deepak su: (to db2admin) root on none
Nov 26 17:22:14 deepak su: (to db2admin) root on none
Nov 26 17:22:28 deepak su: (to db2admin) root on none
Nov 26 17:22:41 deepak su: (to db2admin) root on none
Nov 26 17:22:55 deepak su: (to db2admin) root on none
Nov 26 17:23:08 deepak su: (to db2admin) root on none
Nov 26 17:23:22 deepak su: (to db2admin) root on none
Nov 26 17:23:35 deepak su: (to db2admin) root on none
Nov 26 17:23:49 deepak su: (to db2admin) root on none
Nov 26 17:24:02 deepak su: (to db2admin) root on none
Nov 26 17:24:16 deepak su: (to db2admin) root on none
Nov 26 17:24:30 deepak su: (to db2admin) root on none
Nov 26 17:24:45 deepak su: (to db2admin) root on none
Nov 26 17:25:12 deepak su: (to db2admin) root on none
Nov 26 17:27:32 deepak su: (to db2admin) root on none
Nov 26 17:27:40 deepak su: (to db2admin) root on none
Nov 26 17:27:49 deepak su: (to db2admin) root on none
Nov 26 17:31:35 deepak su: (to db2admin) root on none
Nov 26 17:32:11 deepak auditd[9654]: The audit daemon is exiting.
Nov 26 17:32:12 deepak auditd[22290]: Started dispatcher: /sbin/audispd pid: 22292
Nov 26 17:32:12 deepak audispd: priority_boost_parser called with: 4
Nov 26 17:32:12 deepak audispd: af_unix plugin initialized
Nov 26 17:32:12 deepak audispd: audispd initialized with q_depth=80 and 1 active plugins
Nov 26 17:32:12 deepak auditd[22290]: Init complete, auditd 1.7.7 listening for events (startup state disable)
Nov 26 17:32:12 deepak shadow[22299]: group already exists - group=ns_admin, by=0
Nov 26 17:32:12 deepak shadow[22302]: account removed from group - account=sas, group=users, gid=100, by=0
Nov 26 17:32:12 deepak shadow[22302]: account removed from group - account=sas, group=ns_admin, gid=36, by=0
Nov 26 17:32:12 deepak shadow[22309]: account removed from group - account=mani, group=users, gid=100, by=0
Nov 26 17:32:12 deepak shadow[22309]: account removed from group - account=mani, group=ns_admin, gid=36, by=0
Nov 26 17:32:13 deepak shadow[22310]: account removed from group - account=vivek, group=users, gid=100, by=0
Nov 26 17:32:13 deepak shadow[22310]: account removed from group - account=vivek, group=ns_admin, gid=36, by=0
Nov 26 17:32:16 deepak sshd[22356]: Server listening on 0.0.0.0 port 4422.
Nov 26 17:32:16 deepak sshd[22356]: Server listening on :: port 4422.
Nov 26 17:32:16 deepak /usr/sbin/cron[22393]: (CRON) STARTUP (V5.0)
Nov 26 17:32:17 deepak smartd[22406]: smartd 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
Nov 26 17:32:17 deepak smartd[22406]: Opened configuration file /etc/smartd.conf
Nov 26 17:32:17 deepak smartd[22406]: Drive: DEVICESCAN, implied '-a' Directive on line 26 of file /etc/smartd.conf
Nov 26 17:32:17 deepak smartd[22406]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], opened
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], not found in smartd database.
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], lacks SMART capability
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], to proceed anyway, use '-T permissive' Directive.
Nov 26 17:32:17 deepak smartd[22406]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Nov 26 17:32:18 deepak SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:32:18 deepak SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:32:20 deepak SuSEfirewall2: batch committing...
Nov 26 17:32:21 deepak SuSEfirewall2: Firewall rules successfully set
Nov 26 17:33:14 deepak shutdown[22325]: shutting down for system reboot
Nov 26 17:33:14 deepak init: Switching to runlevel: 6
Nov 26 17:33:21 deepak kernel: [ 1401.996463] bootsplash: status on console 0 changed to on
Nov 26 17:33:24 deepak sshd[22356]: Received signal 15; terminating.
Nov 26 17:33:25 deepak auditd[22290]: The audit daemon is exiting.
Nov 26 17:33:25 deepak haveged: haveged stopping due to signal 15
Nov 26 17:33:26 deepak su: (to db2admin) root on /dev/console
Nov 26 17:33:45 deepak kernel: Kernel logging (proc) stopped.
Nov 26 17:33:45 deepak kernel: Kernel log daemon terminating.
Nov 26 17:33:45 deepak syslog-ng[8245]: Termination requested via signal, terminating;
Nov 26 17:33:45 deepak syslog-ng[8245]: syslog-ng shutting down; version='2.0.9'
Nov 26 17:34:51 deepak syslog-ng[1137]: syslog-ng starting up; version='2.0.9'
Nov 26 17:34:52 deepak firmware.sh[1165]: Cannot find firmware file 'intel-ucode/06-17-0a'
Nov 26 17:34:53 deepak rchal: CPU frequency scaling is not supported by your processor."
If Someone has any idea from the log about why it is happening or how to resolve please comment out.

Look at the instance serial port output to find more about debug messages from your instance.
The fact that the instances is being rebooted and terminated will not allow you to SSH. There are suggestions at this link for the error "CPU frequency scaling is not supported by your processor"

mod_wsgi is compiled in one version and running in a different version even after following the given steps

I am getting an error when I run the apache server through my client after going through the log I understood that the mod_wsgi uses python 2.6 during compiling and uses python 2.7 for running. After some research in the Internet I followed the below steps:
You have to recompile mod-python and/or mod-wsgi.
Remove mods
apt-get remove libapache2-mod-python libapache2-mod-wsgi
Get dependencies
apt-get build-dep libapache2-mod-python libapache2-mod-wsgi
Build mod-python
mkdir /tmp/python
cd /tmp/python
apt-get source libapache2-mod-python
cd libapache2-mod-python-[x.x.x]
dpkg-buildpackage -rfakeroot -b
Build mod-wsgi
mkdir /tmp/wsgi
cd /tmp/wsgi
apt-get source libapache2-mod-wsgi
cd mod-wsgi-[x.x.x]
dpkg-buildpackage -rfakeroot -b
Install newly compiled packages
dpkg -i /tmp/python/libapache2-mod-python-[x.x].deb /tmp/wsgi/libapache2-mod-wsgi-[x.x].deb
It was of no use, now the version has changed to 3.2, I am worried about the space being consumed through the above steps and now the compiling python has changes to python 3.2 from 2.6 but the python used for running is still 2.7. please help me with what to do ? to get back my apache server running successfully.
error.log::::
[Wed Aug 21 11:48:11 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Wed Aug 21 11:48:11 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Wed Aug 21 11:48:11 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Wed Aug 21 11:48:36 2013] [notice] caught SIGTERM, shutting down
[Wed Aug 21 22:48:29 2013] [error] child process 1226 still did not exit, sending a SIGKILL
[Wed Aug 21 22:48:30 2013] [notice] caught SIGTERM, shutting down
[Wed Aug 21 22:56:17 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Wed Aug 21 22:56:17 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Wed Aug 21 22:56:17 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Thu Aug 22 01:32:12 2013] [notice] caught SIGTERM, shutting down
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Thu Aug 22 01:32:26 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Thu Aug 22 04:04:48 2013] [notice] child pid 11212 exit signal Segmentation fault (11)
[Thu Aug 22 04:04:48 2013] [notice] caught SIGTERM, shutting down
[Thu Aug 22 04:04:56 2013] [notice] mod_python: Creating 8 session mutexes based on 6 max processes and 25 max threads.
[Thu Aug 22 04:04:56 2013] [notice] mod_python: using mutex_directory /tmp
[Thu Aug 22 04:04:56 2013] [warn] mod_wsgi: Compiled for Python/3.2.3.
[Thu Aug 22 04:04:56 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Thu Aug 22 04:04:56 2013] [notice] Apache/2.2.22 (Ubuntu) mod_python/3.3.1 Python/2.7.3 mod_wsgi/3.3 configured -- resuming normal operations
Thank you

Don't load mod_python and mod_wsgi at the same time if you don't need to. They are likely compiled against different Python versions. See the following for an explanation of the mismatch you are seeing.
http://code.google.com/p/modwsgi/wiki/InstallationIssues#Python_Version_Mismatch
If you do need both, they must both be compiled for the same version.
These days there is generally no good reason to be using mod_python for new projects.

Just to add
I have uninstalled libapache2-mod-python
sudo apt-get remove libapache2-mod-python
which I have installed
then I have overcome the above error
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.

DigitalOcean Deploy - "ImportError: No module named flask.ext.wtf"

All, I'm trying to deploy a simple page using flask/apache on a digitalocean server. It all works fine locally, but not on the server.
__init__.py contains the statement:
from flask.ext.wtf import Form,TextField
try_me.wsgi is:
#!/usr/bin/python
import sys
import logging
logging.basicConfig(stream=sys.stderr)
sys.path.insert(0,"/var/www/try_me/")
from try_me import app as application
application.secret_key = 'Add your secret key'
Getting the following error (from /var/log/apache2/error.log):
[Mon Jul 29 14:16:50 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Mon Jul 29 14:16:50 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Mon Jul 29 14:16:50 2013] [notice] Apache/2.2.22 (Debian) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] mod_wsgi (pid=2920): Target WSGI script '/var/www/try_me/try_me.wsgi' cannot be loaded as Python module.
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] mod_wsgi (pid=2920): Exception occurred processing WSGI script '/var/www/try_me/try_me.wsgi'.
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] Traceback (most recent call last):
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] File "/var/www/try_me/try_me.wsgi", line 7, in <module>
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] from try_me import app as application
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] File "/var/www/try_me/try_me/__init__.py", line 4, in <module>
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] from flask.ext.wtf import Form
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] File "/usr/local/lib/python2.7/dist-packages/flask/exthook.py", line 87, in load_module
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] raise ImportError('No module named %s' % fullname)
[Mon Jul 29 14:16:57 2013] [error] [client 74.66.8.166] ImportError: No module named flask.ext.wtf
I was able to import manually in a python interpreter (under virtualenv):
Python 2.7.3 (default, Jan 2 2013, 16:53:07)
[GCC 4.7.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from flask.ext.wtf import Form
>>> Form
<class 'flask_wtf.form.Form'>
Any ideas on how to proceed here?

SOLUTION (Following Cathy's advice above, and dAnjou below), virtualenv must be activated from the wsgi script. Adding activate_this execution solved the problem:
#!/usr/bin/python
activate_this = '/var/www/try_me/venv/bin/activate_this.py'
execfile(activate_this, dict(__file__=activate_this))
import sys
..

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Zombie processes while using use_multiprocessing=True in Keras model.fit() - tensorflow

Related

GCE 8 GPU instance randomnly reboots while training is running

Aerospike not starting

Customized SUSE Image not running in Google compute Engine

mod_wsgi is compiled in one version and running in a different version even after following the given steps

DigitalOcean Deploy - "ImportError: No module named flask.ext.wtf"

Categories

Resources