GCE 8 GPU instance randomnly reboots while training is running

GCE 8 GPU instance randomnly reboots while training is running - tensorflow

I have an 8 GPU GCE instance that randomnly reboots in the middle of a training routine. This happened a couple of times. The instance also appears to stay down for quite a while before it comes back up. I found some traces in the kernel log of a dump that looks like it might be the cause (?). Any ideas what I can do about this?
The configuration is pretty ordinary : An ubuntu instance running a python 3 Tensorflow App that's training on images and the Nvidia drivers are installed with the cuda toolkit.
The log is shown below. The last few lines indicating the system is booting up but nearly after 10 hours it appears
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749736] Call Trace:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749737] <IRQ> [<ffffffff813f8dd3>] dump_stack+0x63/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749746] [<ffffffff810ddd33>] __report_bad_irq+0x33/0xc0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749747] [<ffffffff810de0c7>] note_interrupt+0x247/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749749] [<ffffffff810db277>] handle_irq_event_percpu+0x167/0x1d0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749750] [<ffffffff810db31e>] handle_irq_event+0x3e/0x60
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749752] [<ffffffff810de639>] handle_fasteoi_irq+0x99/0x150
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749756] [<ffffffff8103119d>] handle_irq+0x1d/0x30
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749758] [<ffffffff8184341b>] do_IRQ+0x4b/0xd0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749761] [<ffffffff81841502>] common_interrupt+0x82/0x82
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749764] [<ffffffff81085d5e>] ? __do_softirq+0x7e/0x290
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749766] [<ffffffff810860e3>] irq_exit+0xa3/0xb0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749767] [<ffffffff818434e2>] smp_apic_timer_interrupt+0x42/0x50
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749769] [<ffffffff818417a2>] apic_timer_interrupt+0x82/0x90
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749770] <EOI> [<ffffffff81064606>] ? native_safe_halt+0x6/0x10
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749775] [<ffffffff81038e1e>] default_idle+0x1e/0xe0
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749776] [<ffffffff8103962f>] arch_cpu_idle+0xf/0x20
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749780] [<ffffffff810c454a>] default_idle_call+0x2a/0x40
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749781] [<ffffffff810c48b1>] cpu_startup_entry+0x2f1/0x350
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749798] [<ffffffff810517c4>] start_secondary+0x154/0x190
Jun 7 19:23:59 gpu-8-2 kernel: [62064.749799] handlers:
Jun 7 19:23:59 gpu-8-2 kernel: [62064.752277] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.762984] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.773705] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.784444] [<ffffffffc2b034e0>] nvidia_isr [nvidia] threaded [<ffffffffc2b03eb0>] nvidia_isr_kthread_bh [nvidia]
Jun 7 19:23:59 gpu-8-2 kernel: [62064.795096] Disabling IRQ #10
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuset
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpu
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Initializing cgroup subsys cpuacct
Jun 8 05:27:43 gpu-8-2 kernel: [ 0.000000] Linux version 4.4.0-79-generic (buildd#lcy01-30) (gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.4) ) #100-Ubuntu SMP Wed May
17 19:58:14 UTC 2017 (Ubuntu 4.4.0-79.100-generic 4.4.67)

Related

Redis server crash on MacOS 11

I have installed redis using the Rosetta terminal but when I run "redis-server" I get this error. I am on the new Mac Book Pro 2020 with Apple Silicon.
redis-server
42116:C 21 Nov 2020 20:07:12.620 # oO0OoO0OoO0Oo Redis is starting oO0OoO0OoO0Oo 42116:C 21 Nov 2020 20:07:12.620 # Redis version=6.0.9, bits=64, commit=00000000, modified=0, pid=42116, just started 42116:C 21 Nov 2020 20:07:12.620 # Warning: no config file specified, using the default config. In order to specify a config file use redis-server /path/to/redis.conf 42116:M 21 Nov 2020 20:07:12.620 * Increased maximum number of open files to 10032 (it was originally set to 2560).
=== REDIS BUG REPORT START: Cut & paste starting from here === 42116:M 21 Nov 2020 20:07:12.622 # Redis 6.0.9 crashed by signal: 11, si_code: 2 42116:M 21 Nov 2020 20:07:12.622 # Crashed running the instruction at: 0x7fff20371430 42116:M 21 Nov 2020 20:07:12.622 # Accessing address: 0x3046d2000 42116:M 21 Nov 2020 20:07:12.622 # Killed by PID: 0, UID: 0 42116:M 21 Nov 2020 20:07:12.622 # Failed assertion: <no assertion failed> (<no file>:0)
------ STACK TRACE ------ EIP: 0 libsystem_platform.dylib 0x00007fff20371430 _platform_memset$VARIANT$Rosetta + 108
Backtrace: 0 redis-server 0x00000001000e4bb7 logStackTrace + 110 1 redis-server 0x00000001000e4fd5 sigsegvHandler + 271 2 libsystem_platform.dylib 0x00007fff2036ed7d _sigtramp + 29 3 libsystem_malloc.dylib 0x00007fff201547aa tiny_free_no_lock + 1116 4 redis-server 0x00000001001350c3 luaD_call + 97 5 ??? 0x0000000032aaaba2 0x0 + 850045858
------ INFO OUTPUT ------
# Server redis_version:6.0.9 redis_git_sha1:00000000 redis_git_dirty:0 redis_build_id:ec508acaad782189 redis_mode:standalone os:Darwin 20.1.0 x86_64 arch_bits:64 multiplexing_api:kqueue atomicvar_api:atomic-builtin gcc_version:4.2.1 process_id:42116 run_id:3456c4d545624d4cbf42d4b85695b8f4cb6ce250 tcp_port:6379 uptime_in_seconds:0 uptime_in_days:0 hz:10 configured_hz:10 lru_clock:12150112 executable:/Users/leonardo/Dropbox/dev/redis/redis-stable/redis-server config_file: io_threads_active:0
# Clients connected_clients:0 client_recent_max_input_buffer:0 client_recent_max_output_buffer:0 blocked_clients:0 tracking_clients:0 clients_in_timeout_table:0
# Memory used_memory:1019360 used_memory_human:995.47K used_memory_rss:0 used_memory_rss_human:0B used_memory_peak:1019360 used_memory_peak_human:995.47K used_memory_peak_perc:inf% used_memory_overhead:0 used_memory_startup:0 used_memory_dataset:1019360 used_memory_dataset_perc:100.00% allocator_allocated:0 allocator_active:0 allocator_resident:0 total_system_memory:8589934592 total_system_memory_human:8.00G used_memory_lua:37888 used_memory_lua_human:37.00K used_memory_scripts:0 used_memory_scripts_human:0B number_of_cached_scripts:0 maxmemory:0 maxmemory_human:0B maxmemory_policy:noeviction allocator_frag_ratio:nan allocator_frag_bytes:0 allocator_rss_ratio:nan allocator_rss_bytes:0 rss_overhead_ratio:nan rss_overhead_bytes:0 mem_fragmentation_ratio:nan mem_fragmentation_bytes:0 mem_not_counted_for_evict:0 mem_replication_backlog:0 mem_clients_slaves:0 mem_clients_normal:0 mem_aof_buffer:0 mem_allocator:libc active_defrag_running:0 lazyfree_pending_objects:0
# Persistence loading:0 rdb_changes_since_last_save:0 rdb_bgsave_in_progress:0 rdb_last_save_time:1605985632 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:-1 rdb_current_bgsave_time_sec:-1 rdb_last_cow_size:0 aof_enabled:0 aof_rewrite_in_progress:0 aof_rewrite_scheduled:0 aof_last_rewrite_time_sec:-1 aof_current_rewrite_time_sec:-1 aof_last_bgrewrite_status:ok aof_last_write_status:ok aof_last_cow_size:0 module_fork_in_progress:0 module_fork_last_cow_size:0
# Stats total_connections_received:0 total_commands_processed:0 instantaneous_ops_per_sec:0 total_net_input_bytes:0 total_net_output_bytes:0 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.00 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 expired_keys:0 expired_stale_perc:0.00 expired_time_cap_reached_count:0 expire_cycle_cpu_milliseconds:0 evicted_keys:0 keyspace_hits:0 keyspace_misses:0 pubsub_channels:0 pubsub_patterns:0 latest_fork_usec:0 migrate_cached_sockets:0 slave_expires_tracked_keys:0 active_defrag_hits:0 active_defrag_misses:0 active_defrag_key_hits:0 active_defrag_key_misses:0 tracking_total_keys:0 tracking_total_items:0 tracking_total_prefixes:0 unexpected_error_replies:0 total_reads_processed:0 total_writes_processed:0 io_threaded_reads_processed:0 io_threaded_writes_processed:0
# Replication role:master connected_slaves:0 master_replid:b00cc4f1203a9a29b81236248b7ebc68c567f4ad master_replid2:0000000000000000000000000000000000000000 master_repl_offset:0 second_repl_offset:-1 repl_backlog_active:0 repl_backlog_size:1048576 repl_backlog_first_byte_offset:0 repl_backlog_histlen:0
# CPU used_cpu_sys:0.004632 used_cpu_user:0.007445 used_cpu_sys_children:0.000000 used_cpu_user_children:0.000000
# Modules
# Commandstats
# Cluster cluster_enabled:0
# Keyspace
------ CLIENT LIST OUTPUT ------
------ REGISTERS ------ 42116:M 21 Nov 2020 20:07:12.623 # RAX:00000003046d1c80 RBX:0000000000000013 RCX:00000003046d2000 RDX:00007f9b90d338ae RDI:00000003046d1c18 RSI:0000000000000000 RBP:00000003046d1a40 RSP:00000003046d1858 R8 :0000000000000000 R9 :00000003046d1910 R10:00000001001507b3 R11:ffffffffffffffff R12:00000003046d1ae0 R13:00000000000000ff R14:0000000100151127 R15:0000000100181740 RIP:00007fff20371430 EFL:0000000000000202 CS :000000000000002b FS:0000000000000000 GS:0000000000000000 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1867) -> 0000000108c36a00 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1866) -> 0000000000000006 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1865) -> 0000000000000000 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1864)
-> 0000000000002800 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1863) -> 0000000000000000 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1862) -> 00007fff20152020 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1861) -> 000000010015fbca 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1860) -> 00000001000f34d6 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185f) -> 00000003046d19d0 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185e) -> 00007f9e85400000 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185d)
-> 0000000100152d38 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185c) -> 00000000000018eb 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185b) -> 000000010014dbcd 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d185a) -> 00000003046d1c18 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1859) -> 00007f9e85407da0 42116:M 21 Nov 2020 20:07:12.623 # (00000003046d1858) -> 0000000100103ebb
------ MODULES INFO OUTPUT ------
------ DUMPING CODE AROUND EIP ------ Symbol: _platform_memset$VARIANT$Rosetta (base: 0x7fff203713c4) Module: /usr/lib/system/libsystem_platform.dylib (base 0x7fff2036b000) $ xxd
-r -p /tmp/dump.hex /tmp/dump.bin $ objdump --adjust-vma=0x7fff203713c4 -D -b binary -m i386:x86-64 /tmp/dump.bin
------ 42116:M 21 Nov 2020 20:07:12.623 # dump of function (hexdump of 236 bytes): 81e6ff00000048b90101010101010101480faff14889f94883fa400f82360100004881fa008000000f82a00000000faef0480fc337480fc37708480fc37710480fc37718480fc37720480fc37728480fc37730480fc37738488d4f404883e1c04801fa488d41404829c27631480fc331480fc37108480fc37110480fc37118480fc37120480fc37128480fc37130480fc371384883c1404883ea4077cf4801d1480fc331480fc37108480fc37110480fc37118480fc37120480fc37128480fc37130480fc371380faef84889f8c3488937488977084889771048897718488977204889772848897730488977
=== REDIS BUG REPORT END. Make sure to include from START to END. ===
Please report the crash by opening an issue on github:
http://github.com/redis/redis/issues
Suspect RAM error? Use redis-server --test-memory to verify it.
zsh: segmentation fault redis-server

Memory overflow can cause the Redis service to crash. During peak time, the Redis service may require more memory than what is currently allocated.
To check current configuration and used memory, run the following command in the CLI. It checks for used memory, maxmemory, evicted keys, and Redis up time in days:
redis-cli -p REDIS_PORT -h REDIS_HOST info | egrep --color "(role|used_memory_peak|maxmemory|evicted_keys|uptime_in_days)"

[UPDATE]: Per the latest on that redis Github issue, a fix was merged.
You can either build it locally from their latest master or wait for the next public release (current version is 6.0.9), which will likely include the fix.
I believe the Redis team is currently still working on support here:
https://github.com/redis/redis/issues/8062
Per that link, you might be able to run Redis under sudo

Customized SUSE Image not running in Google compute Engine

I have uploaded the customized image and created the VM instance of it. I am unable to do SSH in to it. As per troubleshooting guidelines I have attached the root persistent disk and from the log file I found that VM instance frequently booted and terminated from the log file "/var/log/messages". Please find the log file below
"
Nov 26 11:40:28 linux syslog-ng[1997]: syslog-ng starting up; version='2.0.9'
Nov 26 11:40:33 linux rchal: CPU frequency scaling is not supported by your processor.
Nov 26 11:40:33 linux rchal: boot with 'CPUFREQ=no' in to avoid this warning.
Nov 26 11:40:33 linux rchal: Cannot load cpufreq governors - No cpufreq driver available
Nov 26 11:40:33 linux kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 26 11:40:33 linux kernel: [ 18.645145] bootsplash: status on console 0 changed to on
Nov 26 11:40:57 linux kernel: [ 57.972129] Uniform Multi-Platform E-IDE driver
Nov 26 11:40:57 linux kernel: [ 57.988151] ide-cd driver 5.00
Nov 26 11:40:57 linux kernel: [ 58.089061] st: Version 20101219, fixed bufsize 32768, s/g segs 256
Nov 26 11:41:02 linux kernel: [ 62.944338] eth1: no IPv6 routers present
Nov 26 11:41:02 linux kernel: [ 63.259092] eth0: no IPv6 routers present
Nov 26 17:11:16 linux su: (to root) root on none
Nov 26 17:11:26 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:11:27 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:11:27 linux kernel: [ 88.008142] ip6_tables: (C) 2000-2006 Netfilter Core Team
Nov 26 17:11:27 linux kernel: [ 88.252544] ip_tables: (C) 2000-2006 Netfilter Core Team
Nov 26 17:11:27 linux kernel: [ 88.296835] nf_conntrack version 0.5.0 (7168 buckets, 28672 max)
Nov 26 17:11:28 linux SuSEfirewall2: batch committing...
Nov 26 17:11:29 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:11:42 linux ifdown: eth0
Nov 26 17:11:44 linux ifdown: eth1
Nov 26 17:11:55 linux ifup: lo
Nov 26 17:11:55 linux ifup: lo
Nov 26 17:11:55 linux ifup: IP address: 127.0.0.1/8
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:55 linux ifup: IP address: 127.0.0.2/8
Nov 26 17:11:55 linux ifup:
Nov 26 17:11:56 linux ifup: eth0
Nov 26 17:11:56 linux ifup: eth0
Nov 26 17:11:57 linux ifup: IP address: 10.203.92.100/24
Nov 26 17:11:57 linux ifup:
Nov 26 17:11:58 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:11:59 linux ifup: eth1
Nov 26 17:11:59 linux ifup: eth1
Nov 26 17:11:59 linux ifup: IP address: 192.168.17.250/24
Nov 26 17:11:59 linux ifup:
Nov 26 17:12:01 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:12:02 linux ifup: tap0
Nov 26 17:12:03 linux kernel: [ 124.153528] tun: Universal TUN/TAP device driver, 1.6
Nov 26 17:12:03 linux kernel: [ 124.153528] tun: (C) 1999-2004 Max Krasnyansky <maxk#qualcomm.com>
Nov 26 17:12:03 linux kernel: [ 124.219136] ADDRCONF(NETDEV_UP): tap0: link is not ready
Nov 26 17:12:04 linux SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:12:04 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:12:04 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:12:06 linux SuSEfirewall2: batch committing...
Nov 26 17:12:06 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:12:31 linux SuSEfirewall2: batch committing...
Nov 26 17:12:31 linux SuSEfirewall2: Firewall rules unloaded.
Nov 26 17:12:31 linux SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:12:32 linux SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:12:33 linux SuSEfirewall2: batch committing...
Nov 26 17:12:33 linux SuSEfirewall2: Firewall rules successfully set
Nov 26 17:12:39 linux init: Re-reading inittab
Nov 26 17:12:45 linux ifdown: tap0
Nov 26 17:12:48 linux ifdown: eth0
Nov 26 17:12:50 linux ifdown: eth1
Nov 26 17:12:55 linux init: Entering runlevel: 3
Nov 26 17:12:56 linux SuSEfirewall2: batch committing...
Nov 26 17:12:56 linux SuSEfirewall2: Firewall rules set to CLOSE.
Nov 26 17:12:57 linux kernel: Kernel logging (proc) stopped.
Nov 26 17:12:57 linux kernel: Kernel log daemon terminating.
Nov 26 17:12:57 linux syslog-ng[1997]: Termination requested via signal, terminating;
Nov 26 17:12:57 linux syslog-ng[1997]: syslog-ng shutting down; version='2.0.9'
Nov 26 17:12:57 deepak syslog-ng[8245]: syslog-ng starting up; version='2.0.9'
Nov 26 17:12:57 deepak firmware.sh[8273]: Cannot find firmware file 'intel-ucode/06-17-0a'
Nov 26 17:13:02 deepak kernel: klogd 1.4.1, log source = /proc/kmsg started.
Nov 26 17:13:02 deepak kernel: [ 178.548747] microcode: CPU0 sig=0x1067a, pf=0x40, revision=0x60b
Nov 26 17:13:02 deepak kernel: [ 178.669173] microcode: Microcode Update Driver: v2.00 <tigran#aivazian.fsnet.co.uk>, Peter Oruba
Nov 26 17:13:02 deepak kernel: [ 178.824111] microcode: CPU0 update to revision 0xa0b failed
Nov 26 17:13:05 deepak ifup: lo
Nov 26 17:13:05 deepak ifup: lo
Nov 26 17:13:05 deepak ifup: IP address: 127.0.0.1/8
Nov 26 17:13:05 deepak ifup:
Nov 26 17:13:05 deepak ifup:
Nov 26 17:13:06 deepak ifup: IP address: 127.0.0.2/8
Nov 26 17:13:06 deepak ifup:
Nov 26 17:13:07 deepak ifup: eth0
Nov 26 17:13:07 deepak ifup: eth0
Nov 26 17:13:07 deepak ifup: IP address: 10.203.92.100/24
Nov 26 17:13:07 deepak ifup:
Nov 26 17:13:08 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:09 deepak ifup: eth1
Nov 26 17:13:09 deepak ifup: eth1
Nov 26 17:13:09 deepak ifup: IP address: 192.168.17.250/24
Nov 26 17:13:09 deepak ifup:
Nov 26 17:13:10 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:16 deepak ifup: tap0
Nov 26 17:13:16 deepak kernel: [ 197.436436] ADDRCONF(NETDEV_UP): tap0: link is not ready
Nov 26 17:13:17 deepak SuSEfirewall2: /var/lock/SuSEfirewall2.booting exists which means system boot in progress, exit.
Nov 26 17:13:18 deepak auditd[9654]: Started dispatcher: /sbin/audispd pid: 9656
Nov 26 17:13:18 deepak kernel: [ 199.072126] auditd (9654): /proc/9654/oom_adj is deprecated, please use /proc/9654/oom_score_adj instead.
Nov 26 17:13:18 deepak audispd: priority_boost_parser called with: 4
Nov 26 17:13:18 deepak audispd: af_unix plugin initialized
Nov 26 17:13:18 deepak audispd: audispd initialized with q_depth=80 and 1 active plugins
Nov 26 17:13:18 deepak auditd[9654]: Init complete, auditd 1.7.7 listening for events (startup state disable)
Nov 26 17:13:18 deepak haveged: haveged starting up
Nov 26 17:13:18 deepak haveged: arch: x86 vendor: intel generic: 0 i_cache: 32 d_cache: 32 loop_idx: 30 loop_idxmax: 40 loop_sz: 31836 loop_szmax: 124334 etime: 30361 havege_ndpt 0
Nov 26 17:13:19 deepak kernel: [ 200.624132] BIOS EDD facility v0.16 2004-Jun-25, 1 devices found
Nov 26 17:13:20 deepak mcelog: mcelog read: No such device
Nov 26 17:21:10 deepak shadow[30512]: new group added - group=db2iadm1, gid=113, by=0
Nov 26 17:21:10 deepak shadow[30512]: running GROUPADD_CMD command - script=/usr/sbin/groupadd.local, account=db2iadm1, uid=113, gid=0, home=, by=0
Nov 26 17:21:11 deepak useradd[30526]: new account added - account=db2admin, uid=1005, gid=113, home=/home/db2admin, shell=/bin/bash, by=0
Nov 26 17:21:11 deepak useradd[30526]: account added to group - account=db2admin, group=video, gid=33, by=0
Nov 26 17:21:11 deepak useradd[30526]: account added to group - account=db2admin, group=dialout, gid=16, by=0
Nov 26 17:21:11 deepak useradd[30526]: home directory created - account=db2admin, uid=1005, home=/home/db2admin, by=0
Nov 26 17:21:11 deepak useradd[30526]: running USERADD_CMD command - script=/usr/sbin/useradd.local, account=db2admin, uid=1005, gid=113, home=/home/db2admin, by=0
Nov 26 17:21:11 deepak shadow[30530]: GID 113 is not unique - by=0
Nov 26 17:21:11 deepak shadow[30533]: new group added - group=db2fadm1, gid=114, by=0
Nov 26 17:21:11 deepak shadow[30533]: running GROUPADD_CMD command - script=/usr/sbin/groupadd.local, account=db2fadm1, uid=114, gid=0, home=, by=0
Nov 26 17:21:11 deepak useradd[30537]: new account added - account=db2fenc1, uid=1006, gid=114, home=/home/db2fenc1, shell=/bin/bash, by=0
Nov 26 17:21:11 deepak useradd[30537]: account added to group - account=db2fenc1, group=video, gid=33, by=0
Nov 26 17:21:11 deepak useradd[30537]: account added to group - account=db2fenc1, group=dialout, gid=16, by=0
Nov 26 17:21:11 deepak useradd[30537]: home directory created - account=db2fenc1, uid=1006, home=/home/db2fenc1, by=0
Nov 26 17:21:11 deepak useradd[30537]: running USERADD_CMD command - script=/usr/sbin/useradd.local, account=db2fenc1, uid=1006, gid=114, home=/home/db2fenc1, by=0
Nov 26 17:21:16 deepak su: (to db2admin) root on none
Nov 26 17:21:33 deepak su: (to db2admin) root on none
Nov 26 17:21:44 deepak su: (to db2admin) root on none
Nov 26 17:21:55 deepak su: (to db2admin) root on none
Nov 26 17:21:57 deepak su: (to db2admin) root on none
Nov 26 17:22:14 deepak su: (to db2admin) root on none
Nov 26 17:22:28 deepak su: (to db2admin) root on none
Nov 26 17:22:41 deepak su: (to db2admin) root on none
Nov 26 17:22:55 deepak su: (to db2admin) root on none
Nov 26 17:23:08 deepak su: (to db2admin) root on none
Nov 26 17:23:22 deepak su: (to db2admin) root on none
Nov 26 17:23:35 deepak su: (to db2admin) root on none
Nov 26 17:23:49 deepak su: (to db2admin) root on none
Nov 26 17:24:02 deepak su: (to db2admin) root on none
Nov 26 17:24:16 deepak su: (to db2admin) root on none
Nov 26 17:24:30 deepak su: (to db2admin) root on none
Nov 26 17:24:45 deepak su: (to db2admin) root on none
Nov 26 17:25:12 deepak su: (to db2admin) root on none
Nov 26 17:27:32 deepak su: (to db2admin) root on none
Nov 26 17:27:40 deepak su: (to db2admin) root on none
Nov 26 17:27:49 deepak su: (to db2admin) root on none
Nov 26 17:31:35 deepak su: (to db2admin) root on none
Nov 26 17:32:11 deepak auditd[9654]: The audit daemon is exiting.
Nov 26 17:32:12 deepak auditd[22290]: Started dispatcher: /sbin/audispd pid: 22292
Nov 26 17:32:12 deepak audispd: priority_boost_parser called with: 4
Nov 26 17:32:12 deepak audispd: af_unix plugin initialized
Nov 26 17:32:12 deepak audispd: audispd initialized with q_depth=80 and 1 active plugins
Nov 26 17:32:12 deepak auditd[22290]: Init complete, auditd 1.7.7 listening for events (startup state disable)
Nov 26 17:32:12 deepak shadow[22299]: group already exists - group=ns_admin, by=0
Nov 26 17:32:12 deepak shadow[22302]: account removed from group - account=sas, group=users, gid=100, by=0
Nov 26 17:32:12 deepak shadow[22302]: account removed from group - account=sas, group=ns_admin, gid=36, by=0
Nov 26 17:32:12 deepak shadow[22309]: account removed from group - account=mani, group=users, gid=100, by=0
Nov 26 17:32:12 deepak shadow[22309]: account removed from group - account=mani, group=ns_admin, gid=36, by=0
Nov 26 17:32:13 deepak shadow[22310]: account removed from group - account=vivek, group=users, gid=100, by=0
Nov 26 17:32:13 deepak shadow[22310]: account removed from group - account=vivek, group=ns_admin, gid=36, by=0
Nov 26 17:32:16 deepak sshd[22356]: Server listening on 0.0.0.0 port 4422.
Nov 26 17:32:16 deepak sshd[22356]: Server listening on :: port 4422.
Nov 26 17:32:16 deepak /usr/sbin/cron[22393]: (CRON) STARTUP (V5.0)
Nov 26 17:32:17 deepak smartd[22406]: smartd 5.39 2008-10-24 22:33 [x86_64-suse-linux-gnu] (openSUSE RPM) Copyright (C) 2002-8 by Bruce Allen, http://smartmontools.sourceforge.net
Nov 26 17:32:17 deepak smartd[22406]: Opened configuration file /etc/smartd.conf
Nov 26 17:32:17 deepak smartd[22406]: Drive: DEVICESCAN, implied '-a' Directive on line 26 of file /etc/smartd.conf
Nov 26 17:32:17 deepak smartd[22406]: Configuration file /etc/smartd.conf was parsed, found DEVICESCAN, scanning devices
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda, type changed from 'scsi' to 'sat'
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], opened
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], not found in smartd database.
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], lacks SMART capability
Nov 26 17:32:17 deepak smartd[22406]: Device: /dev/sda [SAT], to proceed anyway, use '-T permissive' Directive.
Nov 26 17:32:17 deepak smartd[22406]: Unable to monitor any SMART enabled devices. Try debug (-d) option. Exiting...
Nov 26 17:32:18 deepak SuSEfirewall2: Setting up rules from /etc/sysconfig/SuSEfirewall2 ...
Nov 26 17:32:18 deepak SuSEfirewall2: using default zone 'ext' for interface eth1
Nov 26 17:32:20 deepak SuSEfirewall2: batch committing...
Nov 26 17:32:21 deepak SuSEfirewall2: Firewall rules successfully set
Nov 26 17:33:14 deepak shutdown[22325]: shutting down for system reboot
Nov 26 17:33:14 deepak init: Switching to runlevel: 6
Nov 26 17:33:21 deepak kernel: [ 1401.996463] bootsplash: status on console 0 changed to on
Nov 26 17:33:24 deepak sshd[22356]: Received signal 15; terminating.
Nov 26 17:33:25 deepak auditd[22290]: The audit daemon is exiting.
Nov 26 17:33:25 deepak haveged: haveged stopping due to signal 15
Nov 26 17:33:26 deepak su: (to db2admin) root on /dev/console
Nov 26 17:33:45 deepak kernel: Kernel logging (proc) stopped.
Nov 26 17:33:45 deepak kernel: Kernel log daemon terminating.
Nov 26 17:33:45 deepak syslog-ng[8245]: Termination requested via signal, terminating;
Nov 26 17:33:45 deepak syslog-ng[8245]: syslog-ng shutting down; version='2.0.9'
Nov 26 17:34:51 deepak syslog-ng[1137]: syslog-ng starting up; version='2.0.9'
Nov 26 17:34:52 deepak firmware.sh[1165]: Cannot find firmware file 'intel-ucode/06-17-0a'
Nov 26 17:34:53 deepak rchal: CPU frequency scaling is not supported by your processor."
If Someone has any idea from the log about why it is happening or how to resolve please comment out.

Look at the instance serial port output to find more about debug messages from your instance.
The fact that the instances is being rebooted and terminated will not allow you to SSH. There are suggestions at this link for the error "CPU frequency scaling is not supported by your processor"

iOS7.0 Intermittent Crashes

Team,
Once I upgraded my application to iOS 7.0 with Xcode 5, I see intermittent crashes in the application. if I compile with xCode 4.2 and I don't see any intermittent crashes on iOS7 device.
I hope the below crashes are memory related, because it occurs at different places in the application.
Please find logs
Apr 1 14:07:57 My-AD-Username ADTCommercial[185] <Notice>: [AMPMessageDispatcherThread handleMessage:] message {jlodigkeit-41-1396382877074} received on Outbound-msg-dispatcher for processing, MIM: 29, MID: 0
Apr 1 14:07:57 My-AD-Username ReportCrash[192] <Notice>: ReportCrash acting against PID 185
Apr 1 14:07:57 My-AD-Username ReportCrash[192] <Notice>: Formulating crash report for process ADTCommercial[185]
Apr 1 14:07:57 My-AD-Username com.apple.debugserver-300.2[184] <Warning>: 21 +523.616314 sec [00b8/060b]: RNBRunLoopLaunchInferior DNBProcessLaunch() returned error: 'failed to get the task for process 185'
Apr 1 14:07:57 My-AD-Username com.apple.debugserver-300.2[184] <Warning>: error: failed to launch process /Developer/usr/bin/debugserver: failed to get the task for process 185
Apr 1 14:07:57 My-AD-Username com.apple.debugserver-300.2[184] <Warning>: 22 +0.002490 sec [00b8/1207]: error: ::read ( -1, 0x3469ec, 18446744069414585344 ) => -1 err = Bad file descriptor (0x00000009)
Apr 1 14:07:57 My-AD-Username com.apple.debugserver-300.2[184] <Warning>: Exiting.
Apr 1 14:07:57 My-AD-Username com.apple.launchd[1] (UIKitApplication:com.adt.commercial.mobility.dev[0xdd96][185]) <Warning>: (UIKitApplication:com.adt.commercial.mobility.dev[0xdd96]) Job appears to have crashed: Segmentation fault: 11
Apr 1 14:07:57 My-AD-Username backboardd[28] <Warning>: Application 'UIKitApplication:com.adt.commercial.mobility.dev[0xdd96]' exited abnormally with signal 11: Segmentation fault: 11
Apr 1 14:09:26 My-AD-Username ADTCommercial[194] <Notice>: [ADT_JobDetailsScreenServiceHistoryScreen loadData:create:preload:] called
Apr 1 14:09:26 My-AD-Username ADTCommercial[194] <Notice>: [AMPScreenManager viewWillPop:] popped ADT_ServiceHistoryDetailsScreen from history
Apr 1 14:09:27 My-AD-Username ADTCommercial[194] <Notice>: # -[JobDetailsScreen(Custom) refreshAction] #....
Apr 1 14:09:27 My-AD-Username ReportCrash[196] <Notice>: ReportCrash acting against PID 194
Apr 1 14:09:27 My-AD-Username ReportCrash[196] <Notice>: Formulating crash report for process ADTCommercial[194]
Apr 1 14:09:27 My-AD-Username com.apple.launchd[1] (UIKitApplication:com.adt.commercial.mobility.dev[0xd929][194]) <Warning>: (UIKitApplication:com.adt.commercial.mobility.dev[0xd929]) Job appears to have crashed: Segmentation fault: 11
Apr 1 14:09:27 My-AD-Username backboardd[28] <Warning>: Application 'UIKitApplication:com.adt.commercial.mobility.dev[0xd929]' exited abnormally with signal 11: Segmentation fault: 11
I believe, its memory related crashes. Any idea why?
Thanks,
Ramesh

Error on startup for mongodb server

totally new to mongodb. I'm trying to install locomotive CMS on my server, which is cool, but I've always used SQL/MySQL so mongo is totally new to me.
I installed all the needed mongodb modules, but when I run: sudo service mongod start I get an error code. When I look in the logs for the error, here is what is output:
Fri Mar 21 18:13:47.186 [initandlisten] MongoDB starting : pid=5053 port=27017 dbpath=/var/lib/mongo 64-bit host=vagrant-centos64.vagrantup.com
Fri Mar 21 18:13:47.186 [initandlisten] db version v2.4.9
Fri Mar 21 18:13:47.186 [initandlisten] git version: 52fe0d21959e32a5bdbecdc62057db386e4e029c
Fri Mar 21 18:13:47.186 [initandlisten] build info: Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49
Fri Mar 21 18:13:47.186 [initandlisten] allocator: tcmalloc
Fri Mar 21 18:13:47.186 [initandlisten] options: { config: "/etc/mongod.conf", dbpath: "/var/lib/mongo", fork: "true", logappend: "true", logpath: "/var/log/mongo/mongod.log", pidfilepath: "/var/run/mo$
Fri Mar 21 18:13:47.192 [initandlisten] journal dir=/var/lib/mongo/journal
Fri Mar 21 18:13:47.192 [initandlisten] recover : no journal files present, no recovery needed
Fri Mar 21 18:13:47.192 [initandlisten]
Fri Mar 21 18:13:47.192 [initandlisten] ERROR: Insufficient free space for journal files
Fri Mar 21 18:13:47.192 [initandlisten] Please make at least 3379MB available in /var/lib/mongo/journal or use --smallfiles
Fri Mar 21 18:13:47.192 [initandlisten]
Fri Mar 21 18:13:47.193 [initandlisten] exception in initAndListen: 15926 Insufficient free space for journals, terminating
Fri Mar 21 18:13:47.193 dbexit:
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: going to close listening sockets...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: going to flush diaglog...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: going to close sockets...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: waiting for fs preallocator...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: lock for final commit...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: final commit...
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: closing all files...
Fri Mar 21 18:13:47.193 [initandlisten] closeAllFiles() finished
Fri Mar 21 18:13:47.193 [initandlisten] journalCleanup...
Fri Mar 21 18:13:47.193 [initandlisten] removeJournalFiles
Fri Mar 21 18:13:47.193 [initandlisten] shutdown: removing fs lock...
Fri Mar 21 18:13:47.193 dbexit: really exiting now
Also, I run: sudo service mongod status and the output is mongod is stopped so I know it's not running.
Following the stack, it looks like the error has something to do with insufficient space, but my server has 15gb free and im running sudo, so i know it's not a permission error....how can I allocate more space...or better yet, what should i allocate more space to?
Any help is appreciated.

Add smallfiles = true to "/etc/mongodb.conf".
Now try to start the service, I assume this should fix the issue!!
Set to true to modify MongoDB to use a smaller default data file size. Specifically, smallfiles reduces the initial size for data files and limits them to 512 megabytes. The smallfiles setting also reduces the size of each journal files from 1 gigabyte to 128 megabytes.

mod_wsgi is compiled in one version and running in a different version even after following the given steps

I am getting an error when I run the apache server through my client after going through the log I understood that the mod_wsgi uses python 2.6 during compiling and uses python 2.7 for running. After some research in the Internet I followed the below steps:
You have to recompile mod-python and/or mod-wsgi.
Remove mods
apt-get remove libapache2-mod-python libapache2-mod-wsgi
Get dependencies
apt-get build-dep libapache2-mod-python libapache2-mod-wsgi
Build mod-python
mkdir /tmp/python
cd /tmp/python
apt-get source libapache2-mod-python
cd libapache2-mod-python-[x.x.x]
dpkg-buildpackage -rfakeroot -b
Build mod-wsgi
mkdir /tmp/wsgi
cd /tmp/wsgi
apt-get source libapache2-mod-wsgi
cd mod-wsgi-[x.x.x]
dpkg-buildpackage -rfakeroot -b
Install newly compiled packages
dpkg -i /tmp/python/libapache2-mod-python-[x.x].deb /tmp/wsgi/libapache2-mod-wsgi-[x.x].deb
It was of no use, now the version has changed to 3.2, I am worried about the space being consumed through the above steps and now the compiling python has changes to python 3.2 from 2.6 but the python used for running is still 2.7. please help me with what to do ? to get back my apache server running successfully.
error.log::::
[Wed Aug 21 11:48:11 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Wed Aug 21 11:48:11 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Wed Aug 21 11:48:11 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Wed Aug 21 11:48:36 2013] [notice] caught SIGTERM, shutting down
[Wed Aug 21 22:48:29 2013] [error] child process 1226 still did not exit, sending a SIGKILL
[Wed Aug 21 22:48:30 2013] [notice] caught SIGTERM, shutting down
[Wed Aug 21 22:56:17 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Wed Aug 21 22:56:17 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Wed Aug 21 22:56:17 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Thu Aug 22 01:32:12 2013] [notice] caught SIGTERM, shutting down
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Thu Aug 22 01:32:26 2013] [notice] Apache/2.2.22 (Ubuntu) mod_wsgi/3.3 Python/2.7.3 configured -- resuming normal operations
[Thu Aug 22 04:04:48 2013] [notice] child pid 11212 exit signal Segmentation fault (11)
[Thu Aug 22 04:04:48 2013] [notice] caught SIGTERM, shutting down
[Thu Aug 22 04:04:56 2013] [notice] mod_python: Creating 8 session mutexes based on 6 max processes and 25 max threads.
[Thu Aug 22 04:04:56 2013] [notice] mod_python: using mutex_directory /tmp
[Thu Aug 22 04:04:56 2013] [warn] mod_wsgi: Compiled for Python/3.2.3.
[Thu Aug 22 04:04:56 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.
[Thu Aug 22 04:04:56 2013] [notice] Apache/2.2.22 (Ubuntu) mod_python/3.3.1 Python/2.7.3 mod_wsgi/3.3 configured -- resuming normal operations
Thank you

Don't load mod_python and mod_wsgi at the same time if you don't need to. They are likely compiled against different Python versions. See the following for an explanation of the mismatch you are seeing.
http://code.google.com/p/modwsgi/wiki/InstallationIssues#Python_Version_Mismatch
If you do need both, they must both be compiled for the same version.
These days there is generally no good reason to be using mod_python for new projects.

Just to add
I have uninstalled libapache2-mod-python
sudo apt-get remove libapache2-mod-python
which I have installed
then I have overcome the above error
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Compiled for Python/2.7.2+.
[Thu Aug 22 01:32:26 2013] [warn] mod_wsgi: Runtime using Python/2.7.3.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

GCE 8 GPU instance randomnly reboots while training is running - tensorflow

Related

Redis server crash on MacOS 11

Customized SUSE Image not running in Google compute Engine

iOS7.0 Intermittent Crashes

Error on startup for mongodb server

mod_wsgi is compiled in one version and running in a different version even after following the given steps

Categories

Resources