Erlang crashes (must be ejabberd): Why and how to debug the logfile? - crash

Everyday I have a new Erlang crash report on my server. As ejabberd is the only Erlang-thing I use, this must be the cause of the crash.
The logfile (erl_crash.dump) has almost 9,000 lines so I have no idea how to debug that. But when I searched for "ejabberd" in that logfile, there were 5 occurrences - and every single occurrence was something related to "ejabberdctl".
I'm addressing ejabberdctl via PHP script (exec()) to programatially create users. Could that be the cause for the crash (somehow)?
In /var/log/ejabberd directory, I've found some errors in erlang.log and ejabberd.log. But I don't really know how to resolve them:
=ERROR REPORT====
Mnesia('ejabberd#MYHOST'): ** ERROR ** (core dumped to file: "/var/lib/ejabberd/MnesiaCore.ejabberd#MYHOST_...")
** FATAL ** mnesia_monitor crashed: {badarg,
[{ets,lookup,
[mnesia_decision,
'ejabberdctl#MYHOST']},
{mnesia_recover,has_mnesia_down,1},
{mnesia_monitor,handle_info,2},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]} state: {state,
<0.65.0>,
[],
[],
true,
[],
undefined,
[]}
=ERROR REPORT====
Mnesia('ejabberd#MYHOST'): ** WARNING ** Mnesia is overloaded: {dump_log,
time_threshold}
=CRASH REPORT====
crasher:
initial call: ejabberd_listener:init/3
pid: <0.366.0>
registered_name: []
exception exit: {timeout,
{gen_server,call,
[<0.682.0>,{become_controller,<0.685.0>}]}}
in function gen_server:call/2
in call from ejabberd_listener:accept/3
ancestors: [ejabberd_listeners,ejabberd_sup,<0.39.0>]
messages: [{#Ref<0.0.0.11304>,ok}]
links: [#Port<0.2761>,<0.274.0>]
dictionary: []
trap_exit: false
status: running
heap_size: 2584
stack_size: 24
reductions: 20938
neighbours:

The erl_crash.dump file contains the states of almost everything in the moment when the Erlang VM crashed. There's a tool for analyzing it, just:
Start an Erlang shell and start the webtool:
somebody#somehost> erl
Erlang R15B02 (erts-5.9.2) [source] [smp:2:2] [async-threads:0] [kernel-poll:false]
Eshell V5.9.2 (abort with ^G)
1> webtool:start().
WebTool is available at http://localhost:8888/
Or http://127.0.0.1:8888/
{ok,<0.35.0>}
2>
Navigate to the address given above with your browser, and click WebTool -> Start Tools -> CrashDumpViewer -> Start, then CrashDumpViewer -> Load Crashdump.
Look for the Slogan in General Information. It's the summarized reason of crashing.
Look for processes with a state other than Waiting. Those processes are doing something while the Erlang VM crashed, they are probably the sources.

You can only execute ejabberdctl once. Executing it twice from your PHP will generate conflict in node naming and the crash you see.
Do not use ejabberdctl from code, but rely on API.

Are they any chances that you are string ejabberd twice?

Do you have an erlang.log log file? If so, you should find good info in there about a crash.

You could use ssh port forwarding to export webtool to your local machine, where you can point a browser at it. Exposing it to the whole internet would probably be a bad security error.

Related

libevent crash in event_add

I have been using python-libevent (https://pypi.org/project/python-libevent/) with multiple
versions of Python without any issues. Now, when I am trying to migrate to Python-3.8,
I see that a event_add() from my C code segfaults! and I am not getting much
from the backtrace/gdb!
I tried with python3.8-dbg and rebuilding libraries with debug symbols, still not getting
any clues!
(gdb) bt
#0 0x0000000000000000 in ?? ()
#1 0x00007fe4bc2f804e in event_add (ev=0x7fe4bba0d000, tv=0x7fe4bba0c020) at event.c:2443
where
2443: EVBASE_ACQUIRE_LOCK(ev->ev_base, th_base_lock);
Can anyone please help me on how to debug this further! I checked that I have enabled
pthreads and debug:
evthread_use_pthreads();
evthread_enable_lock_debugging();
event_enable_debug_mode();
event_enable_debug_logging(EVENT_DBG_ALL)
Has anything changed in the usage of libpython APIs - related to
memory-handling/capsules between 3.6 and 3.8 ? (my code works up until Py3.6)
libevent version: 2.1.8-stable
Py3.8 version: 3.8.0
Any help/pointers to debug much appreciated.
--More info--
After debugging with valgrind (helgrind tool) I see :
==13558== Process terminating with default action of signal 11 (SIGSEGV): dumping core
==13558== Bad permissions for mapped region at address 0x0
==13558== at 0x0: ???
In my module, when I call event_add()!
Is there any restriction on
adding event to a event-base owned by a different module ?

working on dbt want to give the details of snowflake in project.yml file but getting error

Trying the give the details of snowfalke in dbt profiles.yml file . but as soon as when ran the command i,e
$atom /home/myname/.dbt/profiles.yml gives below error:
/usr/bin/atom: line 190: 1705 Trace/breakpoint trap (core dumped) nohup "$ATOM_PATH" --executed-from="$(pwd)" --pid=$$ "$#" > "$ATOM_HOME/nohup.out" 2>&1
Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Permission denied
Following things i tried: Ran below commands but still no luck:
1)
$google-chrome --no-gpu --no-sandbox --disable-setuid-sandbox --headless --dump-dom http://www.chromestatus.com
Error:
[0627/161930.251811:ERROR:udev_watcher.cc(61)] Failed to enable receiving udev events.
[0627/161932.565713:ERROR:platform_shared_memory_region_posix.cc(46)] Descriptor access mode (0) differs from expected (2)
[0627/161932.566251:WARNING:crash_handler_host_linux.cc(366)] Could not translate tid - assuming crashing thread is thread group leader; syscall_supported=0
[0627/161932.769040:WARNING:crash_handler_host_linux.cc(366)] Could not translate tid - assuming crashing thread is thread group leader; syscall_supported=0
--2020-06-27 16:19:32-- https://clients2.google.com/cr/report
Resolving clients2.google.com (clients2.google.com)... 2404:6800:4009:805::200e, 172.217.174.238
Connecting to clients2.google.com (clients2.google.com)|2404:6800:4009:805::200e|:443... [0627/161933.036124:ERROR:headless_shell.cc(399)] Abnormal renderer termination.
connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/plain]
Saving to: ‘/dev/fd/4’
0K
Crash dump id: e870824b56e91b9f
$ google-chrome
Error:
Failed to move to new namespace: PID namespaces supported, Network namespace supported, but failed: errno = Permission denied
Trace/breakpoint trap (core dumped)
[1772:1772:0100/000000.825375:ERROR:zygote_linux.cc(653)] write: Broken pipe (32)
[0627/162152.831614:ERROR:nacl_helper_linux.cc(308)] NaCl helper process running without a sandbox!
Most likely you need to configure your SUID sandbox correctly
Could any one advise on the above issue.
This could be an atom error. Could be packages or something that they installed. I could be wrong, but I don’t see anything in that error message that indicates it’s a profiles.yml thing. I wonder if they can open other files in atom just fine?
Alternatively, use a text editor that isn’t based on chrome
Thanks #jake and #Christine from FishTown Analytics.
Here are a few helpful links.
Atom issue;
Blogpost!

Not able to execute plugin enable command RabbitMQ

I installed Erlang and RabbitMQ on windows 7. RabbitMQ Service is running. But when I am trying to execute plugin enable command, I am getting below error
=SUPERVISOR REPORT==== 9-Jul-2018::14:14:46.134000 ===
supervisor: {local,'Elixir.Logger.Supervisor'}
errorContext: start_error
=INFO REPORT==== 9-Jul-2018::14:14:46.149000 ===
application: logger
exited: {{shutdown,
{failed_to_start_child,'Elixir.Logger.ErrorHandler',noproc}},
{'Elixir.Logger.App',start,[normal,[]]}}
type: temporary
Could not start application logger: Logger.App.start(:normal, []) returned an er
ror: shutdown: failed to start child: Logger.ErrorHandler
** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
You are using a version of RabbitMQ prior to 3.7.7 with Erlang 21. This is not supported and is clearly documented here: https://www.rabbitmq.com/which-erlang.html
Solution: upgrade to RabbitMQ 3.7.7 or downgrade Erlang to version 20.3.
Also, you could have found the answer to this yourself by searching Google with the following text:
rabbitmq "Elixir.Logger.ErrorHandler"
NOTE: the RabbitMQ team monitors the rabbitmq-users mailing list and only sometimes answers questions on StackOverflow.

Live Migration Failure: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory

I have a 2 node OpenStack Mitaka environment consisting of a controller/compute node and a compute node.
I've followed the setup guide to enable instance live migration using LVM block storage. I.e.: There's no shared storage backend, just local LVM block storage.
Using OpenStack Horizon to perform the live migration a success message is displayed, however, the migration is far from successful. This worked pretty much out-of-the-box with our Juno installation. I've exhausted Google and cannot find any other instances of people facing the same problem. I thought it might have been a time synchronisation problem so have set both nodes to UTC. Still the problems persists.
Source machine /var/log/nova/nova-compute.log
2016-08-12 15:56:42.120 2230 ERROR nova.virt.libvirt.driver [req-b71ea7b0-5fa8-4b57-92d2-4edec62135c2 b017d86d1143461a92a267d4b912c104 88c686f09e1b427fb750f5c00716f84e - - -] [instance: 5763b6b6-370c-448c-8e8f-8b71eafaa8f1] Migration operation has aborted
2016-08-12 15:56:42.470 2230 ERROR nova.virt.libvirt.driver [req-b71ea7b0-5fa8-4b57-92d2-4edec62135c2 b017d86d1143461a92a267d4b912c104 88c686f09e1b427fb750f5c00716f84e - - -] [instance: 5763b6b6-370c-448c-8e8f-8b71eafaa8f1] Live Migration failure: internal error: unable to execute QEMU command 'migrate': Migration disabled: failed to allocate shared memory
Target node /var/log/libvirt/libvirtd.log
2016-08-12 15:56:41.864+0000: 2170: error : qemuMonitorJSONGetMigrationStatsReply:2443 : internal error: info migration reply was missing return status
2016-08-12 15:56:41.864+0000: 2170: error : virNetClientProgramDispatchError:177 : Cannot open log file: '/var/log/libvirt/qemu/instance-0000006a.log': Device or resource busy
There are no other events captured in the source or target nova or libvirt logs.
I should also note that I am trying to use qemu+tcp (libvirt listening enabled, default tcp port, no auth) rather than qemu+ssh in order to keep things simple while testing. In fact, I intend to only use qemu+tcp anyway.
Which version of ubuntu did you deploy?
I had the same error with ubuntu 14.04 and mitaka version.
And I figured out that default kernel (3.13) makes this problem.
I upgraded the kernel from 3.13 to 4.40 and this problem is gone now.
I hope my experience help you solve this problem out.
Thanks

Zookeeper error connection loss exception

I'm running a SeqWare VM on an amazon EC2 instance I'm trying to use the SeqWare query engine to query data from VCF files. When I first launch the instance and follow the instructions to import data, It works fine, and continues to work until I stop the instance. When I restart it. It won't let me import anything, nor create a new workspace. It always returns the error below. I looked at the processes and found that none of the required nodes were running, so I logged into root and went to the etc/init.d directory and start everything again, at which point, when T try to import data, I don't even get an error and I have to stop the process.
[seqware#master target]$ java -classpath seqware-distribution-0.13.6.7-qe-full.jar com.github.seqware.queryengine.system.importers.SOFeatureImporter -i ../../seqware-queryengine/src/test/resources/com/github/seqware/queryengine/system/FeatureImporter/consequences_annotated.vcf ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.3_100001-101000.vcf -o keyValueVCF.out -r hg_19 -s c111aea5-5e18-4c62-a8a7-ec82fe151301 -a ad_hoc -w VCFVariantImportWorker
[SeqWare Query Engine] 0 [main] ERROR org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper - ZooKeeper exists failed after 3 retries
[SeqWare Query Engine] 1 [main] ERROR org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher - hconnection Received unexpected KeeperException, re-throwing exception
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:154)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:226)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:82)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:580)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:569)
at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:186)
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:100)
at com.github.seqware.queryengine.impl.HBaseStorage.<init>(HBaseStorage.java:89)
at com.github.seqware.queryengine.factory.SWQEFactory$Storage_Type$3.buildStorage(SWQEFactory.java:109)
at com.github.seqware.queryengine.factory.SWQEFactory.getStorage(SWQEFactory.java:174)
at com.github.seqware.queryengine.factory.SWQEFactory.getQueryInterface(SWQEFactory.java:199)
at com.github.seqware.queryengine.impl.SimpleModelManager.<init>(SimpleModelManager.java:49)
at com.github.seqware.queryengine.impl.HBaseModelManager.<init>(HBaseModelManager.java:36)
at com.github.seqware.queryengine.impl.MRHBaseModelManager.<init>(MRHBaseModelManager.java:32)
at com.github.seqware.queryengine.factory.SWQEFactory.getModelManager(SWQEFactory.java:211)
at com.github.seqware.queryengine.system.importers.FeatureImporter.performImport(FeatureImporter.java:66)
at com.github.seqware.queryengine.system.importers.SOFeatureImporter.runMain(SOFeatureImporter.java:141)
at com.github.seqware.queryengine.system.importers.SOFeatureImporter.main(SOFeatureImporter.java:60)
[SeqWare Query Engine] 3 [main] FATAL org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation - Unexpected exception during initialization, aborting
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/master
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.exists(ZooKeeper.java:1021)
at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.exists(RecoverableZooKeeper.java:154)
at org.apache.hadoop.hbase.zookeeper.ZKUtil.watchAndCheckExists(ZKUtil.java:226)
at org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.start(ZooKeeperNodeTracker.java:82)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.setupZookeeperTrackers(HConnectionManager.java:580)
at org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.<init>(HConnectionManager.java:569)
at org.apache.hadoop.hbase.client.HConnectionManager.getConnection(HConnectionManager.java:186)
at org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:100)
at com.github.seqware.queryengine.impl.HBaseStorage.<init>(HBaseStorage.java:89)
at com.github.seqware.queryengine.factory.SWQEFactory$Storage_Type$3.buildStorage(SWQEFactory.java:109)
at com.github.seqware.queryengine.factory.SWQEFactory.getStorage(SWQEFactory.java:174)
at com.github.seqware.queryengine.factory.SWQEFactory.getQueryInterface(SWQEFactory.java:199)
at com.github.seqware.queryengine.impl.SimpleModelManager.<init>(SimpleModelManager.java:49)
at com.github.seqware.queryengine.impl.HBaseModelManager.<init>(HBaseModelManager.java:36)
at com.github.seqware.queryengine.impl.MRHBaseModelManager.<init>(MRHBaseModelManager.java:32)
at com.github.seqware.queryengine.factory.SWQEFactory.getModelManager(SWQEFactory.java:211)
at com.github.seqware.queryengine.system.importers.FeatureImporter.performImport(FeatureImporter.java:66)
at com.github.seqware.queryengine.system.importers.SOFeatureImporter.runMain(SOFeatureImporter.java:141)
at com.github.seqware.queryengine.system.importers.SOFeatureImporter.main(SOFeatureImporter.java:60)
I figured it out.The apache services were installed from the cloudera package. They weren't being restarted when the instance was being restarted and apparently, just running their script from the etc/init.d was the incorrect way to do it. I found the commands to restart them in the cloudera documentation.
I too faced this problem.I was able to solve this problem by providing jute.maxbuffer parameter while starting zookeeper.
For more info you can refer
https://issues.apache.org/jira/browse/SOLR-4793