php7.0.2 Program terminated with signal 11, Segmentation fault - php-7

I am running php-7.0.2 with codeigniter (a php mvc frame). I got some segmentation faults which caused core dumps. And, I found that these segmentation faults randomly occurred when the child php-fpm processes shutdown and restart. I don't know why.
Using gdb "bt" to display the core dump:
Core was generated by `php-fpm: pool www '.
Program terminated with signal 11, Segmentation fault.
\#0 zend_string_release (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_string.h:269
269 /home/smt/phpng/php-7.0.2/Zend/zend_string.h: No such file or directory.
in /home/smt/phpng/php-7.0.2/Zend/zend_string.h
Missing separate debuginfos, use: debuginfo-install php7-7.0.2-20160407105024.x86_64
(gdb) bt
\#0 zend_string_release (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_string.h:269
\#1 zend_hash_destroy (ht=0x114dae0) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1273
\#2 0x000000000080647b in module_destructor (module=0x14b6ae0)
at /home/smt/phpng/php-7.0.2/Zend/zend_API.c:2509
\#3 0x000000000080075c in module_destructor_zval (zv=<value optimized out>)
at /home/smt/phpng/php-7.0.2/Zend/zend.c:615
\#4 0x000000000080dcff in _zend_hash_del_el_ex (ht=0x1154780)
at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1013
\#5 _zend_hash_del_el (ht=0x1154780) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1037
\#6 zend_hash_graceful_reverse_destroy (ht=0x1154780) at /home/smt/phpng/php-7.0.2/Zend/zend_hash.c:1489
\#7 0x0000000000800096 in zend_shutdown () at /home/smt/phpng/php-7.0.2/Zend/zend.c:840
\#8 0x00000000007a2a6a in php_module_shutdown () at /home/smt/phpng/php-7.0.2/main/main.c:2339
\#9 0x000000000089e45d in main (argc=<value optimized out>, argv=<value optimized out>)
at /home/smt/phpng/php-7.0.2/sapi/fpm/fpm/fpm_main.c:1997
(gdb) quit
The php-fpm.log is as following:
[20-Apr-2016 08:00:02] WARNING: [pool www] child 11751 exited on signal 11 (SIGSEGV - core dumped) after 3600.462022 seconds from start
I am very curious about this bug.
Until now, I am sure that the core dumps occurred when the fpm restarted. The restarts were caused by the command 'kill -10 fpm-master-process-ids'. Or, the fpm also restarted when it had processed 'pm.max_requests' requests.
However, the core dumps didn't occur at every restart and the probability of core dumps was very small. I cannot find the role.
Fortunately, I have installed the 7.0.5 version to replace the 7.0.2 version in our production environment and it had run for three days without core dumps.
I cannot find any modification in the changelogs from 7.0.2 to 7.0.5. This is exactly a very strange bug and I want to know the reason. who can tell me something about this bug?

After updating to 7.0.5, core dump has not occurred for 2 weeks. So, this bug has been fixed in 7.0.5!
I still don't know what case this bug.
I am a curious cat. #_#

Related

Apache Ignite: What causes the ArrayIndexOutOfBoundsException in StripedCompositeReadWriteLock?

We have services written in Kotlin, which are behaves as clients for Ignite. We use ignite-core and ignite-spring dependencies in these services.
These service starts to produce errors like that after several days of uptime:
java.lang.ArrayIndexOutOfBoundsException: Index -4 out of bounds for length 16
at org.apache.ignite.internal.util.StripedCompositeReadWriteLock.getReadHoldCount(StripedCompositeReadWriteLock.java:120)
at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.holdsLock(GridDhtPartitionTopologyImpl.java:262)
at org.apache.ignite.internal.processors.cache.CacheGroupContext.isTopologyLocked(CacheGroupContext.java:678)
at org.apache.ignite.internal.processors.cache.GridCacheContext.toCacheObject(GridCacheContext.java:1863)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapSingleUpdate(GridNearAtomicSingleUpdateFuture.java:553)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.map(GridNearAtomicSingleUpdateFuture.java:458)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicSingleUpdateFuture.mapOnTopology(GridNearAtomicSingleUpdateFuture.java:447)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridNearAtomicAbstractUpdateFuture.map(GridNearAtomicAbstractUpdateFuture.java:255)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$26.apply(GridDhtAtomicCache.java:1159)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache$26.apply(GridDhtAtomicCache.java:1157)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.asyncOp(GridDhtAtomicCache.java:782)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.update0(GridDhtAtomicCache.java:1157)
at org.apache.ignite.internal.processors.cache.distributed.dht.atomic.GridDhtAtomicCache.putAsync0(GridDhtAtomicCache.java:661)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.putAsync(GridCacheAdapter.java:2998)
at org.apache.ignite.internal.processors.cache.GridCacheAdapter.putAsync(GridCacheAdapter.java:2981)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.putAsync0(IgniteCacheProxyImpl.java:1339)
... 21 common frames omitted
Wrapped by: org.apache.ignite.IgniteCheckedException: Index -4 out of bounds for length 16
at org.apache.ignite.internal.util.IgniteUtils.cast(IgniteUtils.java:7628)
at org.apache.ignite.internal.util.future.GridFutureAdapter.resolve(GridFutureAdapter.java:260)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:172)
at org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:141)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$7.applyx(IgniteCacheProxyImpl.java:1344)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl$7.applyx(IgniteCacheProxyImpl.java:1341)
at org.apache.ignite.internal.util.lang.IgniteClosureX.apply(IgniteClosureX.java:38)
at org.apache.ignite.internal.util.future.GridFutureChainListener.applyCallback(GridFutureChainListener.java:78)
at org.apache.ignite.internal.util.future.GridFutureChainListener.apply(GridFutureChainListener.java:70)
at org.apache.ignite.internal.util.future.GridFutureChainListener.apply(GridFutureChainListener.java:30)
at org.apache.ignite.internal.util.future.GridFutureAdapter.notifyListener(GridFutureAdapter.java:399)
at org.apache.ignite.internal.util.future.GridFutureAdapter.listen(GridFutureAdapter.java:354)
at org.apache.ignite.internal.util.future.GridFutureAdapter$ChainFuture.<init>(GridFutureAdapter.java:588)
at org.apache.ignite.internal.util.future.GridFutureAdapter.chain(GridFutureAdapter.java:361)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.putAsync0(IgniteCacheProxyImpl.java:1341)
at org.apache.ignite.internal.processors.cache.IgniteCacheProxyImpl.putAsync(IgniteCacheProxyImpl.java:1326)
at org.apache.ignite.internal.processors.cache.GatewayProtectedCacheProxy.putAsync(GatewayProtectedCacheProxy.java:880)
... 19 common frames omitted
The index specified in the ArrayIndexOutOfBoundsException is different for different exception instances but it's always a negative value.
After restarting the services, the error goes and the services work fine next several days. But then the error appears again.
I investigated it a bit and found the error may be causes by overflowing IDX_GEN value in StripedCompositeReadWriteLock since this value is always increased. If there are a lot of calls to the lock class it maybe overflowed. But it's just my guess and I'm not 100% sure.
Is there other possible causes for the error?
Or maybe some workaround how to avoid this issue. I'd like to avoid periodically restarting a whole service just to reset the IDX_GEN
Ignite version and ignite dependencies versions is 2.11.0
Java 18

Laravel 6.10 Application works locally but fails on production server with "Target class [] does not exist."

This problem deals specifically with Laravel 6.10 and queue processing. On my local machine, the program runs fine, and all queued jobs load well and process to completion. On my GoDaddy server, I get a mysterious error when the job tries to load that reads:
Error thrown on line 805 in /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/Container.php with message:
Target class [] does not exist.
Trace:
#0 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/Container.php(681): Illuminate\Container\Container->build(NULL)
#1 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/Container.php(629): Illuminate\Container\Container->resolve(NULL, Array)
#2 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Foundation/Application.php(776): Illuminate\Container\Container->make(NULL, Array)
#3 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php(215): Illuminate\Foundation\Application->make(NULL)
#4 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php(88): Illuminate\Queue\Jobs\Job->resolve(NULL)
#5 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(354): Illuminate\Queue\Jobs\Job->fire()
#6 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(300): Illuminate\Queue\Worker->process('database', Object(Illuminate\Queue\Jobs\DatabaseJob), Object(Illuminate\Queue\WorkerOptions))
#7 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(134): Illuminate\Queue\Worker->runJob(Object(Illuminate\Queue\Jobs\DatabaseJob), 'database', Object(Illuminate\Queue\WorkerOptions))
#8 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\Queue\Worker->daemon('database', 'default', Object(Illuminate\Queue\WorkerOptions))
#9 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\Queue\Console\WorkCommand->runWorker('database', 'default')
#10 [internal function]: Illuminate\Queue\Console\WorkCommand->handle()
#11 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array(Array, Array)
#12 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}()
#13 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\Container\Util::unwrapIfClosure(Object(Closure))
#14 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\Container\BoundMethod::callBoundMethod(Object(Illuminate\Foundation\Application), Array, Object(Closure))
#15 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Container/Container.php(590): Illuminate\Container\BoundMethod::call(Object(Illuminate\Foundation\Application), Array, Array, NULL)
#16 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Console/Command.php(201): Illuminate\Container\Container->call(Array)
#17 /home/jaredclemence/public_html/theninjaassistant.com/vendor/symfony/console/Command/Command.php(255): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#18 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Console/Command.php(188): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#19 /home/jaredclemence/public_html/theninjaassistant.com/vendor/symfony/console/Application.php(1011): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#20 /home/jaredclemence/public_html/theninjaassistant.com/vendor/symfony/console/Application.php(272): Symfony\Component\Console\Application->doRunCommand(Object(Illuminate\Queue\Console\WorkCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#21 /home/jaredclemence/public_html/theninjaassistant.com/vendor/symfony/console/Application.php(148): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#22 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#23 /home/jaredclemence/public_html/theninjaassistant.com/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(131): Illuminate\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#24 /home/jaredclemence/public_html/theninjaassistant.com/artisan(37): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#25 {main}
I believe the problem starts and will be solved by fixing the trace at item #4 where the following is called: Illuminate\Queue\Jobs\Job->resolve(NULL). I see this in both failing jobs on the GoDaddy server, but it does not happen in local. I don't know enough about laravel to understand where the NULL value is coming from and how to fix it. This occurs before the job class is loaded, but it does not happen for all queued jobs. Only jobs of this class.
In both the local copy and the production copy I use the GoDaddy databases, so both systems talk to the same database host. I use a database named CMP_dev and CMP_core to differentiate between the development and production tables. Because I am using the same database source, I can rule out changes in the mysql settings.
I upgraded all composer packages and retested. Then, I committed the composer.lock file and updated the GoDaddy server to match. So, I can rule out problems with old buggy code that have already been fixed by someone else.
The PHP version on the server is 7.3.11, and the PHP version on local dev is 7.3.6. The good news is that they are both 7.3.X, which reduces risk of language variations, but there still may be an issue between 7.3.6 and 7.3.11, but GoDaddy does not allow me to control the PHP setting beyond the minor version number of 7.3.
---- Added on January 09, 2020------
I thought it might be the difference in web servers. On my local machine, I use php artisan serve to host the software. On GoDaddy, I use Nginx. However, then I realized that it is not the server that runs the queue. The command line runs the queue, and both commands are being run using php artisan schedule:run. This rules out the web server software and all its components.
---------
I have successfully run queued mail jobs, which means that the queue works. This should localize the issue to the two classes that are generating problems for me. If I can find the issue with one, I will likely find the issue with the second, so I will include the first job causing issue here:
app/Jobs/ConvertCsvFileToIntermediateFile.php:
namespace App\Jobs;
use Illuminate\Bus\Queueable;
use Illuminate\Contracts\Queue\ShouldQueue;
use Illuminate\Foundation\Bus\Dispatchable;
use Illuminate\Queue\InteractsWithQueue;
use Illuminate\Queue\SerializesModels;
use App\ContactCsvFile;
use Illuminate\Support\Facades\Log;
class ConvertCsvFileToIntermediateFile implements ShouldQueue
{
use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;
public $timeout = 500;
/** #var ContactCsvFile */
private $file;
private $delegate;
/**
* Create a new job instance.
*
* #return void
*/
public function __construct(ContactCsvFile $file, $delegate = null)
{
$this->file = $file;
$this->delegate = $delegate;
}
/**
* Execute the job.
*
* #return void
*/
public function handle()
{
$this->file->process($this->delegate);
}
}
The error is occurring before handle() is ever called. I know this because, when I include a dd($this->file) as the first line of handle, the line is never reached.
Also, I think it is important to note that when most jobs fail, their class name is listed in the queue:failed table. But in this case, the queue:failed table reads the time: "2020-01-08 09:23:23."
+----+------------+---------+---------------------+-----------+
| ID | Connection | Queue | Class | Failed At |
+----+------------+---------+---------------------+-----------+
| 2 | database | default | 2020-01-08 09:23:23 | |
+----+------------+---------+---------------------+-----------+
You have to read laravel 6.10 release notes.. Some functions deprecated and fixes.
I think you have to add php unit 9 instead of v8 not supported and add it to composer required not dev
I still do not know what the meaning of the error is or how to fix it. However, I found a way around the issue, which I will describe here in case it helps any others with this problem.
I began to suspect that the strange database entries that appeared in the queue:failed table were an indication that something terrible happened to cause the program to fail during execution, which maybe led to a partial write to the database. This does not make sense to me as a logical solution since I have the jobs running within a transaction that should roll back on failure.
That being said, I looked at a potential memory bloat issue and a timeout issue. I've been having trouble with the timeouts, but the classes are already set for public $timeout = 500;, which should guarantee a full 500 seconds before the job fails (These jobs were failing in less than one minute). Even though the class allows for 500 seconds, GoDaddy can limit the amount of time that a process runs, so it is possible that the process took too long.
GoDaddy can also limit the amount of memory allowed to each process. Since the jobs worked on large arrays of data, it is possible that the length of time issue and the memory bloat issue could be solved by creating one job for each item instead of creating one job to iterate the array.
This is how I solved the problem. I did create a job that processed just one item. Instead of sending the whole array to one job, I iterated the array and sent each item to a job. I thought that this would increase the amount of time it took for the system to process each item, since the program has to bootstrap the application for each job as it loads, and now I have hundreds of jobs instead of one. However, if the total time increased, the time to process a single job would certainly decrease, because the job only processed a single item and then closed. Once I made this change, the GoDaddy server stopped throwing the NULL errors, and everything ran well.
I was also surprised to see that the smaller jobs also completed faster than the single job with the array. Even though the system had to bootstrap the application on each job, the system ran much faster when memory was not overloaded.

HANA hdbindexserver start issue after power outage

There was a power outage for our 5+1 node HANA cell cluster.
After we booted up the servers, tried to start the HANA DB.
During HDB start with SIDADM we can see on the nodes 2-3-4-5:
FAIL: process hdbindexserver HDB Indexserver not running
So of course trying to start hdbindexserver with hand with SIDADM:
cd /usr/sap/SIDADM/HDB0x/exe; ./hdbindexserver
But this just produces error:
/usr/sap/SIDADM/HDB0x/foobar003/trace> cat indexserver_alert_foobar003.trc
...
[14268]{-1}[-1/-1] 2017-10-09 19:55:34.593776 e TrexNet Communication.cpp(00501) : no internal interface found
[14287]{-1}[-1/-1] 2017-10-09 19:56:01.428226 e Checkpoint CheckpointMgr.cc(00244) : Skip versions garbage collection savepoint: transaction distribution work failure: snapshot timestamp synchronization failed
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467184 e Row_Engine transdtx.cc(01410) : Unexpected ltt exception thrown: transaction distribution work failure (at foobar/ptime/storage/tm/transdtx.cc:1410 )
[14287]{-1}[-1/-1] 2017-10-09 19:56:22.467427 f PersistenceLayer PersistenceController.cpp(00679) : startup failed exception 1: no.71000145 (ptime/storage/tm/transdtx.cc:1512)
snapshot timestamp synchronization failed
...
The IPs are up. There is 1 TB of RAM.
The question: what could cause hdbindexserver to fail to start?
Looks like the indexserver process wasn't able to bind the internal network interface again:
Communication.cpp(00501) : no internal interface found
I'd look into the other tracefiles and the system log to check whether the configured NI is up and available.
It seems the persistence storage (disk where data and log file resides) is not responding within time and hence it's getting timed out. Can you check if you can access the data file and log file from the server.
Also check is network I/O slow or disk I/O slow on that server, causing the synchronization to timeout.
You can try stopping the system completely and try to bring HDB on just that server first to check if above issue exists.

RabbitMQ Crash Explanation

Our RabbitMQ service crashed twice with the following report in the $RABBITMQ_NODENAME-sasl.log:
=CRASH REPORT==== 7-Jun-2016::14:37:25 ===
crasher:
initial call: gen:init_it/6
pid: <0.223.0>
registered_name: []
exception exit: {{badmatch,
{[{msg_location,
<<162,171,39,113,226,229,228,92,227,253,48,186,
45,48,29,98>>,
1,357,0,583},
******************
16000 similar msg_location lines snipped
******************
1795219}},
[{rabbit_msg_store,combine_files,3,[]},
{rabbit_msg_store_gc,attempt_action,3,[]},
{rabbit_msg_store_gc,handle_cast,2,[]},
{gen_server2,handle_msg,2,[]},
{proc_lib,wake_up,3,
[{file,"proc_lib.erl"},{line,250}]}]}
in function gen_server2:terminate/3
ancestors: [msg_store_persistent,rabbit_sup,<0.159.0>]
messages: [{'$gen_cast',{combine,394,380}}]
links: [#Port<0.86370>,<0.218.0>,#Port<0.86369>]
dictionary: [{{"/var/lib/rabbitmq/mnesia/$RABBITMQ_NODENAME/msg_store_persistent/357.rdq",
fhc_file},
{file,1,false}},
{{"/var/lib/rabbitmq/mnesia/$RABBITMQ_NODENAME/msg_store_persistent/340.rdq",
fhc_file},
{file,1,true}},
{fhc_age_tree,{2,
{{1465,346244,764691},
#Ref<0.0.3145729.257998>,nil,
{{1465,346244,891543},
#Ref<0.0.3145729.258001>,nil,nil}}}},
{{#Ref<0.0.3145729.257998>,fhc_handle},
{handle,{file_descriptor,prim_file,{#Port<0.86369>,59}},
0,false,0,1048576,[],false,
"/var/lib/rabbitmq/mnesia/$RABBITMQ_NODENAME/msg_store_persistent/357.rdq",
[raw,binary,read_ahead,read],
[{write_buffer,1048576}],
false,true,
{1465,346244,764691}}},
{{#Ref<0.0.3145729.258001>,fhc_handle},
{handle,{file_descriptor,prim_file,{#Port<0.86370>,64}},
14212552,false,0,1048576,[],false,
"/var/lib/rabbitmq/mnesia/$RABBITMQ_NODENAME/msg_store_persistent/340.rdq",
[raw,binary,read_ahead,read,write],
[{write_buffer,1048576}],
true,true,
{1465,346244,891543}}}]
trap_exit: false
status: running
heap_size: 121536
stack_size: 27
reductions: 835024
neighbours:
We'd like to understand what this crash report means. Does it signify a bad message, RMQ can't find a message, or something completely different? We're using RabbitMQ 3.1.5 with Erlang 18, and while we know we're using an old version, we want to first know what's causing the crash before dedicating resources to an upgrade.
This message means that RabbitMQ message storage process has failed to combine files during garbage collection on message store. This can in theory cause message loss.
Note that 3.1.5 is not supported and has not been tested with OTP 10. This issue can be already fixed in newer versions though.

COM exception on "custom component" - how to identify DLL?

We have a large legacy VB app made up of a number of DLLs (a couple of dozen or so), all installed into a single COM+ Server Application. Every now and then, something happens that causes dllhost.exe to keel over (and automatically restart), leaving this message in the Windows Application Event log...
The system has called a custom component and that component has
failed and generated an exception. This indicates a problem with the
custom component. Notify the developer of this component that a failure has
occurred and provide them with the information below.
Server Application ID: {8CC02F18-2733-4A17-9E5C-1A70CB6B6977}
Server Application Instance ID: {1940A147-8A5E-45FA-86FE-DAF92A822597}
Server Application Name: MyTestApp
The serious nature of this error has caused the process to terminate.
Exception: C0000005
Address: 0x758DA3DA
Source: Complus
Event ID: 4786
Level: Error
Along side this is another log, specifically on dllhost.exe...
Faulting application name: dllhost.exe, version: 6.0.6000.16386, time stamp: 0x4549b14e
Faulting module name: msvcrt.dll, version: 7.0.6002.18005, time stamp: 0x49e0379e
Exception code: 0xc0000005
Fault offset: 0x0000a3da
Faulting process id: 0x83c
Faulting application start time: 0x01cb50c507ee0166
Faulting application path: %11
Faulting module path: %12
Report Id: %13
I know it's flagging a failure in the C runtime (msvcrt), but ideally I need to trace this back into the DLL that's called into msvcrt (probably with bad data/parameters). So without installing a debugger, is there any way to identify the DLL that causes this? I'm trying to see if there's a memory dump anywhere I can use to analyse offline - and thus tie the Address to something specific. But without that, I'm not sure that's possible. Can the COM subsystem be told to generate a minidump when a hosted application crashes? (yes it can [probably] - there's a checkbox on the 'Dump' tab).
This is on Windows Server 2008 R1 32-bit (but also be interested for Server 2003 as well).
It doesn't affect availability of the app -- COM+ simply restarts dllhost and the application continues, but it is an inconvienience that would be useful to fix.
Edit Okay, I've got a crash dump, I've got windbg, but it's not helping. Not sure if I'm being thick (a possibility) or something else :-) Output of !analyze -v is below , but it's not showing me anything in our DLLs, although it looks like it hasn't been able to resolve FAULTING_IP? I'm not sure where to turn next.
I'm wondering if any of my pdb's are dodgy and be worth generating new ones -- hooked into Microsoft's symbol server, so they shouldn't be, but not sure for what module it's (apparently) reporting wrong symbols for (BUGCHECK_STR and PRIMARY_PROBLEM_CLASS) (or are these symbols on the server the code was originally running on?). Would it be better to put the PDBs on the server itself?
If not, any other ideas? I've used windbg briefly before, but I'm no regular user of it, so maybe there's some more incantations I need to type to dig deeper? Guidance welcome :-)
*******************************************************************************
* *
* Exception Analysis *
* *
*******************************************************************************
FAULTING_IP:
+5c112faf02e0d82c
00000000 ?? ???
EXCEPTION_RECORD: ffffffff -- (.exr 0xffffffffffffffff)
ExceptionAddress: 00000000
ExceptionCode: 80000003 (Break instruction exception)
ExceptionFlags: 00000000
NumberParameters: 0
FAULTING_THREAD: 00000f1c
DEFAULT_BUCKET_ID: WRONG_SYMBOLS
PROCESS_NAME: dllhost.exe
ERROR_CODE: (NTSTATUS) 0x80000003 - {EXCEPTION} Breakpoint A breakpoint has been reached.
EXCEPTION_CODE: (HRESULT) 0x80000003 (2147483651) - One or more arguments are invalid
MOD_LIST: <ANALYSIS/>
NTGLOBALFLAG: 0
APPLICATION_VERIFIER_FLAGS: 0
MANAGED_STACK: !dumpstack -EE
OS Thread Id: 0xf1c (0)
Current frame:
ChildEBP RetAddr Caller,Callee
LAST_CONTROL_TRANSFER: from 77b15620 to 77b15e74
PRIMARY_PROBLEM_CLASS: WRONG_SYMBOLS
BUGCHECK_STR: APPLICATION_FAULT_WRONG_SYMBOLS
STACK_TEXT:
0022fa68 77b15620 77429884 00000064 00000000 ntdll!KiFastSystemCallRet
0022fa6c 77429884 00000064 00000000 00000000 ntdll!NtWaitForSingleObject+0xc
0022fadc 774297f2 00000064 ffffffff 00000000 kernel32!WaitForSingleObjectEx+0xbe
0022faf0 778e2c44 00000064 ffffffff 00e42374 kernel32!WaitForSingleObject+0x12
0022fb0c 778e2e32 00060848 0022fb5b 00000000 ole32!CSurrogateProcessActivator::WaitForSurrogateTimeout+0x55
0022fb24 00e413a4 0022fb40 00000000 00061d98 ole32!CoRegisterSurrogateEx+0x1e9
0022fcb0 00e41570 00e40000 00000000 00061d98 dllhost!WinMain+0xf2
0022fd40 7742d0e9 7ffde000 0022fd8c 77af19bb dllhost!_initterm_e+0x1a1
0022fd4c 77af19bb 7ffde000 dc2ccd29 00000000 kernel32!BaseThreadInitThunk+0xe
0022fd8c 77af198e 00e416e6 7ffde000 ffffffff ntdll!__RtlUserThreadStart+0x23
0022fda4 00000000 00e416e6 7ffde000 00000000 ntdll!_RtlUserThreadStart+0x1b
STACK_COMMAND: .cxr 00000000 ; kb ; dt ntdll!LdrpLastDllInitializer BaseDllName ; dt ntdll!LdrpFailureData ; ~0s; .ecxr ; kb
FOLLOWUP_IP:
dllhost!WinMain+f2
00e413a4 ff15a410e400 call dword ptr [dllhost!_imp__CoUninitialize (00e410a4)]
SYMBOL_STACK_INDEX: 6
SYMBOL_NAME: dllhost!WinMain+f2
FOLLOWUP_NAME: MachineOwner
MODULE_NAME: dllhost
IMAGE_NAME: dllhost.exe
DEBUG_FLR_IMAGE_TIMESTAMP: 4549b14e
FAILURE_BUCKET_ID: WRONG_SYMBOLS_80000003_dllhost.exe!WinMain
BUCKET_ID: APPLICATION_FAULT_WRONG_SYMBOLS_dllhost!WinMain+f2
Do you have symbols for the VB dlls? Symbols are important to get the call-stack. I hope you have correct symbols. You can use ld * and then lme which should get you list of symbols that did not match within windbg. Also set the symbol path for MS symbols and as well as for your custom code using _NT_SYMBOL_PATH
One of the easiest option is to load the dump within DebugDiag which should give you reason for the failure along with call-stack. DebugDiag has debugger extensions for Complus.
And here is a command to native call stack for all the threads
~*ek
and this one switch to the current exception
.ecxr
Debug Mon / WinDbg is the best way to troubleshoot this issue.
you should be able to use the modules list in winDbg, or the lm command to list the loaded modules. The stack trace should then tell you which DLLs are involved. This should be possible even without the symbols for the process/dll.