Get Jaeger agent error when distributed trace spans Node.js and Java services - jaeger

In our application a Node.js front end talks to a Java Spring backend. Everything is containerized and running in Kubernetes. Some time ago we added support for Jaeger distribtued tracing across the front end and back end services. Jaeger has been running fine until recently.
Our Elasticsearch cluster was out of date so we upgraded. That mandated an upgrade of Jaeger--we ended up with the following bits:
Jaeger Helm Chart: 0.13.3 from https://github.com/helm/charts/tree/master/incubator/jaeger
Jaeger Client for Node: 3.17.1
Jaeger Client for Java:
opentracing-spring-jaeger-cloud-starter 2.0.3
opentracing-spring-jaeger-web-starter 2.0.3
Both of the opentracing libraries have a dependency on the version 0.35.1 of the Jaeger Java client.
Since upgrading, traces that are created on one side or the other seem to be fine. But traces that span the boundary (i.e. start on the Node.js front end and complete on the Java backend) generate errors in the jaeger-agent pod like this:
{"level":"error","ts":1574224941.7531824,"caller":"processors/thrift_processor.go:119",
"msg":"Processor failed","error":"*jaeger.Batch error reading struct: *jaeger.Span error
reading struct: *jaeger.Log error reading struct: *jaeger.Tag error reading struct:
error reading field 3: Invalid data length","stacktrace":"github.com/jaegertracing/jaeger/cmd/agent/app/processors.
(*ThriftProcessor).processBuffer\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/
agent/app/processors/thrift_processor.go:119\ngithub.com/jaegertracing/jaeger/cmd/agent/app/proc
essors.NewThriftProcessor.func2\n\t/home/travis/gopath/src/github.com/jaegertracing/jaeger/cmd/a
gent/app/processors/thrift_processor.go:83"}
For these traces, the Jaeger UI shows us the spans that were created by the front end before invoking the backend API, but the child backend spans do not show up as you would expect.
What might cause this sort of processor error?

It looks like you have different versions of opentracing. The spring-starter-jaeger version 2.x upgrade the version of opentracing, so you might have introduced this breaking changes when you upgraded the dependency version.

Related

Issues with Scatter Gather in Mule 4.2.2

Is there any known issue with scatter gather scope in Mule Runtime 4.2.2(On-prem). I could see few of my request is not moving after the scatter gather component with no error or exceptions. I do invoking the soap service , rest service and file consuming under the scatter gather component. Its not happening for all the request but for few request. It is causing timeout for that request.I do have before and after logger . Its does not showing any before activities as well. On the same time, another request process it successfully.
I am just using basic config of scatter-gather component.
There are at least 6 fixed issues with scatter-gather mentioned in the Mule 4.3.0 release notes. This means that the issues were reported against previous versions. You can check those issues by searching the id number (MULE-NNNNN) in Mule JIRA: https://www.mulesoft.org/jira to see if any has affected version 4.2.2.
You have also to consider the possibility that the issue is related to something else, like threading.
It would be a good idea to test with Mule 4.3.0 to see if you notice any improvements.

Cannot install Glassfish update tool

Firstly, there are related posts:
GlassFish Server update center installation times out
Java EE 7 updatetool installation fails
I got my Java EE 7 SDK (Update 3) from here: http://www.oracle.com/technetwork/java/javaee/downloads/index.html
I have tried each of the solutions in the above posts and here: https://blogs.oracle.com/dipol/troubleshooting-glassfish-update-center
Including:
In the cmd prompt running set PKG_CLIENT_CONNECT_TIMEOUT=300 and set PKG_CLIENT_READ_TIMEOUT=300 before updatetool in C:\glassfish4\bin\updatetool.bat (c:\glassfish4 in my install directory - all settings were default, including install update tool...).
Set above mentioned timeout to much larger values - doesn't appear to make a difference at all, the process basically bombs immediately.
Running C:\glassfish4\bin\updatetool.bat many times.
Triple checking that I didn't somehow configure a proxy server in my sleep.
Use the update tool via the Glassfish admin console at http://localhost:4848 (seems to show no available update or add-ons, which seems odd..)
I get the following screenshot when I run C:\glassfish4\bin\updatetool.bat
I have no idea why the error would be proxy related, unless it happened to be something on their end. Interestingly, If I go directly to the URL mentioned (via Chrome) I get the following page:
What could possibly be going wrong here?
The updatetool was a commercial feature of Oracle GlassFish. Any update functionality relied on Oracle providing a site where updates could be hosted. Since Oracle GlassFish is no longer supported, this site no longer exists so the updatetool won't work any more.
Rather than downloading GlassFish from Oracle, you should download it from the official open source site, hosted on GitHub. Alternatively, if you really do need support, you could try Payara Server which is open source, and derived from GlassFish, but has support available (disclaimer: I work for Payara)

GlassFish 4 Rolling Upgrade Issue on Single Cluster

I use GlassFish 4.1 single cluster with two instances on same node.
My steps for rolling upgrade:
deploy app with old version ClusterTest:1.0
(asadmin deploy --target=cluster1 --enabled=true --availabilityenabled=true --name=ClusterTest:1.0 ClusterTest.ear)
deploy new version app with disabled state ClusterTest:1.1
(asadmin deploy --target=cluster1 --enabled=false --availabilityenabled=true --name=ClusterTest:1.1 ClusterTest.ear)
enable app on 1st instance
(asadmin enable --target=instance1 ClusterTest:1.1)
On 1st instance new app is available, but on 2nd: 404 error (i expect available old version)
what i do wrong?
There are a lot of problems with rolling upgrades on GlassFish. Many of these problems have been fixed in the latest version of Payara Server. It may be that you aren't hitting any of these issues, but there is a very detailed discussion on the Payara GitHub repository:
https://github.com/payara/Payara/issues/455
You may also want to look at this video which describes basic application versioning which may contain the information you need
https://www.youtube.com/watch?v=6QVBsH6IjEA

Repeated IBM bluemix Node Red app crashing; status 1

My Node Red application in IBM BlueMix is repeatedly crashing - once an hour - with no real error message other than "exited with status: 1."
How can I troubleshoot this issue?
Is there someone from IBM BlueMix support that monitors this that could take a look?
I looked at my logs and there's nothing in there that really says what's going on.
Edit per requests:
The regular log for "OUT/ERR" is scrolling so fast with HTTPD logs that I can't get it to copy/paste. Filtering to "ERR" Channel the only thing I see is below. I believe this is an error which occurs during deploy when the application restarts.
[App/0] ERR js-bson: Failed to load c++ bson extension, using pure JS version
My Node Red application is gathering data from Wink, LIFX, and other IoT services and compiles them together into a Freeboard dashboard.
Caught crash on screenshot here -- not enough cred to post images so it'll only post as a link
The zlib error was fixed in the 0.13.2 Node-RED release (that shipped 19/02/16).
If you re-stage your application is should pick up the new version of Node-RED
You can re-stage the application using the cf command line management application:
cf restage <app name>

What can i do when allow_store_upgrade fails?

I'm using neo4j in a glassfish server through a modified version of Alex Smirnov neo4j JCA connector.
My version is available here : https://github.com/Riduidel/neo4j-connector
I'm using this connector with neo4j 1.8.
As a consequence, when i want to use it, i first install the connector in my Glassfish application server, then use this connector in applications wishing to connect to.
It works OK when using it with fresh stores.
But, when using it with stores created with previous version, I encounter weird bugs.
Typically, I got today the following stack
javax.resource.spi.ResourceAllocationException: Error in allocating a connection. Cause: Failed to transition org.neo4j.kernel.InternalAbstractGraphDatabase$DefaultKernelExtensionLoader#3bbd53b1 from NONE to STOPPED
...
...
.../* JCA internal exception stack */
...
...
Caused by: com.sun.appserv.connectors.internal.api.PoolingException: Failed to transition org.neo4j.kernel.InternalAbstractGraphDatabase$DefaultKernelExtensionLoader#494b584c from NONE to STOPPED
at com.sun.enterprise.resource.pool.ConnectionPool.createSingleResource(ConnectionPool.java:924)
at com.sun.enterprise.resource.pool.ConnectionPool.createResource(ConnectionPool.java:1185)
at com.sun.enterprise.resource.pool.datastructure.RWLockDataStructure.addResource(RWLockDataStructure.java:98)
... 66 more
Caused by: org.neo4j.kernel.lifecycle.LifecycleException: Failed to transition org.neo4j.kernel.InternalAbstractGraphDatabase$DefaultKernelExtensionLoader#494b584c from NONE to STOPPED
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.init(LifeSupport.java:388)
at org.neo4j.kernel.lifecycle.LifeSupport.init(LifeSupport.java:82)
at org.neo4j.kernel.lifecycle.LifeSupport.start(LifeSupport.java:116)
at org.neo4j.kernel.InternalAbstractGraphDatabase.run(InternalAbstractGraphDatabase.java:227)
at org.neo4j.kernel.EmbeddedGraphDatabase.<init>(EmbeddedGraphDatabase.java:79)
at org.neo4j.kernel.EmbeddedGraphDatabase.<init>(EmbeddedGraphDatabase.java:70)
at com.netoprise.neo4j.AbstractNeo4jManagedConnectionFactory.createDatabase(AbstractNeo4jManagedConnectionFactory.java:165)
at com.netoprise.neo4j.AbstractNeo4jManagedConnectionFactory.createDatabase(AbstractNeo4jManagedConnectionFactory.java:127)
at com.netoprise.neo4j.Neo4jManagedConnectionFactory.createManagedConnection(Neo4jManagedConnectionFactory.java:163)
at com.sun.enterprise.resource.allocator.ConnectorAllocator.createResource(ConnectorAllocator.java:160)
at com.sun.enterprise.resource.pool.ConnectionPool.createSingleResource(ConnectionPool.java:907)
... 68 more
Caused by: java.lang.AssertionError
at org.neo4j.index.impl.lucene.LuceneDataSource.cleanWriteLocks(LuceneDataSource.java:265)
at org.neo4j.index.impl.lucene.LuceneDataSource.cleanWriteLocks(LuceneDataSource.java:260)
at org.neo4j.index.impl.lucene.LuceneDataSource.cleanWriteLocks(LuceneDataSource.java:260)
at org.neo4j.index.impl.lucene.LuceneDataSource.cleanWriteLocks(LuceneDataSource.java:260)
at org.neo4j.index.impl.lucene.LuceneDataSource.<init>(LuceneDataSource.java:185)
at org.neo4j.index.lucene.LuceneIndexProvider.load(LuceneIndexProvider.java:72)
at org.neo4j.kernel.InternalAbstractGraphDatabase$DefaultKernelExtensionLoader.loadIndexImplementations(InternalAbstractGraphDatabase.java:1171)
at org.neo4j.kernel.InternalAbstractGraphDatabase$DefaultKernelExtensionLoader.init(InternalAbstractGraphDatabase.java:1143)
at org.neo4j.kernel.lifecycle.LifeSupport$LifecycleInstance.init(LifeSupport.java:382)
... 78 more
A fast inspection reveals that this exception is linked to an undeletable "write.lock" file. My write.lock file can't be deleted because I guess migration is not over.
How can I make sure the migration is done before using it without migrating it outside of Glassfish ?
Is there a way to ahve exclusive store migrations in that context ? And if so, how ?
And is it the solution for my problem ?
EDIT 1 Added exception message.
EDIT 2 All this only happen when loaded graph was previously used with a Neo4j 1.5 and now with a Neo4j 1.8 connector. when graph is created by connector, absolutely no error happens.
EDIT 3 Strangely enough, this happens as long as there is no debugger plugged into that code : as soon as I try to debug it, the issue stop appearing. Which make me thinking there may be a migration cleanup mechanism that remvoe the write lock once migration is done, and this cleanup is not performed when using my neo4j JCA connector. Is it a valid observation ?
I am not too familiar with the JCA connector, but to be sure, I would just write a very small migration java class that opens the database, lets it migrate and shut down. Then try it again with the JCA connector?
After further investigations, truth revealed to not be in multiple calls to the EmbeddedGraphDatabase constructor, but instead to multiple identicail IndexProvider being loaded.
I use neo4j embedded in an open-source JCA connector.
In this connector, the org.neo4j.kernel.Service class is replaced by a custom one which contains a workaround regarding service loading for JBoss non shared libraries.
Unfortunatly, in our context, this workaround implies loading twice the index provider :
once using the EAR classloader
once using the Glassfish library classloader.
Why ?
Because, as our neo4j instance is using for application data AND for authentication, neo4j connector jar is put in ${domain}/lib. As a consequence, due to Classloader delegation in application server, the EAR classloader delegates to the Glassfish library classloader, and find this way the LuceneIndexProvider. Then, the Glassfish library classloader is directly used to load the same LuceneIndexProvider class.
This concludes by us having two LuceneIndexProvider objects, both trying to migrate the lucene index. Which lead to the AssertionError as the write.lock file created by the first object should be deleted by the second one, which can't do that.
I've then changed slightly that very specific class to use JBoss workaround only when default loading mechanism do not return any class (seee commit here). This small change worked like a charm, so I think you can considered this issue as fixed.