I just upgraded from Chronicle Queue 4.6.109 to 5.16.16. Since the upgrade I noticed many warnings of the following form in all of our services that use Chronicle Queue:
2018-10-16 16:26:58,524 WARN [main ] SCQIndexing Took 256 us to linearScan by position from 379578902 to 379580588 = (0x169ff0ac-0x169fea16)=1686
2018-10-16 16:29:19,130 WARN [main ] SCQIndexing Took 315 us to linearScan by position from 411040047 to 411042086 = (0x18800126-0x187ff92f)=2039
2018-10-16 16:29:40,121 WARN [main ] SCQIndexing Took 73 us to linearScan by position from 415383606 to 415388071 = (0x18c251a7-0x18c24036)=4465
2018-10-16 16:34:03,655 WARN [main ] SCQIndexing Took 310 us to linearScan by position from 478146209 to 478150976 = (0x1c800140-0x1c7feea1)=4767
...
Is this something to really worry about or just used for (internal) profiling? Is there anything I can change in my code to eliminate these warnings, i.e. reduce the scan time (except setting the property chronicle.queue.report.linear.scan.latency to false)?
This is solved with 5.17.0. See https://github.com/OpenHFT/Chronicle-Queue/issues/526
Related
Problem
When performing queries in my production environment, the index is not being used and a full scan is performed, but my development environment works fine and uses the index.
After looking deeper at the problem in production, it also seems that the index information is being saved to the storage backend, but the data is not, and is being stored locally. I have no idea why this is...
I will explain the architecture now:
Environments
The following describe my two environments. Important to note, the index in question is a composite index, as such uses the storage backend, but I still included the index-backend in the architecture environment (aka Elasticsearch).
Both local and production environment versions are the same, i.e:
Janusgraph: 0.5.2
ScyllaDB: 0.5.2
Elasticsearch: 7.13.1
Local Environment
Services are running in docker-compose, consisting of a single Janusgraph instance, a single ScyllaDB instance, and a single Elasticsearch Instance.
Production Environment
Running on AWS, kubernetes cluster managed with EKS, I have multiple janusgraph deployments, which connect to a ScyllaDB cluster (in the same k8s cluster), which is done via Scylla For Kubernetes (https://operator.docs.scylladb.com/stable/), and an Elasticsearch cluster.
Setup
The following will give the simplest example I can that contains the problems I describe.
I pre-create the index's with the Janusgraph management system, such as:
# management.groovy
import org.janusgraph.graphdb.database.management.ManagementSystem
cluster = Cluster.open("/opt/janusgraph/my_scripts/gremlin.yaml")
client = cluster.connect()
graph = JanusGraphFactory.open("/opt/janusgraph/my_scripts/env.properties")
g = graph.traversal().withRemote(DriverRemoteConnection.using(client, "g"))
m = graph.openManagement()
uid_property = m.makePropertyKey("uid").dataType(String).make()
user_label = m.makeVertexLabel("User").make()
m.buildIndex("index::User::uid", Vertex.class).addKey(uid_property).indexOnly(user_label).buildCompositeIndex()
m.commit()
Upon inspection with m.printSchema() I can see that the index's are ENABLED, in both my local environment and production environment.
I proceed to import all the data that needs to exist on the graph, both local env and production env are OK.
Performing Queries
The following outline what happens when I run a query
Local Environment
What we see here is a simple lookup just to check that the query is using the index:
gremlin> g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 1.837 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=true
\_query=multiKSQ[1]#4000
\_index=index::User::uid
\_orders=[]
\_isOrdered=true
optimization 0.038
optimization 0.497
backend-query 1 0.901
\_query=index::User::uid:multiKSQ[1]#4000
\_limit=4000
>TOTAL - - 1.837 -
Production Environment
Again, we run the query to see if it using the index (which it is not)
g.V().has("User", "uid", "00003b90-dcc2-494d-a179-ac9009029501").profile()
==>Traversal Metrics
Step Count Traversers Time (ms) % Dur
=============================================================================================================
JanusGraphStep([],[~label.eq(User), uid.eq(... 1 1 11296.568 100.00
\_condition=(~label = User AND uid = 00003b90-dcc2-494d-a179-ac9009029501)
\_isFitted=false
\_query=[]
\_orders=[]
\_isOrdered=true
optimization 0.025
optimization 0.102
scan 0.000
\_condition=VERTEX
\_query=[]
\_fullscan=true
>TOTAL - - 11296.568 -
What Happened? So far my best guess:
The storage backend is NOT being used for storing data, but is being used for storing information about the indexes
Update: Aug 16 2021, after digging around some more I found out something interesting
It is now clear that the data is actually not being saved to the storage backend at all.
In my local environment I set the storage.directory environment variable to /var/lib/janusgraph/data, which mounts onto an empty directory, this directory remains empty. Any vertex/edge updates get's saved to the scyllaDB storage backend, and the data persists between janusgraph instance restarts.
In my production environment, this directory (/var/lib/janusgraph/data) is populated with files:
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.lck
-rw-r--r-- 1 janusgraph janusgraph 9650 Aug 16 05:46 je.config.csv
-rw-r--r-- 1 janusgraph janusgraph 450 Aug 16 05:46 je.info.0
-rw-r--r-- 1 janusgraph janusgraph 0 Aug 16 05:46 je.info.0.lck
drwxr-xr-x 2 janusgraph janusgraph 118 Aug 16 05:46 .
-rw-r--r-- 1 janusgraph janusgraph 7533 Aug 16 05:46 00000000.jdb
drwx------ 1 janusgraph janusgraph 75 Aug 16 05:53 ..
-rw-r--r-- 1 janusgraph janusgraph 19951 Aug 16 06:09 je.stat.csv
and any subsequent updates on the graph seem to be reflected here, the update do not get put onto the storage backend, and other janusgraph instances on kubernetes cannot see any changes other instances make, leading me to come to the conclusion, the storage backend is not being used for storing data
The domain name used for the storage.hostname and index.hostname both resolve to IP address's, confirmed with using nslookup.
The endpoints must also work, as the keyspace janusgraph is created, and also has a different replication factor that I defined, and also retains the index information regardless of restarting the janusgraph instances.
Idea 1 (Index is not enabled)
This was disproved via running m.printSchema() showing that all the index's were ENABLED
Idea 2 (Storage backends have different data)
I looked at the data stored in scylladb, and got a summary with nodetool cfstats, this does show something different:
# Local
Keyspace : janusgraph
Read Count: 1688328
Read Latency: 2.5682805710738673E-5 ms
Write Count: 1055210
Write Latency: 1.702409946835227E-5 ms
...
Memtable cell count: 126411
Memtable data size: 345700491
Memtable off heap memory used: 480247808
# Production
Keyspace : janusgraph
Read Count: 6367
Read Latency: 2.1203078372860058E-5 ms
Write Count: 21
Write Latency: 0.0 ms
...
Memtable cell count: 4
Memtable data size: 10092
Memtable off heap memory used: 131072
Although I don't know how to explain the difference, it is clear that both backends contain all the data, verified with various count() queries over labels, such as g.V().hasLabel("User").count(), which both environments report the same result
Idea 3 (Elasticsearch Warnings)
When launching a gremlin console session, there is a difference in that the production environment shows:
07:27:09 WARN org.elasticsearch.client.RestClient - request [PUT http://*******<i_removed_the_domain>******:9200/_cluster/settings] returned 3 warnings: [299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.data] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.master] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."],[299 Elasticsearch-7.13.4-c5f60e894ca0c61cdbae4f5a686d9f08bcefc942 "[node.ml] setting was deprecated in Elasticsearch and will be removed in a future release! See the breaking changes documentation for the next major version."]
but as my problem is using composite index's, I believe we can disregard elasticsearch warnings.
Idea 4 (ScyllaDB cluster node resources)
Another idea I had was increasing the node resources, even with 7gb RAM, the problem still persists.
Finally...
I don't know what to try next in order to solve this problem, this is my first time pushing Janusgraph into production and perhaps I have missed something important. I have been stuck on this problem for quite a while, hence now asking the community here for help.
Thank you very much for reading this for, and hopefully helping me to solve this problem
I solved the problem myself, I realised that my K8s Deployment .yaml file I use for deploying needed all environment variables to have the prefix janusgraph., as such the janusgraph server was starting with all default variables rather than my selected ones.
Every-time I was creating a gremlin shell session (which connected to it's localhost server), although I was specifying all the correct endpoints and configuration, it was still saving the data according to default janusgraph variables. Although, even in this case, I don't know why the index's were successfully created on my specified backend.
But none the less, the solution was to make sure environment variables have the prefix janusgraph.
I am writing a modified version of the Task Assignment example with my own domain model.
In my model each Task can have a NextTask and a PreviousTask and an Assignee. All 3 are configured as PlanningVariables:
...
/** PreviousTask is a calculated task that the Resource will complete before this one. */
#PlanningVariable(valueRangeProviderRefs = { "tasksRange" }, graphType = PlanningVariableGraphType.CHAINED)
public Task PreviousTask;
#InverseRelationShadowVariable(sourceVariableName = "PreviousTask")
public Task NextTask;
#AnchorShadowVariable(sourceVariableName = "PreviousTask")
public Resource Assignee;
I have been stuck on the Local Search Phase step for some time now as it appears to require an initialized state of my planning variables (Task.PreviousTask in this case).
Error log:
2020-07-16 15:00:15.341 INFO 4616 --- [pool-1-thread-1] o.o.core.impl.solver.DefaultSolver : Solving started: time spent (65), best score (-27init/[0]hard/[0/0/0/0]soft), environment mode (REPRODUCIBLE), random (JDK with seed 0).
2020-07-16 15:00:15.356 INFO 4616 --- [pool-1-thread-1] .c.i.c.DefaultConstructionHeuristicPhase : Construction Heuristic phase (0) ended: time spent (81), best score (-27init/[0]hard/[0/0/0/0]soft), score calculation speed (90/sec), step total (0).
2020-07-16 15:00:15.376 ERROR 4616 --- [pool-1-thread-1] o.o.c.impl.solver.DefaultSolverManager : Solving failed for problemId (5e433f57-8c75-4756-8a9c-4c4ca4a83d6d).
java.lang.IllegalStateException: Local Search phase (1) needs to start from an initialized solution, but the planning variable (Task.PreviousTask) is uninitialized for the entity (com.redhat.optaplannersbs.domain.Task#697ea710).
Maybe there is no Construction Heuristic configured before this phase to initialize the solution.
I've been pouring over the documentation and trying to figure out what I've missed or broken between it and the source example but I cannot work it out. I have no Construction Heuristic configured (Same as the example), and I could not see where in the example does it ever set the previousTaskOrEmployee variable before solving.
I could supply a random PreviousTask in the initial solution model, but surely at least 1 of the tasks would have no previous task?
May I ask if you have <localSearch/> in your solverConfig.xml? If so, the <constructionHeuristic/> has to be there as well.
If none of these phases (CH nor LS) is configured, both Construction Heuristic and Local Search is added with default parameters. But once the appears in the solverConfig.xml, it's considered an override of the defaults, and a user is supposed to take care of initializing the solution (most often by providing the Construction Heuristic configuration).
So I got Pepper robot charging working smoothly for 1 interval. I did it by creating a loop of calling out goToStation() until it returns 0. Before it i deactivated the topic and also paused awarness with pauseAwareness();
When it went successfully to station i activated dialog topic again and then resumed awareness. All the other features worked well after that. Then i called Pepper to leave the station and it did successfully. Now when i want pepper to go to the charging station again it will just deactivate the topic, pause awareness but will not go to station.
Logs end here for unsuccessful going back to charging
[WARN ] audio.NuanceVoconEngine :xLhNmspConnectionEventCallback:0 Nuance Remote: Connection Failed.
[INFO ] audio.NuanceVoconEngine :xLhNmspConnectionEventCallback:0 eventType: 1 eventValue: 5 reason: 10 sessionId: <null> message: Disconnected.
[INFO ] Dialog :connectionChanged:0 Connection changed: 0
[WARN ] ConditionChecker :onEvent:0 Event queue is empty!
enter code here
In the first attempt of charging that goes successfull logs look like this after that.
[INFO ] audio.NuanceVoconEngine :xLhNmspConnectionEventCallback:0 Nuance Remote: Connection Established.
[INFO ] Dialog :connectionChanged:0 Connection changed: 1
[INFO ] audio.NuanceVoconEngine :xLhNmspConnectionEventCallback:0 eventType: 3 eventValue: 1 reason: 10 sessionId: db6dafbe-65d4-4348-8420-e772560e9fce message: <null>
...... (these logs continiue with alrecharge manager doing the job)
What could be the cause of it working the first time using it and then next time it will just not work?
I have Celery-based task queue with RabbitMQ as the broker. I am processing about 100 messages per day. I have no backend set up.
I start the task master like this:
broker = os.environ.get('AMQP_HOST', None)
app = Celery(broker=broker)
server = QueueServer((default_http_host, default_http_port), app)
... and I start the worker like this:
broker = os.environ.get('AMQP_HOST', None)
app = Celery('worker', broker=broker)
app.conf.update(
CELERYD_CONCURRENCY = 1,
CELERYD_PREFETCH_MULTIPLIER = 1,
CELERY_ACKS_LATE = True,
)
The server runs correctly for quite some time, but after about two weeks it suddenly stops. I have tracked the stoppage down to RabbitMQ no longer receiving messages due to memory exhaustion:
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: vm_memory_high_watermark set. Memory used:252239992 allowed:249239961
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: =WARNING REPORT==== 25-Feb-2016::02:01:39 ===
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: memory resource limit alarm set on node rabbit#e654ac167b10.
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: **********************************************************
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: *** Publishers will be blocked until this alarm clears ***
Feb 25 02:01:39 render-mq-1 docker/e654ac167b10[2189]: **********************************************************
The problem is I cannot figure out what needs to be configured differently to prevent this exhaustion. Obviously somewhere something is not being purged, but I don't understand what.
For instance, after about 8 days, rabbitmqctl status shows me this:
{memory,[{total,138588744},
{connection_readers,1081984},
{connection_writers,353792},
{connection_channels,1103992},
{connection_other,2249320},
{queue_procs,428528},
{queue_slave_procs,0},
{plugins,0},
{other_proc,13555000},
{mnesia,74832},
{mgmt_db,0},
{msg_index,43243768},
{other_ets,7874864},
{binary,42401472},
{code,16699615},
{atom,654217},
{other_system,8867360}]},
... when it was first started it was much lower:
{memory,[{total,51076896},
{connection_readers,205816},
{connection_writers,86624},
{connection_channels,314512},
{connection_other,371808},
{queue_procs,318032},
{queue_slave_procs,0},
{plugins,0},
{other_proc,14315600},
{mnesia,74832},
{mgmt_db,0},
{msg_index,2115976},
{other_ets,1057008},
{binary,6284328},
{code,16699615},
{atom,654217},
{other_system,8578528}]},
... even when all the queues are empty (except one job currently processing):
root#dba9f095a160:/# rabbitmqctl list_queues -q name memory messages messages_ready messages_unacknowledged
celery 61152 1 0 1
celery#render-worker-lg3pi.celery.pidbox 117632 0 0 0
celery#render-worker-lkec7.celery.pidbox 70448 0 0 0
celeryev.17c02213-ecb2-4419-8e5a-f5ff682ea4b4 76240 0 0 0
celeryev.5f59e936-44d7-4098-aa72-45555f846f83 27088 0 0 0
celeryev.d63dbc9e-c769-4a75-a533-a06bc4fe08d7 50184 0 0 0
I am at a loss to figure out how to find the reason for memory consumption. Any help would be greatly appreciated.
Logs say that you use 252239992 bytes, which is about 250Mb, which is not so high.
How many memory do you have on this machine and what is vm_memory_high_watermark value for rabbitmq? (you can check it by running rabbitmqctl eval "vm_memory_monitor:get_vm_memory_high_watermark().")
Maybe you should just increase watermark.
Another option can be making all your queues lazy https://www.rabbitmq.com/lazy-queues.html
You don't seem to be generating a huge volume of messages so the 2GB memory consumption seems strangely high. Nonetheless you could try getting rabbitmq to delete old messages - in your celery configuration set
CELERY_DEFAULT_DELIVERY_MODE = 'transient'
I am reviewing the ColdFusion Web Connector settings in workers.properties to hopefully address a sporadic response time issue.
I've been advised to inspect the output from the metrics.log file (CF Admin > Debugging & Logging > Debug Output Settings > Enable Metric Logging) and use this to inform the adjustments to the settings max_reuse_connections, connection_pool_size and connection_pool_timeout.
My question is: How do I interpret the metrics.log output to inform the choice of setting values? Is there any documentation that can guide me?
Examples from over a 120 hour period:
95% of entries -
"Information","scheduler-2","06/16/14","08:09:04",,"Max threads: 150 Current thread count: 4 Current thread busy: 0 Max processing time: 83425 Request count: 9072 Error count: 72 Bytes received: 1649 Bytes sent: 22768583 Free memory: 124252584 Total memory: 1055326208 Active Sessions: 1396"
Occurred once -
"Information","scheduler-2","06/13/14","14:20:22",,"Max threads: 150 Current thread count: 10 Current thread busy: 5 Max processing time: 2338 Request count: 21 Error count: 4 Bytes received: 155 Bytes sent: 139798 Free memory: 114920208 Total memory: 1053097984 Active Sessions: 6899"
Environment:
3 x Windows 2008 R2 (hardware load balanced)
ColdFusion 10 (update 12)
Apache 2.2.21
Richard, I realize your question here is from 2014, and perhaps you have since resolved it, but I suspect your problem was that the port set in the CF admin (below the "metrics log" checkbox) was set to 8500, which is your internal web server (used by the CF admin only, typically, if at all). That's why the numbers are not changing. (And for those who don't enable the internal web server at installation of CF, or later, most values in the metrics log are null).
I address this problem in a blog post I happened to do just last week: http://www.carehart.org/blog/client/index.cfm/2016/3/2/cf_metrics_log_part1
Hope any of this helps.