Enviornment : Ignite-2.8.1, Java 11
I am getting out of memory for my application after few minutes of start. On analyzing heap dump created on OOM, I see millions of instances of class org.apache.internal.processors.continuous.GridContinuousMessage
I do not see any direct references of these from my code.
Please suggest. Attaching snapshot.
You seem to have a Continuous Query running, and it is too slow/hanging and not able to process notifications in time, leading to their pile-up.
Related
Enviornment : Ignite-2.8.1, Java 8
I am getting heap full for my application after few hours of start. On analyzing heap dump, I see instances class org.apache.internal.processors.query.*. Looks like after query execution, it is not getting cleaned up from the heap and after some time leading to failure due to heap full.
One thing I have realized is all these entries are for queries that are triggered via ignite executor services or normal task scheduling services.
Please suggest. Attaching snapshot.
Ignite is just trying to execute the SQL queries that are being sent to it. You would need to investigate the client.
On the Ignite side, you can make sure you're using the right garbage collector and setting the heap size appropriately. There's no One Good Answer, but in general, use the G1 GC and I'd start with 4Gb of heap space if you're using a lot of SQL.
I have a datapipeline component that reads SQS messages, generated at S3 upload trigger, and parses and publishes the message for a batchpipeline component.
I have recently observed that in production system, my datapipeline keeps crashing with OutOfMemory error under heavy load but it never crashes when tested locally with similar loads? The batchpipeline never seems to crash in Production ever.
How do I go about debugging it when I can't reproduce it locally?
As I have found a solution, after 2 weeks, to my problem above, I figured I'll document it for others and my future self.
I wasn't able to replicate the issue because the aws command-
aws s3 cp --recursive dir s3://input-queue/dir
somehow wasn't uploading messages fast enough that it could stress my local datapipeline. So I brought down the datapipeline and once there were 10k SQS messages in the queue, I started it and as expected, it crashed with Out Of Memory error after processing ~3000 messages. It turns out that the pipeline was able to handle continuous throughput but it broke when it started with 10k message load.
My hypothesis was that the issue is happening because Java garbage collection is unable to properly clean up objects after execution. So, I started analyzing the generated heap dump and after some days of research, I stumbled on the possible root cause for Out of Memory error. There were ~5000 instances of my MessageHandlerTask class, when ideally they should have been GC'd after being processed and not keep on piling up.
Further investigation on that line of thought led me to the root cause- it turned out that the code was using Executors.newFixedThreadPool() to create an ExecutorService for submitting tasks to. This implementation used an unbounded queue of tasks, so if too many tasks were submitted, all of them waited in the queue, taking up huge memory.
The reality was similar- messages were being polled faster than they could be processed. This caused a lot of valid MessageHandlerTask instances to be created that filled the heap memory if there was a message backlog.
The fix was to have create a ThreadPoolExecutor with an ArrayBlockingQueue of capacity 100 so that there is a cap on number of instances of MessageHandlerTask and its member variables.
Having figured out the fix, I moved on to optimize the pipeline for maximum throughput by varying the maximumPoolSize of the ThreadPoolExecutor. It turned out there were some SQS connection exceptions happening at higher thread counts. Further investigation revealed that increasing the SQS connection pool size ameliorated this issue.
I ultimately settled on a count of 40 threads for the given Xmx heap size of 1.5G and 80 SQS connection pool size so that the task threads do not run out of SQS connections while processing. This helped me achieve a throughput of 44 messages/s with just a single instance of datapipeline.
I also found out why the batchpipeline never crashed in Production, despite suffering from a similar ExecutorService implementation- turns out the datapipeline could be stressed by too many concurrent S3 uploads but the messages for batchpipeline were produced by datapipeline in a gradual fashion. Besides, the batchpipeline had a much higher throughput that I benchmarked at 347 messages/s when using 70 maximumPoolSize.
We are currently upgrading a TYPO3-Installation with about 60.000 Pages to V9.
The Upgrade-Wizard "Introduce URL parts ("slugs") to all existing pages" does not finish. In Browser (Install-Tool) I get a time-out.
Calling it via
./vendor/bin/typo3cms upgrade:wizard pagesSlugs
results in following Error:
[ Symfony\Component\Process\Exception\ProcessSignaledException ]
The process has been signaled with signal "9".
After using my favourite internet-search-engine I thinks that means most likely "out of memory".
Sadly the database doesn't seams to be touched at all - so no pages got the slug after that. That means just running this process several times will not help. Observing the Process the PHP-Process takes all memory it can get, then filling the swap. When the swap is full the process crashes.
Tested so far on a local Docker with 16GB RAM Host and on a Server with 8 Cores but 8GB RAM (DB is on an external Machine).
Any ideas to fix that?
After debugging I found out that the reason for this are messed up relations in database. So there are non deleted pages which points to non existing parents. This was mainly caused by a heavy clean up of the database before. Beside the wizard is not checking that and could be an improvement on it - the main problem is my database in that case.
I'm performing some queries over a tpch 100gb dataset on presto, I have 4 nodes, 1 master, 3 workers. When I try to run some queries, not all of them, I see on Presto web interface that the nodes die during the execution, resulting in query failure, the error is the following:
.facebook.presto.operator.PageTransportTimeoutException: Encountered too many errors talking to a worker node. The node may have crashed or been under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I rebooted all nodes and presto service but the error remains, this problem doesn't exist if I run the same queries over a smaller dataset.Can someone provide some help on this problem?
Thanks
3 possible causes for this kind of error. You may ssh into one of worker to find out what the problem is when the query is running.
High CPU
Tune down the task.concurrency to, for example, 8
High memory
In the jvm.config, -Xmx should no more than 80% total memory. In the config.properties, query.max-memory-per-node should be no more than the half of Xmx number.
Low open file limit
Set in the /etc/security/limits.conf a larger number for the Presto process. The default is definitely way too low.
It might be an issue for configuration. For example, if the local maximum memory is not set appropriately and the query use too much heap memory, full GC might happen to cause such errors. I would suggest to ask in the Presto Google Group and describe someway to reproduce the issue :)
I was running presto on Mac with 16GB of ram below is the configuration of java.config file.
-server
-Xmx16G
-XX:+UseG1GC
-XX:G1HeapRegionSize=32M
-XX:+UseGCOverheadLimit
-XX:+ExplicitGCInvokesConcurrent
-XX:+HeapDumpOnOutOfMemoryError
-XX:OnOutOfMemoryError=kill -9 %p
I was getting following error even for running the Query
Select now();
Query 20200817_134204_00005_ud7tk failed: Encountered too many errors talking to a worker node. The node may have crashed or be under too much load. This is probably a transient issue, so please retry your query in a few minutes.
I changed my -Xmx16G value to -Xmx10G and it works fine.
I used following link to install presto on my system.
Link for Presto Installation
I am trying to load a dataset to GraphDB 7.0. I wrote a Python script to transform and load the data on Sublime Text 3. The program suddenly stopped working and closed, the computer threatened to restart but didn't, and I lost several hours worth of computing as GraphDB doesn't let me query the inserts. This is the error I get on GraphDB:
The currently selected repository cannot be used for queries due to an error:
org.openrdf.repository.RepositoryException: java.lang.RuntimeException: There is not enough memory for the entity pool to load: 65728645 bytes are required but there are 0 left. Maybe cache-memory/tuple-index-memory is too big.
I set the JVM as follows:
-Xms8g
-Xmx9g
I don't exactly remember what I set as the values for the cache and index memories. How do I resolve this issue?
For the record, the database I need to parse has about 300k records. The program shut shop at about 50k. What do I need to do to resolve this issue?
Open the workbench and check the amount of memory you have given to cache memory.
Xmx should be a value that is enough for
cache-memory + memory-for-queries + entity-pool-hash-memory
sadly the latter cannot be calculated easily because it depends on the number of entities in the repository. You will either have to:
Increase the java memory with a bigger value for Xmx
Decrease the value for cache memory