Opengrok OutOfMemoryError when re-indexing

Opengrok OutOfMemoryError when re-indexing - reindex

I have checked out from SVN 19 projects, all are in code source directory.
I run indexing from jenkins with following command:
C:\Jenkins\workspace\Grok-Multiple-Projects-Checkout-And-Indexing>java -Xmx12288m -Xms2048m -jar C:\grok_0.12.1\opengrok-0.12.1.5\lib\opengrok.jar -W C:\grok_0.12.1\Data\configuration.xml -c C:\grok_0.12.1\ctags58\ctags.exe -P -S -v -s C:\grok_0.12.1\src -d C:\grok_0.12.1\Data -i *.zip -i *.tmp -i *.db -i *.jar -i d:.svn -G -L polished -a on -T 8
First time I run indexing with the upper command: there are no errors!
However, consecutive runs will produce a
Java.lang.OutOfMemoryError: Java heap space
It runs fine until a point in the logs where it hangs for aprox 30 mins, and at some point memory consumption increases until it eats up all the allocated 12GB of RAM.
Here is the log:
09:38:40 Nov 01, 2016 9:38:45 AM org.opensolaris.opengrok.index.IndexDatabase$1 run
09:38:40 SEVERE: Problem updating lucene index database:
09:38:40 java.lang.OutOfMemoryError: Java heap space
09:38:40
09:38:41 Nov 01, 2016 9:38:45 AM org.opensolaris.opengrok.util.Statistics report
09:38:41 INFO: Done indexing data of all repositories (took 0:37:20)
09:38:41 Nov 01, 2016 9:38:45 AM org.opensolaris.opengrok.util.Statistics report
09:38:41 INFO: Total time: 0:37:21
09:38:41 Nov 01, 2016 9:38:45 AM org.opensolaris.opengrok.util.Statistics report
09:38:41 INFO: Final Memory: 19M/11,332M
Any ideas to why it needs so much memory and if increasing it will solve the OOM error. Could it be a memory leak in opengrok?

I know this is old question for old OpenGrok version (I can tell that because OpenGrok dropped the org.opensolaris class prefix in 2018), however I think it still needs some answer.
Assuming the indexer was indeed performing incremental reindex, there has to be something that is consuming the heap excessively. For instance, it could be caused by merging the pre existing history with the newly added (incremental) history.
Running the indexer with the -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/data/jvm/ Java options will create a heap dump that can be then analyzed with tools such as MAT (Eclipse) or YourKit.
As for memory leaks: not that it is not possible to create memory leaks in Java (via thread local storage), it is just quite improbable for them to manifest in the indexer.

Related

DSpace 7.1 AIP restore StackOverflowError

I try to migrate from DSpace 6.4 to 7.1. New Dspace is installed on other machine (virtual machine on Centos 7 with 8Gb of RAM)
I have created full site AIP backup with user passwords. (total size of packages - 11Gb)
I've tried to do full restore but always got the same error.
So I'm just trying to import only "first level without childs"
JAVA_OPTS="-Xmx2048m -Xss4m -Dfile.encoding=UTF-8" /dspace/bin/dspace packager -r -k -t AIP -e dinkwi.test#gmail.com -o skipIfParentMissing=true -i 123456789/0 /home/dimich/11111/repo.zip
It doesn't matter if I use -k or -f param, output ia always the same
Ingesting package located at /home/dimich/11111/repo.zip
Exception: null
java.lang.StackOverflowError
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:788)
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:802)
.... (more then 1k lines)
at org.dspace.eperson.GroupServiceImpl.getChildren(GroupServiceImpl.java:802)
my dspace.log ended with
2021-12-20 11:05:28,141 INFO unknown unknown org.dspace.eperson.GroupServiceImpl # dinkwi.test#gmail.com::update_group:group_id=9e6a2038-01d9-41ad-96b9-c6fb55b44381
2021-12-20 11:05:30,048 INFO unknown unknown org.dspace.eperson.GroupServiceImpl # dinkwi.test#gmail.com::update_group:group_id=23aaa7e9-ca2d-4af5-af64-600f7126e2be
2021-12-20 11:05:30,800 INFO unknown unknown org.springframework.cache.ehcache.EhCacheManagerFactoryBean # Shutting down EhCache CacheManager 'org.dspace.services'
So I just want to figure out the problem: small stack or some bug in user/group that fails with infinite loop/recursion, or maybe something else...
Main problem - i'm good in php/mysql and have no experience with java/postgre and the way how to debug this ...
Any help would be appreciated.
p.s after failed restore I always run command
/dspace/bin/dspace cleanup -v

how to extend the start up time for a scylla node if it is loading data

getting this error during startup time of scylla node as I am loading data
Nov 12 21:55:13 usw1-im-stage-scylladb1 scylla[53703]: [shard 0] database - Keyspace product_prod: Reading CF cleanup_transaction id=bb0a0640-058f-11ea-b8e4-00000000000c version=dde3ee6f-185b-37ba-80fb-6425cce4532f
Nov 12 22:10:02 usw1-im-stage-scylladb1 systemd[1]: scylla-server.service start operation timed out. Terminating.
running this on scylla enterprise 2019.1.2

Scylla's documentation includes KBs and FAQ. I think this KB is spot-on, what you are looking for:
https://docs.scylladb.com/troubleshooting/scylla_wont_start/#solution
Here is the solution suggested there:
Locate the directory with the systemd files where the scylla-server.service resides.
For Centos operating systems it is expected to be under /usr/lib/systemd/system/scylla-server.service
For Ubuntu operating systems it is expected to be under /etc/systemd/system/scylla-server.service.d
Create the following directory (if not exist)
Centos
sudo mkdir /usr/lib/systemd/system/scylla-server.service
Ubuntu
sudo mkdir /etc/systemd/system/scylla-server.service.d
Create a file inside that directory named 10-timeout.conf, with the following contents:
[Service]
TimeoutStartSec=9000
Reload the systemd Daemon for the new configurations to take in effect.
systemctl daemon-reload

shuf generates "Bad file descriptor" error on nfs but only when run as a background process

Here is an interesting mystery ...
This code ...
shuf $TRAINING_UNSHUFFLED > $TRAINING_SHUFFLED
wc -l $TRAINING_UNSHUFFLED
wc -l $TRAINING_SHUFFLED
shuf $VALIDATION_UNSHUFFLED > $VALIDATION_SHUFFLED
wc -l $VALIDATION_UNSHUFFLED
wc -l $VALIDATION_SHUFFLED
generates this error ...
shuf: read error: Bad file descriptor
8122 /nfs/digits/datasets/com-aosvapps-distracted-driving3/databases/TrainImagePathsAndLabels_AlpineTest1.csv
0 /nfs/digits/datasets/com-aosvapps-distracted-driving3/databases/TrainImagePathsAndLabels_AlpineTest1_Shuffled.csv
shuf: read error: Bad file descriptor
882 /nfs/digits/datasets/com-aosvapps-distracted-driving3/databases/ValImagePathsAndLabels_AlpineTest1.csv
0 /nfs/digits/datasets/com-aosvapps-distracted-driving3/databases/ValImagePathsAndLabels_AlpineTest1_Shuffled.csv
but ONLY when I run it as a background job like so ...
tf2$nohup ./shuffle.sh >> /tmp/shuffle.log 2>&1 0>&- &
[1] 6897
When I run it directly in an interactive shell, it seems to work fine.
tf2$./shuffle.sh > /tmp/shuffle.log
I am guessing that this has something to do with the fact that both the input and output files reside on an nfs share on a different aws ec2 instance.
The severing of stdin, stderr and stdin in the background process example is suspicious. This is done so that the process will not die if the terminal session is closed. I have many other commands that read and write from this share without any problems at all. Only the shuf command is being difficult.
I am curious as to what might be causing this and if it is fixable without seeking an alternative to shuf?
I am using shuf (GNU coreutils) 8.21 on Ubuntu 14.04.5 LTS.
tf2$which shuf
/usr/bin/shuf
tf2$shuf --version
shuf (GNU coreutils) 8.21
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>.
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Written by Paul Eggert.
tf2$lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 14.04.5 LTS
Release: 14.04
Codename: trusty
UPDATE: eliminating the severing of STDIN makes the problem go away
ie. if instead of doing this ...
$nohup ./shuffle.sh > /tmp/shuffle.log 2>&1 0>&- &
I do this ...
$nohup ./shuffle.sh > /tmp/shuffle.log 2>&1 &
the "Bad descriptor" error goes away.
However, the severing of stdin/stdout/stderr is there to ensure that killing the terminal session will not kill the process, so this solution is not entirely satisfactory.
Furthermore, it only seems be be necessary to do this for shuf. None of other commands which read files from this file system cause any errors.

This turned out to be a bug in glibc.
The details are here:
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25029
The work-around is simple:
instead of
shuf $TRAINING_UNSHUFFLED > $TRAINING_SHUFFLED
do
shuf < $TRAINING_UNSHUFFLED > $TRAINING_SHUFFLED
Thanks to Pádraig Brady on the coreutils team.

How can a specific application be monitored by perf inside the kvm?

I have an application which I want to monitor it via perf stat when running inside a kvm VM.
After Googling I have found that perf kvm stat can do this. However there is an error by running the command:
sudo perf kvm stat record -p appPID
which results in help representation ...
usage: perf kvm stat record [<options>]
-p, --pid <pid> record events on existing process id
-t, --tid <tid> record events on existing thread id
-r, --realtime <n> collect data with this RT SCHED_FIFO priority
--no-buffering collect data without buffering
-a, --all-cpus system-wide collection from all CPUs
-C, --cpu <cpu> list of cpus to monitor
-c, --count <n> event period to sample
-o, --output <file> output file name
-i, --no-inherit child tasks do not inherit counters
-m, --mmap-pages <pages[,pages]>
number of mmap data pages and AUX area tracing mmap pages
-v, --verbose be more verbose (show counter open errors, etc)
-q, --quiet don't print any message
Does any one know what is the problem?

Use kvm with vPMU (virtualization of PMU counters) - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-vPMU.html "2.2. VIRTUAL PERFORMANCE MONITORING UNIT (VPMU)"). Then run perf record -p $pid and perf stat -p $pid inside the guest.
Host system has no knowledge (tables) of guest processes (they are managed by guest kernel, which can be non Linux, or different version of linux with incompatible table format), so host kernel can't profile some specific guest process. It only can profile whole guest (and there is perf kvm command - https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Tuning_and_Optimization_Guide/chap-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools.html#sect-Virtualization_Tuning_Optimization_Guide-Monitoring_Tools-perf_kvm)

Never successfully built a large hadoop&spark cluster

I was wondering if anybody could help me with this issue in deploying a spark cluster using the bdutil tool.
When the total number of cores increase (>= 1024), it failed all the time with the following reasons:
Some machine is never sshable, like "Tue Dec 8 13:45:14 PST 2015: 'hadoop-w-5' not yet sshable (255); sleeping"
Some nodes fail with an "Exited 100" error when deploying spark worker nodes, like "Tue Dec 8 15:28:31 PST 2015: Exited 100 : gcloud --project=cs-bwamem --quiet --verbosity=info compute ssh hadoop-w-6 --command=sudo su -l -c "cd ${PWD} && ./deploy-core-setup.sh" 2>>deploy-core-setup_deploy.stderr 1>>deploy-core-setup_deploy.stdout --ssh-flag=-tt --ssh-flag=-oServerAliveInterval=60 --ssh-flag=-oServerAliveCountMax=3 --ssh-flag=-oConnectTimeout=30 --zone=us-central1-f"
In the log file, it says:
hadoop-w-40: ==> deploy-core-setup_deploy.stderr <==
hadoop-w-40: dpkg-query: package 'openjdk-7-jdk' is not installed and no information is available
hadoop-w-40: Use dpkg --info (= dpkg-deb --info) to examine archive files,
hadoop-w-40: and dpkg --contents (= dpkg-deb --contents) to list their contents.
hadoop-w-40: Failed to fetch http://httpredir.debian.org/debian/pool/main/x/xml-core/xml-core_0.13+nmu2_all.deb Error reading from server. Remote end closed connection [IP: 128.31.0.66 80]
hadoop-w-40: E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
I tried 16-core 128-nodes, 32-core 64-nodes, 32-core 32-nodes and other over 1024-core configurations, but either the above Reason 1 or 2 will show up.
I also tried to modify the ssh-flag to change the ConnectTimeout to 1200s, and change bdutil_env.sh to set the polling interval to 30s, 60s, ..., none of them works. There will be always some nodes which fail.
Here is one of the configurations that I used:
time ./bdutil \
--bucket $BUCKET \
--force \
--machine_type n1-highmem-32 \
--master_machine_type n1-highmem-32 \
--num_workers 64 \
--project $PROJECT \
--upload_files ${JAR_FILE} \
--env_var_files hadoop2_env.sh,extensions/spark/spark_env.sh \
deploy

To summarize some of the information that came out from a separate email discussion, as IP mappings change and different debian mirrors get assigned, there can be occasional problems where the concurrent calls to apt-get install during a bdutil deployment can either overload some unbalanced servers or trigger DDOS protections leading to deployment failures. These do tend to be transient, and at the moment it appears I can deploy large clusters in zones like us-east1-c and us-east1-d successfully again.
There are a few options you can take to reduce the load on the debian mirrors:
Set MAX_CONCURRENT_ASYNC_PROCESSES to a much smaller value than the default 150 inside bdutil_env.sh, such as 10 to only deploy 10 at a time; this will make the deployment take longer, but would lighten the load as if you just did several back-to-back 10-node deployments.
If the VMs were successfully created but the deployment steps fail, instead of needing to retry the whole delete/deploy cycle, you can try ./bdutil <all your flags> run_command -t all -- 'rm -rf /home/hadoop' followed by ./bdutil <all your flags> run_command_steps to just run through the whole deployment attempt.
Incrementally build your cluster using resize_env.sh; initially set --num_workers 10 and deploy your cluster, and then edit resize_env.sh to set NEW_NUM_WORKERS=20, and run ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh deploy and it will only deploy the new workers 10-20 without touching those first 10. Then you just repeat, adding another 10 workers to NEW_NUM_WORKERS each time. If a resize attempt fails, you simply ./bdutil <all your flags> -e extensions/google/experimental/resize_env.sh delete to only delete those extra workers without affecting the ones you already deployed successfully.
Finally, if you're looking for more reproducible and optimized deployments, you should consider using Google Cloud Dataproc, which lets you use the standard gcloud CLI to deploy cluster, submit jobs, and further manage/delete clusters without needing to remember your bdutil flags or keep track of what clusters you have on your client machine. You can SSH into Dataproc clusters and use them basically the same as bdutil clusters, with some minor differences, like Dataproc DEFAULT_FS being HDFS so that any GCS paths you use should fully-specify the complete gs://bucket/object name.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Opengrok OutOfMemoryError when re-indexing - reindex

Related

DSpace 7.1 AIP restore StackOverflowError

how to extend the start up time for a scylla node if it is loading data

shuf generates "Bad file descriptor" error on nfs but only when run as a background process

How can a specific application be monitored by perf inside the kvm?

Never successfully built a large hadoop&spark cluster

Categories

Resources