How to dump Permgen? - jvm

I wanted to take the dump of the Permgen of a application server.
I do not want to use -XX:+TraceClassLoading -XX:+TraceClassUnloading as i do not want to restart the server, Neither i want to use jconsole.
I there any tool like jmap(used to heap dump didnt find any option for permgen) to get the permgen so that i can supply only the pid.

jmap -permstat <pid>
is going to produce an output like that :
30337 intern Strings occupying 2746200 bytes.
class_loader classes bytes parent_loader alive? type
<bootstrap> 2031 7253392 null live <internal>
0x517474f0 1 1760 null dead sun/reflect/DelegatingClassLoader#0x43f95d38
0x4f83f670 1 1744 0x4ebfb8e8 dead sun/reflect/DelegatingClassLoader#0x43f95d38
[...]
total = 287 10020 35889952 N/A alive=3, dead=284 N/A
This is not a full dump, but doing that is going to allow you to do some investigation.
I am still looking on how to find more information.

It is not possible to 'dump permgen' as it's done for the heap.
In addition to jmap -permstat as others have presented, you can analyze standard heap dump to shed some light on your permanent generation as described in this blog entry: 'The Unknown Generation: Perm'.
Because a heap dump does not really contain a lot of information about perm space, perm problems are difficult to tackle. Recently, I found this great article by Sporar, Sundararajan and Kieviet. The authors shed some light on the permanent generation. Of course, I had to check right away if and how I can use the Eclipse Memory Analyzer to analyze this “unknown” generation. This is what this blog is about.

jmap -permstat <pid>

Related

JVM Runtime.availableProcessors() returns 2 when it should be 4

I'm running openjdk11 on alpine linux in a container in an AWS EKS cluster.
The application determines the size of a threadpool based on the number of CPUs as returned by Runtime.getRuntime().availableProcessors()
This call is returning 2 processors even though the container shows that 4 CPUs are available:
# cat /proc/cpuinfo | grep processor
processor : 0
processor : 1
processor : 2
processor : 3
Any idea why and how to solve the problem?
Update
Doing some more digging (prompted by some great questions from #gohm'c in the comments), I found a way to add some trace log prints to the JVM with -Xlog:os+container=trace
[0.001s][trace][os,container] CPU Shares is: 1536
[0.001s][trace][os,container] CPU Share count based on shares: 2
Now, I defined in resources.requests.cpu: "1500m".
I don't know why the slight discrepancy but I changed the value of the CPU request, and indeed the CPU Shares in the log trace changes accordingly.
I understand how the resources.limits.cpu value could affect the CPUs that the JVM sees. But why is the resources.requests.cpu value doing that! This seems like a bug to me? Any thoughts?

Question about initializing weedfs volume servers (disks) and fid

I have two question about seaweedfs:
in each server I have 10 disks, How can I run weedfs volume servee on them?
should I define 10 times "-dir=" in front of "./weed volume -max=100 -mserver="
or I should make systemd unit file for each disk?
fore example:
for sdb
ExecStart=/home/weedfs/weed volume -max=100 -mserver=192.168.200.20:9333 -port=8080 -dataCenter=dc1 -dir="/srv/sdb/data"
for sdc
ExecStart=/home/weedfs/weed volume -max=100 -mserver=192.168.200.20:9333 -port=8080 -dataCenter=dc1 -dir="/srv/sdc/data"
What is the best solution?
Can I create and define fid myself instead of asking master api?
forexample instead of this steps:
a)curl http://localhost:9333/dir/assign
{"fid":"14,8e3cf10b7811f43a542cfa34","url":"192.168.200.20:8080","publicUrl":"192.168.200.20:8080","count":1}
b)curl -F file=#/home/eitaa/weedfs/weed http://192.168.200.20:8080/14,8e3cf10b7811f43a542cfa34
directly I want to generating fid (I mean this part ",8e3cf10b7811f43a542cfa34" ) with desired volume id (eg:"8") and uploading file?
Or I should use Master api (Assign a file key)?
Either way. Pick the one that is easier for you.
Possibly. You may need to run volume with "-index=leveldb" to optimize memory usage, in case the file keys are not monotonically increasing.

Flink s3 read error: Data read has a different length than the expected

Using flink 1.7.0, but also seen on flink 1.8.0. We are getting frequent but somewhat random errors when reading gzipped objects from S3 through the flink .readFile source:
org.apache.flink.fs.s3base.shaded.com.amazonaws.SdkClientException: Data read has a different length than the expected: dataLength=9713156; expectedLength=9770429; includeSkipped=true; in.getClass()=class org.apache.flink.fs.s3base.shaded.com.amazonaws.services.s3.AmazonS3Client$2; markedSupported=false; marked=0; resetSinceLastMarked=false; markCount=0; resetCount=0
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.checkLength(LengthCheckInputStream.java:151)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.util.LengthCheckInputStream.read(LengthCheckInputStream.java:93)
at org.apache.flink.fs.s3base.shaded.com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:76)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.closeStream(S3AInputStream.java:529)
at org.apache.flink.fs.shaded.hadoop3.org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:490)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at org.apache.flink.fs.s3.common.hadoop.HadoopDataInputStream.close(HadoopDataInputStream.java:89)
at java.util.zip.InflaterInputStream.close(InflaterInputStream.java:227)
at java.util.zip.GZIPInputStream.close(GZIPInputStream.java:136)
at org.apache.flink.api.common.io.InputStreamFSInputWrapper.close(InputStreamFSInputWrapper.java:46)
at org.apache.flink.api.common.io.FileInputFormat.close(FileInputFormat.java:861)
at org.apache.flink.api.common.io.DelimitedInputFormat.close(DelimitedInputFormat.java:536)
at org.apache.flink.streaming.api.functions.source.ContinuousFileReaderOperator$SplitReader.run(ContinuousFileReaderOperator.java:336)
ys
Within a given job, we generally see many / most of the jobs read successfully, but there's pretty much always at least one failure (say out of 50 files).
It seems this error is actually originating from the AWS client, so perhaps flink has nothing to do with it, but I'm hopeful someone might have an insight as to how to make this work reliably.
When the error occurs, it ends up killing the source and canceling all the connected operators. I'm still new to flink, but I would think that this is something that could be recoverable from a previous snapshot? Should I expect that flink will retry reading the file when this kind of exception occurs?
Maybe you can try to add more connection for s3a like
flink:
...
config: |
fs.s3a.connection.maximum: 320

How to optimize golang program that spends most time in runtime.osyield and runtime.usleep

I've been working on optimizing code that analyzes social graph data (with lots of help from https://blog.golang.org/profiling-go-programs) and I've successfully reworked a lot of slow code.
All data is loaded into memory from db first, and the data analysis from there appears CPU bound (max memory consumption < 10MB, CPU1 # 100%)
But now most of my program's time seems to be in runtime.osyield and runtime.usleep. What's the way to prevent that?
I've set GOMAXPROCS=1 and the code does not spawn any goroutines (other than what the golang libraries may call).
This is my top10 output from pprof
(pprof) top10
62550ms of 72360ms total (86.44%)
Dropped 208 nodes (cum <= 361.80ms)
Showing top 10 nodes out of 77 (cum >= 1040ms)
flat flat% sum% cum cum%
20760ms 28.69% 28.69% 20850ms 28.81% runtime.osyield
14070ms 19.44% 48.13% 14080ms 19.46% runtime.usleep
11740ms 16.22% 64.36% 23100ms 31.92% _/C_/code/sc_proto/cloudgraph.(*Graph).LeafProb
6170ms 8.53% 72.89% 6170ms 8.53% runtime.memmove
4740ms 6.55% 79.44% 10660ms 14.73% runtime.typedslicecopy
2040ms 2.82% 82.26% 2040ms 2.82% _/C_/code/sc_proto.mAvg
890ms 1.23% 83.49% 1590ms 2.20% runtime.scanobject
770ms 1.06% 84.55% 1420ms 1.96% runtime.mallocgc
760ms 1.05% 85.60% 760ms 1.05% runtime.heapBitsForObject
610ms 0.84% 86.44% 1040ms 1.44% _/C_/code/sc_proto/cloudgraph.(*Node).DeepestChildren
(pprof)
The _ /C_/code/sc_proto/* functions are my code.
And the output from web:
(better, SVG version of graph here: https://goo.gl/Tyc6X4)
Found the answer myself, so I'm posting this here for anyone else who is having a similar problem. And special thanks to #JimB for sending me down the right path.
As can be seen from the graph, the paths which lead to osyield and usleep are garbage collection routines. This program was using a linked list which generated a lot of pointers, which created a lot of work for the gc, which occasionally blocked execution of my code while it cleaned up my mess.
Ultimately the solution to this problem came from https://software.intel.com/en-us/blogs/2014/05/10/debugging-performance-issues-in-go-programs (which was an awesome resource btw). I followed the instructions about the memory profiler there; and the recommendation to replace collections of pointers with slices cleared up my garbage collection issues, and my code is much faster now!

Authentication failures in cassandra when 1 of 16 nodes is down

I have a Cassandra cluster running :
Cassandra 2.0.11.83 | DSE 4.6.0 | CQL spec 3.1.1 | Thrift protocol 19.39.0
The cluster has 18 nodes, split among 3 datacenters, 6 in each. My system_auth keyspace has the following replication defined:
replication = {
'class': 'NetworkTopologyStrategy',
'DC1': '4',
'DC2': '4',
'DC3': '4'}
and my authenticator/authorizer are set to:
authenticator: org.apache.cassandra.auth.PasswordAuthenticator
authorizer: org.apache.cassandra.auth.CassandraAuthorizer
This morning I brought down one of the nodes in DC1 for maintenance. Within a few seconds/minute client applications started logging exceptions like this:
"User my_application_user has no MODIFY permission on or any of its parents"
Running 'LIST ALL PERMISSIONS of my_application_user' on one of the other nodes shows that user to have SELECT and MODIFY on the keyspace xxxxx, so I am rather confused. Do I have a setup issue? Is this a bug of some sort?
Re-posting this as the answer, as BrianC suggested above.
So this is resolved... Here's the sequence of events that seems to have fixed it:
Add 18 more nodes
Run cleanup on original nodes (this was part of the original plan)
Run a scrub on 1 table, since it was throwing exceptions on cleanup
Run a repair on the system_auth KS on the original troubled node
Wait for repair service to complete a full pass on all keyspaces
Decom original 18 nodes.
Honestly, I don't know what fixed it. The system_auth repair makes most sense, but what doesn't make sense is that it had run many passes before, so why work now, I don't know. I hope this at least helps someone.