Debug Diag generate large memory dump files - debugdiag

I configured Debug Diag on Production where I set a Crash rule for a specific app pool with action type Long Stack Trace. But the problem is it's generating dump file those are very large in size approx 700mb each. I'm not sure why these files are too large. Is there a way to truncate it?

When you use "Log Stack Trace" option, the callstack for the exception will be logged in to a text file (not dump file) that Debug Diagnostic generates for the process to which it is attached to. I am assuming that the dump is getting generated if your process is crashing with a 2nd chance exception (that is, if you didn't change anything else in the default crash rule).
If you look at the name of the dump file, you would be able to identify on what exact condition the dump got generated.

Related

Why would std::process::Command::ouput fail?

What would cause std::process:Command::output to fail? If the callee program fails, the error will be captured as part of the resulting Output.stderr, so I guess output will only return an Error if the OS fails to create a new process for some reason? Is that something that I can safely ignore for my simple CLI tool?
There could be some issue opening the binary being executed (i.e. access denied, doesn't exist)
When waiting for the process to finish, the waitpid syscall could be interrupted
Getting the output involves creating a pipe, which will fail if the file descriptor limit is hit (cat /proc/sys/fs/file-max to check)
It also involves opening a file, which will fail if the limit on open files is reached (ulimit -n to check)
You probably only need to worry about the first two: you can't do anything about hitting limits in the kernel.

Breaking/opening files after failed jobs (sas7bdat.lck issue)

Good day,
Tl;dr:
a) Is it feasibly possible to recover data from .lck file?
b) If .lck issue appears, the SAS would work around it.
We have automated mundane jobs running on SAS machines. Every now and then job fails. This sometimes leaves locked file behind. (< filename>.sas7bdat.lck instead of < filename>.sas7bdat file)
This issue prevents re-running the program as SAS sees that there is already specified filename and tries to access it, failing. Message:
Attempt to rename temporary member of <dataset> failed.
Currently we handle them by manually deleting the file and adjusting generation number.
Question is two folded: a) Is it feasibly possible to recover data from .lck file? b) If .lck issue appears, the SAS would work around it. (Note that we have a lot of jobs and inputting checking code in all of them is work intensive.)
The .sas7bdat.lck file is the one that SAS writes to as it's creating a data set. If the data step (or PROC) completes successfully, the original data set file is deleted and the .sas7bdat.lck file gets renamed to remove the .lck part. If any errors occur, the .lck file gets deleted and the original data set is left in place, unmodified. That's how SAS avoids overwriting existing data sets when errors occur.
Therefore, you should be able to just rename the file to remove the .lck, or maybe rename it to damaged.sas7bdat for example, and then try accessing the file. You can try a PROC DATASETS REPAIR (https://v8doc.sas.com/sashtml/proc/z0247721.htm) if you really need to get whatever data might be present.
The best solution will obviously be to correct whatever fault is causing your jobs to bomb out like this in the first place. No SAS program should ever leave .lck files lying about, even if it encounters errors - your jobs must actually be crashing the SAS environment itself, or perhaps they're being killed prematurely by another process. Simply accepting that this happens and trying to work around it is likely to just be storing up more problems for the future.

Weblogic Dump JFR without discarding old ones

I am using weblogic server and was trying to get the JFRs for my Weblogic Server. The command line arguments I use are:
-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,dumponexitpath=/my/path,repository=/some/path
There are 2 disadvantages here:
1) There is a maximum of 3 JFRs stored and data before that are lost.
2) When there is an OOM, I execute a script to kill the server with signal 11 (SIGSEGV). This does not dump the currently recording JFR.
How do I go about getting the data at the time of crash and retain all the JFR data? Space is not an issue here. If I specify maxage=0, then the JFR is never dumped. If I specify maxsize, the files are deleted once the limit is reached.
I assume JDK 7/8, since it is 2018 and you are on WLS, which means recordings can only be dumped in the Java shutdown hook. Try SIGTERM
kill -l 15
In JDK 9 and later, a dump can also be written (in native) if the JVM crashes. The file is located where the Java process was started and is called hs_err_pidXXX.jfr
JDK 10 added support for Old Object Sample events, which can be used to diagnose memory leaks. If the application exits due to an OutOfMemoryError, it will write an OOS event with paths to GC roots (regardless if you have enabled the event or not). It should provide information to solve the memory leak.
JDK 11.03 or later contains a command line tool, which can be used to print the contents of a recording file.
$ jfr print --events OldObjectSample hs_oom_pidXXX.jfr
By looking at the allocationTime you can see when objects were allocated. Memory leaks are typically allocated through out the lifetime of the application, so if you ignore the early samples (static objects) and late samples (short-lived objects) you are likely to find a leaking object and its path to the GC root. Just follow the reference chain until you find a reference that should not be there.

IOMeter doesn't write log files due to full disk

Does anyone have a workaround for IOMeter not writing logs to disk? I believe this is because the iobw.tst file takes up the whole disk. I have had the test running, then manually created a temporary 1MB file while the disk was filling up, then deleted that 1MB file after the disk is full and while the reads and writes are being performed and this consistently produces the full log file for the test. Similarly, clearing the Recycle Bin or temporary files at this time produces the same result.
Does anyone know of a way to reserve this space for the logfile using a configuration file or something along these lines? IOMeter is part of an automated suite of tests that I'm working on and this issue is preventing full automation.
You have to compile Dynamo with "DETAILS" and/or "DEBUG" flags "on".
Then dynamo will store all the info into ~/std.out log (if you're under linux)

Checksum Exception when reading from or copying to hdfs in apache hadoop

I am trying to implement a parallelized algorithm using Apache hadoop, however I am facing some issues when trying to transfer a file from the local file system to hdfs. A checksum exception is being thrown when trying to read from or transfer a file.
The strange thing is that some files are being successfully copied while others are not (I tried with 2 files, one is slightly bigger than the other, both are small in size though). Another observation that I have made is that the Java FileSystem.getFileChecksum method, is returning a null in all cases.
A slight background on what I am trying to achieve: I am trying to write a file to hdfs, to be able to use it as a distributed cache for the mapreduce job that I have written.
I have also tried the hadoop fs -copyFromLocal command from the terminal, and the result is the exact same behaviour as when it is done through the java code.
I have looked all over the web, including other questions here on stackoverflow however I haven't managed to solve the issue. Please be aware that I am still quite new to hadoop so any help is greatly appreciated.
I am attaching the stack trace below which shows the exceptions being thrown. (In this case I have posted the stack trace resulting from the hadoop fs -copyFromLocal command from terminal)
name#ubuntu:~/Desktop/hadoop2$ bin/hadoop fs -copyFromLocal ~/Desktop/dtlScaleData/attr.txt /tmp/hadoop-name/dfs/data/attr2.txt
13/03/15 15:02:51 INFO util.NativeCodeLoader: Loaded the native-hadoop library
13/03/15 15:02:51 INFO fs.FSInputChecker: Found checksum error: b[0, 0]=
org.apache.hadoop.fs.ChecksumException: Checksum error: /home/name/Desktop/dtlScaleData/attr.txt at 0
at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.readChunk(ChecksumFileSystem.java:219)
at org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInputChecker.java:237)
at org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.java:189)
at org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java:158)
at java.io.DataInputStream.read(DataInputStream.java:100)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:68)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:47)
at org.apache.hadoop.io.IOUtils.copyBytes(IOUtils.java:100)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:230)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:176)
at org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:1183)
at org.apache.hadoop.fs.FsShell.copyFromLocal(FsShell.java:130)
at org.apache.hadoop.fs.FsShell.run(FsShell.java:1762)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
at org.apache.hadoop.fs.FsShell.main(FsShell.java:1895)
copyFromLocal: Checksum error: /home/name/Desktop/dtlScaleData/attr.txt at 0
You are probably hitting the bug described in HADOOP-7199. What happens is that when you download a file with copyToLocal, it also copies a crc file in the same directory, so if you modify your file and then try to do copyFromLocal, it will do a checksum of your new file and compare to your local crc file and fail with a non descriptive error message.
To fix it, please check if you have this crc file, if you do just remove it and try again.
I face the same problem solved by removing .crc files
Ok so I managed to solve this issue and I'm writing the answer here just in case someone else encounters the same problem.
What I did was simply create a new file and copied all the contents from the problematic file.
From what I can presume it looks like some crc file is being created and attached to that particular file, hence by trying with another file, another crc check will be carried out. Another reason could be that I have named the file attr.txt, which could be a conflicting file name with some other resource. Maybe someone could expand even more on my answer, since I am not 100% sure on the technical details and these are just my observations.
CRC file holds serial number for the Particular block data. Entire data is spiltted into Collective Blocks. Each block stores metada along with the CRC file inside /hdfs/data/dfs/data folder. If some one makes correction to the CRC files...the actual and current CRC serial numbers would mismatch and it causes the ERROR!! Best practice to fix this ERROR is to do override the meta data file along with CRC file.
I got the exact same problem and didn't fid any solution. Since this was my first hadoop experience, I could not follow some instruction over the internet. I solved this problem by formatting my namenode.
hadoop namenode -format