Why does JProfiler snapshot file grow over time? - jprofiler

I setup a trigger that starts recording of CPU/thread/telemetry, sleep for a while, and write out a snapshot file, then stop recording of CPU/thread/telemetry with reset data.
It seems a snapshot file grows over time. Is this because of accumulated thread/telemetry data? If so, it will eventually eat up all the heap if it runs for very long like a year? Anyway to reset its data?


Listen or wait for a specific time without using timer

Is there a way to listen or wait for a specific time (e.g. 11:30 am) every day. The only way I know how is to set a timer that checks for the current time every 60 seconds which I have actually implemented using a backgroundworker. But is there a way to just wait and listen for the specified time (similar to monitoring for directory changes) and then take some action?
Thanks in advance.
Typically, rather than having a program resident in memory waiting, you would setup a Scheduled Task for this (or a cron job on linux). The scheduled task will run the program at the appropriate time. The program can still check (validate) the expected time if needed, but it shouldn't just always sit in the background using up resources if it's only going to run once per day.
The scheduled task is also better because it will recover automatically from computer reboots, crashes, etc. If something happens that interrupts your program's normal running, the scheduled task will still be able to run.
This is especially important in the .Net world, because .Net requires you to be very careful writing long-lived programs to avoid address space fragmentation. The .Net garbage collector is good at freeing up and returning old memory to the operating system, but over time your program's virtual address space can become fragmented and eventually you will not be able to allocate new memory any longer.
Even if this is part of a larger program, where there are also other things happening based on user interactions, it's still a good idea to split this off into a separate process.

Is time stamp needed for a process?

Is time stamp used in process to free its resources if it holds resources for long time ??
if yes then in process state diagram there is no connection between block(wait) state and terminate state but both are connected via running state.so here a anomaly in concept arises that if a process has to quit from wait state then it has to go via running state.
The OS can't indiscriminately kill a process if it holds a resource for a long time.
Would you kill a 24/7 server who waits on a socket?
Would you kill a process who keeps a file opened? After how long? What if the process actually needs to keep the file opened for days? If you think it can't possibly need that much, let me give you this scenario: a computationally intensive process who takes days to compute all it's data. Big data. It uses a file as a "cache/buffer". It writes to it and reads from it continuously. As such it holds the file open all this time.
Would you kill a process who waits on a lock? After how much time? 10 minutes? 1 hour? 10 hours? 5 days? What if the purpose of that process is to do a cleanup or smth after some other process has released the lock, be it if that other process held it for 1 second or for 2 weeks?

Cassandra Commit and Recovery on a Single Node

I am a newbie to Cassandra - I have been searching for information related to commits and crash recovery in Cassandra on a single node. And, hoping someone can clarify the details.
I am testing Cassandra - so, set it up on a single node. I am using stresstool on datastax to insert millions of rows. What happens if there is an electrical failure or system shutdown? Will all the data that was in Cassandra's memory get written to disk upon Cassandra restart (I guess commitlog acts as intermediary)? How long is this process?
Cassandra's commit log gives Cassandra durable writes. When you write to Cassandra, the write is appended to the commit log before the write is acknowledged to the client. This means every write that the client receives a successful response for is guaranteed to be written to the commit log. The write is also made to the current memtable, which will eventually be written to disk as an SSTable when large enough. This could be a long time after the write is made.
However, the commit log is not immediately synced to disk for performance reasons. The default is periodic mode (set by the commitlog_sync param in cassandra.yaml) with a period of 10 seconds (set by commitlog_sync_period_in_ms in cassandra.yaml). This means the commit log is synced to disk every 10 seconds. With this behaviour you could lose up to 10 seconds of writes if the server loses power. If you had multiple nodes in your cluster and used a replication factor of greater than one you would need to lose power to multiple nodes within 10 seconds to lose any data.
If this risk window isn't acceptable, you can use batch mode for the commit log. This mode won't acknowledge writes to the client until the commit log has been synced to disk. The time window is set by commitlog_sync_batch_window_in_ms, default is 50 ms. This will significantly increase your write latency and probably decrease the throughput as well so only use this if the cost of losing a few acknowledged writes is high. It is especially important to store your commit log on a separate drive when using this mode.
In the event that your server loses power, on startup Cassandra replays the commit log to rebuild its memtable. This process will take seconds (possibly minutes) on very write heavy servers.
If you want to ensure that the data in the memtables is written to disk you can run 'nodetool flush' (this operates per node). This will create a new SSTable and delete the commit logs referring to data in the memtables flushed.
You are asking something like
What happen if there is a network failure at the time data is being loaded in Oracle using SQL*Loader ?
Or what happens Sqoop stops processing due to some condition while transferring data?
Simply whatever data is being transferred before electrical failure or system shutdown, it will remain the same.
Coming to second question, when ever the memtable runs out of space, i.e when the number of keys exceed certain limit (128 is default) or when it reaches the time duration (cluster clock), it is being stored into sstable, immutable space.

Processing data while it is loading

We have a tool which loads data from some optical media, and once it's all copied to the hard drive runs it through a third-party tool for processing. I would like to optimise this process so each file is processed as it is read in. Trouble is, the third-party tool (which naturally I cannot change) has a 12 second startup overhead. What is the best way I can deal with this, in terms of finishing the entire process as soon as possible? I can pass any number of files to the processing tool in each run, so I need to be able to determine exactly when to run the tool to get the fastest result overall. The data being copied could be anything from one large file (which can't be processed until it's fully copied) to hundreds of small files.
The simplest would be to create and run 2 threads, one that runs the tool and one that loads data. Start 12 seconds timer and trigger both threads. Upon each file load completion check the passed time. If 12 seconds passed, fetch the data into the thread running the tool. Restart loading the data in parallel to processing of previous bulk. Once previous bulk processing completes restart the 12 sec timer and continue checking it upon every file load completion. Repeat till no more data remains.
For better results a more complex solution might be required. You can do some benchmarking to get an evaluation of average data loading time. Since it might be different for small and large files, several evaluations may be needed for different categories of files (according to size). Optimal resources utilization would be the one that processes the data in the same rate the new data arrives. Processing time includes the 12 seconds startup. The benchmarking should give you a ratio of processing threads number vs. reading threads number (you can also decrease/increase the number of active reading threads according to the incoming file sizes). Actually, it's a variation of producer-consumer problem with multiple producers and consumers.

How to terminate process with Lucene NRT Reader/ Writer gracefully?

We are using Lucene's near real-time search feature for full-text search in our application. Since commits are costly, say we commit to index after every 10 documents are added (We expect around 150 to 200 documents per hour for indexing). Now, if I want to terminate my process, how do I make sure that all documents in memory are committed to disk before my process is killed? Is there any approach recommended here? Or is my document volume too less to bother about and should I commit on every addition?
Should I track all uncommitted documents? And if process gets killed before they are committed to disk, should I index these uncommitted ones again when the process start up?
Lucene NRT is used in a process that runs embedded Jetty. Is it the right approach to send a shutdown command (invoke some servlet) to jetty and wait till all documents are committed and then terminate using System.exit()?
You could add a hook to commit all the buffered documents in the destroy method of your servlet and make sure that the embedded servlet container is shut down before calling System.exit (maybe by adding a shutdown hook to the JVM).
But this is still not perfect. If your process gets killed, all the buffered data will be lost. Another solution is to use soft commits. Soft commits are cheap commits (no fsync is performed) so no data will be lost if your process gets killed (but data could still be lost if the server shuts down unexpectedly).
To sum up:
shutdown hook
best throughput
data can be lost if the process gets killed
soft commit
no data will be lost if the process gets killed
data may be lost if the server shuts down unexpectedly
hard commit (default)
no data loss at all
slow (need to perform a fsync)