Valgrind Massif take snapshot manually in a programatic way - valgrind

I'm using Valgrind Massif to measure the memory usage of my program. Actually I know where I allocate or release memory(roughly), so I want to manually take snapshots. Moreover, I want to compare the result from multiple runs, so manually taking snapshots makes more sense in comparison.
Is there a programmatic way to do this?

You can instruct massif to take a snapshot using massif monitor commands.
See https://www.valgrind.org/docs/manual/ms-manual.html#ms-manual.monitor-commands
Such monitor commands can be done from a shell, using vgdb
or from within your program, using the client request VALGRIND_MONITOR_COMMAND.
See https://www.valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
Note that if you have a specific client request corresponding to a monitor command, it is better to use the specific client request rather than the 'generic' VALGRIND_MONITOR_COMMAND.
In the case of massif, there are no specific client requests.

Related

Calling API from PigLatin

Complete newbie to PigLatin, but looking to pull data from the MetOffice DataPoint API e.g.:
http://datapoint.metoffice.gov.uk/public/data/val/wxfcs/all/xml/350509?res=3hourly&key=abc123....
...into Hadoop.
My question is "Can this be undertaken using PigLatin (from within Pig View, in Ambari)"?
I've hunted round for how to format a GET request into the code, but without luck.
Am I barking up the wrong tree? Should I be looking to use a different service within the Hadoop framework to accomplish this?
It is very bad idea to make calls to external services from inside of map-reduce jobs. The reason being that when running on the cluster your jobs are very scalable whereas the external system might not be so. Modern resource managers like YARN make this situation even worse, when you swamp external system with the requests your tasks on the cluster will be mostly sleeping waiting for reply from the server. The resource manager will see that CPU is not being used by tasks and will schedule more of your tasks to run which will make even more requests to the external system, swamping it with the requests even more. I've seen modest 100 machine cluster putting out 100K requests per second.
What you really want to do is to either somehow get the bulk data from the web service or setup a system with a queue and few controlled number of workers that will pull from the external system at set rate.
As for your original question, I don't think PigLatin provides such service, but it could be easily done with UDFs either Python or Java. With Python you can use excellent requests library, which will make your UDF be about 6 lines of code. Java UDF will be little bit more verbose, but nothing terrible by Java standards.
"Can this be undertaken using PigLatin (from within Pig View, in
Ambari)"?
No, by default Pig load from HDFS storage, unless you write your own loader.
And i share same point with #Vlad, that this is not a good idea, you have many other other components used for data ingestion, but this not a use case of Pig !

how to design multi-process program using redis in python

I just started to use the redis cache in python. I read the tutorial but still feel confused about the concepts of "connectionpool", "connection" and etc..
I try to write a program which will be invoked multiple times in the console in different processes. They will all get and set the same shared in memory redis cache using same set of keys.
So to make it thread(process) safe, should I have one global connectionpool and get connections from the pool in different processes? Or should I have one global connection? What's the right way to do it?
Thanks,
Each instance of the program should spawn its own ConnectionPool. But this has nothing to do with thread safety. Whether or not your code is thread safe will depend on the type of operations you will be executing, and if you have multiple instances which may read and write concurrently, you need to look into using transactions, which are built into redis.

Running multiple Kettle transformation on single JVM

We want to use pan.sh to execute multiple kettle transformations. After exploring the script I found that it internally calls spoon.sh script which runs in PDI. Now the problem is every time a new transformation starts it create a separate JVM for its executions(invoked via a .bat file), however I want to group them to use single JVM to overcome memory constraints that the multiple JVM are putting on the batch server.
Could somebody guide me on how can I achieve this or share the documentation/resources with me.
Thanks for the good work.
Use Carte. This is exactly what this is for. You can startup a server (on the local box if you like) and then submit your jobs to it. One JVM, one heap, shared resource.
Benefit of that is then scalability, so when your box becomes too busy just add another one, also using carte and start sending some of the jobs to that other server.
There's an old but still current blog here:
http://diethardsteiner.blogspot.co.uk/2011/01/pentaho-data-integration-remote.html
As well as doco on the pentaho website.
Starting the server is as simple as:
carte.sh <hostname> <port>
There is also a status page, which you can use to query your carte servers, so if you have a cluster of servers, you can pick a quiet one to send your job to.

Does StackExchange.Redis supports MONITOR?

I recently migrated from Booksleeve to StackExchange.Redis.
For monitoring purposes, I need to use the MONITOR command.
In the wiki I read
From the IServer instance, the Server commands are available
But I can't find any method concerning MONITOR in IServer ; After a quick search in the repository, it seems this command is not mappped even if RedisCommand.MONITOR is defined.
So, is the MONITOR command supported by StackExchange.Redis ?
Support for monitor is not provided, for multiple reasons:
invoking monitor is a path with of no return; a monitor connection can never be anything except a monitor connection - it certainly doesn't play nicely with the multiplexer (although I guess a separate connection could be used)
monitor is not something that is generally encouraged - it has impact; and when you do use it, it would be a good idea to run it as close to the server as possible (typically in a terminal to the server itself)
it should typically be used for short durations
But more importantly, perhaps, I simply haven't seen a suitable user-case or had a request for it. If there is some scenario where monitor makes sense, I'm happy to consider adding some kind of support. What is it that you want to do with it here?
Note that caveat on the monitor page you link to:
In this particular case, running a single MONITOR client can reduce the throughput by more than 50%. Running more MONITOR clients will reduce throughput even more.

Top & httpd - demystifying what is actually running

I often use the "top" command to see what is taking up resources. Mostly it comes up with a long list of Apache httpd processes, which is not very useful. Is there any way to see a similar list, but such that I could see which PHP scripts etc. those httpd processes are actually running?
If you're concerned about long running processes (i.e. requests that take more than a second or two to execute), you'll be able to get an idea of them using Apache's mod_status. See the documentation, and an example of the output (from www.apache.org). This isn't unique to PHP, but applies to anything running inside an apache process.
Note that the www.apache.org status output is publicly available presumably for demonstration purposes -- you'd want to restrict access to yours so that not everyone can see it.
There's a top-like ncurses-based utility called apachetop which provides realtime log analysis for Apache. Unfortunately, the project has been abandoned and the code suffers from some bugs, however it's actually very much usable. Just don't run it as root, run it as any user with access to the web server log files and you should be fine.
The php scripts happen so fast, top wouldn't show you very much. Or it would zip by quite quickly. Most webrequests are quite quick.
I think your best bet would be to have some type of real time log processor, that kept an eye on your access logs and updates stats for you of average run time, memory usage and stuff like that.
You could make your PHP pages time themselves and write their path and execution time to file or database. Note that would slow everything down while you were monitoring, but it would serve as a good measuring method.
It wouldn't be that interactive though. You'd be able to get daily or weekly results from it, but it'd be hard to see something meaningful within minutes or hours.