profile an awk command? - awk

Probably a silly question, since awk commands are usually pretty compact and do just one or two operations...
Is there a way to profile and awk command? ie. if it uses gsub, split, sorting associative arrays, is there an easy way to find out which part is bogging down the whole operation?
EDIT: Specifically I am looking for executing time for each subcommand, not how many times it was called. is this possible?

From the gawk man page:
pgawk is the profiling version of gawk. It is identical in every way
to gawk, except that programs run more slowly, and it automatically
produces an execution profile in the file awkprof.out when done. See
the --profile option, below.
so the answer would be yes if you are using the GNU implementation.
And to forstall your next question, the man page goes on to say
dgawk is an awk debugger. Instead of running the program directly, it
loads the AWK source code and then prompts for debugging commands.
Unlike gawk and pgawk, dgawk only processes AWK program source provided
with the -f option. The debugger is documented in GAWK: Effective AWK
Programming.

There's an awk implementation with a debugger similar to gdb, called dgawk.
You say you want execution time for each subcommand.
Here's how I do it, regardless of language:
Give it enough workload so it runs long enough, and time it with a watch (N seconds).
Then do it again, and while it's running, hit Ctrl-C.
Do backtrace to examine the stack, and copy that into a text editor.
Do that several times, like 10.
Any subcommand will appear on the stack for the fraction of time it spends.
So if sort is taking 50% of the time (N/2 seconds), it will appear on about 5 of those samples.
This tells you about big time-takers, not little ones. I assume you are looking for the big ones.
(Some people say this isn't accurate, which is baloney. Sure the amount of time isn't very accurate - it doesn't need to be. The accuracy you need is in location - pinpointing where the problem is, and that's what it does.)
ADDED: You can almost do this with pgawk. If you run your program in profiling mode, each time you hit Ctrl-C (or whatever) it prints the call stack to the output file. The only problem is, it prints the function names but not what lines they are called from, which you might actually need.

Here is the fine documentation about profiling gawk.

Build a profiling version of gawk for gprof, or use the kernel-based oprofile. You can then see in a lot of detail how much time is spent in various internal functions in gawk in response to your script and its data. Functions like gsub and split map to functions inside gawk.
For instance gsub and other functions are handled by the do_sub function in this source file:
http://git.savannah.gnu.org/cgit/gawk.git/tree/builtin.c
So you would look for how much time is spent in do_sub.
You want to compile and link gawk with the -pg GCC option. Successful runs of the program will then dump a profiling file gmon.out from which gprof will produce a report.
I highly recommend oprofile also, but going into it little out of scope for this answer.

Related

Gawk "system" call. Run external program in several instances

I am using Gawk on a MAC based UNIX. I have a "system" call that runs a program. As I have understood things, the program waits for the external program before going on to the next line. I would like to start several instances of the external program, wait for them all to finish, "cat" the results and then continue with the Gawk program. The value in this is to use all the processors on a MAC Pro machine. It takes days to run now and running the external program in several instances would greatly help. Thank You in advance for any help on this.
system call to bash script that ran various programs in the background. Used the bash script "wait" command to make sure the programs were finished before returning to the GAWK program. That worked.
Got the idea from the son of a friend. Very helpful.
Ellis

Why all programs are divided into 200 basic blocks by Valgrind?

Why all programs are divided into 200 basic blocks by Valgrind? And how to divided?
First Question
It's been some time since I've worked on a Valgrind tool (even longer than this question is old), but in case anyone is still interested, here's what I've dredged up from memory:
First, a distinction: a super block is a bit different from a basic block. Valgrind uses super blocks, not basic blocks. A super block may exit at any point, but a basic block will only ever exit by running off its end.
Valgrind doesn't divide a program into 200 super blocks. I'm pretty sure that it instead breaks programs up into super blocks of no more than 200 IRStatements (which may or may not translate directly into instructions).
The reason for this I'm pretty sure is for efficiency of the translator: at least with current versions of Valgrind I'm reasonably sure it doesn't translate your entire program up front. Translating the program into its IR format is time consuming and resource intensive, so the translator seeks to only translate as much of the program as it needs to. It does this by only translating code as it gets executed for the first time.
Second Question
Now, as to your second question... I'm not entirely sure what you're asking. If you're asking, "How does Valgrind decide how to divide up the program?", then the answer is that it decides similarly to a compiler. It starts converting the program into super blocks, and starts a new super block whenever it reaches the block limit size or detects that there is an entry point into the block from elsewhere (super blocks and basic blocks can only have one entry point).
If you instead meant, "Can I change the size of an IRSB super block?", then yes, there is an option you can pass back to Valgrind in your tools initialization code to tell it what size super blocks you want (although I don't recall if you can increase this to an arbitrary size). None of this is documented online, and only sparsely documented in the files themselves. You can take a look at the source to the other tools to see how they pass configuration options to Valgrind during initialization. That should at least give you a good idea on which headers to look at to figure out what option you need to pass back to Valgrind.

Is there any interactive console for some strong language for everyday work of processing strings?

starting to work as an IT man lately
with some programing background,
there are so many occasions where there's a need for processing large amount of data.
mainly strings i guess..
for example:
there's 2 large sets of lines, and we need all the lines in both of the sets
replacing one or more white characters in a row, to one line break...
taking the 4th to 7th character of each line and print them in one line with comma as a delimiter
these are not the best examples, but generally any kind of parsing, manipulating and query of texts.
it's very often that the task is extremely easy in any programing language, but it is just to frustrating to open the IDE of such language....
i'm looking to some way to write code (with intelisence/autocomplete), in an easy fast window...... with simple input and output textboxes....
do you understand my need? can you think of anything that can help?
i know some of the problems can be solved using excel.. but i really prefer some good old programing.... unless someone is strongly believe i'm wrong.
if i will build something myself, there will be an option to add any amount of unlimited multiline textboxes. they'll be automatically named, although the name is changeable (the names will be the the name of the variables).
you can as well add any number of output textboxes that have names...
and you have the editor window, in which you write the procedure..... and it will have some interactive intelisence like interface...
can you see what i'm saying? do you know anything similar?
Seems like Python would be fine for this.
Has an interactive keyboard interface, quite nice abstraction facilities,
and strings as objects with good libraries for processing such strings.
It sounds like a lot of what you want can be handled with regular expressions using sed, awk, or perl in a standard console. Autocomplete will be pretty limited, but your scripts will be short anyway - to deal with your third case above, for example:
sed 's/^...\(....\).*/\1/g' < input.txt | tr "\n" ',' > output.csv
What you can do is use an interactive regex tester. There's many online like this one.
You could also look into tools like Data Wrangler from Stanford, which are designed to be more accesible but as powerful as traditional shell tools.
(Note that your first issue - intersecting sets of lines - is a bit different, and would be solved in the shell with comm. This page has a good explanation of how to use comm to perform set operations like "all the files in this file not in this file" or "only the files in this file also in this other file".)

Faster way of testing your prolog program

I am new to Prolog, and the task of launching the prolog interpreter from the terminal, typing consult('some_prolog_program.pl'), and then testing the predicate you just wrote is very time consuming, is there a way to run a scripted test to speed up development?
For example in C I can write a main where I would use the functions I defined, I can then execute:
make && ./a.out
to test the code, can I do something similar with Prolog?
You can have the interpreter always open and then recompile the file.
You can auto-run a predicate after compiling the file:
:- foo(4,2).
This will run foo(4,2) when the line is encountered in the file.
There are flags that can be used while launching (most) Prolog interpreters that allow you to compile a file and run predicates (check the man page). This way you could make a Bash script. The following will consult file.pl and run foo/0 using SWI-Prolog:
#!/bin/sh
exec swipl -q -f none -g "load_files([file],[silent(true)])" \
-t foo -- $*
This predicate will unify Arguments with a list of the flags you gave at the command line:
current_prolog_flag(argv, Arguments)
But unless you are going to run a lot of tests, I don't think that writing all this extra code will be faster.
Personally I really like the flexibility of testing any predicate at any time with or without tracing (see trace/0) without having to write extra code to call them (unlike in C).
P.S. about reloading the file without leaving the interpreter: You might have some problem if you have used dynamic predicates or global variables; you will have to do some cleaning.
You can invoke a test file from the command-line with prolog +l <file>
Also, you can build a single run_tests predicate that exercises a series of calls and validates the actual results against expected results. Here's an article with a good worked-out example: http://kenegozi.com/blog/2008/07/24/unit-testing-in-prolog
In SWI, you can load things as usual. Then, when you edit your files you simply say make. on the toplevel and it checks all dependencies automatically and only reloads the modified files.
For bigger projects it does make a lot of sense to use makefiles. In particular to do unit testing. See SWI's package plunit.
For simple scripts in SWI-Prolog, using REPL to test the code manually is usually good enough. Changed files can be reloaded via make/0 (?- make. on toplevel). Just keep the Prolog REPL running while editing, then save the edits, run make. in the REPL and hit ↑, ↑, Enter to execute the last query before the make. from history.
The main benefit of REPL is its interactivity:
You may fiddle with the arguments.
Transition to debugging or tracing (both command line and graphical) is easy.
You don't need to perform I/O to print the result. Output is handled by the toplevel, which prints the substitution. You see the whole substitution, not only its part you just happen to print (possibly accidentally overlooking other parts).
You may interactively choose how many substitutions you want to see for a goal that succeeds multiple times.
It is obvious if there is a choice point left after the last result returned by a non-deterministic predicate, which is hard to observe otherwise. In that case, false. is printed when backtracking beyond the last result.
If you need to preserve the test calls to repeat them later, create a protocol (transcript or "log" of the interactive session) and edit it to become a script, or even a test suite (see below). The protocol is a plain text file with escape sequences for the terminal, containing a verbatim copy of what you see during the interactive session. View the protocol using cat protocol.txt on Linux (and other *NIXes) or type protocol.txt on Windows.
If interactivity is not needed, perform the test calls from the command line non-interactively. Let's test the CLP(FD) factorial example n_factorial/2, saved in factorial.pl (don't forget to add :- use_module(library(clpfd)). when copying the code):
$ swipl -q -t "between(0, 9, N), n_factorial(N, F), format('~D ', F), fail." factorial.pl
1 1 2 6 24 120 720 5,040 40,320 362,880
On Windows, you may need to specify full path to swipl.exe as it's not in the PATH, probably.
If the call is always the same, you may save it to a shell script or Makefile (run would be a good name for the target).
In your current workflow for testing functions in C, you create a new program and call the function under test from its entry point (main function). Prolog scripts can have an entry point, too. See library(main). Prolog does not require compilation, so you can just directly call the script (./test.pl) without calling Make first.
For larger projects, you may want to create a less ad-hoc test suite. A unit testing framework like PlUnit is needed. Its use is beyond the scope of this answer; see the documentation.

How would one go about testing an interpreter or a compiler?

I've been experimenting with creating an interpreter for Brainfuck, and while quite simple to make and get up and running, part of me wants to be able to run tests against it. I can't seem to fathom how many tests one might have to write to test all the possible instruction combinations to ensure that the implementation is proper.
Obviously, with Brainfuck, the instruction set is small, but I can't help but think that as more instructions are added, your test code would grow exponentially. More so than your typical tests at any rate.
Now, I'm about as newbie as you can get in terms of writing compilers and interpreters, so my assumptions could very well be way off base.
Basically, where do you even begin with testing on something like this?
Testing a compiler is a little different from testing some other kinds of apps, because it's OK for the compiler to produce different assembly-code versions of a program as long as they all do the right thing. However, if you're just testing an interpreter, it's pretty much the same as any other text-based application. Here is a Unix-centric view:
You will want to build up a regression test suite. Each test should have
Source code you will interpret, say test001.bf
Standard input to the program you will interpret, say test001.0
What you expect the interpreter to produce on standard output, say test001.1
What you expect the interpreter to produce on standard error, say test001.2 (you care about standard error because you want to test your interpreter's error messages)
You will need a "run test" script that does something like the following
function fail {
echo "Unexpected differences on $1:"
diff $2 $3
exit 1
}
for testname
do
tmp1=$(tempfile)
tmp2=$(tempfile)
brainfuck $testname.bf < $testname.0 > $tmp1 2> $tmp2
[ cmp -s $testname.1 $tmp1 ] || fail "stdout" $testname.1 $tmp1
[ cmp -s $testname.2 $tmp2 ] || fail "stderr" $testname.2 $tmp2
done
You will find it helpful to have a "create test" script that does something like
brainfuck $testname.bf < $testname.0 > $testname.1 2> $testname.2
You run this only when you're totally confident that the interpreter works for that case.
You keep your test suite under source control.
It's convenient to embellish your test script so you can leave out files that are expected to be empty.
Any time anything changes, you re-run all the tests. You probably also re-run them all nightly via a cron job.
Finally, you want to add enough tests to get good test coverage of your compiler's source code. The quality of coverage tools varies widely, but GNU Gcov is an adequate coverage tool.
Good luck with your interpreter! If you want to see a lovingly crafted but not very well documented testing infrastructure, go look at the test2 directory for the Quick C-- compiler.
I don't think there's anything 'special' about testing a compiler; in a sense it's almost easier than testing some programs, since a compiler has such a basic high-level summary - you hand in source, it gives you back (possibly) compiled code and (possibly) a set of diagnostic messages.
Like any complex software entity, there will be many code paths, but since it's all very data-oriented (text in, text and bytes out) it's straightforward to author tests.
I’ve written an article on compiler testing, the original conclusion of which (slightly toned down for publication) was: It’s morally wrong to reinvent the wheel. Unless you already know all about the preexisting solutions and have a very good reason for ignoring them, you should start by looking at the tools that already exist. The easiest place to start is Gnu C Torture, but bear in mind that it’s based on Deja Gnu, which has, shall we say, issues. (It took me six attempts even to get the maintainer to allow a critical bug report about the Hello World example onto the mailing list.)
I’ll immodestly suggest that you look at the following as a starting place for tools to investigate:
Software: Practice and Experience April 2007. (Payware, not available to the general public---free preprint at http://pobox.com/~flash/Practical_Testing_of_C99.pdf.
http://en.wikipedia.org/wiki/Compiler_correctness#Testing (Largely written by me.)
Compiler testing bibliography (Please let me know of any updates I’ve missed.)
In the case of brainfuck, I think testing it should be done with brainfuck scripts. I would test the following, though:
1: Are all the cells initialized to 0
2: What happens when you decrement the data pointer when it's currently pointing to the first cell? Does it wrap? Does it point to invalid memory?
3: What happens when you increment the data pointer when it's pointing at the last cell? Does it wrap? Does it point to invalid memory
4: Does output function correctly
5: Does input function correctly
6: Does the [ ] stuff work correctly
7: What happens when you increment a byte more than 255 times, does it wrap to 0 properly, or is it incorrectly treated as an integer or other value.
More tests are possible too, but this is probably where i'd start. I wrote a BF compiler a few years ago, and that had a few extra tests. Particularly I tested the [ ] stuff heavily, by having a lot of code inside the block, since an early version of my code generator had issues there (on x86 using a jxx I had issues when the block produced more than 128 bytes or so of code, resulting in invalid x86 asm).
You can test with some already written apps.
The secret is to:
Separate the concerns
Observe the law of Demeter
Inject your dependencies
Well, software that is hard to test is a sign that the developer wrote it like it's 1985. Sorry to say that, but utilizing the three principles I presented here, even line numbered BASIC would be unit testable (it IS possible to inject dependencies into BASIC, because you can do "goto variable".