Does an AWK script take up a lot of CPU? - scripting

Does AWK uses a lot of processing power? If so, is there a better scripting language to do it? Or should I do it in C itself (where rest of my code is).

Depends on what you're telling it to do. Most of the work is passed to the regexp engine, which should be similar, no matter what language you use.
Now if you're using an awk script from inside a C program, and you have the resources to just implement the functionality in C too, you're best off doing that. You'll avoid the process creation/termination + communication overhead (which may or may not be a big part of the performance hit you'll get).
For more information, tell us more about your script!

If most of your code is in c, It is probably cleaner to use c to do your string processing rather than shelling out.
You can use PCRE directly in your program.

Related

What is a good, simple scripting language to embed into a macro-processor

I want to write a macroprocessor. So far I've done a very simple sketch of how it should look and I came to the conclusion that inventing a completely new language would not be a good idea but I should reuse existing concepts. My sketch so far is a kind of irb with some tex-alike syntax and features, but I'm not sure what I should use as ruby-substitute.
The language should be simple, yet powerful. I don't want to write an OS in it, but it should be less "raw" than e.g. bc or forth. I don't care about execution time at all. Embedding should not be too hard and it'll be nice if the language itself was stable.
So far I've considered these:
Lua - It should process text easily. Lua does not even have a while(c=getchar()){}. I'm skeptic.
awk - Simple, text processing is easy, but never intended for embedding
perl - Way to complex, stable, but it seems almost dead.
python - Significant whitespaces; won't they get in the way for inlined function-definitions?
groovy/nice/java - Hard/impossible to embed? Also way to heavy.
javascript - Really like it (besides DOM) but is there a stable/embeddable implementation? I don't want to mess around with the api every 2 weeks when there's a new v8 version. As I said, I don't care about execution time.
I have not really found any pros/cons for
io
guile/scheme
TCL
Update: The language should have features such as function-definition, library-loading or regexps (loops would also be very nice) I don't want to use a traditional macro-language such as M4 because I want to able to write in a more procedural (or maybe functional) style. Macro languages have their pros, but I requires a completely new way of thinking about a problem which is hard especially for beginners. My Aim is to use the best of both worlds.
Given that TCL is about string and array processing, and is intended for embedding, it would seem an obvious choice.
Luatex has a certain following. Presumably they have found a way to make it work for text processing, so you might like to look at that.
Scheme (including guile) is also very nice for scripting; alternatively you might look at whether there is a way you could embed an elisp processor (embed xemacs?), which after all is all about text processing.

overhead of exec() call?

I have a web script that is a simple wrapper around a perl program:
#!/bin/sh
source ~/.myinit // pull in some configurations like [PRIVATE_DIR]
exec [PRIVATE_DIR]/myprog.pl
This is really just to keep the code better compartmentalized, since the main program (myprog.pl) runs on different machines with different configuration modes and it's cleaner to not have to update that for every installation. However, I would be willing to sacrifice cleanliness for efficiency.
So the question is, does this extra sh exec() call add any non-negligible overhead, considering it could be hit by the webserver quite frequently (we have 1000s of users on at a time)? I ask, because I know that people have gone to great lengths to embed programs into httpd to avoid having to make an extra fork/exec call. My assumption has been that this was due to the overhead of the program being called (eg mod_perl bypasses the extremely slow perl startup), not the process of calling itself. Would that be accurate?
I've tried to benchmark this but I can't get the resolution I need to see if it makes a difference. Any advice on that would also be appreciated.
Thanks!
Try a test?
Assuming you have a test environmentn separate from production.
(and this is a little vague, and real testers will be able to amplify how to improve on this)
Run your current 2 level design in a tight loop, using the same param for 100,000? 1,000,000? ...
Then 'fix' your code so the perl is called directly, with the same params as above, looping for same count.
Capture performance stats for both runs. The diff between the two should be (roughly) the cost of the extra call.
If this works out, I'd appreciate seeing the results on this:-)
(There are many more posts for tag=testing than tag=benchmark)
I hope this helps.
P.S. as you appear to be a new user, if you get an answer that helps you please remember to mark it as accepted, or give it a + (or -) as a useful answer
It's easy for a perl program to take over some envars from a shell script, see the answer to Have perl execute shellscript & take over env vars.
Using this there is no need for exec at all which should alleviate all your worries about exec overhead. :-)

Is there still any reason to learn AWK?

I am constantly learning new tools, even old fashioned ones, because I like to use the right solution for the problem.
Nevertheless, I wonder if there is still any reason to learn some of them. awk for example is interesting to me, but for simple text processing, I can use grep, cut, sed, etc. while for complex ones, I'll go for Python.
Now I don't mean that's it's not a powerful and handy tool. But since it takes time and energy to learn a new tool, is it worth it ?
If you quickly learn the basics of awk, you can indeed do amazing things on the command line.
But the real reason to learn awk is to have an excuse to read the superb book The AWK Programming Language by Aho, Kernighan, and Weinberger.
The AWK Programming Language at archive.org
You would think, from the name, that it simply teaches you awk. Actually, that is just the beginning. Launching into the vast array of problems that can be tackled once one is using a concise scripting language that makes string manipulation easy — and awk was one of the first — it proceeds to teach the reader how to implement a database, a parser, an interpreter, and (if memory serves me) a compiler for a small project-specific computer language! If only they had also programmed an example operating system using awk, the book would have been a fairly complete survey introduction to computer science!
Famously clear and concise, like the original C Language book, it also is a wonderful example of friendly technical writing done right. Even the index is a piece of craftsmanship.
Awk? If you know it, you'll use it at the command-line occasionally, but for anything larger you'll feel trapped, unable to access the wider features of your system and the Internet that something like Python provides access to. But the book? You'll always be glad you read it!
I think it depends on the environment you find yourself in. If you are a *nix person, then knowing awk is a Good Thing. The only other scripting environment that can be found on virtually every *nix is sh. So while grep, sed, etc can surely replace awk on a modern mainstream linux distro, when you move to more exotic systems, knowing a little awk is going to be Real Handy.
awk can also be used for more than just text processing. For example one of my supervisors writes astronomy code in awk - that is how utterly old school and awesome he is. Back in his days, it was the best tool for the job... and now even though his students like me use python and what not, he sticks to what he knows and works well.
In closing, there is a lot of old code kicking around the world, knowing a little awk isn't going to hurt. It will also make you better *nix person :-)
The only reason I use awk is the auto-splitting:
awk '{print $3}' < file.in
This prints the third whitespace-delimited field in file.in. It's a bit easier than:
tr -s ' ' < file.in | cut -d' ' -f3
I think awk is great if your file contains columns/fields. I use it when processing/analyzing a particular column in a multicolumn file. Or if I want to add/delete a particular column(s).
e.g.
awk -F \t '{ if ($2 > $3) print; }' <filename>
will print only if the 2nd column value in a tab seperated file is greater than the 3rd column value.
Of course I could use Perl or Python, but awk makes it so much simpler with a concise single line command.
Also learning awk is pretty low-cost. You can learn awk basics in less than an hour, so it's not as much effort as learning any other programming/scripting language.
I use AWK occasionally for dealing with HTML. For instance, this code translates tables to csv files:
BEGIN {s=""; FS="n"}
/<td/ { gsub(/<[^>]*>/, ""); s=(s ", " $1);}
/<tr|<TR/ { print s; s="" }
Which is great if you're screen scraping. Actually, it might be the case that I love AWK because it allows me to build the wrong solution to problems so quickly :) more examples. It's also mentioned in Jon Bentley's lovely Programming Pearls.
6 years after asking this question I can now answer with certainty: no, learning awk is not worth it.
Basic tasks are handled by basic bash commands, or even GUI tools easily. More complex tasks will be easily tackled with modern dynamic languages such as Python (fav or mine) or Ruby.
You should definitely learn a modern scripting dynamic language as it will help you in so many tasks (web, admin, data crunching, automation, etc). And by doing so, learning a tool such as awk is completely useless, it will save you at best a few seconds every month.
I do use awk every so often. It's good for very simple text shuffling in the middle of a pipeline; it fills a very narrow niche right between not needing it at all and needing to whip out Perl/Python/whatever.
I wouldn't advise you spend a whole lot of time on it, but it might come in handy to know the basics of the syntax -- at least enough that you can consult the manual quickly should you ever want to use it.
Learning AWK was invaluable for me in my last contract working on an embedded Linux system on which neither Perl nor most other scripting languages were installed.
If you already know and use sed, you might as well pick up at least a bit of awk. They can be piped together for some pretty powerful tricks. Always impresses the audience.
Most awk one liners can be achieved with Perl one liners - if you choose to get into a Perl one liner mindset. Or, just use Perl three liners :)
If you're maintaining shell scripts written by someone who liked awk, then clearly, you're going to need to learn awk.
Even if there's no practical need, if you already know regex it won't take long to pick up the basics, and it's fun to see how things were designed back then. It's rather elegant.
Computerworld recently did an interview with Alfred V. Aho (one of the three creators of AWK) about AWK. It's a quite interesting read. So maybe you'll find some hints in it, why it's a good idea learn AWK.
awk has a very good ratio utility/difficulty, and "simple awk" works in every Unix/Linux/MacOS (and it can be installed in other systems too).
It was designed in Golden Age when people hated typing, so scripts can be very, very short and fast to write. I will try to instal mawk, a fast version, allegedly it accelerates the computation about 9 times, awk/gawk is rather slow, so if you want to use it instead of R etc. you may want mawk.
Nope.
Even though it might be interesting, you can do everything that awk can do using other, more powerful tools such as Perl.
Spend your time learning those more powerful tools - and only incidentally pick up some awk along the way.
It's useful mostly if you have to occasionally parse log files for data or output of programs while shell scripting, because it's very easy to achieve in awk that that would take you a little more lines of code in python.
It certainly has more power than that, but this seems to be tasks most people use it for.
Of course: I'm working in an environment where the only available languages are: (some shity language which generates COBOL, OMG, OMG), bash (old version), perl (I don't master it yet), sed, awk, and some other command line utilities. Knowing awk saved me several hours (and had generated several text processing tasks from my collegaues - they come to me at least three times a day).
I'd say it's probably not worth it anymore. I use it from time to time as a much more versatile stream editor than sed with searching abilities included, but if you are proficient with python I do not know a task which you would be able to finish that much faster to compensate for the time needed to learn awk.
The following command is probably the only one for which I've used awk in the last two years (it purges half-removed packages from my Debian/Ubuntu systems):
$ dpkg -l|awk '/^rc/ {print $2}'|xargs sudo dpkg -P
I'd say there is. For simple stuff, AWK is a lot easier on the inexperienced sysadmin / developer than Python. You can learn a little AWK and do a lot of things, learning Python means learning a whole new language (yes, I know AWK is a language is a sense too).
Perl might be able to do a lot of things AWK can do, but offered the choice in this day and age I would choose Python here. So yes, you should learn AWK. but learn Python too :-)
I was recently trying to visualize network pcap files logging a DOS attack which amounted to over 20Gbs. I needed the timestamp and the Ip addresses. In my scenario, AWK one-liner worked fabulously and pretty fast as well. I specifically used AWK to clean the extracted files, get the ip addresses and the total packet count from those IP addresses within grouped span of time. I totally agree with what other people have written above. It depends on your needs.
awk is a powertool language, so you are likely going to find awk being used somewhere if you are an IT professional of any sort. If you can handle the syntax and regular expressions of grep and sed then you should have no problem picking up awk and it is probably worthwhile to.
Where I've found awk really shine is in simplifying things like processing multi-line records and mangling/interpolating multiple files simultaneously.
One reason NOT to learn awk is that it doesn't have non-greedy matches in regular expressions.
I have an awk code that now I must rewrite only because I suddenly debugged that there is no such thing as non-greedy matches in awk/gawk thus it can't properly execute some regexes.
It depends on your team mates and you leader and the task you are working on.
if( team mates and leader ask to write awk ){
if( you can reject that){
if( awk code is very small){
learn little just like learn Regex
}else{
use python or even java
}
}else{
do as they ask
}
}
Now that PERL is ported to pretty much every significant platform, I'd say it's not worth it. It's more versatile than sed and awk together. As for auto-splitting, you can do it in perl like this:
perl -F':' -ane 'print $F[3],"\n";' /etc/passwd
EDIT: you might still want to get somewhat acquainted with awk, because some other tools are based on its philosophy of pattern-based actions (e.g. DTrace on Solaris).
I work in area the files are in column format. So awk is invaluable to me to REFORMAT the file so different software can work together. For non IT profession, using awk is enough and perfect. Now a day, computer speed is not an issue, so I can combine awk & unix to pipe many 1 liners command into a "script". With Awk search by field and record, I use it to check the file data very fast, instead of "vi" to open a file. I have to say awk capability brought joy to my job specially, I am able to assist co-worker to sort things out quickly using awk. Amazing code to me.
I have been doing some coding in python at present.
But I still do not know it well enough to use easily for simple one off file transformations.
With awk I can quickly develop a one line piece of code on the unix command line that does some pretty swish transformations. Every time I use awk, the piece of code I write will be disposable and no more than a few lines long. Maybe an "if" statment and "printf" statement here or there on the one line.
I have never written a piece of code that is more than 10 lines long with awk.
I saw some such scripts years ago.
But anything that required many lines of code, I would resort to python.
I love awk. It is a very powerful tool in combination with sed.
if you care anything about speed, but don't wanna be dealing with C/C++ or assembly, you go for awk, specifically, mawk 1.9.9.6.
It also lacks perl's ugly syntax, python3's feature bloat, javascript's annoying UTF16 setup, or C's memory-pointer pitfall traps
Most of the time, for the implementation of the same pseudo-codes, awk only loses against specialized vectorized instructions, like AVX/SSE

Should I choose scripting or compiled code for small tasks?

I'm a Java programmer, and I like my compiler, static analysis tools and unit testing frameworks as tools that help me quickly deliver robust and efficient code. The JRE is pretty much everywhere I would work, too.
Given that situation, I can't see a reason why I would ever choose to use shell scripting, vb scripting etc, no matter how small the task is if I wear one of my other hats like my cool black sysadmin fedora.
I don't wear the other hats too often, under what circumstances should I choose scripting over writing compiled code?
Whatever you think will be most efficient for you!
I had a co-worker who seemed to use a different language for every task; Perl for quick text processing, PHP for small internal web applications, .NET for our main product, cygwin for filesystem stuff. He preferred to use the technology which was most specific to the task at hand.
Personally, I find that context switching between technologies is painful. My day-to-day work is in .NET, so that's pretty much the terms I think in. For most tasks I find it more efficient to knock something up in C# using SnippetCompiler than I would to hack around in PowerShell or a scripting environment.
If you are comfortable with Java, and the JRE is everywhere you work, then I would say keep using it. There are, however, languages like perl and python that are particularly suited to quickly solving problems. I would suggest learning either perl or python, and then use your judgement on when to use it.
If I have a small problem that I'd like to solve quickly, I tend to use a scripting language. The code tax is smaller, and, for me at least, the result comes faster.
I would say where it makes sense. If it's going to take you longer to open up your IDE, compile the script, etc. than it would to edit a script file and be done with it than use script file. If you're not going to be changing the thing often and are quicker at Java coding then go that route :)
It is usually quicker to write scripts than compiled programmes. You don't have to worry so much about portability between different platforms and environments. A shell script will run pretty much every where on most platforms. Because you're a java developer and you mention that you have java everywhere you might look at groovy (http://groovy.codehaus.org/). It is a scripting language written in java with the ability to use java libraries.
The way I see it (others disagree) all your code needs to be maintainable. The smallest useful collection of code is that which a single person maintains. Even that benefits from the language and tools you mentioned.
However, there may obviously be tasks where specialised languages are more advantageous than a single general purpose language.
If you can write it quicker in Java, then go for it.
Just try and be aware of what the various scripting languages can do.
e.g. Don't make a full blown Java app when you can do the same with a bash one-liner.
Weigh the importance of the tool against popping open a text editor for a quick edit vs. opening IDE, recompiling, redeploying, etc.
Of course, the prime directive should be to "use whatever you're comfortable with." If Java is getting the job done right and on time, stick to it. But a lot of the scripting languages could save you some time because they're attuned to different problems. If you're using regular expressions, the scripting languages are a good fit. If you're dropping into shell commands, scripts are nice.
I tend to use Ruby scripts whenever I'm writing something that's small, because it's quick to write, easy to maintain, and (with Gems) easy to bolt on additional functionality without needed to use JARs or anything. Your milage will, of course, vary.
At the end of the day this is a question that only you can answer for yourself. Based on the fact that you said "I can't see a reason why I would ever choose to use shell scripting , ..." then it's probably the case that you should never choose it right now.
But if I were you I would pick a scripting language like python, ruby or perl and start trying to solve some of these small problems with this language. Over time you will start to get a feel for when it is more appropriate to write a quick script than build a full-blown solution.
I use scripting languages for writing programs which are not expected to be maintained beyond few executions. Most of these languages are light on boiler-plate syntax and do have a REPL. Both these features enable rapid prototyping.
Since you already know Java, you can try JVM languages like Groovy, JRuby, BeanShell etc. Scala has much lighter syntax than Java, has a REPL, is statically typed and runs on the JVM - you might give that a shot as well.

Which scripting language to support in an existing codebase?

I'm looking at adding scripting functionality to an existing codebase and am weighing up the pros/cons of various packages. Lua is probably the most obvious choice, but I was wondering if people have any other suggestions based on their experience.
Scripts will be triggered upon certain events and may stay resident for a period of time. For example upon startup a script may define several options which the program presents to the user as a number of buttons. Upon selecting one of these buttons the program will notify the script where further events may occur.
These are the only real requirements;
Must be a cross-platform library that is compilable from source
Scripts must be able to call registered code-side functions
Code must be able to call script-side functions
Be used within a C/C++ codebase.
Based on my own experience:
Python. IMHO this is a good choice. We have a pretty big code base with a lot of users and they like it a lot.
Ruby. There are some really nice apps such as Google Sketchup that use this. I wrote a Sketchup plugin and thought it was pretty nice.
Tcl. This is the old-school embeddable scripting language of choice, but it doesn't have a lot of momentum these days. It's high quality though, they use it on the Hubble Space Telescope!
Lua. I've only done baby stuff with it but IIRC it only has a floating point numeric type, so make sure that's not a problem for the data you will be working with.
We're lucky to be living in the golden age of scripting, so it's hard to make a bad choice if you choose from any of the popular ones.
I have played around a little bit with Spidermonkey. It seems like it would at least be worth a look at in your situation. I have heard good things about Lua as well. The big argument for using a javascript scripting language is that a lot of developers know it already and would probably be more comfortable from the get go, whereas Lua most likely would have a bit of a learning curve.
I'm not completely positive but I think that spidermonkey your 4 requirements.
I've used Python extensively for this purpose and have never regretted it.
Lua is has the most straight-forward C API for binding into a code base that I've ever used. In fact, I usually quickly roll bindings for it by hand. Whereas, you often wouldn't consider doing so without a generator like swig for others. Also, it's typically faster and more light weight than the alternatives, and coroutines are a very useful feature that few other languages provide.
AngelScript
lets you call standard C functions and C++ methods with no need for proxy functions. The application simply registers the functions, objects, and methods that the scripts should be able to work with and nothing more has to be done with your code. The same functions used by the application internally can also be used by the scripting engine, which eliminates the need to duplicate functionality.
For the script writer the scripting language follows the widely known syntax of C/C++ (with minor changes), but without the need to worry about pointers and memory leaks.
The original question described Tcl to a "T".
Tcl was designed from the beginning to be an embedded scripting language. It has evolved to be a first class dynamic language in its own right but still is used all over the world as an embeded language. It is available under the BSD license so it is just about as free as it gets. It also compiles on pretty much any moden platform, and many not-so-modern. And not only does it work on desktop systems, there are variations available for mobile platforms.
Tcl excels as a "glue" language, where you can write performance-intensive functions in C while still benefiting from the advantages of a scripting language for less performance critical parts of the application.
Tcl also comes with a first class GUI toolkit (Tk) that is arguably one of the easiest cross platform GUI toolkits available. It also interfaces very nicely with SQLite and other databases, and has had built-in support for unicode for quite some time.
If the scripting interface will be made available to your customers (as opposed to simply enabling your own engineers to work at the scripting level), Tcl is extremely easy to learn as there are a total of only 12 rules that govern the entire language (as of tcl 8.6). In fact, Tcl shines as a way to invent domain specific languages which is often how it is used as an end-user scripting solution.
There were some excellent suggestions already, but I just wanted to mention that Perl can also be called / can call to C/C++.
You probably could use any modern scripting / bytecode language.
If you're willing to put up with the growing pains of a new product, you could use the Parrot VM. Which has support for many, if not all of the languages listed on this page. Unfortunately it's not done yet, but that hasn't stopped some people from using it in a production environment.
I think most people are probably mentioning the scripting language that they are most familiar with. From my perspective, Tcl was designed specifically to interface with C, so your problem domain is tailor-made for the language. However, I'm sure Python, Perl, or Lua would be fine. You should probably choose the language that is most familiar to your current team, since that will reduce the learning time.