Will running a script using grep / awk on a log file ever impact the Application writing to the log file? - awk

I have used scripts to monitor and extract data from log files for years and never questioned the basic toolset that most people take for granted. In particular grep and awk are used by almost everyone in the community.
I found the current grep bugs (some back dating a few years):
http://savannah.gnu.org/bugs/?group=grep
And from the man pages for GNU grep 2.6.3:
Known Bugs
Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.
Back-references are very slow, and may require exponential time.
And the man pages for GNU Awk 3.1.7:
BUGS
The -F option is not necessary given the command line variable assignment feature; it remains only for backwards compatibility.
Syntactically invalid single character programs tend to overflow the parse stack, generating a rather unhelpful message. Such programs are surprisingly difficult to diagnose in the completely general case, and the effort to do so really is not worth it.
I was interested in the limitations for example
when using complex regex,
extremely large files that are not rotated,
logs that are written to thousands of times per hundredths of a second
Would it just be a case of monitoring the memory usage of the script to make sure it is not using massive amounts of memory?
Is it good practice to implement a timeout feature for scripts that might possibly take a long time to execute?
Are there other good standards and structures that people also use when building solutions using these tools?
I found the extremely helpful answer for the equivalent findstr to give me a better understanding of scripting in a Windows environment:
What are the undocumented features and limitations of the Windows FINDSTR command?

The awk/grep commands are both reading the log file in read-only mode, so there is no impact on the log file getting corrupted because of simultaneous access by both application (write mode) and awk/grep programs (read-only mode).
There is definitely a CPU, memory usage by the awk/grep programs which can impact the application writing to the log file. This impact is similar to any other process using the system resources. The grep/awk command are no exceptions. Depending on what the grep/awk scripts are doing, they can consume a lot of CPU/RAM. A badly written code in any language can cause problems. As suggest in the comments, it is good to constraint the monitoring processes. ulimit and cgroups are the option available for constraining the resources. Other good option is to use timeout which kill the script if it is taking more than expected time.

Related

Using multiple threads/cores for awk performance improvement

I have a directory with ~50k files. Each file has ~700000 lines. I have written an awk program to read each line and print only if there is an error. Everything is running perfectly fine, but the time taken is huge - ~4 days!!!! Is there a way to reduce this time? Can we use multiple cores (processes)? Did anyone try this before?
awk and gawk will not fix this for your by themselves. There is no magic "make it parallel" switch. You will need to rewrite to some degree:
shard by file - the simplest way to fix this is to run multiple awks' in parallel, one per file. You will need some sort of dispatch mechanism. Parallelize Bash script with maximum number of processes shows how you can write this yourself in shell. It will take more reading, but if you want more features check out gearman or celery which should be adaptable to your problem
better hardware - it sounds like your probably need a faster CPU to make this go faster, but it could also be an I/O issue. Having graphs of CPU and I/O from munin or some other monitoring system would help isolate which is the bottleneck in this case. Have you tried running this job on an SSD based system? That is often an easy win these days.
caching - there are probably some amount of duplicate lines or files. If there are enough duplicates it would be helpful to cache the processing in some way. If you calculate the CRC/md5sum for a file and store it in a database you could calculate the md5sum for a new file and skip processing if you've already done so.
complete rewrite - scaling this with awk is going to get ridiculous at some point. Using some map-reduce framework might be a good idea.

Backing up large file directory

I'm in need of suggestions for backing up a very large file directory (500+ GB) on a weekly basis. (In a LAMP environment)
I've read about rsync, but I'm not sure how to trigger that with CRON.
Also, does anyone happen to know how much the compression in rsync shrinks the filesize? (Lets say of a 3MB .jpeg). I only ask because I can't tell how large of a backup server I will need.
Just pointing me in the right direction will be enough, I don't expect anyone to do my homework for me. Thanks in advance for any help!
Here is a wiki page that has much of your question answered.
I would read the whole page to grasp one concept at a time, but the "snapshot backup" is the rsync-script-to-beat-all-rsync-scripts: it does a TimeMachine-like backup where it does differential storage going backwards in time, which is quite handy. This is great if you need chronologically-aware but minimally-sized backups.
Arch (the distro for which this wiki covers) does a really nice thing where you can just drop your scripts into a known location; you will have to adapt that to calling a script as a cron job. Here is a fairly comprehensive introduction to cron.
I would also like to point out that rsync's compression operates on transmission not on storage. The file should be identical on your backup disk, but may take less bandwidth to transfer.
It's going to take some time regardless if it is that large - I would run a compression job through cron and then a big 'ol robocopy (windows) or robocopy equivalent on UNIX of the compressed files also through cron.
You may want to look into a RAID arrangement to deal with this (RAID 1, particularly) giant amount of data so it doesn't have to be a "job" and is done implicitly. But whether you can implement that probably depends on your resources and your situation (you will have worse write times..much worse).

Improve MPI program

I wrote a MPI program that seems to run ok, but I wonder about performance. Master thread needs to do 10 or more times MPI_Send, and the worker receives data 10 or more times and sends it. I wonder if it gives a performance penalty and whether I could transfer everything in single structs or which other technique could I benefit from.
Other general question, once a mpi program works more or less, what are the best optimization techniques.
It's usually the case that sending 1 large message is faster than sending 10 small messages. The time cost of sending a message is well modelled by considering a latency (how long it would take to send an empty message, which is non-zero because of the overhead of function calls, network latency, etc) and a bandwidth (how much longer it takes to send an extra byte given that the network communications has already started). By bundling up messages into one message, you only incurr the latency cost once, and this is often a win (although it's always possible to come up with cases where it isn't). The best way to know for any particular code is simply to try. Note that MPI datatypes allow you very powerful ways to describe the layout of your data in memory so that you can take it almost directly from memory to the network without having to do an intermediate copy into some buffer (so-called "marshalling" of the data).
As to more general optimization questions about MPI -- without knowing more, all we can do is give you advice which is so general as to not be very useful. Minimize the amount of communications which need to be done; wherever possible, use built-in MPI tools (collectives, etc) rather than implementing your own.
One way to fully understand the performance of your MPI application is to run it within the SimGrid platform simulator. The tooling and models provided are sufficient to get realistic timing predictions of mid-range applications (like, a few dozen thousands lines of C or Fortran), and it can be associated to adapted visualization tools that can help you fully understand what is going on in your application, and the actual performance tradeoffs that you have to consider.
For a demo, please refer to this screencast: https://www.youtube.com/watch?v=NOxFOR_t3xI

MPI-2 file format options

I am trying to speed up my file I/O using MPI-2, but there doesn't appear to be any way to read/write formatted files. Many of my I/O files are formatted for ease of pre and post-processing.
Any suggestions for an MPI-2 solution for formatted I/O?
The usual answer to using MPI-IO while generating some sort of portable, sensible file format is to use HDF5 or NetCDF4 . There's a real learning curve to both (but also lots of tutorials out there) but the result is you hve portable, self-describing files that there are a zillion tools for accessing, manipulating, etc.
If by `formatted' output you mean plain human-readable text, then as someone who does a lot of this stuff, I wouldn't be doing my job if I didn't urge you enough to start moving away from that approach. We all by and large start that way, dumping plain text so we can quickly see what's going on; but it's just not a good approach for doing production runs. The files are bloated, the I/O is way slower (I routinely see 6x slowdown in using ascii as vs binary, partly because you're writing out small chunks at a time and partly because of the string conversions), and for what? If there's so little data being output that you actually can feasibly read and understand the output, you don't need parallel I/O; if there are so many numbers that you can't really plausibly flip through them all and understand what's going on, then what's the point?

How to prove that code isn't broken, but the hardware is?

I'm sure it repeats everywhere. You can 'feel' network is slow, or machine or slow or something. But the server/chassis logs are not showing anything, so IT doesn't believe you. What do you do?
Your regressions are taking twice the time ... but that's not enough
Okay you transfer 100 GB using dd etc, but ... that's not enough.
Okay you get server placed in different chassis for 2 week, it works fine ... but .. that's not enough...
so HOW do you get IT to replace the chassis ?
More specifically:
Is there any suite which I can run on two setups ( supposed to be identical ), which can show up difference in network/cpu/disk access .. which IT will believe ?
Computers don't age and slow down the same way we do. If your server is getting slower -- actually slower, not just feels slower because every other computer you use is getting faster -- then there is a reason and it is possible that you may be able to fix it. I'd try cleaning up some disk space, de-fragmenting the disk, and checking what other processes are running (perhaps someone's added more apps to the system and you're just not getting as many cycles).
If your app uses a database, you may want to analyze your query performance and see if some indices are in order. Queries that perform well when you have little data can start taking a long time as the amount of data grows if they have to use table scans. As a former "IT" guy, I'd also be reluctant to throw hardware at a problem because someone tells me the system is slowing down. I'd want to know what has changed and see if I could get the system running the way it should be. If the app has simply out grown the hardware -- after you've made suitable optimizations -- then upgrading is a reasonable choice.
Run a standard benchmark suite. See if it pinpoints memory, cpu, bus or disk, when compared to a "working" similar computer.
See http://en.wikipedia.org/wiki/Benchmark_(computing)#Common_benchmarks for some tips.
The only way to prove something is to do a stringent audit.
Now traditionally, we should keep the system constant between two different sets while altering the variable we are interested. In this case the variable is the hardware that your code is running on. So in simple terms, you should audit the running of your software on two different sets of hardware, one being the hardware you are unhappy about. And see the difference.
Now if you are to do this properly, which I am sure you are, you will first need to come up with a null hypothesis, something like:
"The slowness of the application is
unrelated to the specific hardware we
are using"
And now you set about disproving that hypothesis in favour of an alternative hypothesis. Once you have collected enough results, you can apply statistical analyses on them, to decide whether any differences are statistically significant. There are analyses to find out how much data you need, and then compare the two sets to decide if the differences are random, or not random (which would disprove your null hypothesis). The type of tests you do will mostly depend on your data, but clever people have made checklists to help us decide.
It sounds like your main problem is being listened to by IT, but raw technical data may not be persuasive to the right people. Getting backup from the business may help you and that means talking about money.
Luckily, both platforms already contain a common piece of software - the application itself - designed to make or save money for someone. Why not measure how quickly it can do that e.g. how long does it take to process an order?
By measuring how long your application spends dealing with each sub task or data source you can get a rough idea of the underlying hardware which is under performing. Writing to a local database, or handling a data structure larger than RAM will impact the disk, making network calls will impact the network hardware, CPU bound calculations will impact there.
This data will never be as precise as a benchmark, and it may require expensive coding, but its easier to translate what it finds into money terms. Log4j's NDC and MDC features, and Springs AOP might be good enabling tools for you.
Run perfmon.msc from Start / Run in Windows 2000 through to Vista. Then just add counters for CPU, disk etc..
For SQL queries you should capture the actual queries then run them manually to see if they are slow.
For instance if using SQL Server, run the profiler from Tools, SQL Server Profiler. Then perform some operations in your program and look at the capture for any suspicous database calls. Copy and paste one of the queries into a new query window in management studio and run it.
For networking you should try artificially limiting your network speed to see how it affects your code (e.g. Traffic Shaper XP is a simple freeware limiter).