Using multiple threads/cores for awk performance improvement - gawk

I have a directory with ~50k files. Each file has ~700000 lines. I have written an awk program to read each line and print only if there is an error. Everything is running perfectly fine, but the time taken is huge - ~4 days!!!! Is there a way to reduce this time? Can we use multiple cores (processes)? Did anyone try this before?

awk and gawk will not fix this for your by themselves. There is no magic "make it parallel" switch. You will need to rewrite to some degree:
shard by file - the simplest way to fix this is to run multiple awks' in parallel, one per file. You will need some sort of dispatch mechanism. Parallelize Bash script with maximum number of processes shows how you can write this yourself in shell. It will take more reading, but if you want more features check out gearman or celery which should be adaptable to your problem
better hardware - it sounds like your probably need a faster CPU to make this go faster, but it could also be an I/O issue. Having graphs of CPU and I/O from munin or some other monitoring system would help isolate which is the bottleneck in this case. Have you tried running this job on an SSD based system? That is often an easy win these days.
caching - there are probably some amount of duplicate lines or files. If there are enough duplicates it would be helpful to cache the processing in some way. If you calculate the CRC/md5sum for a file and store it in a database you could calculate the md5sum for a new file and skip processing if you've already done so.
complete rewrite - scaling this with awk is going to get ridiculous at some point. Using some map-reduce framework might be a good idea.

Related

Will running a script using grep / awk on a log file ever impact the Application writing to the log file?

I have used scripts to monitor and extract data from log files for years and never questioned the basic toolset that most people take for granted. In particular grep and awk are used by almost everyone in the community.
I found the current grep bugs (some back dating a few years):
http://savannah.gnu.org/bugs/?group=grep
And from the man pages for GNU grep 2.6.3:
Known Bugs
Large repetition counts in the {n,m} construct may cause grep to use lots of memory. In addition, certain other obscure regular expressions require exponential time and space, and may cause grep to run out of memory.
Back-references are very slow, and may require exponential time.
And the man pages for GNU Awk 3.1.7:
BUGS
The -F option is not necessary given the command line variable assignment feature; it remains only for backwards compatibility.
Syntactically invalid single character programs tend to overflow the parse stack, generating a rather unhelpful message. Such programs are surprisingly difficult to diagnose in the completely general case, and the effort to do so really is not worth it.
I was interested in the limitations for example
when using complex regex,
extremely large files that are not rotated,
logs that are written to thousands of times per hundredths of a second
Would it just be a case of monitoring the memory usage of the script to make sure it is not using massive amounts of memory?
Is it good practice to implement a timeout feature for scripts that might possibly take a long time to execute?
Are there other good standards and structures that people also use when building solutions using these tools?
I found the extremely helpful answer for the equivalent findstr to give me a better understanding of scripting in a Windows environment:
What are the undocumented features and limitations of the Windows FINDSTR command?
The awk/grep commands are both reading the log file in read-only mode, so there is no impact on the log file getting corrupted because of simultaneous access by both application (write mode) and awk/grep programs (read-only mode).
There is definitely a CPU, memory usage by the awk/grep programs which can impact the application writing to the log file. This impact is similar to any other process using the system resources. The grep/awk command are no exceptions. Depending on what the grep/awk scripts are doing, they can consume a lot of CPU/RAM. A badly written code in any language can cause problems. As suggest in the comments, it is good to constraint the monitoring processes. ulimit and cgroups are the option available for constraining the resources. Other good option is to use timeout which kill the script if it is taking more than expected time.

Backing up large file directory

I'm in need of suggestions for backing up a very large file directory (500+ GB) on a weekly basis. (In a LAMP environment)
I've read about rsync, but I'm not sure how to trigger that with CRON.
Also, does anyone happen to know how much the compression in rsync shrinks the filesize? (Lets say of a 3MB .jpeg). I only ask because I can't tell how large of a backup server I will need.
Just pointing me in the right direction will be enough, I don't expect anyone to do my homework for me. Thanks in advance for any help!
Here is a wiki page that has much of your question answered.
I would read the whole page to grasp one concept at a time, but the "snapshot backup" is the rsync-script-to-beat-all-rsync-scripts: it does a TimeMachine-like backup where it does differential storage going backwards in time, which is quite handy. This is great if you need chronologically-aware but minimally-sized backups.
Arch (the distro for which this wiki covers) does a really nice thing where you can just drop your scripts into a known location; you will have to adapt that to calling a script as a cron job. Here is a fairly comprehensive introduction to cron.
I would also like to point out that rsync's compression operates on transmission not on storage. The file should be identical on your backup disk, but may take less bandwidth to transfer.
It's going to take some time regardless if it is that large - I would run a compression job through cron and then a big 'ol robocopy (windows) or robocopy equivalent on UNIX of the compressed files also through cron.
You may want to look into a RAID arrangement to deal with this (RAID 1, particularly) giant amount of data so it doesn't have to be a "job" and is done implicitly. But whether you can implement that probably depends on your resources and your situation (you will have worse write times..much worse).

Why doesn't Hadoop file system support random I/O?

The distributed file systems which like Google File System and Hadoop doesn't support random I/O.
(It can't modify the file which were written before. Only writing and appending is possible.)
Why did they design file system like this?
What are the important advantages of the design?
P.S I know Hadoop will support modifing the data which were written.
But they said, it's performance will very not good. Why?
Hadoop distributes and replicates files. Since the files are replicated, any write operation is going to have to find each replicated section across the network and update the file. This will heavily increase the time for the operation. Updating the file could push it over the block size and require the file split into 2 blocks, and then replicating the 2nd block. I don't know the internals and when/how it would split a block... but it's a potential complication.
What if the job failed or got killed which already did an update and gets re-run? It could update the file multiple times.
The advantage of not updating files in a distributed system is that you don't know who else is using the file when you update it, you don't know where the pieces are stored. There are potential time outs (node with the block is unresponsive) so you might end up with mismatched data (again, I don't know the internals of hadoop and an update with a node down might be handled, just something I'm brainstorming)
There are a lot of potential issues (a few laid out above) with updating files on the HDFS. None of them are insurmountable, but they will require a performance hit to check and account for.
Since the HDFS's main purpose is to store data for use in mapreduce, row level update isn't that important at this stage.
I think it's because of the block size of the data and the whole idea of Hadoop is that you don't move data around but instead you move the algorithm to the data.
Hadoop is designed for non-realtime batch processing of data. If you're looking at ways of implementing something more like a traditional RDBMS in terms of response time and random access have a look at HBase which is built on top of Hadoop.

Does it consume CPU when reading a large file

Suppose I want to do following opeartions on my 2-core machine:
Read a very large file
Compute
Does the file reading operation need to consume 1 core? Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
Thanks.
Edit
Thanks guys, yea, we should always consider if the file I/O blocks the computing. Now let's just consider that the file I/O will never block computing, you can think the computing doesn't depends on the file's data, we just read the file in for future processing. Now we have 2 core, we need to read in a file, and we need to do computing, is it the best solution to create 3 threads, 1 for file reading and 2 for computing, as most of you has already pointed out: file reading consumes very little CPU?
It depends on how your hardware is configured. Normally, reading is not CPU-intensive, thanks to DMA. It may be very expensive though, if it initiates swap-out of other applications. But there is more to it.
Don't read a huge file at once if you can
If your file is really big, you should use mmap or sequential processing, when you don't need to read a whole file at once. Try to consume it by chunks is possible.
For example, to sum all values in a huge file, you don't need to load this file into the memory. You can process it by small chunks, accumulating the sum. Memory is an expensive resource in most situations.
Reading is sequential
Does the file reading operation need to consume 1 core?
Yes, I think most low-level read operations are implemented sequentially (consume 1 core).
You can avoid blocking on read operation if you use asynchronous I/O, but it is just a variation of the same "read by small chunks" technique. You can launch several small asynchronous read operations at once, but you have always to check if an operation has finished before you use the result.
See also this Stack Overflow answer to a related question).
Reading and computing in parallel
Previously I just create 2 threads, one to read file and one to compute? Should I create an additional thread to do compute?
It depends, if you need all data to start computations, than there is no reason to start computation in parallel. It will have to wait effectively until reading is done.
If you can start computing even with partial data, likely you don't need to read the whole file at once. And it is usually much better not to do so with huge files.
What is your bottleneck — computation or IO?
Finally, you should know if your task is computation-bound or input-output bound. If it is limited by the performance of input-output subsystem, there is little benefit in parallelizing computation. If computation is very CPU-intensive, and reading time is negligible, you can benefit from parallelizing computation. Input-output is usually a bottleneck unless you are doing some number-crunching.
This is a good candidate for parallelization, because you have two types of operations here - disk I/O (for reading the file), and CPU load (for your computations). So the first step would be to write your application such that the file I/O wasn't blocking the computation. You could do this by reading a little bit at a time from the file and handing it off to the compute thread.
But now you're saying you have two cores that you want to utilize. Your second thought about parallelizing the CPU-intensive part is correct, because we can only parallelize compute tasks if we have more than one processor to use. But, it might be the case that the blocking part of your application is still the file I/O - that depends on a lot of factors, and the only way to tell what level of parallelization is appropriate is to benchmark.
SO required caveat: multithreading is hard and error-prone, and it's better to have correct code than fast code, if you can pick only one. But I don't advocate against threads, as you may find from others on the site.
I would think this depends on the computation you are performing. If you are doing very heavy computations then I would suggest threading the application. Reading a file demands very little from your CPU and because of this, the overhead created by threading the application might slow it down.
Another thing to consider is if you need to load the entire file before you can compute, if so, there is no point in threading it at all as you will have to complete one action before you can perform the other.

Take advantage of multiple cores executing SQL statements

I have a small application that reads XML files and inserts the information on a SQL DB.
There are ~ 300 000 files to import, each one with ~ 1000 records.
I started the application on 20% of the files and it has been running for 18 hours now, I hope I can improve this time for the rest of the files.
I'm not using a multi-thread approach, but since the computer I'm running the process on has 4 cores I was thinking on doing it to get some improvement on the performance (although I guess the main problem is the I/O and not only the processing).
I was thinking on using the BeginExecutingNonQuery() method on the SqlCommand object I create for each insertion, but I don't know if I should limit the max amount of simultaneous threads (nor I know how to do it).
What's your advice to get the best CPU utilization?
Thanks
If I understand you correctly, you are reading those files on the same machine that runs the database. Although I don't know much about your machine, I bet that your bottleneck is disk IO. This doesn't sound terribly computation intensive to me.
Have you tried using SqlBulkCopy? Basically, you load your data into a DataTable instance, then use the SqlBulkCopy class to load it to SQL Server. Should offer a HUGE performance increase without as much change to your current process as using bcp or another utility.
Look into bulk insert.
Imports a data file into a database table or view in a user-specified format.