File I/O in gnu parallel - file-io

I have a program that takes a single argument. I am using gnu parallel to perform parameter sweeps on this argument. Each run generates a single result, and I want to append all results into a single file, say Results.txt.
What would be a correct way to do this?
I should not have each instance open the file and write to it, as this could create conflicts and also mess up the order of results. The only way I can think of doing this is having each run generate its output in a file with a unique name, and then , when gnu parallel finishes running, merge the results into a single file using a script.
Is there a simpler way of achieving this?
What happens when multiple instances write to/read from the same file? Does gnu parallel create multiple copies, one for each instances, as it does for stdout and stderror?
thanks

If your command sends the result to stdout (standard output) the solution is trivial:
seq 1000 | parallel echo > Results.txt
GNU Parallel guarantees the output will not be mixed.

Normally GNU Parallel prints the output of a job as soon as it's completed. When jobs run for a different amount of time, this can lead to their output being mixed.
To keep the output in order, simply add -k / --keep-order parameter.
Try for example:
parallel -j4 sleep {}\; echo {} ::: 2 1 4 3
parallel -j4 -k sleep {}\; echo {} ::: 2 1 4 3

Related

How does snakemake handle possible corruptions due to a rule run in parallel simultaneously appending to a single file?

I would like to learn how snakemake handles following situations, and what is the best practice avoid collisions/corruptions.
rule something:
input:
expand("/path/to/out-{asd}.txt", asd=LIST)
output:
"/path/to/merged.txt"
shell:
"cat {input} >> {output}"
With snakemake -j10 the command will try to append to the same file simultaneously, and I could not figure out if this could lead to possible corruptions or if this is already handled.
Also, how are more complicated cases handled e.g. where it is not only cat but a return value of another process based on input value being appended to the same file? Is the best practice first writing them to individual files then catting them together?
rule get_merged_total_distinct:
input:
expand("{dataset_id}/merge_libraries/{tomerge}_merged_rmd.bam",dataset_id=config["dataset_id"],tomerge=list(TOMERGE.keys())),
output:
"{dataset_id}/merge_libraries/merged_total_distinct.csv"
params:
with_dups="{dataset_id}/merge_libraries/{tomerge}_merged.bam"
shell:
"""
RCT=$(samtools view -#4 -c -F1 -F4 -q 30 {params.with_dups})
RCD=$(samtools view -#4 -c -F1 -F4 -q 30 {input})
printf "{wildcards.tomerge},${{RCT}},${{RCD}}\n" >> {output}
"""
or cases where an external script is being called to print the result to a single output file?
input:
expand("infile/{x}",...) # expanded as above
output:
"results/all.txt"
shell:
"""
bash script.sh {params.x} {input} {params.y} >> {output}
"""
With your example, the shell directive will expand to
cat /path/to/out-SAMPLE1.txt /path/to/out-SAMPLE2.txt [...] >> /path/to/merged.txt
where SAMPLE1, etc, comes from the LIST. In this case, there is no collision, corruption, or race conditions. One thread will run that command as if you typed it on your shell and all inputs will get cated to the output. Since snakemake is pull based, once the output exists that rule will only run again if the inputs change at which points the new inputs will be added to the old due to using >>. As such, I would recommend using > so the old contents are removed; rules should be deterministic where possible.
Now, if you had done something like
rule something:
input:
"/path/to/out-{asd}.txt"
output:
touch("/path/to/merged-{asd}.txt")
params:
output="/path/to/merged.txt"
shell:
"cat {input} >> {params.output}"
# then invoke
snakemake -j10 /path/to/merged-{a..z}.txt
Things are more messy. Snakemake will launch all 10 jobs and output to the single merged.txt. Note that file is now a parameter and we are targeting some dummy files. This will behave as if you had 10 different shells and executed the commands
cat /path/to/out-a.txt >> /path/to/merged.txt
# ...
cat /path/to/out-z.txt >> /path/to/merged.txt
all at once. The output will have a random order and lines may be interleaved or interrupted.
As some guidance
Try to make outputs deterministic. Given the same inputs you should always produce the same outputs. If possible, set random seeds and enforce input ordering. In the second example, you have no idea what the output will be.
Don't use the append operator. This follows from the first point. If the output already exists and needs to be updated, start from scratch.
If you need to append a bunch of outputs, say log files or to create a summary, do so in a separate rule. This again follows from the first point, but it's the only reason I can think of to use append.
Hope that helps. Otherwise you can comment or edit with a more realistic example of what you are worried about.

Read all lines at the same time individually - Solaris ksh

I need some help with a script. Solaris 10 and ksh.
I Have a file called /temp.list with this content:
192.168.0.1
192.168.0.2
192.168.0.3
So, I have a script which reads this list and executes some commands using the lines values:
FILE_TMP="/temp.list"
while IFS= read line
do
ping $line
done < "$FILE_TMP"
It works, but it executes the command on line 1. When it's over, it goes to the line 2, and it goes successively until the end. I would like to find a way to execute the command ping at the same time in each line of the list. Is there a way to do it?
Thank you in advance!
Marcus Quintella
As Ari's suggested, googling ksh multithreading will produce a lot of ideas/solutions.
A simple example:
FILE_TMP="/temp.list"
while IFS= read line
do
ping $line &
done < "$FILE_TMP"
The trailing '&' says to kick the ping command off in the background, allowing loop processing to continue while the ping command is running in the background.
'course, this is just the tip of the proverbial iceberg as you now need to consider:
multiple ping commands are going to be dumping output to stdout (ie, you're going to get a mish-mash of ping output in your console), so you'll need to give some thought as to what to do with multiple streams of output (eg, redirect to a common file? redirect to separate files?)
you need to have some idea as to how you want to go about managing and (possibly) terminating commands running in the background [ see jobs, ps, fg, bg, kill ]
if running in a shell script you'll likely find yourself wanting to suspend the main shell script processing until all background jobs have completed [ see wait ]

ksh loop putting results of file into variable

I'm not sure the best way to handle this, I'm guessing it's using a while loop. I have a .txt file with a set of numbers ( these numbers can change based on another script that runs )
ex:
0
36
41
53
60
Each number is on it's own line. For each number I want to get that number and execute a script using it. So in this example I would call a script to stop database 0, after that completes call a script to stop database 36 and so on until it's complete with all numbers in the list.
1) Is a while loop the best way to handle this?
2) I'm having trouble trying to determine what the [[condition]] needs to be to get each number 1 at a time, where can i find some additional help on this?
while [[ condition ]] ; do
command1
done
For testing purposes the file that contains all the numbers is test.txt. The script that will execute is a python script - "amgr.py stop (number from test.txt)"
Here's a simpler way:
cat test.txt | xargs amgr.py stop
this will get each line of your file, and then put it as an extra parameter for your amgr.py:
amgr.py stop 0
amgr.py stop 36
and so on..
I ended up using this method to give me the results i was looking for.
while read -r line ; do
amgr.py stop $line
done<test.txt

alternative to tail -F

I am monitoring a log file by doing "TAIL -n -0 -F filename". But this is taking up a lof of CPU as there are many messages being written to the logfile. Is there a way, I can open a file and read new/few entries and close it and repeat it every 5 second interval? So that I don't need to keep following the file? How can I remember the last read line to start from the next one in the next run? I am trying to do this in nawk by spawning tail shell cmd.
You won't be able to magically use less resources to tail a file by writing your own implementation. If tail -f is using resources because the file is growing fast, a custom version won't help any if you still want to view all lines as they are being written. You are simply limited by your hardware I/O and/or CPU.
Try using --sleep-interval=S where "S" is a number of seconds (the default is 1.0 - you can specify decimals).
tail -n 0 --sleep-interval=.5 -F filename
If you have so many log entries that tail is bogging down the CPU, how are you able to monitor them?

Nano hacks: most useful tiny programs you've coded or come across

It's the first great virtue of programmers. All of us have, at one time or another automated a task with a bit of throw-away code. Sometimes it takes a couple seconds tapping out a one-liner, sometimes we spend an exorbitant amount of time automating away a two-second task and then never use it again.
What tiny hack have you found useful enough to reuse? To make go so far as to make an alias for?
Note: before answering, please check to make sure it's not already on favourite command-line tricks using BASH or perl/ruby one-liner questions.
i found this on dotfiles.org just today. it's very simple, but clever. i felt stupid for not having thought of it myself.
###
### Handy Extract Program
###
extract () {
if [ -f $1 ] ; then
case $1 in
*.tar.bz2) tar xvjf $1 ;;
*.tar.gz) tar xvzf $1 ;;
*.bz2) bunzip2 $1 ;;
*.rar) unrar x $1 ;;
*.gz) gunzip $1 ;;
*.tar) tar xvf $1 ;;
*.tbz2) tar xvjf $1 ;;
*.tgz) tar xvzf $1 ;;
*.zip) unzip $1 ;;
*.Z) uncompress $1 ;;
*.7z) 7z x $1 ;;
*) echo "'$1' cannot be extracted via >extract<" ;;
esac
else
echo "'$1' is not a valid file"
fi
}
Here's a filter that puts commas in the middle of any large numbers in standard input.
$ cat ~/bin/comma
#!/usr/bin/perl -p
s/(\d{4,})/commify($1)/ge;
sub commify {
local $_ = shift;
1 while s/^([ -+]?\d+)(\d{3})/$1,$2/;
return $_;
}
I usually wind up using it for long output lists of big numbers, and I tire of counting decimal places. Now instead of seeing
-rw-r--r-- 1 alester alester 2244487404 Oct 6 15:38 listdetail.sql
I can run that as ls -l | comma and see
-rw-r--r-- 1 alester alester 2,244,487,404 Oct 6 15:38 listdetail.sql
This script saved my career!
Quite a few years ago, i was working remotely on a client database. I updated a shipment to change its status. But I forgot the where clause.
I'll never forget the feeling in the pit of my stomach when I saw (6834 rows affected). I basically spent the entire night going through event logs and figuring out the proper status on all those shipments. Crap!
So I wrote a script (originally in awk) that would start a transaction for any updates, and check the rows affected before committing. This prevented any surprises.
So now I never do updates from command line without going through a script like this. Here it is (now in Python):
import sys
import subprocess as sp
pgm = "isql"
if len(sys.argv) == 1:
print "Usage: \nsql sql-string [rows-affected]"
sys.exit()
sql_str = sys.argv[1].upper()
max_rows_affected = 3
if len(sys.argv) > 2:
max_rows_affected = int(sys.argv[2])
if sql_str.startswith("UPDATE"):
sql_str = "BEGIN TRANSACTION\\n" + sql_str
p1 = sp.Popen([pgm, sql_str],stdout=sp.PIPE,
shell=True)
(stdout, stderr) = p1.communicate()
print stdout
# example -> (33 rows affected)
affected = stdout.splitlines()[-1]
affected = affected.split()[0].lstrip('(')
num_affected = int(affected)
if num_affected > max_rows_affected:
print "WARNING! ", num_affected,"rows were affected, rolling back..."
sql_str = "ROLLBACK TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
sql_str = "COMMIT TRANSACTION"
ret_code = sp.call([pgm, sql_str], shell=True)
else:
ret_code = sp.call([pgm, sql_str], shell=True)
I use this script under assorted linuxes to check whether a directory copy between machines (or to CD/DVD) worked or whether copying (e.g. ext3 utf8 filenames -> fusebl
k) has mangled special characters in the filenames.
#!/bin/bash
## dsum Do checksums recursively over a directory.
## Typical usage: dsum <directory> > outfile
export LC_ALL=C # Optional - use sort order across different locales
if [ $# != 1 ]; then echo "Usage: ${0/*\//} <directory>" 1>&2; exit; fi
cd $1 1>&2 || exit
#findargs=-follow # Uncomment to follow symbolic links
find . $findargs -type f | sort | xargs -d'\n' cksum
Sorry, don't have the exact code handy, but I coded a regular expression for searching source code in VS.Net that allowed me to search anything not in comments. It came in very useful in a particular project I was working on, where people insisted that commenting out code was good practice, in case you wanted to go back and see what the code used to do.
I have two ruby scripts that I modify regularly to download all of various webcomics. Extremely handy! Note: They require wget, so probably linux. Note2: read these before you try them, they need a little bit of modification for each site.
Date based downloader:
#!/usr/bin/ruby -w
Day = 60 * 60 * 24
Fromat = "hjlsdahjsd/comics/st%Y%m%d.gif"
t = Time.local(2005, 2, 5)
MWF = [1,3,5]
until t == Time.local(2007, 7, 9)
if MWF.include? t.wday
`wget #{t.strftime(Fromat)}`
sleep 3
end
t += Day
end
Or you can use the number based one:
#!/usr/bin/ruby -w
Fromat = "http://fdsafdsa/comics/%08d.gif"
1.upto(986) do |i|
`wget #{sprintf(Fromat, i)}`
sleep 1
end
Instead of having to repeatedly open files in SQL Query Analyser and run them, I found the syntax needed to make a batch file, and could then run 100 at once. Oh the sweet sweet joy! I've used this ever since.
isqlw -S servername -d dbname -E -i F:\blah\whatever.sql -o F:\results.txt
This goes back to my COBOL days but I had two generic COBOL programs, one batch and one online (mainframe folks will know what these are). They were shells of a program that could take any set of parameters and/or files and be run, batch or executed in an IMS test region. I had them set up so that depending on the parameters I could access files, databases(DB2 or IMS DB) and or just manipulate working storage or whatever.
It was great because I could test that date function without guessing or test why there was truncation or why there was a database ABEND. The programs grew in size as time went on to include all sorts of tests and become a staple of the development group. Everyone knew where the code resided and included them in their unit testing as well. Those programs got so large (most of the code were commented out tests) and it was all contributed by people through the years. They saved so much time and settled so many disagreements!
I coded a Perl script to map dependencies, without going into an endless loop, For a legacy C program I inherited .... that also had a diamond dependency problem.
I wrote small program that e-mailed me when I received e-mails from friends, on an rarely used e-mail account.
I wrote another small program that sent me text messages if my home IP changes.
To name a few.
Years ago I built a suite of applications on a custom web application platform in PERL.
One cool feature was to convert SQL query strings into human readable sentences that described what the results were.
The code was relatively short but the end effect was nice.
I've got a little app that you run and it dumps a GUID into the clipboard. You can run it /noui or not. With UI, its a single button that drops a new GUID every time you click it. Without it drops a new one and then exits.
I mostly use it from within VS. I have it as an external app and mapped to a shortcut. I'm writing an app that relies heavily on xaml and guids, so I always find I need to paste a new guid into xaml...
Any time I write a clever list comprehension or use of map/reduce in python. There was one like this:
if reduce(lambda x, c: locks[x] and c, locknames, True):
print "Sub-threads terminated!"
The reason I remember that is that I came up with it myself, then saw the exact same code on somebody else's website. Now-adays it'd probably be done like:
if all(map(lambda z: locks[z], locknames)):
print "ya trik"
I've got 20 or 30 of these things lying around because once I coded up the framework for my standard console app in windows I can pretty much drop in any logic I want, so I got a lot of these little things that solve specific problems.
I guess the ones I'm using a lot right now is a console app that takes stdin and colorizes the output based on xml profiles that match regular expressions to colors. I use it for watching my log files from builds. The other one is a command line launcher so I don't pollute my PATH env var and it would exceed the limit on some systems anyway, namely win2k.
I'm constantly connecting to various linux servers from my own desktop throughout my workday, so I created a few aliases that will launch an xterm on those machines and set the title, background color, and other tweaks:
alias x="xterm" # local
alias xd="ssh -Xf me#development_host xterm -bg aliceblue -ls -sb -bc -geometry 100x30 -title Development"
alias xp="ssh -Xf me#production_host xterm -bg thistle1 ..."
I have a bunch of servers I frequently connect to, as well, but they're all on my local network. This Ruby script prints out the command to create aliases for any machine with ssh open:
#!/usr/bin/env ruby
require 'rubygems'
require 'dnssd'
handle = DNSSD.browse('_ssh._tcp') do |reply|
print "alias #{reply.name}='ssh #{reply.name}.#{reply.domain}';"
end
sleep 1
handle.stop
Use it like this in your .bash_profile:
eval `ruby ~/.alias_shares`