I just finished my bowtie2 alignment jobs by snakemake.
But as you know, bowtie2 output a align summary:
23774776 reads; of these:
23774776 (100.00%) were paired; of these:
5928889 (24.94%) aligned concordantly 0 times
17845887 (75.06%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
5928889 pairs aligned concordantly 0 times; of these:
1214536 (20.49%) aligned discordantly 1 time
----
4714353 pairs aligned 0 times concordantly or discordantly; of these:
9428706 mates make up the pairs; of these:
6563535 (69.61%) aligned 0 times
2843810 (30.16%) aligned exactly 1 time
21361 (0.23%) aligned >1 times
86.20% overall alignment rate
This summary was wroten in the following files:
snakejob.align.601.sh.e6589895
snakejob.align.602.sh.e6591632
snakejob.align.603.sh.e6591988
snakejob.align.604.sh.e6591623
snakejob.align.605.sh.e6591927
snakejob.align.606.sh.e6591628
snakejob.align.607.sh.e6590473
snakejob.align.608.sh.e6591280
snakejob.align.609.sh.e6590190
snakejob.align.610.sh.e6590903
There was no sample name in the summary. I think the snakejob id (6**) may hava a relationship with the sample name.
I have checked the files in the hidden folder .snakemake/metadata, the message in the file looks like:
{"rule": "PE", "shellcmd": "/soft/samtools/samtools view -bF 12 /home/RAD/01align/out/R40.bam > /home/RAD/01align/out/R40.PE.bam && echo '3 done'", "params": [], "version": null, "incomplete": false, "input": ["/home/RAD/01align/out/R40.bam"], "code": "gAMoQxR0AABkAQBkAgB8CgCDAQEBZAAAU3EAKFgFAAAAaW5wdXRxAVgGAAAAb3V0cHV0cQJYBgAAAHBhcmFtc3EDWAkAAAB3aWxkY2FyZHNxBFgHAAAAdGhyZWFkc3EFWAkAAAByZXNvdXJjZXNxBlgDAAAAbG9ncQdYBwAAAHZlcnNpb25xCFgEAAAAcnVsZXEJWAkAAABjb25kYV9lbnZxClgMAAAAYmVuY2hfcmVjb3JkcQt0cQxdcQ0oTlhYAAAAL25mcy9iaW9zb2Z0L3NhbXRvb2xzL3NhbXRvb2xzIHZpZXcgLWJGIDEyIHtpbnB1dC5iYW19ID4ge291dHB1dC5QRWJhbX0gJiYgZWNobyAnMyBkb25lJ3EOaAtlWAUAAABzaGVsbHEPhXEQdHERLg==", "log": []}
The code section may contain some help information for me to get the sample name. But I don’t know how to generate these code value.
I hope someone could help me out.
The portable solution to this is to specify a log file for the rule. See the docs. Also have a look at the best practice workflow(s) from the Snakemake workflows project.
Related
I have a text file where the first few lines are text and the rest of the lines contain data in the form of real numbers. I just require the real numbers array to be stored in a new file. For this, I read the total lines of the file for which output is correct and then trying to read the real numbers from the particular line numbers. I am unable to understand as to how to read this data.
Below is a part of file. Also I have many files like these to read.
AptitudeQT paperI: 12233105
Latitude : 30.00 S
Longitude: 46.45 E
Attemptone Time: 2017-03-30-09-03
End Time: 2017-03-30-14-55
Height(agl): m
Pressure: hPa
Temperature: deg C
Humidity: %RH
Uvelocity: cm/s
Vvelocity: cm/s
WindSpeed: cm/s
WindDirection: deg
---------------------------------------
10 1008.383 27.655 62.200 -718.801 -45.665 720.250 266.500
20 1007.175 27.407 62.950 -792.284 -18.481 792.500 268.800
There are many examples how to skip/read lines like this
But to sum it up, option A is to skip header and read only data:
! Skip first 17 lines
do i = 1, 17
read (unit,*,IOSTAT=stat) ! Dummy read
if ( stat /= 0 ) stop "error"
end do
! Read data
do i = 1, 1000
read (unit,*,IOSTAT=stat) data(:,i)
if ( stat /= 0 ) stop "error"
end do
If you have many files like this, I suggest wrapping this in a subroutine/function.
Option B is to use unix tail utility to discard header (more info here):
tail -n +18 file.txt
I'm trying to take N files, which, incidentally, are all syslog log files, and interlace them based on the timestamp which is the first part of the line. I can do this naively but I fear that my approach will not scale well with any more than just a handful of these files.
So let's say I just have two files, 1.log and 2.log. 1.log looks like this:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.384521+00:00 bar 1
and 2.log looks like this:
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
Given that example, I would want the output to be:
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1
As that would be the lines of the files, combined, and sorted by the timestamp with which each line begins.
We can assume that each file is internally sorted before the program is run. (If it isn't, rsyslog and I have some talking to do.)
So quite naively I could write something like this, ignoring memory concerns and whatnot:
interlaced_lines = []
first_lines = [[f.readline(), f] for f in files]
while first_lines:
first_lines.sort()
oldest_line, f = first_lines[0]
while oldest_line and (len(first_lines) == 1 or (first_lines[1][0] and oldest_line < first_lines[1][0])):
interlaced_lines.append(oldest_line)
oldest_line = f.readline()
if oldest_line:
first_lines[0][0] = oldest_line
else:
first_lines = first_lines[1:]
I fear that this might be quite slow, reading line by line like this. However, I'm not sure how else to do it. Can I perform this task faster with a different algorithm or through parallelization? I am largely indifferent to which languages and tools to use.
As it turns out, since each file is internally presorted, I can get pretty far with sort --merge. With over 2GB of logs it sorted them in 15 seconds. Using my example:
% sort --merge 1.log 2.log
2016-04-06T21:13:23.655446+00:00 foo 1
2016-04-06T21:13:24.372946+00:00 foo 2
2016-04-06T21:13:24.373171+00:00 bar 2
2016-04-06T21:13:24.384521+00:00 bar 1
I have a problem with ksh in that a while loop is failing to obey the "while" condition. I should add now that this is ksh88 on my client's Solaris box. (That's a separate problem that can't be addressed in this forum. ;) I have seen Lance's question and some similar but none that I have found seem to address this. (Disclaimer: NO I haven't looked at every ksh question in this forum)
Here's a very cut down piece of code that replicates the problem:
1 #!/usr/bin/ksh
2 #
3 go=1
4 set -x
5 tail -0f loop-test.txt | while [[ $go -eq 1 ]]
6 do
7 read lbuff
8 set $lbuff
9 nwords=$#
10 printf "Line has %d words <%s>\n" $nwords "${lbuff}"
11 if [[ "${lbuff}" = "0" ]]
12 then
13 printf "Line consists of %s; time to absquatulate\n" $lbuff
14 go=0 # Violate the WHILE condition to get out of loop
15 fi
16 done
17 printf "\nLooks like I've fallen out of the loop\n"
18 exit 0
The way I test this is:
Run loop-test.sh in background mode
In a different window I run commands like "echo some nonsense >>loop_test.txt" (w/o the quotes, of course)
When I wish to exit, I type "echo 0 >>loop-test.txt"
What happens? It indeed sets go=0 and displays the line:
Line consists of 0; time to absquatulate
but does not exit the loop. To break out I append one more line to the txt file. The loop does NOT process that line and just falls out of the loop, issuing that "fallen out" message before exiting.
What's going on with this? I don't want to use "break" because in the actual script, the loop is monitoring the log of a database engine and the flag is set when it sees messages that the engine is shutting down. The actual script must still process those final lines before exiting.
Open to ideas, anyone?
Thanks much!
-- J.
OK, that flopped pretty quick. After reading a few other posts, I found an answer given by dogbane that sidesteps my entire pipe-to-while scheme. His is the second answer to a question (from 2013) where I see neeraj is using the same scheme I'm using.
What was wrong? The pipe-to-while has always worked for input that will end, like a file or a command with a distinct end to its output. However, from a tail command, there is no distinct EOF. Hence, the while-in-a-subshell doesn't know when to terminate.
Dogbane's solution: Don't use a pipe. Applying his logic to my situation, the basic loop is:
while read line
do
# put loop body here
done < <(tail -0f ${logfile})
No subshell, no problem.
Caveat about that syntax: There must be a space between the two < operators; otherwise it looks like a HEREIS document with bad syntax.
Er, one more catch: The syntax did not work in ksh, not even in the mksh (under cygwin) which emulates ksh93. But it did work in bash. So my boss is gonna have a good laugh at me, 'cause he knows I dislike bash.
So thanks MUCH, dogbane.
-- J
After articulating the problem and sleeping on it, the reason for the described behavior came to me: After setting go=0, the control flow of the loop still depends on another line of data coming in from STDIN via that pipe.
And now that I have realized the cause of the weirdness, I can speculate on an alternative way of reading from the stream. For the moment I am thinking of the following solution:
Open the input file as STDIN (Need to research the exec syntax for that)
When the condition occurs, close STDIN (Again, need to research the syntax for that)
It should then be safe to use the more intuitive:while read lbuffat the top of the loop.
I'll test this out today and post the result. I'd hope someone else benefit from the method (if it works).
Someone can tell if it is possible, save a variable(a number (this number is a total of rows in a SQL tabel)) in like a memory or a file? and then, like 5 min later, check if the number is the same? and send me like a warning or alert for nagios?
Sounds like you are looking to do something like this:
#!/bin/sh
OLD_NUM=`command_to_get_number`
while true
do
sleep 5m
NEW_NUM=`command_to_get_number`
[ "$OLD_NUM" != "$NEW_NUM" ] && notify-send "Number changed"
OLD_NUM="$NEW_NUM"
done
notify-send will give you a desktop notification, not sure if there is a similar command to work with nagios.
I'm using the ocount tool from the oprofile suite to count three different HW performance counters:
ocount --events=rtm_retired:commit,rtm_retired:start,rtm_retired:aborted programA
The problem is that because the three counters share a prefix, the output is irksomely ambiguous.
Event counts (actual) for programA:
Event Count % time counted
rtm_retired 908 100.00
rtm_retired 908 100.00
rtm_retired 0 100.00
The ordering is correct given the command line, but if I'm dumping all this stuff into files as I do experiments with other counters, it's possible to lose track of what counter is what.
Looking at the ocount manpage, I can't seem to figure out a way to force it give the full event name.
Added:
Looking at the sources, I'm not actually sure this is possible, as the three events above are just masks on the same counter, and the count-printing section of the sources seems to only deal with event names, not mask names.
Alas (but would love to be proven wrong).
If changing the source code for ocount isn't an option, you can always modify the output afterwards.
Try piping the output through this perl one liner:
ocount --events=rtm_retired:commit,rtm_retired:start,rtm_retired:aborted programA | \
perl -n -e ' #suffix = ("commit", "start", "aborted"); if ( m/rtm_retired/ ) { $count++; s/rtm_retired/rtm_retired:$suffix[$count-1]/; } print $_;'
This should work as long as you make sure to keep track of the order of counters you pass to ocount and match the #suffix array to it.