AWK print combination of string+bash variable+string - awk

I am trying to rename contigs in a fasta file with the isolate ID and numbering contigs from 1 to n using awk.
Fastafile:
>NODE_1_length_172477_cov_46.1343
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
The isolate ID is a variable, as I am tying to do this for multiple files. I have come as far as getting it to print isolateIDnumber, but I need >isolateID_number
for file in /dir/*.fasta
do
name=$(basename "$file" .fasta)
awk '/^>/{print "'"$name"'" ++i; next}{print}' $file > rename.fasta
done;
This gives me:
15AR07771
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
Desired output:
>15AR0777_1
GCAGGGCGCAGTTTTTGGAGGCTTGGCAAACCCGTGAGGGAAATTTGGCAGGCAAAATTT
TGGCGGTCGTGCCGAAAAAAGCGGAGGCGATTTCAAATAAATTGTTTTTCACACATCATC
CCAAGCGGCAGACGGAGTTTGCAGTCGGACAAATCAGGCAAGGGCGCGCAGAGTAAGTCA
Question is, where do I put the string so that it will print >15AR0777_1 instead of 15AR07771
I tried a few variations of the following, but none worked
awk '/^>/{print ">'"$name"'" "_" ++i; next}{print}' $file > rename.fasta
awk '/^>/{print ">'"$name"'" _++i; next}{print}' $file > rename.fasta
Thanks!

Use awk -v awk_var="$bash_bar" to transport shell variables into awk scripts. man awk:
-v var=val
--assign var=val
Assign the value val to the variable var, before execution of the program begins. Such variable values are available to the
BEGIN rule of an AWK program.
ie:
for file in dir/*.fasta
do
name=$(basename "$file" .fasta)
awk -v name="$name" '/^>/{print ">" name "_" ++i; next}{print}' $file > rename.fasta
done
Here is an all awk version for it:
awk '
FNR==1 { # new file, close old and make name for new
close(f) # close the old output file
n=FILENAME # get filename of the new file
gsub(/^.*\/|\.fasta$/,"",n) # remove path and .fasta
f="rename_" n ".fasta" # new output file
}
/^>/ {
$0=">" n "_" ++i # >name_number
}
{
print > f # print to output file
}' dir/*.fasta # process .fasta files in dir
If there is a file dir/15AR07771.fasta the script will produce a file ./rename_15AR07771.fasta of it. (Your version writes all output files to rename.fasta and does not even append, you may want to fix that.)

Related

How can I send the output of an AWK script to a file?

Within an AWK script, I'm needing to send the output of the script to a file while also printing it to the terminal. Is there a nice and tidy way I can do this without having a copy of every print redirect to the file?
I'm not particularly good at making SSCCE examples but here's my attempt at demonstrating my problem;
BEGIN{
print "This is an awk script"
# I don't want to have to do this for every print
print "This is an awk script" > thisiswhack.out
}
{
# data manip. stuff here
# ...
print "%s %s %s" blah, blah blah
# I don't want to have to do this for every print again
print "%s %s %s" blah blah blah >> thisiswhack.out
}
END{
print "Yay we're done!"
# Seriously, there has to be a better way to do this within the script
print "Yay we're done!" >> thisiswhack.out
}
Surely there must be a way to send the entire output of the script to an output file within the script itself, right?
The command to duplicate streams is tee, and we can use it inside awk:
awk '
BEGIN {tee = "tee out.txt"}
{print | tee}' in.txt
This invokes tee with the file argument out.txt, and opens a stream to this command.
The stream (and therefore tee) remains open until awk exits, or close(tee) is called.
Every time print | tee is used, the data is printed to that stream. tee then appends this data both to the file out.txt, and stdout.
The | command feature is POSIX awk. Also the tee variable isn't compulsory (you can use the string).
Of course, we can use tee outside awk too: awk ... | tee out.txt.
GNU AWK's Redirection allows sending output to command, rather than file, therefore I suggest following exploit of said feature:
awk 'BEGIN{command="tee output.txt"}{print tolower($0) | command}' input.txt
Note: I use tolower($0) for demonstration purposes. I redirect print into tee command, which does output to mentioned file and standard output, thus you should get lowercase version of input.txt written to output.txt and standard output.
If you are not confined to single awk usage then you might alternatively use tee outside, like so
awk '{print tolower($0)}' input.txt | tee output.txt
awk '
function prtf(str) {
printf "%s", str > "thisiswhack.out"
printf "%s", str
fflush()
}
function prt(str) {
prtf( str ORS )
}
{
# to print adding a newline at the end:
prt( "foo" )
# to print as-is without adding a newline:
prtf( sprintf("%s, %s, %d", $2, "bar", 17) )
}
' file
In the above we are not spawning a subshell to call any other command so it's efficient, and we're using fflush() after every print to ensure both output streams (stdout and the extra file) don't get out of sync with respect to each other (e.g. stdout displays less text than the file or vice-versa if the command is killed).
The above always overwrites the contents of "thisiswhack.out" with whatever the script outputs. If you want to append instead then change > to >>. If you want the option of doing both, introduce a variable (which I've named prtappend below) to control it which you can set on the command line, e.g. change:
printf "%s", str > "thisiswhack.out"
to:
printf "%s", str >> "thisiswhack.out"
and add:
BEGIN {
if ( !prtappend ) {
printf "" > "thisiswhack.out"
}
}
then if you do awk -v prtappend=1 '...' it'll append to thisiswhack.out instead of overwriting it.
Of course, the better approach if you're on a Unix system is to have your awk script called from a shell script with it's output piped to tee, e.g.:
#!/usr/bin/env bash
awk '
{
print "foo"
printf"%s, %s, %d", $2, "bar", 17
}
' "${#:--}" |
tee 'thisiswhack.out'
Note that this is one more example of why you should not call awk from a shebang.

Count field separators on each line of input file and if missing/exceeding, output filename to error file

I have to validate the input file, Input.txt, for proper number of field separators on each row and if even one row including the header is missing or exceeding the correct number of field separators then print the name of the file to errorfiles.txt and exit.
I have another file to use as reference for the correct number of field separators, valid.txt, then compare the number of field separators on each row of the input file with the number of field separators in the valid.txt file.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
This is not working.
awk -F '|' '{ print NF-1; exit }' valid.txt > fscount
awk -F '|' '(NF-1) != "cat fscount" { print FILENAME>"errorfiles.txt"; exit}' Input.txt
It is not fully clear what your requirement is, to print the FILENAME on just a single input file provided, perhaps you wanted to loop over a list of files on a directory running this command?
Anyway, to use the content of the file in the context of awk, just use its -v switch and use input re-direction on the file
awk -F '|' -v count="$(<fscount)" -v fname="errorfiles.txt" '(NF-1) != (count+0) { print FILENAME > fname; close(fname); exit}' Input.txt
Notice the use of close(filename) here, which is generally required when you are manipulating files inside awk constructs. The close() call just closes the file descriptor associated with opening the file pointed by filename explicitly, instead of letting the OS do it.
GNU awk solution:
awk -F '|' 'ARGIND==1{aimNF=NF; nextfile} ARGIND==2{if (NF!=aimNF) {print FILENAME > "errorfiles.txt"; exit}}' valid.txt Input.txt
You can do it with just one command,
-- use awk to read two files, store NF number of 1st file, and compare it in the second file.
For other awk you can replace ARGIND==1 with FILENAME==ARGV[1], and so on.
Or if you are sure first file won't be empty, use NR==FNR and NR>FNR instead.

Bash how to split file on empty line with awk

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)
I found this answer that suggests using awk, but I can't make it work with my desired naming convention
awk -v RS="" '{print $0 > $1".txt"}' file
I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following
bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern
Input file:
A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
Desired output:
A1.in
4
RURDDD
A2.in
6
RRULDD
KKKKKK
A3.in
26
RRRULU
Another fix for the awk:
$ awk -v RS="" '{
split(FILENAME,a,".") # separate name and extension
f=a[1] NR "." a[2] # form the filename, use NR as number
print > f # output to file
close(f) # in case there are MANY to avoid running out f fds
}' A.in
In any normal case, the following script should work:
awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file
The reason why this might fail is most likely due to some CRLF terminations (See here and here).
As mentioned by James, making it a bit more robust as:
awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file
If you want to use csplit, the following will do the trick:
csplit --suppress-matched -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'
See man csplit for understanding the above.
Input file content:
$ cat A.in
4
RURDDD
6
RRULDD
KKKKKK
26
RRRULU
AWK file content:
BEGIN{
n=1
}
{
if(NF!=0){
print $0 >> "A"n".in"
}else{
n++
}
}
Execution:
awk -f ctrl.awk A.in
Output:
$ cat A1.in
4
RURDDD
$ cat A2.in
6
RRULDD
KKKKKK
$ cat A3.in
26
RRRULU
PS: One-liner execution without AWK file:
awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in

Trying to modify awk code

awk 'BEGIN{OFS=","} FNR == 1
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
{
cmd="gunzip -cd " FILENAME
cmd; close(cmd)
}
END {print fn,fnr,nl}
' /tmp/appscraps/* > /tmp/test.txt
the above scans all files in a given directory. prints the file name, number of lines in each file and number of lines found containing 'ERROR'.
im now trying to make it so that the script executes a command if any of the file it reads in isn't a regular file. i.e., if the file is a gzip file, then run a particular command.
above is my attempt to include the gunzip command in there and to do it on my own. unfortunately, it isn't working. also, i cannot "gunzip" all the files in the directory beforehand. this is because not all files in the directory will be "gzip" type. some will be regular files.
so i need the script to treat any .gz file it finds a different way so it can read it, count and print the number of lines that's in it, and the number of lines it found matching the pattern supplied (just as it would if the file had been a regular file).
any help?
This part of your script makes no sense:
{if (NR > 1) {print fn,fnr,nl}
fn=FILENAME; fnr = 1; nl = 0}
{fnr = FNR}
/ERROR/ && FILENAME ~ /\.gz$/ {nl++}
Let me restructure it a bit and comment it so it's clearer what it does:
{ # for every line of every input file, do the following:
# If this is the 2nd or subsequent line, print the values of these variables:
if (NR > 1) {
print fn,fnr,nl
}
fn = FILENAME # set fn to FILENAME. Since this will occur for the first line of
# every file, this is that value fn will have when printed above,
# so why not just get rid of fn and print FILENAME?
fnr = 1 # set fnr to 1. This is immediately over-written below by
# setting it to FNR so this is pointless.
nl = 0
}
{ # for every line of every input file, also do the following
# (note the unnecessary "}" then "{" above):
fnr = FNR # set fnr to FNR. Since this will occur for the first line of
# every file, this is that value fnr will have when printed above,
# so why not just get rid of fnr and print FNR-1?
}
/ERROR/ && FILENAME ~ /\.gz$/ {
nl++ # increment the value of nl. Since nl is always set to zero above,
# this will only ever set it to 1, so why not just set it to 1?
# I suspect the real intent is to NOT set it to zero above.
}
You also have the code above testing for a file name that ends in ".gz" but then you're running gunzip on every file in the very next block.
Beyond that, just call gunzip from shell as everyone else also suggested. awk is a tool for parsing text, it's not an environment from which to call other tools - that's what a shell is for.
For example, assuming your comment (prints the file name, number of lines in each file and number of lines found containing 'ERROR) accurately describes what you want your awk script to do and assuming it makes sense to test for the word "ERROR" directly in a ".gz" file using awk:
for file in /tmp/appscraps/*.gz
do
awk -v OFS=',' '/ERROR/{nl++} END{print FILENAME, NR+0, nl+0}' "$file"
gunzip -cd "$file"
done > /tmp/test.txt
Much clearer and simpler, isn't it?
If it doesn't make sense to test for the word ERROR directly in a ".gz" file, then you can do this instead:
for file in /tmp/appscraps/*.gz
do
zcat "$file" | awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
gunzip -cd "$file"
done > /tmp/test.txt
To handle gz and non-gz files as you've now described in your comment below:
for file in /tmp/appscraps/*
do
case $file in
*.gz ) cmd="zcat" ;;
* ) cmd="cat" ;;
esac
"$cmd" "$file" |
awk -v file="$file" -v OFS=',' '/ERROR/{nl++} END{print file, NR+0, nl+0}'
done > /tmp/test.txt
I left out the gunzip since you don't need it as far as I can tell from your stated requirements. If I'm wrong, explain what you need it for.
I think it could be simpler than that.
With shell expansion you already have the file name (hence you can print it).
So you can do a loop over all the files, and for each do the following:
print the file name
zgrep -c ERROR $file (this outputs the number of lines containing 'ERROR')
zcat $file|wc -l (this will output the line numbers)
zgrep and zcat work on both plain text files and gzipped ones.
Assuming you don't have any spaces in the paths/filenames:
for f in /tmp/appscraps/*
do
n_lines=$(zcat "$f"|wc -l)
n_errors=$(zgrep -c ERROR "$f")
echo "$f $n_lines $n_errors"
done
This is untested but it should work.
You can use execute the following command for each file :
gunzip -t FILENAME; echo $?
It will pass print the exit code 0(for gzip files) or 1(corrupt/other file). Now you can compare the output using IF to execute the required processing.

choose the newest file and use getline to read it

Having problems with a small awk script, Im trying to choose the newest of some log files and then use getline to read it. The problem is that it dosent work if I dont send it any input first to the script.
This works
echo | myprog.awk
this do not
myprog.awk
myprog.awk
BEGIN{
#find the newest file
command="ls -alrt | tail -1 | cut -c59-100"
command | getline logfile
close(command)
}
{
while((getline<logfile)>0){
#do the magic
print $0
}
}
Your problem is that while your program selects OK the logfile the block {} is to be executed for every line of the input file and you have not input file so it defaults to standard input. I don't know awk very well myself so I don't know how to change the input (if possible) from within an awk script, so I would:
#! /bin/awk -f
BEGIN{
# find the newest file
command = "ls -1rt | tail -1 "
command | getline logfile
close(command)
while((getline<logfile)>0){
getline<logfile
# do the magic
print $0
}
}
or maybe
alias myprog.awk="awk '{print $0}' `ls -1rt | tail -1`"
Again, this maybe a little dirty. We'll wait for a better answer. :-)
Never parse ls. See this for the reason.
Why do you need to use getline? Let awk do the work for you.
#!/bin/bash
# get the newest file
files=(*) newest=${f[0]}
for f in "${files[#]}"; do
if [[ $f -nt $newest ]]; then
newest=$f
fi
done
# process it with awk
awk '{
# do the magic
print $0
}' $newest