nested awk commands?

nested awk commands? - awk

I have got following two codes:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
and
grep -v '#' neco.txt |
grep -v 'seq-name' |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk '$6 >= 49 { print }' |
awk '$6 <= 180 { print }' |
awk '$4 > 1 { print }' |
awk '$5 < $nut { print }' |
wc -l
I would like my script to replace "nut" at this place:
awk '$4 < $nut { print }'
with the number returned from this:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
However, $1 in code just above should represent not column from ids_lengths.txt, but first column from neco.txt! (similiarly as I use $6 and $4 in main code).
A help how to solve these nested awks will definitely be appreciated:-)
edit:
Line of my input file (neco.txt) looks like this:
FZWTUY402JKYFZ 2 100.000 3 11 9 4.500 7 0 0 0 . TG TGTGTGTGT
The biggest problem is that I want to filter those lines that have in the fifth column number less than number, which I get from another file (ids_lengths.txt), when searching with first column (e.g. FZWTUY402JKYFZ). That's why I put "nut" variable in my draft script :-)
ids_lengths.txt looks like this:
>FZWTUY402JKYFZ
153
>FZWTUY402JXI9S
42
>FZWTUY402JMZO4
158

You can combine the two grep -v operations and the four consecutive awk operations into one of each. This gives you useful economy without completely rewriting everything:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
grep -E -v '#|seq-name' neco.txt |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk -vnut="$nut" '$6 >= 49 && $6 <= 180 && $4 > 1 && $5 < nut { print }' |
wc -l
I would not bother to make a single awk script determine the value of nut and do the value-based filtering. It can be done, but it complicates things unnecessarily — unless you can demonstrate that the whole thing is a bottleneck for the performance of the production system, in which case you do work harder (though I'd probably use Perl in that case; it can do the whole lot in one command).

Approximately:
awk -v select="$1" '$0 ~ select && FNR == NR { getline; nut = $0; } FNR == NR {next} $4 > 1 $5 < nut && $6 >= 49 && $6 <= 180 && ! /#/ && ! /seq-name/ && $NF ~ /^AC.+CA$/ {count++} END {print count}' neco.txt ids_lengths.txt
The regex will need to be adjusted to something that AWK understands. I can't see how the regex matches the sample data you provided. Part of the solution may be to use a field count as one of the conditions. Perhaps NF == 13 or NF >= 13.
Here's the script above broken out on multiple lines for readability:
awk -v select="$1" '
$0 ~ select && FNR == NR {
getline
nut = $0;
}
FNR == NR {next}
$4 > 1
$5 < nut &&
$6 >= 49 &&
$6 <= 180 &&
! /#/ &&
! /seq-name/ &&
$NF ~ /^AC.+CA$/ {
count++
}
END {
print count
}' ids_lengths.txt neco.txt

Related

Delete lines with specific pattern name in the first column awk

I want to delete all lines in which MGD is not between 676 and 900.
I wrote a script
#!/bin/bash
for index in {1..100} # I do this script on 100 files, that is why I use for loop
do
awk 'BEGIN { FS = "MGD" };
{if ($2 >= 676 && $2 <= 900) print}' eq2_15_333_$index.ndx | tee eq3_15_333_$index.ndx
done
Input example
MGD816 SOL77
MGD71 SOL117
MGD7 SOL13194
MGD18 SOL235
MGD740 SOL340
MGD697 SOL396
MGD70 SOL9910
Expected output
MGD816 SOL77
MGD740 SOL340
MGD697 SOL396
I don't know what my script do something wrong, because I still have something which has MGD7 or MGD71, but MGD18 I haven't in my output.
Edit
I tested this script and it works perfectly
awk '/^MGD/{val=substr($1,4);if(val+0 >= 676 && val+0 <= 900){print}}' new.txt | tee new2.txt
and I have output
MGD816 SOL77
MGD740 SOL340
MGD697 SOL396

Based on your shown samples try following once. This is completely based on your shown attempt, since no samples were provided so its not tested, should work though.
#!/bin/bash
for index in {1..100}
do
awk '/^MGD/{val=substr($1,4);if(val+0 >= 676 && val+0 <= 900){print}}' eq2_15_333_$index.ndx | tee eq3_15_333_$index.ndx
done

I want to explain why your original i.e.
awk 'BEGIN { FS = "MGD" }; {if ($2 >= 676 && $2 <= 900) print}'
did not work as excepted, you set "MGD" as FS thus AWK splitted only at MGD - if you do awk 'BEGIN{FS="MGD"}{print $2}' file.txt and content of file.txt is
MGD816 SOL77
MGD71 SOL117
MGD7 SOL13194
MGD18 SOL235
MGD740 SOL340
MGD697 SOL396
MGD70 SOL9910
output is
71 SOL117
7 SOL13194
18 SOL235
740 SOL340
697 SOL396
70 SOL9910
If you want $2 to be first number you should specify FS which is "MGD" or spaces i.e.
awk 'BEGIN{FS="MGD|[[:space:]]+"}...

awk calculation fails in cases where zero is used

I am using awk to calculate % of each id using the below, which runs and is very close except when the # being used in the calculation is zero. I am not sure how to code this condition into the awk as it happens frequently. Thank you :).
file1
ABCA2 9 232
ABHD12 211 648
ABL2 83 0
file2
CC2D2A 442
(CCDC114) 0
awk with error
awk 'function ceil(v) {return int(v)==v?v:int(v+1)}
> NR==FNR{f1[$1]=$2; next}
> $1 in f1{print $1, ceil(10000*(1-f1[$1]/$3))/100 "%"}' all_sorted_genes_base_counts.bed all_sorted_total_base_counts.bed > total_panel_coverage.txt
awk: cmd. line:3: (FILENAME=file1 FNR=3) fatal: division by zero attempted

When you have a script that fails when parsing 2 input files, I can't imagine why you'd only show 1 sample input file and no expected output thereby ensuring
we can't test our potential solutions against a sample you think is relevant and
we have no way of knowing if our script is doing what you want or not,
but in general to guard against a zero denominator you'd use code like:
awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
e.g.
$ echo '6 2' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
3
$ echo '6 0' | awk '{print ($2 == 0 ? "NaN" : $1 / $2)}'
NaN

awk to extract the data between Dates

Would like to extract the line items, if the dates between 5th Apr to 10th Apr from second field ($2) . Having many gun zip files into that directory.
Inputs.gz
Des1,DATE,Des1,Des2,Des3
ab,01-APR-15,10,0,4
ab,04-APR-15,25,0,12
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
ab,11-APR-15,85,0,1
I have tried below command and in-complete
zcat Inputs*.gz | awk 'BEGIN{FS=OFS=","} { if ( (substr($2,1,2) >=5) && (substr($2,1,2) <=10) ) print $0 }' > Output.txt
Expected Output
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
Please suggest ...

Try this:
awk -F",|-" '$2 >= 5 && $2 <= 10'
It adds the date delimiter to the FS using the -F flag. To ensure that it's APR of 2015, you could separately add tests like:
awk -F",|-" '$2 >= 5 && $2 <= 10 && $3=="APR" && $4==15'
While this makes the date easy to parse up front, if you want to print it out again, you'll need to reconstruct it with something like _date = $2 "-" $3 "-" $4. And if you need to manipulate the data in general, you'd want to add back in the BEGIN {OFS=","} part.
The field numbering I used assumes there are no "-" delimiters in the first field.
I get the following output:
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
If you have a whole mess of dates and you really only care about the one in the 2nd field via comma delimiters, you could use split like:
awk -F"," '{ split($2, darr, "-") } darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15'
which is like saying:
for every line, parse the 2nd field into the darr array using the - delimiter
for every line, if the logic darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15 is true print the whole line.

Another simple solution by using regular expression
awk -F',' '$2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
-F Field separator.
$2 second field
~ match regular expression
'/([0][5-9]|10)-APR-15/` reguler expression to match 05 to 09 or 10
APR-15
Using internal field separator
awk 'BEGIN{ FS="," } $2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
using explicate date number declarations
awk 'BEGIN{ FS="," } $2 ~ /(05|06|07|08|09|10)-APR-15/{ print $0 }' txt

awk printing against the condition

I have a simple question but I could not figure it out.
I have a file that I want to print all the lines that DO NOT match the condition I specify in the awk if condition. But I can just get to print the condition, how the other would work?
This is my code:
awk '{if ($18==0 && $19==0 && $20==0 && $21==0) print $0}' file
I also tried this:
awk '{if !($18==0 && $19==0 && $20==0 && $21==0) print $0}' file
But the second one doesn't work, any help is appreciated. Thank you.

Here you can do:
awk '$18+$19+$20+$21!=0' file
print $0 is not needed, since its default action.

The negation (!) needs to be inside the parentheses:
awk '{if (!($18==0 && $19==0 && $20==0 && $21==0)) print $0}' file
And we add another set inside to wrap everything.
(FYI, if you had given how it "didn't work" (i.e., a syntax error on !, that would have been more helpful. Please remember to include error messages or symptoms of something not working for future questions!)

You could also reverse your conditional statement:
you want the opposite of :
awk '{if ($18==0 && $19==0 && $20==0 && $21==0) print $0}' file
Which can either be :
awk '{if ($18!=0 || $19!=0 || $20!=0 || $21!=0) print $0}' file
or
awk '{if (!($18==0 && $19==0 && $20==0 && $21==0)) print $0}' file
Example :
!cat file
A 0 0 0
B 1 1 1
C 1 0 1
awk '$2+$3+$4!=0' file
B 1 1 1
C 1 0 1
awk '{if ($2!=0 || $3!=0 || $4!=0) print $0}' file
B 1 1 1
C 1 0 1
awk '{if (!($2==0 && $3==0 && $4==0)) print $0}' file
B 1 1 1
C 1 0 1
awk '{if (!($2==0 || $3==0 || $4==0)) print $0}' file
B 1 1 1

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Consider the following command:
$ gawk -F"\t" "BEGIN{OFS=\"\t\"}{$2=$3=\"\"; print $0}" Input.tsv
When I set $2 = $3 = "", the intended effect is to get the same effect as writing:
print $1,$4,$5...$NF
However, what actually happens is that I get two empty fields, with the extra field delimiters still printing.
Is it possible to actually delete $2 and $3?
Note: If this was on Linux in bash, the correct statement above would be the following, but Windows does not handle single quotes well in cmd.exe.
$ gawk -F'\t' 'BEGIN{OFS="\t"}{$2=$3=""; print $0}' Input.tsv

This is an oldie but goodie.
As Jonathan points out, you can't delete fields in the middle, but you can replace their contents with the contents of other fields. And you can make a reusable function to handle the deletion for you.
$ cat test.awk
function rmcol(col, i) {
for (i=col; i<NF; i++) {
$i = $(i+1)
}
NF--
}
{
rmcol(3)
}
1
$ printf 'one two three four\ntest red green blue\n' | awk -f test.awk
one two four
test red blue

You can't delete fields in the middle, but you can delete fields at the end, by decrementing NF.
So you can shift all the later fields down to overwrite $2 and $3 then decrement NF by two, which erases the last two fields:
$ echo 1 2 3 4 5 6 7 | awk '{for(i=2; i<NF-1; ++i) $i=$(i+2); NF-=2; print $0}'
1 4 5 6 7

If you are just looking to remove columns, you can use cut:
$ cut -f 1,4- file.txt
To emulate cut:
$ awk -F "\t" '{ for (i=1; i<=NF; i++) if (i != 2 && i != 3) { if (i == NF) printf $i"\n"; else printf $i"\t" } }' file.txt
Similarly:
$ awk -F "\t" '{ delim =""; for (i=1; i<=NF; i++) if (i != 2 && i != 3) { printf delim $i; delim = "\t"; } printf "\n" }' file.txt
HTH

The only way I can think to do it in Awk without using a loop is to use gsub on $0 to combine adjacent FS:
$ echo {1..10} | awk '{$2=$3=""; gsub(FS"+",FS); print}'
1 4 5 6 7 8 9 10

One way could be to remove fields like you do and remove extra spaces with gsub:
$ awk 'BEGIN { FS = "\t" } { $2 = $3 = ""; gsub( /\s+/, "\t" ); print }' input-file

In the addition of the answer by Suicidal Steve I'd like to suggest one more solution but using sed instead awk.
It seems more complicated than usage of cut as it was suggested by Steve. But it was the better solution because sed -i allows editing in-place.
$ sed -i 's/\(.*,\).*,.*,\(.*\)/\1\2/' FILENAME

To remove fields 2 and 3 from a given input file (assuming a tab field separator), you can remove the fields from $0 using gensub and regenerate it as follows:
awk -F '\t' 'BEGIN{OFS="\t"}\
{$0=gensub(/[^\t]*\t/,"",3);\
$0=gensub(/[^\t]*\t/,"",2);\
print}' Input.tsv

The method presented in the answer of ghoti has some problems:
every assignment of $i = $(i+1) forces awk to rebuild the record $0. This implies that if you have 100 fields and you want to delete field 10, you rebuild the record 90 times.
changing the value of NF manually is not posix compliant and leads to undefined behaviour (as is mentioned in the comments).
A somewhat more cumbersome, but stable robust way to delete a set of columns would be:
a single column:
awk -v del=3 '
BEGIN{FS=fs;OFS=ofs}
{ b=""; for(i=1;i<=NF;++i) if(i!=del) b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file
multiple columns:
awk -v del=3,5,7 '
BEGIN{FS=fs;OFS=ofs; del="," del ","}
{ b=""; for(i=1;i<=NF;++i) if (del !~ ","i",") b=(b?b OFS:"") $i; $0=b }
# do whatever you want to do
' file

Well, if the goal is to remove the extra delimiters, then you can use tr on Linux. Example:
$ echo "1,2,,,5" | tr -s ','
1,2,5

echo one two three four five six|awk '{
print $0
is3=$3
$3=""
print $0
print is3
}'
one two three four five six
one two four five six
three

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

nested awk commands? - awk

Related

Delete lines with specific pattern name in the first column awk

awk calculation fails in cases where zero is used

awk to extract the data between Dates

awk printing against the condition

Is there a way to completely delete fields in awk, so that extra delimiters do not print?

Categories

Resources