how its better to filtering a file by using awk with more than two pattern matching? - awk

I want to modify this command and create a command to filter rows with "val" flag and more than 2 "PASS".
any suggestion?
this command can work only with one PASS:
awk '{if(($5=="val") && ($0 ~ /PASS/ )) {print $0}}' sample.vcf

Assuming $5 is the flag field:
awk '{if(($5=="val") && ($0 ~ /^.*PASS.*PASS.*$/ )) {print $0}}' sample.vcf
BTW {if(($5=="val") && ($0 ~ /PASS/ )) {print $0}} will not match any lines since if $5 == "val" $0 will never ~ /PASS/

Related

AWK: Compare two columns conditionally in one file

I have a pipe (|) delimited file where $1 has IDs and there are values in $2 and $3. The file has ~5000 rows in it with each ID $1 repeated multiple times. The file looks like this
a|1|2
a|2|0
a|3|3
a|4|0
b|5|3
b|2|4
I am trying to print lines where the $2 on the current line is <= the max $3 so the output will be
a|1|2
a|2|0
a|3|3
b|2|4
Any lead on this would be highly appreciated! Thank you.
It sounds like you just want to, for each $1, print those lines where $2 is less than or equal to the max $3:
$ cat tst.awk
BEGIN { FS="[|]" }
NR==FNR {
max[$1] = ( ($1 in max) && (max[$1] > $3) ? max[$1] : $3 )
next
}
$2 <= max[$1]
$ awk -f tst.awk file file
a|1|2
a|2|0
a|3|3
b|2|4

How to exclude lines matching a regex pattern in a column by awk?

I want to exclude lines containing a specific string.
header
1:test
2:test
3:none
4:test
Why don't these commands work?
awk -F: 'FNR>1 {$0 !~ /none/} {print $1}' 1.txt
awk -F: 'FNR>1 {$2 !~ /none/} {print $1}' 1.txt
but this works:
awk '$0 !~ /none/ {print $0}' 1.txt
I intend to get
1
2
4
You need to provide the regex test as condition, not as action, and may use
awk -F: 'FNR>1 && !/none/{print $1}' file
awk -F: 'FNR>1 && $2 !~ /none/{print $1}' file
See an awk online demo
Details
-F: - sets the field separator to a colon
FNR>1 && !/none/ - if number of processed records for current file is more than 1 and there is no none on the line (if $2 !~ /none/ is used, returns true if Field 2 does not contain none pattern)
{print $1} - print Field 1 value.

How to select 2 fields if they meet requirements & length in awk or sed?

I want to select 2 fields and out put them to a file:
field$1 I want to select all if it = # symbol (for email)
field$2 I want to select if it = certain character length ie. 40.
only output if both requirements are met, how to do this in awk or sed?
I was using this:
awk -F: '/\#/ {print $1 ":" $2 }' test.txt > file_output.txt
however the # is for both $1 and $2 which is not what i want.
Thanks,
Edit: here is an example (in bold)
email#some.com:123456789123456789123456789:blah:blah:blah
ignore:1234#56789
output needed:
email#some.com:123456789123456789123456789
you can use this;
awk -F: '{if ($1 ~ /\#/ && length($2) == 40) print $1 ":" $2 }' test.txt > file_output.txt
Test;
sample file
$ cat t
user#host1:0123456789012345678901234567890123456789
user#host2:0123456789012345678901234567890123456789
userhost3:0123456789012345678901234567890123456789
user#host4:012345677
awk output;
$ awk -F: '{if ($1 ~ /\#/ && length($2) == 40) print $0 }' t
user#host1:0123456789012345678901234567890123456789
user#host2:0123456789012345678901234567890123456789

awk to extract the data between Dates

Would like to extract the line items, if the dates between 5th Apr to 10th Apr from second field ($2) . Having many gun zip files into that directory.
Inputs.gz
Des1,DATE,Des1,Des2,Des3
ab,01-APR-15,10,0,4
ab,04-APR-15,25,0,12
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
ab,11-APR-15,85,0,1
I have tried below command and in-complete
zcat Inputs*.gz | awk 'BEGIN{FS=OFS=","} { if ( (substr($2,1,2) >=5) && (substr($2,1,2) <=10) ) print $0 }' > Output.txt
Expected Output
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
Please suggest ...
Try this:
awk -F",|-" '$2 >= 5 && $2 <= 10'
It adds the date delimiter to the FS using the -F flag. To ensure that it's APR of 2015, you could separately add tests like:
awk -F",|-" '$2 >= 5 && $2 <= 10 && $3=="APR" && $4==15'
While this makes the date easy to parse up front, if you want to print it out again, you'll need to reconstruct it with something like _date = $2 "-" $3 "-" $4. And if you need to manipulate the data in general, you'd want to add back in the BEGIN {OFS=","} part.
The field numbering I used assumes there are no "-" delimiters in the first field.
I get the following output:
ab,05-APR-15,40,0,6
ab,07-APR-15,55,0,6
ab,10-APR-15,70,0,1
If you have a whole mess of dates and you really only care about the one in the 2nd field via comma delimiters, you could use split like:
awk -F"," '{ split($2, darr, "-") } darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15'
which is like saying:
for every line, parse the 2nd field into the darr array using the - delimiter
for every line, if the logic darr[1] >= 5 && darr[1] <= 10 && darr[2]=="APR" && darr[3]==15 is true print the whole line.
Another simple solution by using regular expression
awk -F',' '$2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
-F Field separator.
$2 second field
~ match regular expression
'/([0][5-9]|10)-APR-15/` reguler expression to match 05 to 09 or 10
APR-15
Using internal field separator
awk 'BEGIN{ FS="," } $2 ~ /([0][5-9]|10)-APR-15/{ print $0 }' txt
using explicate date number declarations
awk 'BEGIN{ FS="," } $2 ~ /(05|06|07|08|09|10)-APR-15/{ print $0 }' txt

nested awk commands?

I have got following two codes:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
and
grep -v '#' neco.txt |
grep -v 'seq-name' |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk '$6 >= 49 { print }' |
awk '$6 <= 180 { print }' |
awk '$4 > 1 { print }' |
awk '$5 < $nut { print }' |
wc -l
I would like my script to replace "nut" at this place:
awk '$4 < $nut { print }'
with the number returned from this:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
However, $1 in code just above should represent not column from ids_lengths.txt, but first column from neco.txt! (similiarly as I use $6 and $4 in main code).
A help how to solve these nested awks will definitely be appreciated:-)
edit:
Line of my input file (neco.txt) looks like this:
FZWTUY402JKYFZ 2 100.000 3 11 9 4.500 7 0 0 0 . TG TGTGTGTGT
The biggest problem is that I want to filter those lines that have in the fifth column number less than number, which I get from another file (ids_lengths.txt), when searching with first column (e.g. FZWTUY402JKYFZ). That's why I put "nut" variable in my draft script :-)
ids_lengths.txt looks like this:
>FZWTUY402JKYFZ
153
>FZWTUY402JXI9S
42
>FZWTUY402JMZO4
158
You can combine the two grep -v operations and the four consecutive awk operations into one of each. This gives you useful economy without completely rewriting everything:
nut=`awk "/$1/{getline; print}" ids_lengths.txt`
grep -E -v '#|seq-name' neco.txt |
grep -E '(\S+\s+){13}\bAC(.)+CA\b' |
awk -vnut="$nut" '$6 >= 49 && $6 <= 180 && $4 > 1 && $5 < nut { print }' |
wc -l
I would not bother to make a single awk script determine the value of nut and do the value-based filtering. It can be done, but it complicates things unnecessarily — unless you can demonstrate that the whole thing is a bottleneck for the performance of the production system, in which case you do work harder (though I'd probably use Perl in that case; it can do the whole lot in one command).
Approximately:
awk -v select="$1" '$0 ~ select && FNR == NR { getline; nut = $0; } FNR == NR {next} $4 > 1 $5 < nut && $6 >= 49 && $6 <= 180 && ! /#/ && ! /seq-name/ && $NF ~ /^AC.+CA$/ {count++} END {print count}' neco.txt ids_lengths.txt
The regex will need to be adjusted to something that AWK understands. I can't see how the regex matches the sample data you provided. Part of the solution may be to use a field count as one of the conditions. Perhaps NF == 13 or NF >= 13.
Here's the script above broken out on multiple lines for readability:
awk -v select="$1" '
$0 ~ select && FNR == NR {
getline
nut = $0;
}
FNR == NR {next}
$4 > 1
$5 < nut &&
$6 >= 49 &&
$6 <= 180 &&
! /#/ &&
! /seq-name/ &&
$NF ~ /^AC.+CA$/ {
count++
}
END {
print count
}' ids_lengths.txt neco.txt