Print rows where one column is the same but another is different - awk

Given files test1 and test2:
$ cat test*
alert_name,id,severity
bar,2,1
foo,1,0
alert_name,id,severity
foo,1,9
bar,2,1
I want to find rows where name is the same but severity has changed (ie foo) and print the change. I have got this far using awk:
awk 'BEGIN{FS=","} FNR>1 && NR==FNR { a[$1]; next } ($1 in a) && ($3 != a[$3]) {printf "Alert %s severity from %s to %s\n", $1, a[$3], $3}' test1 test2
which prints:
Alert foo severity from to 9
Alert bar severity from to 1
So the match is wrong, and I can't print a[$3].

You may try this awk:
awk -F, '$1 in sev && $3 != sev[$1] {
printf "Alert %s severity from %s to %s\n", $1, sev[$1], $3
}
{sev[$1] = $3}' test*
Alert foo severity from 0 to 9

mawk 'BEGIN { _+=(_^=FS=OFS=",")+_ }
FNR == NR || +$_<=+___[__=$!!_] ? !_*(___[$!!_]=$_) : \
$!_ = "Alert "__ " severity from "___[__]" to " $_' files*.txt
Alert foo severity from 0 to 9

Related

AWK: How to number auto-increment?

I have a file.file content is:
20210126000880000003|3|33.00|20210126|15:30
1|20210126000000000000000000002207|1220210126080109|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT07|621067000000123645|收款方户名|2021-01-26|2021-01-26|10.00|TN|NCS|12|875466
2|20210126000000000000000000002208|1220210126080110|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT06|621067000000123645|收款方户名|2021-01-26|2021-01-26|20.00|TN|NCS|12|875466
3|20210126000000000000000000002209|1220210126080111|1000|100000000000000319|100058110000000325|402041000012|402041000012|PT08|621067000000123645|收款方户名|2021-01-26|2021-01-26|3.00|TN|NCS|12|875466
I use awk command:
awk -F"|" 'NR==1{print $1};FNR==2{print $2,$3}' testfile
Get the following result:
20210126000880000003
20210126000000000000000000002207 1220210126080109
I want the number to auto-increase:
awk -F"|" 'NR==1{print $1+1};FNR==2{print $2+1,$3+1}' testfile
But get follow result:
20210126000880001024
20210126000000000944237587726336 1220210126080110
have question:
I want to the numer is auto-increase: hope the result is:
20210126000880000003
20210126000000000000000000002207|1220210126080109
-------------------------------------------------
20210126000880000004
20210126000000000000000000002208|1220210126080110
--------------------------------------------------
20210126000880000005
20210126000000000000000000002209|1220210126080111
How to auto_increase?
Thanks!
You may try this gnu awk command:
awk -M 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {print ++hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
A more readable version:
awk -M 'BEGIN {
FS=OFS="|"
}
NR == 1 {
hdr = $1
next
}
NF > 2 {
print ++hdr
print $2, $3
print "-------------------"
}' file
Here is a POSIX awk solution that doesn't need -M:
awk 'BEGIN {FS=OFS="|"} NR == 1 {hdr = $1; next} NF>2 {"echo " hdr " + 1 | bc" | getline hdr; print hdr; print $2, $3; print "-------------------"}' file
20210126000880000004
20210126000000000000000000002207|1220210126080109
-------------------
20210126000880000005
20210126000000000000000000002208|1220210126080110
-------------------
20210126000880000006
20210126000000000000000000002209|1220210126080111
-------------------
Anubhava has the best solution but for older versions of GNU awk that don't support -M (big numbers) you can try the following:
awk -F\| 'NR==1 { print $1;hed=$1;hed1=substr($1,(length($1)-1));next; } !/^$/ {print $2" "$3 } /^$/ { print "--------------------------------------------------";printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 }' testfile
Explanation:
awk -F\| 'NR==1 { # Set field delimiter to | and process the first line
print $1; # Print the first field
hed=$1; # Set the variable hed to the first field
hed1=substr($1,(length($1)-1)); # Set a counter variable hed1 to the last digit in hed ($1)
next;
}
!/^$/ {
print $2" "$3 # Where there is no blank line, print the second field, a space and the third field
}
/^$/ {
print "--------------------------------------------------"; # Where there is a blank field, process
printf "%s%s\n",substr(hed,1,((length(hed))-(length(hed1)+1))),++hed1 # print the header extract before the counter, followed by the incremented counter
}' testfile

awk to search field in file2 using range of file1

I am trying to use awk to find all the $2 values in file2 which is ~30MB, that are between $2 and $3 in file1 which is ~2GB. If a value in $2 of file2 is between the file1 fields then it is printed along with the $6 value in file1. Both file1 and file2 are tab-delimited as well as the desired output. If there is nothing to print then the next line is processed. The awk below runs but is very slow (has been processing for ~ 1 day and still not done). Is there a better way to approach this or a better programming language?
$1 and $2 and $3 from file1 and $1 and $2 of file2 must match $1 of file1 and be in the range of $2 and $3 of file1.
so in order for the line to be printed in the output it must match $1 and be in the range of $2 and $3 of file2
So, since the line from file2matches $1 in file1 and in the $2 and $3 range it is printed in the output.
Thank you :).
file1 (~3MB)
1 948953 948956 chr1:948953-948956 . ISG15
1 949363 949858 chr1:949363-949858 . ISG15
2 800000 900500 chr1:800000-900500 . AGRN
file2 (~80MB)
1 12214 . C G
1 949800 . T G
2 900000 rs123 - A
3 900000 . C -
desired output tab-delimited
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
awk
awk -F'\t' -v OFS='\t' '
NR == FNR {min[NR]=$2; max[NR]=$3; Gene[NR]=$NF; next}
{
for (id in min)
if (min[id] < $2 && $2 < max[id]) {
print $0, id, Gene[id]
break
}
}
' file1 file2
This would be faster than what you have since it only loops through the file1 contents that have the same $1 value as in file2 and stops searching after it finds a range that matches:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR==FNR {
c = ++num[$1]
beg[$1][c] = $2
end[$1][c] = $3
val[$1][c] = $NF
next
}
$1 in val {
for (c=1; c<=num[$1]; c++) {
if ( (beg[$1][c] <= $2) && ($2 <= end[$1][c]) ) {
print $0, val[$1][c]
break
}
}
}
$ awk -f tst.awk file1 file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
Unfortunately for unsorted input as you have there's not too many options to make it faster. If the ranges in file1 can overlap each other then remove the "break".
Could you please try following and let me know if this helps you.
awk 'FNR==NR{A[$1]=$0;B[$1,++D[$1]]=$2;next} {++C[$1]}($2<B[$1,C[$1]] && $3>B[$1,C[$1]]){print A[$1]}' Input_file2 Input_file1
Reading the files one by one here, first reading file Input_file2 and then Input_file1 here.
If the performance is the issue, you have to sort both files by the value (and range start).
With the files sorted your scans can be incremental (and consequently much faster)
Here is an untested script
$ awk '{line=$0; k=$2;
getline < "file1";
while (k >= $2) getline < "file1";
if(k <= $3) print line, $NF}' file2
You can try to create a dict from file1 using multiarrays in gawk, this is more efficient computational (file1 has small size compared to file2),
awk '
NR==FNR{for(i=$2;i<=$3;++i) d[$1,i] = $6; next}
d[$1,$2]{print $0, d[$1,$2]}' file1 file2
you get,
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
One possible approach is to use AWK to generate another AWK file. Memory consumption should be low so for a really big file1 this might be a lifesaver. As for speed, that might depend on how smart the AWK implementation is. I haven't had a chance to try it on huge data sets; I am curious about your findings.
Create a file step1.awk:
{
sub(/^chr/, "", $1);
print "$1==\"" $1 "\" && " $2 "<$2 && $2<" $3 " { print $0 \"\\t" $6 "\"; }";
}
Apply that on file1:
$ awk -f step1.awk file1
$1=="1" && 948953<$2 && $2<948956 { print $0 "\tISG15"; }
$1=="1" && 949363<$2 && $2<949858 { print $0 "\tISG15"; }
Pipe the output to a file step2.awk and apply that on file2:
$ awk -f step1.awk file1 > step2.awk
$ awk -f step2.awk file2
1 949800 rs201725126 T G ISG15
Alternative: generating C
I rewrote step1.awk, making it generate C rather than AWK code. Not only will this solve the memory issue you reported earlier; it will also be a lot faster considering the fact that C is compiled to native code.
BEGIN {
print "#include <stdio.h>";
print "#include <string.h>";
print "int main() {";
print " char s[999];";
print " int a, b;";
print " while (fgets(s, sizeof(s), stdin)) {";
print " s[strlen(s)-1] = 0;";
print " sscanf(s, \"%d %d\", &a, &b);";
}
{
print " if (a==" $1 " && " $2 "<b && b<" $3 ") printf(\"%s\\t%s\\n\", s, \"" $6 "\");";
}
END {
print " }";
print "}";
}
Given your sample file1, this will generate the following C source:
#include <stdio.h>
#include <string.h>
int main() {
char s[999];
int a, b;
while (fgets(s, sizeof(s), stdin)) {
s[strlen(s)-1] = 0;
sscanf(s, "%d %d", &a, &b);
if (a==1 && 948953<b && b<948956) printf("%s\t%s\n", s, "ISG15");
if (a==1 && 949363<b && b<949858) printf("%s\t%s\n", s, "ISG15");
if (a==2 && 800000<b && b<900500) printf("%s\t%s\n", s, "AGRN");
}
}
Sample output:
$ awk -f step1.awk file1 > step2.c
$ cc step2.c -o step2
$ ./step2 < file2
1 949800 . T G ISG15
2 900000 rs123 - A AGRN
It may be ineffective but should work, however slowly:
$ awk 'NR==FNR{ a[$2]=$0; next }
{ for(i in a)
if(i>=$2 && i<=$3) print a[i] "\t" $6 }
' f2 f1
1 949800 . T G ISG15
3 900000 . C - AGRN
Basically it reads the file2in memory and for every line in file1 it goes thru every entry of file2 (in memory). It won't read a 2 GB file into memory so it's still got less looking up to do as your version.
You could speed it up by replacing the print a[i] "\t" $6 with {print a[i] "\t" $6; delete a[i]}.
EDIT: Added tab delimited to output and refreshed the output to reflect the changed data. Printing "\t" is enough as the files are already tab delimited and records do not get rebuilt at any point.

awk Print Skipping a field

In the case where type is "" print the 3rd field out of sequence and then print the whole line with the exception of the 3rd field.
Given a tab separated line a b c d e the idea is to print ab<tab>c<tab>a<tab>b<tab>d<tab>e
Setting $3="" seems to cause the subsequent print statement to lose the tab field separators and so is no good.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t", $2 $1,$3; $3=""; print}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
Sticking in a for loop which I like a lot less as a solution results in a blank file.
# $1 = year $2 = movie
BEGIN {FS = "\t"}
type=="" {printf "%s\t%s\t%s\t%s\t", $2 $1,$3,$1,$2; for (i=4; i<=NF;i++) printf "%s\t",$i}
type!="" {printf "%s\t<%s>\t", $2 $1,type; print}
END {print ""}
You need to set the OFS to a tab instead of it's default single blank char and you don't want to just set $3 to a bank char as then you'll get 2 tabs between $2 and $4.
$ cat tst.awk
BEGIN {FS = OFS = "\t"}
{
if (type == "") {
val = $3
for (i=3; i<NF; i++) {
$i = $(i+1)
}
NF--
}
else {
val = "<" type ">"
}
print $2 $1, val, $0
}
$
$ awk -f tst.awk file | tr '\t' '-'
ba-c-a-b-d-e
$
$ awk -v type="foo" -f tst.awk file | tr '\t' '-'
ba-<foo>-a-b-c-d-e
The |tr '\t' '-' is obviously just added to make visible where the tabs are.
If decrementing NF doesn't work in your awk to delete the last field in the record, replace it with sub(/\t[^\t]+$/,"").
One way
awk '{$3=""}1' OFS="\t" infile|column -t
explanation
{$3=""} set column to nil
1 same as print, print the line.
OFS="\t"set Output Field Separator Variable to tab, maybe you needn't it, next commandcolumn -t` make the format again.
column -t columnate lists with tabs.

If statement in GAWK gives an error

I have this piece of code:
gawk '{if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if `(match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else (match($5,/(_[joxT]+\.[0-9]*)/,a) && $6=="hola") {print "hola"}}' pasted`
I'm getting this error:
gawk: cmd. line:1: {if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if (match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else (match($5,/(_[joxT]+\.[0-9]*)/,a)) {print $1}}
gawk: cmd. line:1: ^ syntax error
Do you know where the error is?
Thanks.
Take pity on the next guy to maintain your code and indent. Not every program needs to be expressed on one line.
gawk '
BEGIN {OFS = '\t'}
{
if ($5 ~ /hola/ && $6 == "hola") {
print $2, $1, $2, $1, $3
}
else if (match($5, /(_[joxT]+\.[0-9]*)/, a) && match($6, /(_[joxG]+\.[0-9]*)/, b)) {
print $2 a[1], $1 b[1]
}
else if ($5 ~ /(_[joxT]+\.[0-9]*)/ && $6 == "hola") {
print "hola"
}
}
' pasted
Here, only using match() when you need to capture part of the match.
gawk '{if (match($5,/hola/,a) && $6=="hola") {print $2"\t"$1"\t"$2"\t"$1"\t"$3} else if `(match($5,/(_[joxT]+\.[0-9]*)/,a) && match($6,/(_[joxG]+\.[0-9]*)/,b)) {print $2""a[1]"\t"$1""b[1]} else if (match($5,/(_[joxT]+\.[0-9]*)/,a) && $6=="hola") {print "hola"}}' pasted`

changing the appearance of awk output

I used the following code to extract protein residues from text files.
awk '{
if (FNR == 1 ) print ">" FILENAME
if ($5 == 1 && $4 > 30) {
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt
I got the following output when I used the above code.
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR>1axc
RQTSMTDFYHSKRRLIFS>1bxc
RQTSMTDFYHSKRRLIFSPRR>1axF
RQTSMTDFYHSKRR>1qqt
ARPYQGVRVKEPVKELLRRKRG
I would like to get the output as shown below.How do I change the above code to get the following output?
>1abd
MDEKRRAQHNEVERRRRDKINNWIVQLSKIIPDSSMESTKSGQSKGGILSKASDYIQELRQSNHR
>1axc
RQTSMTDFYHSKRRLIFS
>1bxc
RQTSMTDFYHSKRRLIFSPRR
>1axF
RQTSMTDFYHSKRR
>1qqt
ARPYQGVRVKEPVKELLRRKRG
This might work for you:
awk '{
if (FNR == 1 ) print newline ">" FILENAME
if ($5 == 1 && $4 > 30) {
newline="\n";
printf $3
}
}
END { printf "\n"}' protein/*.txt > seq.txt
With gawk version 4, you can write:
gawk '
BEGINFILE {print ">" FILENAME}
($5 == 1 && $4 > 30) {printf "%s", $3}
ENDFILE {print ""}
' filename ...
http://www.gnu.org/software/gawk/manual/html_node/BEGINFILE_002fENDFILE.html#BEGINFILE_002fENDFILE