Awk to update file based on match and condition in another - awk

The below awk will produce the tab-delimeted file1 with the difference between $3-$2 calulated for each line and printed in $6. Before the awk is executed only 5 fields exist.
What I am having trouble with updated each $2 value in file2 with the $7 value of file1 if the $1 value of file2 matches the $5 of file1 and $6 in file1 is not intron. If the value of $5 is intron then then the value of $7 in file1 is zero. So for example line 1 in file1 is intron so that is equvilant to zero or skipped (those lines are not needed in the calculation).
It is possible that a $1 value in file2 may not exist in file1 and in this case the value of $2 in file2 is zero. Line3 infile2 is an example and is set to zero because it does not exist in file1. Thank you:).
Awk w/ output
awk '
FNR==NR{ # process same line
b[$4]=$3-$2;
next # process next line
}
{
a[$5]+=($3-$2)
}
{
split($1, b, " "); print b[0], a[b[0]]
}' OFS="\t" file1 file2
Output
-2135
-2135
-2222
-2351
-2351
-2414
File1 tab-delimited
chr5 86667863 86667879 RASA1 intron 16
chr5 86669977 86669995 RASA1 splicing 18
chr5 86670703 86670805 RASA1 exon 102
chr5 86679453 86679547 RASA1 intron 94
chr5 86679571 86679673 RASA1 exon 102
chr19 15088950 15088961 NOTCH2 intron 50
chr19 15288950 15288961 NOTCH3 intron 11
chr19 15308240 15308275 NOTCH3 exon 35
File2 space delimited
RASA1 2135
NOTCH2 0
GIMAP8 87
NOTCH3 129
FOXF2 0
PRB3 63
Desired out after file2 is updated
RASA1 222 `(102+102+18)`
NOTCH2 0
GIMAP8 0
NOTCH3 35 `(35)`
FOXF2 0
PRB3 0
Maybe adding a | after the first awk with:
awk 'FNR==NR { a[$1]=$7; next } { if(a[$5]){$1=a[$5] }; print }'
To update file2

Could you please try following. It will provide you sequence of output in same order of Input_file's order.
awk '
FNR==NR{
if(!b[$1]++){
c[++count]=$1
}
a[$1]
next
}
($4 in a) && $5!="intron"{
a[$4]+=$NF
}
END{
for(i=1;i<=count;i++){
print c[i],a[c[i]]?a[c[i]]:0
}
}' Input_file2 Input_file1
Since your Input_file1 is NOT TAB delimited as per your claim, so in case it is then edit Input_file2 Input_file1 -----> Input_file2 FS="\t" Input_file1. To get output as TAB delimited either append above code's output to | column -t command or set OFS="\t" near to FS="\t" too.
Output will be as follows.
RASA1 222
NOTCH2 0
GIMAP8 0
NOTCH3 35
FOXF2 0
PRB3 0

if I understood it correctly, this should do what you expect
$ awk 'FNR==NR && $5!="intron" {a[$4]+=$3-$2; next}
{$2=($1 in a)?a[$1]:0}1' file1 file2 > file2.updated

Related

How to get the filenumber that is being processing by an awk script?

Suppose I have 2 or more files being processed by an awk script.
$ cat file1
a
b
c
$ cat file2
d
e
How do I get the number of the file being processed? Is the a built-in awk for that?
I want to have a script with the behavior of the one bellow. What could I use as my
SOMEVARIABLE?
$ awk '{print FILENAME, NR, FNR, SOMEVARIABLE, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
EDIT: Since OP needs output in a specific format and DO NOT want only count of file so adding following solution now, which should consider empty files count too.(tested and written in GNU awk)
awk '
FNR==1{
FNUM++
}
{
print FILENAME, NR, FNR, FNUM, $0
}
ENDFILE{
if(FNUM==prev){
FNUM++
print FILENAME, 0, 0, FNUM, "Empty file"
}
prev=FNUM
}' file1 file2
Output for 1 Input_file1 and empty Input_file2 comes as follows.
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 0 0 2 Empty file
Solutions when one wants to know total number of files processed by awk command:
1st solution: Could you please try following, using GNU awk(considering that you don't want to count empty files here).
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
2nd solution: In case you only want to know number of files passed to awk command then try following.
awk 'END {print ARGC-1}' Input_file1 Input_file2
Explanation of above codes above with examples: Let's say following are the Input_files, where Input_file1 is having contents and Input_file2 is empty file as follows:
cat Input_file1
a
b
c
> Input_file2
Now when we run command ARGC we get output as 2 files.
awk 'END {print ARGC-1}' Input_file1 Input_file2
2
Now when I run my 1st command it gives 1 file since it is not counting empty file.
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
1
Well... I managed to do it as following:
$ awk 'BEGIN{FNUM=0} FNR==1{FNUM++} {print FILENAME, NR, FNR, FNUM, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
I guess there is no built-in variable to help with that, so I created the variable FNUM (for file number). If there is a solution with a built-in variable, please give me a better answer.

AWK - Match First 3 Fields, print $6 & $7 from both files on same line

My two input files have changed and I need to match the first three fields of two files. When a match is made, I want to print $1 (of which was matched) and $6 and $7 of file1 and $6 and $7 of file two. The original code was an AWK one liner to match just the first field.
File1
BSTN-SANJ BSTN SANJ 0 0 50 105910
MRFD-SANJ MRFD SANJ 0 0 40 69105
NYRK-SANJ NYRK SANJ 0 0 30 905010
SANJ-HMRD SANJ HMRD 0 0 25 69010
SANJ-NRFK SANJ NRFK 0 0 38 61506
File2
BSTN-SANJ BSTN SANJ 0 0 45 601251
MRFD-SANJ MRFD SANJ 0 0 39 919591
NYRK-SANJ NYRK SANJ 0 0 25 690155
Output
BSTN-SANJ 50 105910 45 601251
MRFD-SANJ 40 69105 39 919591
NYRK-SANJ 30 905010 25 690155
This will do
awk -v OFS='\t' '
{key = $1 OFS $2 OFS $3}
NR == FNR {f2[key] = $6 OFS $7; next}
key in f2 {print $1, $6, $7, f2[key]}
' file2 file1

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

awk to count lines in column of file

I have a large file that I want to use awk to count the lines in a specific column $5, before the: and only count -uniq entries, but seem to be having trouble getting the syntax correct. Thank you :).
Sample Input
chr1 955542 955763 + AGRN:exon.1 1 0
chr1 955542 955763 + AGRN:exon.1 2 0
chr1 955542 955763 + AGRN:exon.1 3 0
chr1 955542 955763 + AGRN:exon.1 4 1
chr1 955542 955763 + AGRN:exon.1 5 1
awk -F: ' NR > 1 { count += $5 } -uniq' Input
Desired output
1
$ awk -F'[ \t:]+' '{a[$5]=1;} END{for (k in a)n++; print n;}' Input
1
-F'[ \t:]+'
This tells awk to use spaces, tabs, or colons as the field separator.
a[$5]=1
As we loop through each line, this adds an entry into associative array a for each value of $5 encountered.
END{for (k in a)n++; print n;}
After we have finished reading the file, this counts the number of keys in associative array a and prints the total.
The idiomatic, portable awk approach:
$ awk '{sub(/:.*/,"",$5)} !seen[$5]++{unq++} END{print unq}' file
1
The briefer but gawk-only (courtesy of length(array)) approach:
$ awk '{seen[$5]} END{print length(seen)}' file
1

Obtain patterns from a file, compare to a column of another file, and replace with column of third file, using awk

I am having total three files f1.txt, f2 .txt and f3. txt with different size of columns as given below. I am trying to match the pattern of file2 with file 1 and if match found then replace the file 1 content with file 3 for that particular match. In fact file 2 and file 3 are similar but file 3 is with leading zeros
File 1:
8841
841
526
548
547
88
98
File 2:
841
526
548
547
file 3:
00841
0526
000548
00547
Desired output is in File 1 or may be other file
8841
00841
0526
000548
00547
88
98
I am trying to use single line command from the previous post but that is for matching files and that does not contain replacing with the values from third file if match found. I am new to shell script so please give me the single line command or script which will achieve this. I am open to use "sed" or any other shell script.
awk 'BEGIN{i=0}
FNR==NR { a[i++]=$1; next }
{ for(j=0;j<i;j++)
if(index($0,a[j]))
print $0
}' file2 file1
file2 is of no use. Just use file1 and file3:
$ awk 'NR==FNR{a[$0+0]=$0; next} {print ($0 in a ? a[$0] : $0)}' file3 file1
8841
00841
0526
000548
00547
88
98
Using your file1 and file3 you can do something like:
$ cat file1
8841
841
526
548
547
88
98
$ cat file3
00841
0526
000548
00547
$ awk 'NR==FNR{x=$1;gsub(/^0+/,"",$1);a[$1]=x;next}($1 in a){print a[$1];next}1' file3 file1
8841
00841
0526
000548
00547
88
98
You can avoid file3, and use printf in awk to format the output with leading zeros.
Using awk
awk 'NR==FNR{a[$1 FS $2 FS $3 FS $4];next} {if ($2 FS $3 FS $4 FS $5 in a) printf "%s %05d %04d %06d %05d %s %s",$1,$2,$3,$4,$5,$6,$7}' file2 file1
8841 00841 0526 000548 00547 88 98