Replacing with condition on two files awk - awk

Using those example:
File1:
rs12124819 1 0.020242 776546 A G
rs28765502 1 0.022137 832918 T C
rs7419119 1 0.022518 842013 T G
rs950122 1 0.022720 846864 G C
File2:
1_752566 1 0 752566 G A
1_776546 1 0 776546 A G
1_832918 1 0 832918 T C
1_842013 1 0 842013 T G
I am trying to change the 1st column of file2 with the corresponding 1st column of file1 if their 4th column are equal.
Expected output:
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G
I tried to create 2 array but couldn't find the correct way to use it:
awk 'FNR==NR{a[$4],b[$1];next} ($4) in a{$1=b[FNR]}1' file1 file2 > out.txt
Thanks a lot!

With your shown samples, could you please try following. Written and tested in GNU awk.
awk 'FNR==NR{a[$4]=$1;next} ($4 in a){$1=a[$4]} 1' file1 file2
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which will be TRUE when file1 is being read.
a[$4]=$1 ##Creating array a whose index is $4 and value is $1.
next ##next will skip all further statements from here.
}
($4 in a){ ##Checking condition if 4th field is present in a then do following.
$1=a[$4] ##Setting value of 1st field of file2 as array a value with index of 4th column
}
1 ##1 will print edited/non-edited line.
' file1 file2 ##mentioning Input_file names here.

You may try this awk:
awk 'FNR==NR {map[FNR] = $1; next} {$1 = map[FNR]} 1' file1 file2 | column -t
rs12124819 1 0 752566 G A
rs28765502 1 0 776546 A G
rs7419119 1 0 832918 T C
rs950122 1 0 842013 T G

another alternative (if the files are sorted in the join key as in the sample data)
$ join -j4 -o1.1,2.2,2.3,2.4,2.5,2.6 file1 file2 | column -t
rs12124819 1 0 776546 A G
rs28765502 1 0 832918 T C
rs7419119 1 0 842013 T G
Note that your input files have only 3 matching records.

Related

using awk remove filtered group

i have an input
1 a 0,9
1 b 0,8
1 c 0,1
2 d 0,5
3 e 0,1
3 f 0,7
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
using awk i want to remove filtered group
if third column is greater than 0.6 i want to remove other rows which first columns equal.
Desired Output:
2 d 0,5
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
I have used this, but this dont delete other rows.
awk '($3 < 0.6)' file
With your shown samples, could you please try following.
awk '
FNR==NR{
temp=$3
sub(/,/,".",temp)
if(temp>0.6){
noCount[$1]
}
next
}
!($1 in noCount)
' Input_file Input_file
Output will be as follows.
2 d 0,5
4 g 0,4
4 h 0,3
4 i 0,2
4 j 0,1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##This condition will be TRUE when first time is being read.
temp=$3 ##Creating temp with 3rd field value here.
sub(/,/,".",temp) ##Substituting comma with dot in temp here.
if(temp>0.6){ ##Checking condition if temp is greater than 0.6 then do following.
noCount[$1] ##Creating noCount with index of 1st field.
}
next ##next will skip all further statements from here.
}
!($1 in noCount) ##If 1st field is NOT present in noCount then print line.
' Input_file Input_file ##Mentioning Input_file names here.

Unix replacing values with 2 conditions from 2 files

I am having an issue that I almost solved thanks to this post. Using a dataset in the same format:
File 1
32074_32077 1 0.008348 834830 G A
32082_32085 1 0.008349 834928 A G
32085_32088 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
File 2
rs3094315 1 0.020130 752566 G A
rs12124819 1 0.020242 834928 A G
rs28765502 2 0.022137 834928 T C
rs7419119 3 0.022518 846808 T G
I would like to change the 1st column of file one only IF $4 and $2 are the same in FILE2. If it is not I would like to keep the line as it is.
Expected output:
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Using the answer from the linked post, I cannot have the expected output. I tried this:
awk 'FNR==NR{a[$4]=$1; b[$2]=$1; next} ($4 in a && $2 in b){$1=a[$4]} 1' file1 file2
It doesn't work as expected because the condition $2 in b is always true.. I understand but I don't know how I can work around this.
Thank you.
You may use this awk:
awk 'FNR==NR {a[$2,$4]=$1; next} ($2,$4) in a {$1 = a[$2,$4]} 1' file2 file1 |
column -t
32074_32077 1 0.008348 834830 G A
rs12124819 1 0.008349 834928 A G
rs28765502 2 0.008350 834928 G A
32903_32906 5 0.008468 846808 C T
Use composite key array a with key as ($2,$4).
Used column -t for showing tabular output.

How can I use awk to calculate sum and replace column in file

I'm new to the site and to the programming world and I hope you have time to help me.
My problem is as follows: I have a file with several columns. In the 2nd column there are values. I'm tring to calculate the sum of each values to a given number and to replace the second column with a new column containing the results of the sum.
Here an example of my input:
A B C
x 1 t
y 2 u
z 3 v
I want to sum values in B column to 5 and obtain an output like the one below:
A B C
x 6 t
y 7 u
z 8 v
The code I tried unsucesfully is
zcat my_file.vcf.gz| tail -n +49 | awk 'BEGIN{FS=OFS="\t"} {print $0, $2+5}'>my.output.vcf
Thanks in advance
We could avoid using tail since printing of lines from 49th line could be handled within awk itself. Also you need to add value in 2nd field and then you could print the whole line itself by print command.
Important point, as per OP's sample if 2nd field is having alphabets then need NOT to add 5 in it, so taken care of that condition too here.
zcat my_file.vcf.gz |
awk '
BEGIN{ FS=OFS="\t" }
FNR>=49{
$2=($2~/[a-zA-Z]/?$2:$2+5)
print
}
' > my.output.vcf
You can use
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1'
Here, $2+=5 will add 5 to Filed 2 value, and 1 will trigger the display of the record (row, line, same as print $0).
See an online awk demo:
#!/bin/bash
s='A B C
x 1 t
y 2 u
z 3 v'
awk 'BEGIN{FS=OFS="\t"} {$2+=5}1' <<< "$s"
Output:
A 5 C
x 6 t
y 7 u
z 8 v
Another form for clarity:
awk 'BEGIN{FS=OFS="\t"} {print $1, $2+5, $3}'
you can use:
awk 'BEGIN {FS=OFS="\t"} NR == 1 {print $0} NR > 1 {print $1,($2+5),$3;}'
output:
A B C
x 6 t
y 7 u
z 8 v
Maube this can help you:
cat file | awk '{if (NR > 1 && $2 = ($2+5))
print $0;
else print $0;}'
Answer apply to your code:
zcat my_file.vcf.gz| tail -n +49 | awk '{if (NR > 1 && $2 = ($2+5)) print $0; else print $0;}' > my.output.vcf
cat boo
A B C
x 1 t
y 2 u
z 3 v
cat boo | awk 'BEGIN{FS=OFS="\t"} $2 ~ /^[0-9]+$/ {print $1, $2+5, $3} $2 !~ /^[0-9]+$/ {print} '
A B C
x 6 t
y 7 u
z 8 v

How to get the filenumber that is being processing by an awk script?

Suppose I have 2 or more files being processed by an awk script.
$ cat file1
a
b
c
$ cat file2
d
e
How do I get the number of the file being processed? Is the a built-in awk for that?
I want to have a script with the behavior of the one bellow. What could I use as my
SOMEVARIABLE?
$ awk '{print FILENAME, NR, FNR, SOMEVARIABLE, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
EDIT: Since OP needs output in a specific format and DO NOT want only count of file so adding following solution now, which should consider empty files count too.(tested and written in GNU awk)
awk '
FNR==1{
FNUM++
}
{
print FILENAME, NR, FNR, FNUM, $0
}
ENDFILE{
if(FNUM==prev){
FNUM++
print FILENAME, 0, 0, FNUM, "Empty file"
}
prev=FNUM
}' file1 file2
Output for 1 Input_file1 and empty Input_file2 comes as follows.
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 0 0 2 Empty file
Solutions when one wants to know total number of files processed by awk command:
1st solution: Could you please try following, using GNU awk(considering that you don't want to count empty files here).
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
2nd solution: In case you only want to know number of files passed to awk command then try following.
awk 'END {print ARGC-1}' Input_file1 Input_file2
Explanation of above codes above with examples: Let's say following are the Input_files, where Input_file1 is having contents and Input_file2 is empty file as follows:
cat Input_file1
a
b
c
> Input_file2
Now when we run command ARGC we get output as 2 files.
awk 'END {print ARGC-1}' Input_file1 Input_file2
2
Now when I run my 1st command it gives 1 file since it is not counting empty file.
awk 'NF{count++;nextfile} END{print count}' Input_file1 Input_file2
1
Well... I managed to do it as following:
$ awk 'BEGIN{FNUM=0} FNR==1{FNUM++} {print FILENAME, NR, FNR, FNUM, $0}' file1 file2
file1 1 1 1 a
file1 2 2 1 b
file1 3 3 1 c
file2 4 1 2 d
file2 5 2 2 e
I guess there is no built-in variable to help with that, so I created the variable FNUM (for file number). If there is a solution with a built-in variable, please give me a better answer.

Join two columns from different files with awk

I want to join two columns from two different files using awk. These files look like (A, B, C, 0, 1, 2, etc are columns)
file1:
A B C D E F
fil2:
0 1 2 3 4 5
And I want to be able to select arbitrary columns on my ouput, something of the form:
Ie, I want the output to be:
A C E 4 5
I've seen a million answers with the following awk code (and very similar ones), offering no explanation. But none of them address the exact problem I want to solve:
awk 'FNR==NR{a[FNR]=$2;next};{$NF=a[FNR]};1' file2 file1
awk '
NR==FNR {A[$1,$3,$6] = $0; next}
($1 SUBSEP $2 SUBSEP $3) in A {print A[$1,$2,$3], $4}
' A.txt B.txt
But none of them seem to do what I want and I am not able to understand them.
So, how can I achieve the desired output using awk? (and please, offer an explanation, I want to actually learn)
Note:
I know I can do this using something like
paste <(awk '{print $1}' file1) <(awk '{print $2}' file2)
As I said, I'm trying to learn and understand awk.
With GNU awk for true multi-dimensional arrays and ARGIND:
$ awk -v flds='1 1 1 3 1 5 2 5 2 6' '
BEGIN{ nf = split(flds,o) }
{ f[ARGIND][1]; split($0,f[ARGIND]) }
NR!=FNR { for (i=2; i<=nf; i+=2) printf "%s%s", f[o[i-1]][o[i]], (i<nf?OFS:ORS) }
' file1 file2
A C E 4 5
The "flds" string is just a series of <file number> <field number in that file> pairs so you can print the fields from each file in whatever order you like, e.g.:
$ awk -v flds='1 1 2 2 1 3 2 4 1 5 2 6' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
A 1 C 3 E 5
$ awk -v flds='2 1 1 2 2 3 1 4 2 5' 'BEGIN{nf=split(flds,o)} {f[ARGIND][1]; split($0,f[ARGIND])} NR!=FNR{for (i=2; i<=nf; i+=2) printf "%s%s",f[o[i-1]][o[i]], (i<nf?OFS:ORS)}' file1 file2
0 B 2 D 4