awk to substitute value of field based on another - awk

I am trying to use awk to substitute the value of the Classification field NF+1 with the value of the CLINSIG field NF-1 if that value is Benign. I think the awk is close but currently I get an empty file. What's wrong?
input
Chr Start End Ref Alt Func.refGene PopFreqMax CLINSIG Classification
chr1 43395635 43395635 C T exonic 0.12 Benign VUS
chr1 43396414 43396414 G A exonic 0.14 Benign VUS
chr1 172410967 172410967 G A exonic 0.66 VUS
awk
awk -v OFS='\t' '{ if ($(NF-1) == "Benign") sub($(NF+1)=$(NF-1); print $0 }' input
desired output
Chr Start End Ref Alt Func.refGene PopFreqMax CLINSIG Classification
chr1 43395635 43395635 C T exonic 0.12 Benign Benign
chr1 43396414 43396414 G A exonic 0.14 Benign Benign
chr1 172410967 172410967 G A exonic 0.66 VUS

You probably mean Classification field NF, not NF+1:
$ awk -v OFS='\t' '$(NF-1)=="Benign" {$(NF)=$(NF-1)} {print $0 }' input
Chr Start End Ref Alt Func.refGene PopFreqMax CLINSIG Classification
chr1 43395635 43395635 C T exonic 0.12 Benign Benign
chr1 43396414 43396414 G A exonic 0.14 Benign Benign
chr1 172410967 172410967 G A exonic 0.66 VUS

Related

Cut Columns and Append to Same File

I'm working with a tab separated file on MacOS. The file contains 15 columns and thousands of rows. I want to cut columns 1, 2, and 3 and then append them with columns 11, 12, and 13. I was hoping to do this in a pipe so that no extra files need to be created. The only post I found used a command sponge but I evidently don't have that on MacOS, or it isn't in my BASH.
The input tsv file is actually being generated within the same line of code,
arbitrary command to generate input.tsv | cut -f1-3,11-13 | <Step to cut -f4-6 and append -f1-3> | sort > out.file
Input tsv
chr1 21018 21101 A B C D E F G chr1 20752 21209
chr10 74645 74836 A B C D E F G chr10 74638 74898
chr10 75267 75545 A B C D E F G chr10 75280 75917
chr4 212478 212556 A B C D E F G chr4 212491 213285
Desired Output tsv
chr1 21018 21101
chr1 20752 21209
chr10 74638 74898
chr10 74645 74836
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
Using perl and awk :
code
perl -pe 's/chr[0-9]+/\n$&/g' file | awk '/./{print $1, $2, $3}'
 Output
chr1 21018 21101
chr1 20752 21209
chr10 74645 74836
chr10 74638 74898
chr10 75267 75545
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
here is short awk solution:
awk '{print $1, $2, $3, "\n" $1, $12, $13;}' input.tsv
output:
chr1 21018 21101
chr1 20752 21209
chr10 74645 74836
chr10 74638 74898
chr10 75267 75545
chr10 75280 75917
chr4 212478 212556
chr4 212491 213285
Explanation
{ # for each input line
print $1, $2, $3; # print 1st field, append 2nd and 3rd fields. Terminate with new line
print $1, $12, $13; #print 1st field, append 12th and 13th field. Terminate with new line
}

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

awk to filter file using another capturing all instances

In the below awk I am trying to capture all conditions ofKCNMA1, the line in gene (which is a one column list of names) that are in $8 of file which is tab-delimited
So in the below example all instances/lines where KCNMA1 appear in $8would be printed to output.
There could also be multiple ;, however the name, in this case KCNMA1 will be included. The awk seems to capture 2 of the possible 4 conditions but not all instances as shown by the current output. Thank you :).
gene
KCNMA1
file
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
4629 chr10 78944590 78944590 G A intergenic NONE;KCNMA1 dist=NONE;dist=451371
4630 chr10 79396463 79396463 C T intronic KCNMA1 .
4631 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
4632 chr10 81318663 81318663 C G exonic SFTPA2 .
4633 chr10 89397777 89397777 - GAA exonic NONE;X1X;KCNMA1 .
current output
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
1 chr10 79396463 79396463 C T intronic KCNMA1 .
2 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
desired output (tab-delimeted)
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
4629 chr10 78944590 78944590 G A intergenic NONE;KCNMA1 dist=NONE;dist=451371
4630 chr10 79396463 79396463 C T intronic KCNMA1 .
4631 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
4633 chr10 89397777 89397777 - GAA exonic NONE;X1X;KCNMA1 .
awk
awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene file > out
For the single gene, just pass as a variable
$ awk -v gene='KCNMA1' -v d=';' 'NR==1 || d $8 d ~ d gene d' file
the counter you're using seems unnecessary since you want to have the first field.
If you want to support a file based gene list, you can use this
$ awk -v d=';' 'NR==FNR {genes[$0]; next}
FNR==1;
{for(g in genes)
if(d $8 d ~ d g d) print}' genes file

awk split carrying-over whitespace

The below awk split appears to be leaving the whitespace in after `$4~ in the output and I can not seem to prevent it. What is the correct syntax? Thank you :).
input
chr1 955543 955763 + AGRN-6|pr=2|gc=75
chr1 957571 957852 + AGRN-7|pr=3|gc=61.2
chr1 970621 970740 + AGRN-8|pr=1|gc=57.1
Current output
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
gawk '{print gensub(/(^[^|]+)\|[^|]+([|][^+]+).*/,"\\1\\2","g",$0)}' input
edit
chr1^I955543^I955763^I+ AGRN-6|gc=75$
chr1^I957571^I957852^I+ AGRN-7|gc=61.2$
chr1^I970621^I970740^I+ AGRN-8|gc=57.1$
desired
chr1^I955542^I955662^I+^IAGRN_70$
chr1^I955643^I955763^I+^IAGRN_71$
chr1^I957570^I957690^I+^IAGRN_72$
Another curious awk alternative:
awk '{print $1""$2}' FS='pr=[0-9]\\|' file
Results
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
Explanation
The value of FS could be any regex, so we can use pr=[0-9]| as separator and print the fields before and after it.
awk will rewrite the line with the specified OFS. If you want to preserve the input spacing you can choose a simpler solution with sed
sed -r 's/\|.*\|/\|/' file
chr1 955543 955763 + AGRN-6|gc=75
chr1 957571 957852 + AGRN-7|gc=61.2
chr1 970621 970740 + AGRN-8|gc=57.1
awk '{n=split($5, a, "|"); print $1,$2,$3,$4" "a[1]"|"a[3]}' OFS="\t" input

How to pull out all lines of a file matching each line from another file and output into separate rows?

This is a similar question to what has been previously asked (see below for link) but this time I would like to output the common strings into rows instead of columns as shown below:
I have two files, each with one column that look like this:
File 1
chr1 106623434
chr1 106623436
chr1 106623442
chr1 106623468
chr1 10699400
chr1 10699405
chr1 10699408
chr1 10699415
chr1 10699426
chr1 10699448
chr1 110611528
chr1 110611550
chr1 110611552
chr1 110611554
chr1 110611560
File 2
chr1 1066234
chr1 106994
chr1 1106115
I want to search file 1 and pull out all lines that are an exact match with line 1 of file 2 and output all matches on it's own line. Then I want to do the same for line 2 of file 2 and so on until all matches of file 2 have been found in file 1 and output to it's own row. Also I am working with very large files so something that won't require file 2 to be completely stored in memory, otherwise it will not run to completion. Hopefully the output will look something like this:
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
Similar question at:
How to move all strings in one file that match the lines of another to columns in an output file?
as long as your patterns don't overlap completely this should work
$ while read p; do grep "$p" file1 | tr '\n' '\t'; echo ""; done < file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
You could do this as it uses close to zero memory but it'll be very slow since it reads the whole of "file1" once for every line of "file2":
$ cat tst.awk
{
ofs = ors = ""
while ( (getline line < "file1") > 0) {
if (line ~ "^"$0) {
printf "%s%s", ofs, line
ofs = "\t"
ors = "\n"
}
}
printf ors
close("file1")
}
$ awk -f tst.awk file2
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560
you can try
awk -vOFS="\t" '
NR==FNR{ #only file2
keys[++i]=$0; #'keys' store pattern to search ('i' contains number of keys)
next; #stop processing the current record and
#go on to the next record
}
{
for(j=1; j<=i; ++j)
#if line start with key then add
if($0 ~ "^"keys[j])
a[keys[j]] = a[keys[j]] (a[keys[j]]!=""?OFS:"") $0;
}
END{
for(j=1; j<=i; ++j) print a[keys[j]]; #print formating lines
}' file2 file1
you get,
chr1 106623434 chr1 106623436 chr1 106623442 chr1 106623468
chr1 10699400 chr1 10699405 chr1 10699408 chr1 10699415 chr1 10699426 chr1 10699448
chr1 110611528 chr1 110611550 chr1 110611552 chr1 110611554 chr1 110611560