awk to count and rename symbol in field - awk

I am trying to count a symbol (-) in $5 (ref) and output that symbol renamed and the count using awk. The input file is tab-delimited and the awk below is close, but outputs extra data with incorrect counts and I'm not sure how to fix it. Thank you :).
awk
awk -F'\t' 'BEGIN {printf "Category\tCount\n" } $5 ~ /-/ {printf "indel" } {a[$5]++} END { for (i in a) {printf "%s\t\t%s\n",i , a[i] }}' input
input
Index Mutation Call Start End Ref Alt Func.refGene Gene.refGene ExonicFunc.refGene Sanger
13 c.[1035-3T>C]+[1035-3T>C] 166170127 166170127 T C intronic SCN2A
16 c.[2994C>T]+[=] 166210776 166210776 C T exonic SCN2A synonymous SNV
19 c.[4914T>A]+[4914T>A] 166245230 166245230 T A exonic SCN2A synonymous SNV
20 c.[5109C>T]+[=] 166245425 166245425 C T exonic SCN2A synonymous SNV
21 c.[5139C>T]+[=] 166848646 166848646 G A exonic SCN1A synonymous SNV
22 c.3152_3153insAACCACT 166892841 166892841 - AGTGGTT exonic SCN1A frameshift insertion TP
23 c.2044-5delT 166898947 166898947 A - intronic SCN1A
25 c.1530_1531insA 166901684 166901684 - T exonic SCN1A frameshift insertion FP
current output
Category Count
indelindelindelindel 5
A 4
C 7
Ref 1
- 4
G 2
T 6
TCCT 1
desired output
Category Count
indel 2

here you go...
$ awk -F'\t' '$5=="-"{count++}
END{print "Category","Count";
print "indel",count}' file |
column -t
Category Count
indel 2

Related

Conserve header while joining files in bash

I have this 2 tab separated files:
fileA.tsv
probeId sample1_betaval sample2_betaval sample3_betaval
a 1 2 3
b 4 5 6
c 7 8 9
fileB.tsv
probeId region gene
a intronic tp53
b non-coding NA
c exonic kras
As they are already sorted by probeId, I've merged both files:
join -j 1 fileA.tsv fileB.tsv -t $'\t' > complete.tsv
The problem is that the output does not conserve headers:
a 1 2 3 intronic tp53
b 4 5 6 non-coding NA
c 7 8 9 exonic kras
While my desired output is:
probeId sample1_betaval sample2_betaval sample3_betaval region gene
a 1 2 3 intronic tp53
b 4 5 6 non-coding NA
c 7 8 9 exonic kras
How can I achieve that?
Add --header option if your join provides it:
join --header -j 1 fileA.tsv fileB.tsv -t $'\t' > complete.tsv
Could you please try following(in case you are ok with it).
awk '
FNR==NR{
array[$1]=$0
next
}
($1 in array){
print array[$1],$2,$3
}
' filea fileb | column -t
EDIT: In case OP has many columns in fileb and want to print all apart from 1st column then try following.
awk '
FNR==NR{
array[$1]=$0
next
}
($1 in array){
val=$1
$1=""
sub(/^ +/,"")
print array[val],$0
}
' filea fileb | column -t

manipulating columns in a text file in awk

I have a tab separated text file and want to do some math operation on one column and make a new tab separated text file.
this is an example of my file:
chr1 144520803 144520804 12 chr1 144520813 58
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520840 144520841 12 chr1 144520845 36
chr1 144520848 144520849 14 chr1 144520851 32
chr1 144520848 144520849 14 chr1 144520851 32
i want to change the 4th column. in fact I want to divide every single element in the 4th column by sum of all elements in the 4th column and then multiply by 1000000 . like the expected output.
expected output:
chr1 144520803 144520804 187500 chr1 144520813 58
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520840 144520841 187500 chr1 144520845 36
chr1 144520848 144520849 218750 chr1 144520851 32
chr1 144520848 144520849 218750 chr1 144520851 32
I am trying to do that in awk using the following command but it does not return what I want. do you know how to fix it:
awk '{print $1 "\t" $2 "\t" $3 "\t" $4/{sum+=$4}*1000000 "\t" $5 "\t" $6 "\t" $7}' myfile.txt > new_file.txt
you need two passes, one to compute the sum and then to scale the field
something like this
$ awk -v OFS='\t' 'NR==FNR {sum+=$4; next}
{$4*=(1000000/sum)}1' file{,} > newfile

Modify tab delimited txt file

I want to modify tab delimited txt file using linux commands sed/awk/or any other method
This is an example of tab delimited txt file which I want to modify for R boxplot input:
----start of input format---------
chr8 38277027 38277127 Ex8_inner
25425 8 100 0.0800000
chr8 38277027 38277127 Ex8_inner
25426 4 100 0.0400000
chr9 38277027 38277127 Ex9_inner
25427 9 100 0.0900000
chr9 38277027 38277127 Ex9_inner
25428 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
30935 1 100 0.0100000
chr10 38277027 38277127 Ex10_inner
31584 1 100 0.0100000
all 687 1 1000 0.0010000
all 694 1 1000 0.0010000
all 695 1 1000 0.0010000
all 697 1 1000 0.0010000
all 699 6 1000 0.0060000
all 700 2 1000 0.0020000
all 723 7 1000 0.0070000
all 740 8 1000 0.0080000
all 742 1 1000 0.0010000
all 761 5 1000 0.0050000
all 814 2 1000 0.0020000
all 821 48 1000 0.0480000
------end of input file format------
I want it to be modified so that 4th column of odd rows becomes 1st column and 2nd column of the even rows (1st column is blank) becomes 2nd column. Rows starting with "all" gets deleted.
This is how output file should look:
-----start of the output file----
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
-----end of the output file----
EDIT: As OP has changed Input_file sample a bit so adding code too it.
awk --re-interval 'match($0,/Exon[0-9]{1,}/){val=substr($0,RSTART,RLENGTH);getline;sub(/^ +/,"",$1);print val,$1}' Input_file
NOTE: My awk is old version to I added --re-interval to it you need not to add it in case you have recent version of it too.
With single awk following may help you on same too.
awk '/Ex[0-9]+_inner/{val=$NF;getline;sub(/^ +/,"",$1);print val,$1}' Input_file
Explanation: Adding explanation too here for same.
awk '
/Ex[0-9]+_inner/{ ##Checking condition here if a line contains string Ex then digits _inner if yes then do following actions.
val=$NF; ##Creating variable named val whose value is $NF(last field of current line).
getline; ##using getline which is out of the box keyword of awk to take the cursor to the next line from current line.
sub(/^ +/,"",$1); ##Using sub utility of awk to substitute initial space of first field with NULL.
print val,$1 ##Printing variable named val and first field value here.
}
' Input_file ##Mentioning the Input_file name here.
another awk
$ awk '/^all/{next}
!/^chr/{printf "%s\n", $1; next}
{printf "%s ", $NF}' file
Ex8_inner 25425
Ex8_inner 25426
Ex9_inner 25427
Ex9_inner 25428
Ex10_inner 30935
Ex10_inner 31584
or perhaps
$ awk '!/^all/{if(/^chr/) printf "%s", $NF OFS; else print $1}' file

awk to filter file using another capturing all instances

In the below awk I am trying to capture all conditions ofKCNMA1, the line in gene (which is a one column list of names) that are in $8 of file which is tab-delimited
So in the below example all instances/lines where KCNMA1 appear in $8would be printed to output.
There could also be multiple ;, however the name, in this case KCNMA1 will be included. The awk seems to capture 2 of the possible 4 conditions but not all instances as shown by the current output. Thank you :).
gene
KCNMA1
file
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
4629 chr10 78944590 78944590 G A intergenic NONE;KCNMA1 dist=NONE;dist=451371
4630 chr10 79396463 79396463 C T intronic KCNMA1 .
4631 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
4632 chr10 81318663 81318663 C G exonic SFTPA2 .
4633 chr10 89397777 89397777 - GAA exonic NONE;X1X;KCNMA1 .
current output
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
1 chr10 79396463 79396463 C T intronic KCNMA1 .
2 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
desired output (tab-delimeted)
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene
4629 chr10 78944590 78944590 G A intergenic NONE;KCNMA1 dist=NONE;dist=451371
4630 chr10 79396463 79396463 C T intronic KCNMA1 .
4631 chr10 79397777 79397777 C - exonic KCNMA1;X1X .
4633 chr10 89397777 89397777 - GAA exonic NONE;X1X;KCNMA1 .
awk
awk -F'\t' 'NR==FNR{a[$0];next} FNR==1{print} {x=$8; sub(/;.*/,"",x)} x in a{$1=++c; print}' gene file > out
For the single gene, just pass as a variable
$ awk -v gene='KCNMA1' -v d=';' 'NR==1 || d $8 d ~ d gene d' file
the counter you're using seems unnecessary since you want to have the first field.
If you want to support a file based gene list, you can use this
$ awk -v d=';' 'NR==FNR {genes[$0]; next}
FNR==1;
{for(g in genes)
if(d $8 d ~ d g d) print}' genes file

awk to count lines in column of file

I have a large file that I want to use awk to count the lines in a specific column $5, before the: and only count -uniq entries, but seem to be having trouble getting the syntax correct. Thank you :).
Sample Input
chr1 955542 955763 + AGRN:exon.1 1 0
chr1 955542 955763 + AGRN:exon.1 2 0
chr1 955542 955763 + AGRN:exon.1 3 0
chr1 955542 955763 + AGRN:exon.1 4 1
chr1 955542 955763 + AGRN:exon.1 5 1
awk -F: ' NR > 1 { count += $5 } -uniq' Input
Desired output
1
$ awk -F'[ \t:]+' '{a[$5]=1;} END{for (k in a)n++; print n;}' Input
1
-F'[ \t:]+'
This tells awk to use spaces, tabs, or colons as the field separator.
a[$5]=1
As we loop through each line, this adds an entry into associative array a for each value of $5 encountered.
END{for (k in a)n++; print n;}
After we have finished reading the file, this counts the number of keys in associative array a and prints the total.
The idiomatic, portable awk approach:
$ awk '{sub(/:.*/,"",$5)} !seen[$5]++{unq++} END{print unq}' file
1
The briefer but gawk-only (courtesy of length(array)) approach:
$ awk '{seen[$5]} END{print length(seen)}' file
1