Apply a calculation on columns (awk) - awk

I tried to make this calculation from the column 41 to the end of the line:
awk '{ { split($10,a,":") } { split( a[4], b ,",") } {print b[1]+b[2]}}' filename
I know how to do this on just one column, but when I tried to do a loop it fails :
awk '{for (i=10;i<=NF;i++) {split($i,a,":")} {split(a[4],b,",")} {print ( b[1]+b[2])}}' filename
The aim is to split each columns and to do the sum of those numbers :
./.:0:.,.,.:0,0:0,0
Here is what my file looks like :
Contig POS ID REF ALT QUAL FILTER INFO FORMAT S155 S158 S168 S173 S175 S178 S180 S188 S189 S191 S193 S194 S196 S201 S205 S206 S208 S209 S210
NODE_14985_length_2800_cov_1.38384 67 999978 A C . PASS Ty=SNP;Rk=1;UL=19;UR=31;CL=.;CR=.;Genome=A;Sd=1 GT:DP:PL:AD:HQ ./.:8:.,.,.:8,0:71,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0 ./.:0:.,.,.:0,0:0,0
Here is my actual output:
awk '{for (i=10;i<=NF;i++) {split($i,a,":")} {split(a[4],b,",")} {print b[1]+b[2]}}' file.vcf | head
0
0
0
0
0
I want a matrix of the calcul for each columns :
0 0 0 0
1 2 0 6
2 0 0 8
...
Thank you in advance for your help

changed the printf adding a print at the end (your printf need, at least, a space to separate results of the line)
based on your sample change the 41 to a number lower than 28 (there is only 28 field for awk in this dataset)
your different split are AFTER the loop, they must be IN the loop scope (see where are the brace)
Modified code:
awk 'NR > 1 {
for( i=41; i<=NF; i++) {
split( $i, a, ":" )
#print NF ":" i "[" $i "] a[4]:" a[4]
split( a[4], b, ",")
#print i ": " b[1] " + " b[2] " : " b[1] + b[2]
printf( "%d ", b[1] + b[2])
}
print ""
}' YourFile

Related

awk to extract value in each line and create new file

In the below awk I am trying to extract the value of a substring in each line, and the 2 attempts do not produce the desired results. The first awk executes and returns no data,
and the second only extracts the value. Thank you :).
file
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
awk 1
awk '/^#/ {for (I=1;I<NF;I++) if ($I == "RANKSCORE=") print $(I+1)}' file
awk 2
awk 'BEGIN{FS=OFS="\t"}; /^#/ {print $1,$2,$3} {sub(/.*RANKSCORE=/, ""); print}' file
#CHROM POS ID
#CHROM POS ID REF ALT QUAL FILTER INFO
0.21
0.99
desired (tab-delimited)
1 930215 CM1613956 A G . . 0.21
You may use this awk:
awk 'BEGIN {FS=OFS="\t"}
/^#/ {next}
$NF ~ /;RANKSCORE=/ {
sub(/.+=/, "", $NF)
} 1' file
1 930215 CM1613956 A G . . 0.21
With your shown samples please try following awk code.
awk -F';RANKSCORE=' '
BEGIN{ OFS ="\t" }
/^#/ { next }
NF==2 && match($0,/.* /){
print substr($0,RSTART,RLENGTH-1),$2
}
' Input_file
Explanation: Adding detailed explanation for above code.
awk -F';RANKSCORE=' ' ##Starting awk program from here, settings field separator as ;RANKSCORE=
BEGIN{ OFS ="\t" } ##Setting OFS as tab in BEGIN section of this code.
/^#/ { next } ##If a line starts from # then simply skip that line.
NF==2 && match($0,/.* /){ ##Check if NF is 2 AND matching till last occurrence of single space.
print substr($0,RSTART,RLENGTH-1),$2 ##Printing sub string till matched regex along with 2md field.
}
' Input_file ##Mentioning Input_file name here.
Your RANKSCORE seems to appear in field 8.
match can locate it. substr can extract it.
$ awk -F'\t' -v OFS='\t' '
match($8,/RANKSCORE=[0-9.]+/){
$8 = substr($8, RSTART+10, RLENGTH-10)
print
}
' file
Or more safely, assuming semi-colon sub-delimiters, a couple of subs:
$ awk -F'\t' -v OFS='\t' '
sub(/^(.*;)?RANKSCORE=/,"",$8){
sub(/[^0-9.].*$/,"",$8)
print
}
' file
Assumptions:
we only want exact word matches on RANKSCORE (eg, do not match on old_RANKSCORE)
RANKSCORE=value could show up anywhere in a ;-delimited last field
Adding some lines with different locations of RANKSCORE:
#CHROM POS ID REF ALT QUAL FILTER INFO
1 930215 CM1613956 A G . . PHEN="Retinitis_pigmentosa";RANKSCORE=0.21
1 930215 CM1613956 A G . . RANKSCORE=3.235;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;old_RANKSCORE=7.7234;PHEN="Retinitis_pigmentosa"
1 930215 CM1613956 A G . . stuff=123;RANKSCORE=9.3325;PHEN="Retinitis_pigmentosa"
One awk idea:
awk '
BEGIN { FS=OFS="\t" }
/RANKSCORE/ { n=split($NF,a,"[;=]") # if line contains "RANKSCORE" then split last field on dual delimiters ";" and "="
for (i=1;i<=n;i=i+2) # loop through attribute names (odd-numbered indices) and ...
if (a[i] == "RANKSCORE") { # if attribute == "RANKSCORE" then ...
$NF=a[i+1] # use associated value (even-numbered index) as new value for last field
print # print new line
next # go to next input line
}
}
' file
This generates:
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325
no arrays needed :
{m,n,g}awk '!+_<+NF && sub(";.*$", _, $(NF=NF))^_'\
FS='[ \t]+([^ \t]*;)?RANKSCORE=' OFS='\t'
1 930215 CM1613956 A G . . 0.21
1 930215 CM1613956 A G . . 3.235
1 930215 CM1613956 A G . . 9.3325

awk to compare value of sub-string in field

In the below awk I am trying to extract and compare each substring in $4 that stars with p.. If the first three letters is the same as the last three (there is a digit in between) then that p. is updated to p.(3 letters)(digit)(=) --- the () are only to show that there are 3 enteries and are not needed. If the 3 letters are different then that line is unchanged. In the below file line 1 in an example. In my actual data there are about 10,000 rows wth about 50 columns, but $4 is the only one that will have these values in ut, that is te p. The format of the p. will always be three letters followed by a 1-4 digit # followed by 3 more letters. The awk attempt below I think will extract each p. and split on the ;, but I am not sure how to compare to check if the three letters are the same. Thank you :).
file tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
desired output tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
awk
awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
if (n>1) ostring=(ostring ";") # append ";"
if (match(NM[n],/p[.].*/)) {
# copy up to "p."
ostring=(ostring substr(NM[n],1,RSTART+1))
# Get the substring after "p."
VAL=substr(NM[n],RSTART+2)
# Get its length
lenVAL=length(VAL)
# store aa array
aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file
Extended GNU awk solution:
awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' file
The output:
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
Time performance measurement:
Input testfile:
$ wc -l testfile
10000 testfile
time(awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' testfile >/dev/null)
real 0m0.269s
user 0m0.256s
sys 0m0.000s
time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }' testfile >/dev/null)
real 0m0.470s
user 0m0.416s
sys 0m0.008s
With GNU awk for the 3rd arg to match():
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }
$ gawk -f tst.awk file
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

awk to update value in field of out file using contents of another

In the out.txt below I am trying to use awk to update the contents of $9. The out.txt is created by the awk before the pipe |. If $9 contains a + or - then $8 of out.txt is used as a key to lookup in $2 of file2. When a match ( there will always be one) is found the $3 value of that file2 is used to update $9 of out.txt seperated by a :. So the original +6 in out.txt would be +6:NM_005101.3. The awk below is close but has syntax errors after the | that I can not seem to fix. Thank you :).
out.txt tab-delimited
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6 . . .
file2 space-delimited
2 ISG15 NM_005101.3 948846-948956 949363-949919
desired output `tab-delimited'
R_Index Chr Start End Ref Alt Func.IDP.refGene Gene.IDP.refGene GeneDetail.IDP.refGene Inheritence ExonicFunc.IDP.refGene AAChange.IDP.refGene
1 chr1 948846 948846 - A upstream ISG15 -0:NM_005101.3 . . .
2 chr1 948870 948870 C G UTR5 ISG15 NM_005101.3:c.-84C>G . .
4 chr1 949925 949925 C T downstream ISG15 +6:NM_005101.3 . . .
5 chr1 207646923 207646923 G A intronic CR2 >50 . . .
8 chr1 948840 948840 - C upstream ISG15 -6:NM_005101.3 . . .
Description
lines 1, 3, 5 `$9` updated with`: ` and value of `$3` in `file2`
line 2 and 4 are skipped as these do not have a `+` or `-` in them
awk
awk -v extra=50 -v OFS='\t' '
NR == FNR {
count[$2] = $1
for(i = 1; i <= $1; i++) {
low[$2, i] = $(2 + 2 * i)
high[$2, i] = $(3 + 2 * i)
mid[$2, i] = (low[$2, i] + high[$2, i]) / 2
}
next
}
FNR != 1 && $9 == "." && $12 == "." && $8 in count {
for(i = 1; i <= count[$8]; i++)
if($4 >= (low[$8, i] - extra) && $4 <= (high[$8, i] + extra)) {
if($4 > mid[$8, i]) {
sign = "+"
value = high[$8, i]
}
else {
sign = "-"
value = low[$8, i]
}
diff = (value > $4) ? value - $4 : $4 - value
$9 = (diff > 50) ? ">50" : (sign diff)
break
}
if(i > count[$8]) {
$9 = ">50"
}
}
1
' FS='[- ]' file2 FS='\t' file1 | awk if($6 == "-" || $6 == "+") printf ":" ; 'FNR==NR {a[$2]=$3; next} a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
As far as I can tell, your awk code is OK and your bash usage is wrong.
FS='[- ]' file2 FS='\t' file1 |
awk if($6 == "-" || $6 == "+")
printf ":" ;
'FNR==NR {a[$2]=$3; next}
a[$8]{$3=a[$8]}1' OFS='\t' file2 > final.txt
bash: syntax error near unexpected token `('
I don't know what that's supposed to do. This for sure, though: on the second line, the awk code needs to be quoted (awk 'if(....). The bash error message stems from the fact that bash is interpreting the (unquoted) awk code, and ( is not a valid shell-script token after if.

awk match and find mismatch between files and output results

In the below awk I am using $5 $7 and $8 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. When I search for one match use 3 fields for the key for the lookup and do not skip the header I get current output. I apologize for the long post and file examples, just trying to include everything to help get this working. Thank you :).
file1
Index Chromosomal Position Gene Inheritance Start End Ref Alt Func.refGene
98 48719928 FBN1 AD 48719928 48719929 AT - exonic
101 48807637 FBN1 AD 48807637 48807637 C T exonic
file2
R_Index Chr Start End Ref Alt Func.IDP.refGene
36 chr15 48719928 48719929 AT - exonic
37 chr15 48719928 48719928 A G exonic
38 chr15 48807637 48807637 C T exonic
awk
awk -F'\t' '
NR == FNR {
A[$25]; A[$26]; A[$27]
next
}
{
B[$3]; B[$5]; B[$6]
}
END {
print "Match"
OFS=","
for ( k in A )
{
if ( k && k in B )
printf "%s ", k
}
print "Missing from file1"
OFS=","
for ( k in B )
{
if ( ! ( k in A ) )
printf "%s ", k
}
print "Missing from file2"
OFS=","
for ( k in A )
{
if ( ! ( k in B ) )
printf "%s ", k
}
}
' file1 file2 > list
current output
Match
Missing from file1
A C Ref 48807637 Alt Start T G - AT 48719928 Missing from file2
desired output
Match 48719928 AT -, 48807637 C T
Missing from file1 48719928 A G
Missing from file2
You misunderstand awk syntax and are confusing awk with shell. When you wrote:
A[$25] [$26] [$27]
you probably meant:
A[$25]; A[$26]; A[$27]
(and similarly for B[]) and when you wrote:
IFS=
since IFS is a shell variable, not an awk one, you maybe meant
FS=
BUT since you're doing that in the END section and not calling split() and so not doing anything that would use FS idk what you were hoping to achieve with that. Maybe you meant:
OFS=
BUT you aren't doing anything that would use OFS and your desired output isn't comma-separated so idk what you'd be hoping to achieve with that either.
If that's not enough info for you to solve your problem yourself then reduce your example to something with 10 columns or less so we don't have to read a lot of irrelevant info to help you.
Program 1
This works, except the output format is different from what you request:
awk 'FNR==1 { next }
FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2
Output 1
Match:
48807637 C T
48719928 AT -
Missing in file1:
48719928 A G
Missing in file2:
Program 2
If you must have each set of values in a category comma-separated on a single line, then:
awk 'FNR==1 { next }
FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
END {
printf "Match"
pad = " "
for (k in file1)
{
if (k in file2)
{
printf "%s%s", pad, file1[k]
pad = ", "
}
}
print ""
printf "Missing in file1"
pad = " "
for (k in file2)
{
if (!(k in file1))
{
printf "%s%s", pad, file2[k]
pad = ", "
}
}
print ""
printf "Missing in file2"
pad = " "
for (k in file1)
{
if (!(k in file2))
{
printf "%s%s", pad, file1[k]
pad = ", "
}
}
print ""
}' file1 file2
The code is a little bigger, but the format used exacerbates the difference. The change is all in the END block; the other code is unchanged. The sequences of actions in the END block no longer fit comfortably on a single line, so they're spread out for readability. You can apply a liberal smattering of semicolons and concatenate the lines to shrink the apparent size of the program if you desire.
It's tempting to try a function for the printing, but the conditions just make it too tricky to be worthwhile, I think — but I'm open to persuasion otherwise.
Output 2
Match 48807637 C T, 48719928 AT -
Missing in file1 48719928 A G
Missing in file2
This output will be a lot harder to parse than the one shown first, so doing anything automatically with it will be tricky. While there are only 3 entries to worry about, the line length isn't an issue. If you get to 3 million entries, the lines become very long and unmanageable.

matching non-unique values to unique values

I have data which looks like this
1 3
1 2
1 9
5 4
4 6
5 6
5 8
5 9
4 2
I would like the output to be
1 3,2,9
5 4,6,8,9
4 6,2
This is just sample data but my original one has lots more values.
So this worked
So this basically creates a hash table, using the first column as a key and the second column of the line as the value:
awk '{line="";for (i = 2; i <= NF; i++) line = line $i ", "; table[$1]=table[$1] line;} END {for (key in table) print key " => " table[key];}' trial.txt
OUTPUT
4 => 6, 2
5 => 4, 6, 8, 9
1 => 3, 2, 9
I'd write
awk -v OFS=, '
{
key = $1
$1 = ""
values[key] = values[key] $0
}
END {
for (key in values) {
sub(/^,/, "", values[key])
print key " " values[key]
}
}
' file
If you want only the unique values for each key (requires GNU awk for multi-dimensional arrays)
gawk -v OFS=, '
{ for (i=2; i<=NF; i++) values[$1][$i] = i }
END {
for (key in values) {
printf "%s ", key
sep = ""
for (val in values[key]) {
printf "%s%s", sep, val
sep = ","
}
print ""
}
}
' file
or perl
perl -lane '
$key = shift #F;
$values{$key}{$_} = 1 for #F;
} END {
$, = " ";
print $_, join(",", keys %{$values{$_}}) for keys %values;
' file
if not concerned with the order of the keys, I think this is the idiomatic awk solution.
$ awk '{a[$1]=($1 in a?a[$1]",":"") $2}
END{for(k in a) print k,a[k]}' file |
column -t
4 6,2
5 4,6,8,9
1 3,2,9