counting the number of residues in a file - awk

I have a file as follows. I would like to count the number of each character.
>1DMLA
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
DDVAARLRAAGFGAVGAGATAEETRRMLHRAFDTLA
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
I tried the following code.
awk '/^>/ { res=substr($0, 2); } /^[^>]/ { print res " - " length($0); }' <file
The output of the above code is
1DMLA - 80
1DMLA - 80
1DMLA - 80
1DMLA - 79
1DMLB - 36
2BHDC - 80
2BHDC - 80
2BHDC - 80
My desired output is
1DMLA - 319
1DMLB - 36
2BHDC - 240
How do I change the above code for getting my desired output?

Here's one way using awk:
awk '/^>/ && r { print r, "-", s; r=s="" } /^>/ { r = substr($0, 2); next } { s += length } END { print r, "-", s }' file
Results:
1DMLA - 319
1DMLB - 36
2BHDC - 240

awk -vRS='>' '$1{gsub( "[\r]", "",$1 );
printf "%s - %d\n", $1, length($0) - length($1) - NF + 1}' input

This way:
awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' infile
Or formatted:
awk -F\> '/^>/ { if (seqlen != "")
print seqlen
printf("%s - ",$2)
seqlen=0
next }
seqlen != ""{seqlen+=length($0)}
END{
print seqlen}' infile
see:
Sequence length of FASTA file
Apart from the expected result, this will handle these unexpected file formats.
$ cat infile
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
NALTKAGQAAANAKTVYGENTHRTFSVVVDDCSMRAVLRRLQVGGGTLKFFLTTPVPSLCVTATGPNAVSAVFLLKPQK
>1DMLB
>2BHDC
MTDSPGGVAPASPVEDASDASLGQPEEGAPCQVVLQGAELNGILQAFAPLRTSLLDSLLVMGDRGILIHNTIFGEQVFLP
LEHSQFSRYRWRGPTAAFLSLVDQKRSLLSVFRANQYPDLRRVELAITGQAPFRTLVQRIWTTTSDGEAVELASETLMKR
ELTSFVVLVPQGTPDVQLRLTRPQLTKVLNATGADSATPTTFELGVNGKFSVFTTSTCVTFAAREEGVSSSTSTQVQILS
$ awk -F\> '/^>/ {if (seqlen != ""){print seqlen}printf("%s - ",$2);seqlen=0;next}seqlen != ""{seqlen +=length($0)}END{print seqlen}' kk2
1DMLB - 0
2BHDC - 240

Related

How to print something multiple times in awk

I have a file sample.txt that looks like this:
Sequence: chr18_gl000207_random
Repeat 1
Indices: 2822--2996 Score: 135
Period size: 36 Copynumber: 4.8 Consensus size: 36
Consensus pattern (36 bp):
TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
Repeat 2
Indices: 2736--3623 Score: 932
Period size: 111 Copynumber: 8.1 Consensus size: 111
Consensus pattern (111 bp):
TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
GTGGTGGCTGTTGTGGTTGTAGCCTCAGTGGAAGTGCCTGCAGTTG
Repeat 3
Indices: 3421--3496 Score: 89
Period size: 39 Copynumber: 1.9 Consensus size: 39
Consensus pattern (39 bp):
AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I have used awk to extract values for parameters that are relevant for me like this:
paste <(awk '/Indices/ {print $2}' sample.txt) <(awk '/Period size/ {print $3}' sample.txt) <(awk '/Copynumber/ {print $5}' sample.txt) <(awk '/Consensus pattern/ {getline; print $0}' sample.txt)
Output:
2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
Now I want to add the parameter Sequence to every row.
Desired output:
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT
I want to do this for many files in a loop so I need a solution that would work with a different number of Repeats as well.
$ cat tst.awk
BEGIN { OFS="\t" }
$1 == "Sequence:" { seq = $2; next }
$1 == "Indices:" { ind = $2; next }
$1 == "Period" { per = $3; cpy = $5; next }
$1 == "Consensus" { isCon=1; next }
isCon { print seq":"ind, per, cpy, $1; isCon=0 }
$ awk -f tst.awk file
chr18_gl000207_random:2822--2996 36 4.8 TCAGTTGCAGTGCTGGCTGTTGTTGTGGCAGACTGT
chr18_gl000207_random:2736--3623 111 8.1 TTGTGGCAGACTGTTCAGTTGCAGTGCTGGCTGTTGTTGTGGTTGCGGGTTCAGTAGAGGTGGTA
chr18_gl000207_random:3421--3496 39 1.9 AGTGCTGACTGTTGTGGTGGCAGCCTCAGTAGAAGTGGT

Concatenating array elements into a one string in for loop using awk

I am working on a variant calling format (vcf) file, and I tried to show you guys what I am trying to do:
Input:
1 877803 838425 GC G
1 878077 966631 C CCACGG
Output:
1 877803 838425 C -
1 878077 966631 - CACGG
In summary, I am trying to delete the first letters of longer strings.
And here is my code:
awk 'BEGIN { OFS="\t" } /#/ {next}
{
m = split($4, a, //)
n = split($5, b, //)
x = "-"
delete y
if (m>n){
for (i = n+1; i <= m; i++) {
y = sprintf("%s", a[i])
}
print $1, $2, $3, y, x
}
else if (n>m){
for (j = m+1; i <= n; i++) {
y = sprintf("%s", b[j]) ## Problem here
}
print $1, $2, $3, x, y
}
}' input.vcf > output.vcf
But,
I am getting the following error in line 15, not even in line 9
awk: cmd. line:15: (FILENAME=input.vcf FNR=1) fatal: attempt to use array y in a scalar context
I don't know how to concatenate array elements into a one string using awk.
I will be very happy if you guys help me.
Merry X-Mas!
You may try this awk:
awk -v OFS="\t" 'function trim(s) { return (length(s) == 1 ? "-" : substr(s, 2)); } {$4 = trim($4); $5 = trim($5)} 1' file
1 877803 838425 C -
1 878077 966631 - CACGG
More readable form:
awk -v OFS="\t" 'function trim(s) {
return (length(s) == 1 ? "-" : substr(s, 2))
}
{
$4 = trim($4)
$5 = trim($5)
} 1' file
You can use awk's substr function to process the 4th and 5th space delimited fields:
awk '{ substr($4,2)==""?$4="-":$4=substr($4,2);substr($5,2)==""?$5="-":$5=substr($5,2)}1' file
If the string from position 2 onwards in field 4 is equal to "", set field 4 to "-" otherwise, set field 4 to the extract of the field from position 2 to the end of the field. Do the same with field 5. Print lines modified or not with short hand 1.

Process multiple files with awk

I would like to count the number of points in each interval. I have the positions of the points in the first file and the intervals in the second. First I store the point attributes in two arrays(pos and name) and then i want to loop over them in order to determine wheter it belongs to the given interval ($1 is the name and $2 is the start and $3 is the end of the interval). I have the following code:
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}} file1 file2 > out'
I have a syntax error: "syntax error near unexpected token `i"
I am beginner in awk. Any help is highly appriciated. Thanks
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[FNR] += 1; }
}
}
END {
for(i = 1; i <=FNR; i++){
print sum[i];
}
}
' points windows > output
points:
chr1 52
chr1 65
chr2 120
chr2 101
chr2 160
chr3 20
chr4 50
windows:
chr1 0 100
chr1 100 200
chr2 0 100
chr2 100 200
chr3 0 100
chr3 100 200
chr4 0 100
chr5 0 100
chr6 0 100
chr6 100 200
chr7 0 100
chr8 0 100
gave me the desired output:
2
3
1
1
Thank You
Your ' is in wrong place and awk command is not ending properly, could you please try following. Couldn't test it since no samples are given.
awk 'NR==FNR{name[NR]=$1;pos[NR]=$2;next}; {for (i in name) if (name[i] == $1 && pos[i] >= $2 && pos[i] <= $3) {sum[NR] += 1;}} END {for (i = 1; i <=length(sum); i++) {print sum[i]}}' file1 file2
Non-one liner form of above solution.
awk '
NR==FNR{
name[NR]=$1
pos[NR]=$2
next
}
{
for(i in name){
if(name[i] == $1 && pos[i] >= $2 && pos[i] <= $3){ sum[NR] += 1 }
}
}
END{
for(i = 1; i <=length(sum); i++){
print sum[i]
}
}
' file1 file2 > out
As per sir #Ed Morton 's comment following could be few recommendations: Again these are not tested since no samples were given but you could try to apply them.
sum[NR] should be as sum[FNR] if in case you want to put index as per line number of Input_file2, why because difference between NR and FNR is that NR's value will be keep keep increasing till all Input_file(s) are read but FNR value will be RESET to 1 whenever there is new Input_file is being read.
Then, length(sum) could be value of FNR because basically you may be looking for total number of times loop has to run which you could get by FNR value.

awk to compare value of sub-string in field

In the below awk I am trying to extract and compare each substring in $4 that stars with p.. If the first three letters is the same as the last three (there is a digit in between) then that p. is updated to p.(3 letters)(digit)(=) --- the () are only to show that there are 3 enteries and are not needed. If the 3 letters are different then that line is unchanged. In the below file line 1 in an example. In my actual data there are about 10,000 rows wth about 50 columns, but $4 is the only one that will have these values in ut, that is te p. The format of the p. will always be three letters followed by a 1-4 digit # followed by 3 more letters. The awk attempt below I think will extract each p. and split on the ;, but I am not sure how to compare to check if the three letters are the same. Thank you :).
file tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110Glu;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110Glu
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
desired output tab-delimited
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
awk
awk '
BEGIN { OFS="\t" }
$4 ~ /:NM/ {
ostring=""
# split $4 by ";" and cycle through them
nNM=split($4,NM,";")
for (n=1; n<=nNM; n++) {
if (n>1) ostring=(ostring ";") # append ";"
if (match(NM[n],/p[.].*/)) {
# copy up to "p."
ostring=(ostring substr(NM[n],1,RSTART+1))
# Get the substring after "p."
VAL=substr(NM[n],RSTART+2)
# Get its length
lenVAL=length(VAL)
# store aa array
aa=[{while(length($4)=3){print substr($044,1,3);gsub(/^./,"")}]}' file
Extended GNU awk solution:
awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' file
The output:
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln
Time performance measurement:
Input testfile:
$ wc -l testfile
10000 testfile
time(awk 'NR==1; NR > 1{
len = split($4, a, /\<p\.[a-zA-Z]{3}[0-9]+[a-zA-Z]{3}\>/, seps);
if (len == 1){ print; next }
res = ""
for (i=1; i < len; i++) {
s = seps[i];
if (substr(s, 3, 3) == substr(s, length(s) - 2)) {
seps[i] = substr(s, 1, length(s) - 3)"=";
}
}
for (i=1; i <= len; i++)
res = res a[i] (seps[i]? seps[i]:"");
$4 = res; print
}' FS='\t' OFS='\t' testfile >/dev/null)
real 0m0.269s
user 0m0.256s
sys 0m0.000s
time(awk 'BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }' testfile >/dev/null)
real 0m0.470s
user 0m0.416s
sys 0m0.008s
With GNU awk for the 3rd arg to match():
$ cat tst.awk
BEGIN { FS=OFS="\t" }
NR>1 {
head = ""
tail = $4
while ( match(tail,/(p\.([[:alpha:]]{3})[0-9]+)([[:alpha:]]{3})/,a) ) {
head = head substr(tail,1,RSTART-1) a[1] (a[2] == a[3] ? "=" : a[3])
tail = substr(tail,RSTART+RLENGTH)
}
$4 = head tail
}
{ print }
$ gawk -f tst.awk file
Chr Start ExonicFunc.refGene AAChange.refGene
chr1 155880573 synonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu110=;RIT1:NM_001256822:exon2:c.31G>C:p.Glu110=
chr1 155880573 nonsynonymous SNV RIT1:NM_001256821:exon2:c.31G>C:p.Glu11Gln

awk match and find mismatch between files and output results

In the below awk I am using $5 $7 and $8 of file1 to search $3 $5 and $6 of file2. The header row is skipped and it then outputs a new file with what lines match and if they do not match what file the match is missing from. When I search for one match use 3 fields for the key for the lookup and do not skip the header I get current output. I apologize for the long post and file examples, just trying to include everything to help get this working. Thank you :).
file1
Index Chromosomal Position Gene Inheritance Start End Ref Alt Func.refGene
98 48719928 FBN1 AD 48719928 48719929 AT - exonic
101 48807637 FBN1 AD 48807637 48807637 C T exonic
file2
R_Index Chr Start End Ref Alt Func.IDP.refGene
36 chr15 48719928 48719929 AT - exonic
37 chr15 48719928 48719928 A G exonic
38 chr15 48807637 48807637 C T exonic
awk
awk -F'\t' '
NR == FNR {
A[$25]; A[$26]; A[$27]
next
}
{
B[$3]; B[$5]; B[$6]
}
END {
print "Match"
OFS=","
for ( k in A )
{
if ( k && k in B )
printf "%s ", k
}
print "Missing from file1"
OFS=","
for ( k in B )
{
if ( ! ( k in A ) )
printf "%s ", k
}
print "Missing from file2"
OFS=","
for ( k in A )
{
if ( ! ( k in B ) )
printf "%s ", k
}
}
' file1 file2 > list
current output
Match
Missing from file1
A C Ref 48807637 Alt Start T G - AT 48719928 Missing from file2
desired output
Match 48719928 AT -, 48807637 C T
Missing from file1 48719928 A G
Missing from file2
You misunderstand awk syntax and are confusing awk with shell. When you wrote:
A[$25] [$26] [$27]
you probably meant:
A[$25]; A[$26]; A[$27]
(and similarly for B[]) and when you wrote:
IFS=
since IFS is a shell variable, not an awk one, you maybe meant
FS=
BUT since you're doing that in the END section and not calling split() and so not doing anything that would use FS idk what you were hoping to achieve with that. Maybe you meant:
OFS=
BUT you aren't doing anything that would use OFS and your desired output isn't comma-separated so idk what you'd be hoping to achieve with that either.
If that's not enough info for you to solve your problem yourself then reduce your example to something with 10 columns or less so we don't have to read a lot of irrelevant info to help you.
Program 1
This works, except the output format is different from what you request:
awk 'FNR==1 { next }
FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
END { print "Match:"; for (k in file1) if (k in file2) print file1[k] # Or file2[k]
print "Missing in file1:"; for (k in file2) if (!(k in file1)) print file2[k]
print "Missing in file2:"; for (k in file1) if (!(k in file2)) print file1[k]
}' file1 file2
Output 1
Match:
48807637 C T
48719928 AT -
Missing in file1:
48719928 A G
Missing in file2:
Program 2
If you must have each set of values in a category comma-separated on a single line, then:
awk 'FNR==1 { next }
FNR == NR { file1[$5,$7,$8] = $5 " " $7 " " $8 }
FNR != NR { file2[$3,$5,$6] = $3 " " $5 " " $6 }
END {
printf "Match"
pad = " "
for (k in file1)
{
if (k in file2)
{
printf "%s%s", pad, file1[k]
pad = ", "
}
}
print ""
printf "Missing in file1"
pad = " "
for (k in file2)
{
if (!(k in file1))
{
printf "%s%s", pad, file2[k]
pad = ", "
}
}
print ""
printf "Missing in file2"
pad = " "
for (k in file1)
{
if (!(k in file2))
{
printf "%s%s", pad, file1[k]
pad = ", "
}
}
print ""
}' file1 file2
The code is a little bigger, but the format used exacerbates the difference. The change is all in the END block; the other code is unchanged. The sequences of actions in the END block no longer fit comfortably on a single line, so they're spread out for readability. You can apply a liberal smattering of semicolons and concatenate the lines to shrink the apparent size of the program if you desire.
It's tempting to try a function for the printing, but the conditions just make it too tricky to be worthwhile, I think — but I'm open to persuasion otherwise.
Output 2
Match 48807637 C T, 48719928 AT -
Missing in file1 48719928 A G
Missing in file2
This output will be a lot harder to parse than the one shown first, so doing anything automatically with it will be tricky. While there are only 3 entries to worry about, the line length isn't an issue. If you get to 3 million entries, the lines become very long and unmanageable.