awk to extract and print first occurrence of patterns - awk

I am trying to use awk to extract and print the first ocurrence of NM_ and the portion after theNP_ starting with p.. A : is printed instead of the "|" for each. The input file is tab-delimeted, but the output does not need to be. The below does execute but prints all the lines in the file not just the patterns. There maybe multiple NM or NP in my actual data of over 5000 lines, however only the first occurence of each is extracted and printed. I am still a little unclear on the RSTART and RLENGHTH concepts but, using line 1 as an example from the input:
The NM variable would be NM_020469.2
The NP variable would be :p.Gly268Arg
I have included comments as well. Thank you :).
input
Input Variant HGVS description(s) Errors and warnings
rs41302905 NC_000009.11:g.136131316C>T|NM_020469.2:c.802G>A|NP_065202.2:p.Gly268Arg
rs8176745 NC_000009.11:g.136131347G>A|NM_020469.2:c.771C>T|NP_065202.2:p.Pro257=
desired output
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
awk
awk -F'[\t|]' 'NR>1{ # define FS as tab and `|` to split each, and skip header line
r=$1; nm=np=""; # create variable r with $1 and 2 variables (one for nm and the other for np, setting them to null)
for(i=2;i<=NF;i++) { # start a loop from line2 and itterate
if ($i~/^NM_/) nm=$i; # extract first NM_ in line and read into i
else if ($i~/^NP_/) np=substr($i,index($i,":")); # extract NP_ and print portion after : (including :)
if (nm && np) { print r,nm np; break } # print desired output
}
}' input

Awk solution:
awk -F'[\t|]' 'NR>1{
r=$1; nm=np="";
for(i=2;i<=NF;i++) {
if ($i~/^NM_/) nm=$i;
else if ($i~/^NP_/) np=substr($i,index($i,":"));
if (nm && np) { print r,nm np; break }
}
}' input
'NR>1 - start processing from the 2nd record
r=$1; nm=np="" - initialization of the needed variables
for(i=2;i<=NF;i++) - iterating through the fields (starting from the 2nd)
if ($i~/^NM_/) nm=$i - capturing NM_... item into variale nm
else if ($i~/^NP_/) np=substr($i,index($i,":")) - capturing NP_... item into variale np (starting from : till the end)
if (nm && np) { print r,nm np; break } - if both items has been captured - print them and break the loop to avoid further processing
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Could you please try following and let me know if this helps too.
awk '{
match($0,/NM_[^|]*/);
nm=substr($0,RSTART,RLENGTH);
match($0,/NP_([^|]|[^$])*/);
np=substr($0,RSTART,RLENGTH);
split(np, a,":");
if(nm && np){
print $1,nm ":" a[2]
}
}
' Input_file
Output will be as follows.
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
PS: Since your sample Input_file doesn't have TAB in them so you could add "\t" after awk in case your Input_file is TAB delimited and if you want to have output as TAB delimited too, add OFS="\t" before Input_file.

Short GNU awk solution (with match function):
awk 'match($0,/(NM_[^|]+).*NP_[^:]+([^[:space:]|]+)/,a){ print $1,a[1] a[2] }' input
The output:
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Given your posted sample input, this is all you need to produce your desired output:
$ awk -F'[\t|]+' 'NR>1{sub(/[^:]+/,"",$4); print $1, $3 $4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=
If that's not all you need then provide more truly representative input/output.

Another alternative awk proposal.
awk 'NR>1{sub(/\|/," ")sub(/\|NP_065202.2/,"");print $1,$3,$4}' file
rs41302905 NM_020469.2:c.802G>A:p.Gly268Arg
rs8176745 NM_020469.2:c.771C>T:p.Pro257=

Related

AWK calculate percentages for each input file and output summary to single file

I have hundreds of csv files with the same format. I want to 'summarise' each file by counting occurrences of the word "Correct" in column 3 and calculating the percentage of "Corrects" per file (i.e. "Correct"s / total number of rows in that file). I am currently doing this for each file with a shell 'for-loop', but this isn't ideal for reasons.
Minimal reproducible example:
cat file1.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Incorrect
4,low,Incorrect
cat file2.csv
id,prediction,evaluation
1,high,Correct
2,low,Correct
3,high,Correct
4,low,Incorrect
Correct answer for each individual file:
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file1.csv
model,total_correct,accuracy
file1.csv,2,0.5
awk 'BEGIN{FS=OFS=","; print "model,total_correct,accuracy"} NR>1{n++; if($3 == "Correct"){correct++}} END{print FILENAME, correct, correct / n}' file2.csv
model,total_correct,accuracy
file2.csv,3,0.75
My desired outcome:
model,total_correct,accuracy
file1.csv,2,0.5
file2.csv,3,0.75
Thanks for any advice.
With GNU awk you can try following code. Written and tested with shown samples. Using ENDFILE here to make life easy. Also added 2 more conditions into the code: 1st: Increasing count for n when there is NO NULL line. 2nd: While getting average to make sure no error comes(in case zero records found) it should print N/A rather than an OOTB generated error. I have also changed from NR>1 to FNR>1 since NR will be a cumulative count and we need FNR which reset the line number from each Input_file's beginning.
awk '
BEGIN{
FS=OFS=","
print "model,total_correct,accuracy"
}
FNR>1{
if(NF) { n++ }
if($3 == "Correct"){ correct++ }
}
ENDFILE{
printf("%s,%d,%.02f\n",FILENAME, correct, (n>0&&n?(correct / n):"N/A"))
correct=n=0
}
' *.csv
With the standard awk, you can increment counts in an array indexed by filename and whether the third column is "Correct", and iterate through the filenames in the end to output the statistics:
awk '
BEGIN{FS=OFS=",";print"model,total_correct,accuracy"}
FNR>1{++r[FILENAME,$3=="Correct"]}
END{
for(i=1;i<ARGC;++i){
f=ARGV[i];
print f,c=r[f,1],c/(r[f,0]+c)
}
}' *.csv

Printing blank lines when looping associative array in AWK

I am not clear why blank lines are being printed instead of their correct values from day[] array in AWK.
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
for (i in day)
{
print $i
print day[$i]
}
}
Explicity printing out individual array elements yield the expected values as follows:
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
print day[1]
print day["first"]
print day[2]
print day["second"]
print day[3]
print day["third"]
print day[4]
print day["fourth"]
}
I am running Linux fedora 5.12.11-300.
Many thanks in advance,
Mary
You shouldn't use $ while printing i or array value as it refers to value of field in awk language, use following instead. Also you need not to use 2 times print statements, you could use single print with newline in it too.
awk '
BEGIN{
day[1]="Sunday"
day["first"]="Sunday"
day[2]="Monday"
day["second"]="Monday"
day[4]="Wednesday"
day["fourth"]="Wednesday"
day[3]="Tuesday"
day["third"]="Tuesday"
for (i in day)
{
print i ORS day[i]
}
}'
Improved version of awk: Also you need to to use 2 statements for same value, you can define them in single assignment way. Even with different indexes having same values it should work, that will save few lines of code :)
awk '
BEGIN{
day[1]=day["first"]="Sunday"
day[2]=day["second"]="Monday"
day[4]=day["fourth"]="Wednesday"
day[3]=day["third"]="Tuesday"
for (i in day)
{
print i OFS day[i]
}
}'
A less verbose way of doing this is by splitting input strings, e.g.:
awk -v days='Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday' \
-v cardinal='First,Second,Third,Fourth,Fifth,Sixth,Seventh' '
BEGIN {
split(days, days_ar, /,/)
split(cardinal, cardinal_ar, /,/)
for (i=1; i<=7; i++)
print cardinal_ar[i] " = " days_ar[i]
}' | column -t
Output:
First = Monday
Second = Tuesday
Third = Wednesday
Fourth = Thursday
Fifth = Friday
Sixth = Saturday
Seventh = Sunday
Thank you so much to all those who have contributed to answering my call for help.
I chose the first answer because I am interested finding out the cause of why the array values weren't being printed. RavinderSingh13 correctly identified the reason which was due to the use of $ with variable i. I have mistakenly treated Awk as shell scripting. Below is the code that printed the array values without including $ variable:
awk '
BEGIN{
day[1]=day["first"]="Sunday"
day[2]=day["second"]="Monday"
day[3]=day["third"]="Tuesday"
day[4]=day["fourth"]="Wednesday"
for (i in day)
{
print i,day[i]
}
}'
One interesting point that I have learnt from Awk scripting, is that only the BEGIN section would always run, with/without reading an input file(s). On the other hand, the body / END / both sections would not run unless the Awk statement is accompanied by at least an input file.

awk to extract days from line

I have the following csv file
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
Im trying to store each line in a specific file matching the date from each line, with the date being in the 4th column of each line, so first line ("2020-10-01 00:40:36") should be in output-01.csv, second line in output-03.csv etc
This awk command
awk -F";|-" -vOFS='\t' '{print > "output-"$7".csv"}' testing.csv
half works but fails on line 3 because of the - in the 3rd column, and line 4 because of the in the 3rd column - this produces output-10.csv
Is there a way to run the awk command twice ? then i could extract the date using the ; separator and then split using -
Using gawk takes care of unsorted file too :
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){
file=sprintf("output-%s.csv",arr[3]);
if(!seen[file]++){
print >file;
next
}
}{
print >>file;
close(file);
}' infile
Explanation:
awk 'match($0,/([0-9]{4})-([0-9]{2})-([0-9]{2})/,arr){ # match for regex
file=sprintf("output-%s.csv",arr[3]); # file variable using array arr value, 3rd index
if(!seen[file]++){ # if not seen file name before in array seen
print >file; # print content to file
next # go to next line
}
}{
print >>file; # append content to file
close(file); # close file
}' infile
Try this:
$ awk -F';' -v OFS='\t' '{split($4,a,/[- ]/); file = "output-"a[3]".csv";
$1=$1; print > file; close(file)}' testing.csv
split($4,a,/[- ]/) this will split 4th field further based on space or - characters, saved in array a
file = "output-"a[3]".csv" output filename
$1=$1 since there's no other command changing contents of input line, this is needed to rebuild input line, otherwise OFS will not be applied
print > file print input line to required file
close(file) calling close, useful if there are too many file names
You can also use file = "output-" substr($4,10,2) ".csv" instead of split if the 4th column is consistent as shown in the sample.
With your shown samples, please try following, written and tested in GNU awk.
awk '
match($0,/[0-9]{4}(-[0-9]{2}){2}/){
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv"
print >> (outputFile)
close(outputFile)
}
' Input_file
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
match($0,/[0-9]{4}(-[0-9]{2}){2}/){ ##using match function to match yyyy-mm-dd here in line.
outputFile=substr($0,RSTART+8,RLENGTH-8)".csv" ##Getting matched regex sub-string into outputFile here.
print >> (outputFile) ##Printing current line into outputFile here.
close(outputFile) ##Closing output file to avoid too many files opened error.
}
' Input_file ##Mentioning Input_file name here.
To do this efficiently you should sort on the key field first:
awk -F';' '{print $4, NR, $0}' file |
sort -k1,1 -k3,3n |
awk '
{ curr=$1; sub(/([^ ]+ ){2}/,"") }
curr != prev { close(out); out="output-" (++c) ".csv"; prev=curr }
{ print > out }
'
$ head output*.csv
==> output-1.csv <==
238013750030646-2;;"Default";"2020-10-01 00:40:36";;"opening";0;3591911;283940640
==> output-2.csv <==
238013750030646-2;;"Default";"2020-10-03 00:40:36";;"closing line";0;89320;283940640
==> output-3.csv <==
238013750030646-2;;"something-else";"2020-10-04 00:40:36";;"started";0;0;283940640
==> output-4.csv <==
238013750030646-2;;"default else";"2020-10-08 05:42:06";;"opening";0;2410;283940640
The above will work using any awk+sort in any shell on every Unix box. See the many similar examples on this site for an explanation.

AWK - get value between two strings over multiple lines

input.txt:
>block1
111111111111111111111
>block2
222222222222222222222
>block3
333333333333333333333
AWK command:
awk '/>block2.*>/' input.txt
Expected output
222222222222222222222
However, AWK is returning nothing. What am I misunderstanding?
Thanks!
If you want to print the line after the line containing >block2, then you could use:
awk '/^>block2$/ { nr=NR+1 } NR == nr { print }'
Track the record number plus 1 when you find the match; when the current record number matches the remembered one, print the current record.
If you want all the lines between the line >block2 and >block3, then you'd use:
awk '/^>block2$/,/^>block3/ {if ($0 !~ /^>block[23]$/) print }'
For all lines between the two markers, if the line doesn't match either marker, print it. The output is the same with the sample data file.
another awk
$ awk 'c&&c--; /^>block2/{c=1}' file
222222222222222222222
c specifies how many lines you want to print after the match. If you want the text between two markers
$ awk '/^>block3/{exit} s; /^>block2/{s=1}' file
222222222222222222222
if there are multiple instances and you want them all, just change exit to s=0
You probably meant:
$ awk '/>/{f=/^>block2$/;next} f' file
222222222222222222222

How to print fields for repeated key column in one line

I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks
You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE
to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.