formatted reading using awk - scripting

I am trying to read in a formatted file using awk. The content looks like the following:
1PS1 A1 1 11.197 5.497 7.783
1PS1 A1 1 11.189 5.846 7.700
.
.
.
Following c format, these lines are in following format
"%5d%5s%5s%5d%8.3f%.3f%8.3f"
where, first 5 positions are integer (1), next 5 positions are characters (PS1), next 5 positions are characters (A1), next 5 positions are integer (1), next 24 positions are divided into 3 columns of 8 positions with 3 decimal point floating numbers.
What I've been using is just calling these lines separated by columns using "$1, $2, $3". For example,
cat test.gro | awk 'BEGIN{i=0} {MolID[i]=$1; id[i]=$2; num[i]=$3; x[i]=$4;
y[i]=$5; z[i]=$6; i++} END { ...} >test1.gro
But I ran into some problems with this, and now I am trying to read these files in a formatted way as discussed above.
Any idea how I do this?

Looking at your sample input, it seems the format string is actually "%5d%-5s%5s%5d%8.3f%.3f%8.3f" with the first string field being left-justified. It's too bad awk doesn't have a scanf() function, but you can get your data with a few substr() calls
awk -v OFS=: '
{
a=substr($0,1,5)
b=substr($0,6,5)
c=substr($0,11,5)
d=substr($0,16,5)
e=substr($0,21,8)
f=substr($0,29,8)
g=substr($0,37,8)
print a,b,c,d,e,f,g
}
'
outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700
If you have GNU awk, you can use the FIELDWIDTHS variable like this:
gawk -v FIELDWIDTHS="5 5 5 5 8 8 8" -v OFS=: '{print $1, $2, $3, $4, $5, $6, $7}'
also outputs
1:PS1 : A1: 1: 11.197: 5.497: 7.783
1:PS1 : A1: 1: 11.189: 5.846: 7.700

You never said exactly which fields you think should have what number, so I'd like to be clear about how awk thinks that works (Your choice to be explicit about calling the whitespace in your output format string fields makes me worry a little. You might have a different idea about this than awk.).
From the manpage:
An input line is normally made up of fields separated by white space,
or by regular expression FS. The fields are denoted $1, $2, ..., while
$0 refers to the entire line. If FS is null, the input line is split
into one field per character.
Take note that the whitespace in the input line does not get assigned a field number and that sequential whitespace is treated as a single field separator.
You can test this with something like:
echo "1 2 3 4" | awk '{print "1:" $1 "\t2:" $2 "\t3:" $3 "\t4:" $4}'
at the command line.
All of this assumes that you have not diddles the FS variable, of course.

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

Problems with awk substr

I am trying to split a file column using the substr awk command. So the input is as follows (it consists of 4 lines, one blank line):
#NS500645:122:HYGVMBGX2:4:21402:2606:16446:ACCTAGAAGG:R1
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I want to split the second line by the pattern "GATC" but keeping it on the right sub-string like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
I want that the last line have the same length as the splitted one and regenerate the file like:
ACCTAGAAGGATATGCGCTTGCGCGTTAGA
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTAT
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
GATCC
EEEEE
For split the last colum I am using this awk script:
cat prove | paste - - - - | awk 'BEGIN
{FS="\t"; OFS="\t"}\ {gsub("GATC","/tGATC", $2); {split ($2, a, "\t")};\ for
(i in a) print substr($4, length(a[i-1])+1,
length(a[i-1])+length(a[i]))}'
But the output is as follows:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
Being the second and third line longer that expected.
I check the calculated length that are passed to the substr command and are correct:
1 30
31 70
41 45
Using these length the output should be:
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE
EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
EEEEE
But as I showed it is not the case.
Any suggestions?
I guess you're looking something like this, but your question formatting is really confusing
$ awk -v OFS='\t' 'NR==1 {next}
NR==2 {n=index($0,"GATC")}
/^[^+]/ {print substr($0,1,n-1),substr($0,n)}' file
ACCTAGAAGGATATGCGCTTGCGCGTTAGA GATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEE EEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE
I assumed your file is in this format
dummy header line to be ignored
ACCTAGAAGGATATGCGCTTGCGCGTTAGAGATCACTAGAGCTAAGGAATTTGAGATTACAGTAAGCTATGATCC
+
/AAAAEEEEEEEEEEAAEEEAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEEE

awk command or sed command

000Bxxxxx111118064085vxas - header
10000000001000000000053009-000000000053009-
10000000005000000000000000+000000000000000+
10000000030000000004025404-000000004025404-
10000000039000000000004930-000000000004930-
10000000088000005417665901-000005417665901-
90000060883328364801913 - trailer
In the above file we have header and trailer and the records which start with 1 is the detail record
in the detail record,want to sum the values starting from position 28 till 44 including the sign using awk/sed command
Here is sed, with help from bc to do the arithmetic:
sed -rn '
/header|trailer/! {
s/[[:digit:]]*[+-]([[:digit:]]+)([+-])$/\2\1/
H
}
$ {
x
s/\n//gp
}
' file | bc
I assume the +/- sign follows the number.
Using awk we can solve this problem making use of substr:
substr(s, m[, n ]):
Return the at most n-character substring of s that begins at position m, numbering from 1. If n is omitted, or if n specifies more characters than are left in the string, the length of the substring shall be limited by the length of the string s.
This allows us to take the string which represents the number. Here, I assumed that the sign before and after the number is same and thus the sign of the number :
$ echo "10000000001000000000053009-000000000053009-" \
| awk '{print length($0); print substr($0,27,43-27)}'
43
-000000000053009
Since awk implicitly converts strings to numbers if you do numeric operations on them we can write the following awk-code to achieve the requested :
$ awk '/header|trailer/{next}
{s+=substr($0,27,43-27)}
END{print s}' file.dat
-5421749244
Or in a single line:
$ awk '/header|trailer/{next}{s+=substr($0,27,43-27)} END{print s}' file.dat
-5421749244
The above examples just work on the example file given by the OP. However, if you have a file containing multiple blocks with header and trailer and you only want to use the text inside these blocks (exclude everything outside of the blocks), then you should handle it a bit differently :
$ awk '/header/{s=0;c=1;next}
/trailer/{S+=s;c=0;next}
c{s+=substr($0,27,43-27)}
END{print S}' file.dat
Here we do the following:
If a line with header is found, reset the block sum s to ZERO and set c=1 indicating that we take the next lines into account
If a line with trailer is found, add the block sum s to the overall sum S and set c=0 indicating to ignore the lines.
If c/=0 compute the block sum s
At the END, print the total sum S

Combine awk with sub to print multiple columns

Input:
MARKER POS EA NEA BETA SE N EAF STRAND IMPUTED
1:244953:TTGAC:T 244953 T TTGAC -0.265799 0.291438 4972 0.00133176 + 1
2:569406:G:A 569406 A G -0.17456 0.296652 4972 0.00128021 + 1
Desired output:
1 1:244953:TTGAC:T 0 244953
2 2:569406:G:A 0 569406
Column 1 in output file is first number from first column in input file
Tried:
awk '{gsub(/:.*/,"",$1);print $1,0,$2}' input
But it does not print $2 correctly
Thank you for any help
Your idea is right, but the reason it didn't work is that you've replaced the $1 value as part of the gsub() routine and have not backed it up. So next call to $1 will return the value after the call. So back it up as below. Also sub() is sufficient here for the first replacement part
awk 'NR>1{backup=$1; sub(/:.*/,"",backup);print backup,$1,0,$2}' file
Or use split() function to the first part of the first column. The call to the function returns the number of elements split by delimiter : and updates the elements to the array a. We print the element and subsequent columns as needed.
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file
From GNU awk documentation under String functions
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string.
Add a | column -t to beautify the result to make it appear more spaced out and readable
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file | column -t
Could you please try following and let me know if this helps you?
awk -v s1=" " -F"[: ]" 'FNR>1{print $1 s1 $1 OFS $2 OFS $3 OFS $4 s1 "0" s1 $5}' OFS=":" Input_file