Combine awk with sub to print multiple columns - awk

Input:
MARKER POS EA NEA BETA SE N EAF STRAND IMPUTED
1:244953:TTGAC:T 244953 T TTGAC -0.265799 0.291438 4972 0.00133176 + 1
2:569406:G:A 569406 A G -0.17456 0.296652 4972 0.00128021 + 1
Desired output:
1 1:244953:TTGAC:T 0 244953
2 2:569406:G:A 0 569406
Column 1 in output file is first number from first column in input file
Tried:
awk '{gsub(/:.*/,"",$1);print $1,0,$2}' input
But it does not print $2 correctly
Thank you for any help

Your idea is right, but the reason it didn't work is that you've replaced the $1 value as part of the gsub() routine and have not backed it up. So next call to $1 will return the value after the call. So back it up as below. Also sub() is sufficient here for the first replacement part
awk 'NR>1{backup=$1; sub(/:.*/,"",backup);print backup,$1,0,$2}' file
Or use split() function to the first part of the first column. The call to the function returns the number of elements split by delimiter : and updates the elements to the array a. We print the element and subsequent columns as needed.
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file
From GNU awk documentation under String functions
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string.
Add a | column -t to beautify the result to make it appear more spaced out and readable
awk 'NR>1{n=split($1, a, ":"); print a[1],$1,"0", $2}' file | column -t

Could you please try following and let me know if this helps you?
awk -v s1=" " -F"[: ]" 'FNR>1{print $1 s1 $1 OFS $2 OFS $3 OFS $4 s1 "0" s1 $5}' OFS=":" Input_file

Related

awk: counting fields in a variable

Given a string like {running_db_nodes,[ejabberd#host002,ejabberd#host001]}, , how could the number of comma-delimited strings in square brackets be counted?
The useful substring can be extracted with gensub:
awk '/running_db_nodes/ {print gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1)}' .
A naive approach with NF gets fields from the original input string:
awk -F, '/running_db_nodes/ {nodes=gensub(/ {running_db_nodes,\[(.*)\]},/, "\\1", 1); print NF}'
How could the number of fields in a variable like nodes in the last example be extracted?
You can set your FS to characters [ and ], then split your $2 to an array and capture the count of elements returned from split():
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk -F"[][]" '{print split($2,a,",")}'
2
With your shown samples only and with shown attempts please try following awk code.
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}," |
awk '
{
gsub(/.*\[|\].*$/,"")
print gsub(/,/,"&")+1
}
'
Explanation: Simple explanation would be:
gsub(/.*\[|\].*$/,""): Globally substituting everything from starting to till [ AND substituting from [ to till end of value with NULL in current line.
print gsub(/,/,"&")+1: Globally substituting , with itself(just to count it) and adding 1 to it and printing it as pre requirement.
A naive approach with NF gets fields from the original input string
gensub does not change string it is working on, you might use sub (or gsub) which will alter string it is working at which will alter relevant built-in variables values that is
echo "{running_db_nodes,[ejabberd#host002,ejabberd#host001]}" | awk 'BEGIN{FS=","}{sub(/^.*\[/,"");sub(/].*$/,"");print NF}'
gives output
2
Explanation: use sub to delete everything before [ and [, then ] and everything behind it, print number of fields.
(tested in GNU Awk 5.0.1)

Countif like function in AWK with field headers

I am looking for a way of counting the number of times a value in a field appears in a range of fields in a csv file much the same as countif in excel although I would like to use an awk command if possible.
So column 6 should have the range of values and column 7 would have the times the value appears in column 7, as per below
>awk -F, '{print $0}' file3
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
>awk -F, '{print $6}' file3
test
SBCD
AWER
ASDF
ASDQ
ASDQ
What i want is:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
#adds field name count that I want:
awk -F, -v OFS=, 'NR==1{ print $0, "count"}
NR>1{ print $0}' file3
Ho do I get the output I want?
I have tried this from previous/similar question but no joy,
>awk -F, 'NR>1{c[$6]++;l[NR>1]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[1]]}}' file3
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,
,
,
,
,
,
very similar question to this one
similar python related Q, for my ref
I would harness GNU AWK for this task following way, let file.txt content be
f1,f2,f3,f4,f5,test
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD
row2_1,row2_2,row2_3,AWERF,row2_5,AWER
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ
then
awk 'BEGIN{FS=OFS=","}NR==1{print $0,"count";next}FNR==NR{arr[$6]+=1;next}FNR>1{print $0,arr[$6]}' file.txt file.txt
gives output
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2
Explanation: this is two-pass approach, hence file.txt appears twice. I inform GNU AWK that , is both field separator (FS) and output field separator (OFS), then for first line (header) I print it followed by count and instruct GNU AWK to go to next line, so nothing other is done regarding 1st line, then for first pass, i.e. where global number of line (NR) is equal to number of line in file (FNR) I count number of occurences of values in 6th field and store them as values in array arr, then instruct GNU AWK to get to next line, so onthing other is done in this pass. During second pass for all lines after 1st (FNR>1) I print whole line ($0) followed by corresponding value from array arr
(tested in GNU Awk 5.0.1)
You did not copy the code from the linked question properly. Why change l[NR] to l[NR>1] at all? On the other hand, you should change s[1] to s[6] since it's the sixth field that has the key you're counting:
awk -F, 'NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i]","c[s[6]]}}'
You can also output the header with the new field name:
awk -F, -vOFS=, 'NR==1{print $0,"count"}NR>1{c[$6]++;l[NR]=$0}END{for(i=0;i++<NR;){split(l[i],s,",");print l[i],c[s[6]]}}'
One awk idea:
awk '
BEGIN { FS=OFS="," } # define input/output field delimiters as comma
{ lines[NR]=$0
if (NR==1) next
col6[NR]=$6 # copy field 6 so we do not have to parse the contents of lines[] in the END block
cnt[$6]++
}
END { for (i=1;i<=NR;i++)
print lines[i], (i==1 ? "count" : cnt[col6[i]] )
}
' file3
This generates:
f1,f2,f3,f4,f5,test,count
row1_1,row1_2,row1_3,SBCDE,row1_5,SBCD,1
row2_1,row2_2,row2_3,AWERF,row2_5,AWER,1
row3_1,row3_2,row3_3,ASDFG,row3_5,ASDF,1
row4_1,row4_2,row4_3,PRE-ASDQG,row4_5,ASDQ,2
row4_1,row4_2,row4_3,PRE-ASDQF,row4_5,ASDQ,2

Replace case of 2nd column of dataset within awk?

Im trying command
awk 'BEGIN{FS=","}NR>1{tolower(substr($2,2))} {print $0}' emp.txt
on below data but not working
- M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,AJIT,D003,8-Mar-07,8-Sep-07,70000
M004,SHARVARI,D004,28-Mar-07,28-Mar-08,120000
M005,ADITYA,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
Expected output
M_ID,M_NAME,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999
With your shown samples in GNU awk please try following awk code. Its using GNU awk's match function, where I am using regex (^[^,]*,.)([^,]*)(.*) which is creating 3 capturing groups and storing values into an array named arr(whose indexes are 1,2,3 and so on depending upon number of capturing groups created). Then if this condition is fine then printing array elements where using tolower function to Lower the spellings on 2nd element of arr to get expected output.
awk '
FNR==1{
print
next
}
match($0,/(^[^,]*,.)([^,]*)(.*)/,arr){
print arr[1] tolower(arr[2]) arr[3]
}
' Input_file
You need to assign the result of tolower() to something, it doesn't operate in place. And in this case, you need to concatenate it with the first character of the field and assign that back to the field.
$2 = substr($2, 1, 1) tolower(substr($2, 2));
To get comma separators in the output file, you need to set OFS. So you need:
BEGIN {OFS=FS=","}
mawk, gawk, or nawk :
awk 'BEGIN { _+=_^=FS=OFS="," } NR<_ || $_ = substr( toupper($(+_)=\
tolower($_)), --_,_) substr($++_,_)'
M_ID,M_name,DEPT_ID,START_DATE,END_DATE,Salary
M001,Richa,D001,27-Jan-07,27-Feb-07,150000
M002,Nitin,D002,16-Feb-07,16-May-07,40000
M003,Ajit,D003,8-Mar-07,8-Sep-07,70000
M004,Sharvari,D004,28-Mar-07,28-Mar-08,120000
M005,Aditya,D002,27-Apr-07,27-Jul-07,40000
M006,Rohan,D004,12-Apr-07,12-Apr-08,130000
M007,Usha,D003,17-Apr-07,17-Oct-07,70000
M008,Anjali,D002,2-Apr-07,2-Jul-07,40000
M009,Yash,D006,11-Apr-07,11-Jul-07,85000
M010,Nalini,D007,15-Apr-07,15-Oct-07,9999

split based on the last dot and create a new column with the last part of the string

I have a file with 2 columns. In the first column, there are several strings (IDs) and in the second values. In the strings, there are a number of dots that can be variable. I would like to split these strings based on the last dot. I found in the forum how remove the last past after the last dot, but I don't want to remove it. I would like to create a new column with the last part of the strings, using bash command (e.g. awk)
Example of strings:
5_8S_A.3-C_1.A 50
6_FS_B.L.3-O_1.A 20
H.YU-201.D 80
UI-LP.56.2011.A 10
Example of output:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
I tried to solve it by using the following command but it works if I have just 1 dot in the string:
awk -F' ' '{{split($1, arr, "."); print arr[1] "\t" arr[2] "\t" $2}}' file.txt
You may use this sed:
sed -E 's/^([[:blank:]]*[^[:blank:]]+)\.([^[:blank:]]+)/\1 \2/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
Details:
^: Start
([[:blank:]]*[^[:blank:]]+): Capture group #2 to match 0 or more whitespaces followed by 1+ non-whitespace characters.
\.: Match a dot. Since this regex pattern is greedy it will match until last dot
([^[:blank:]]+): Capture group #2 to match 1+ non-whitespace characters
\1 \2: Replacement to place a space between capture value #1 and capture value #2
Assumptions:
each line consists of two (white) space delimited fields
first field contains at least one period (.)
Sticking with OP's desire (?) to use awk:
awk '
{ n=split($1,arr,".") # split first field on period (".")
pfx=""
for (i=1;i<n;i++) { # print all but the nth array entry
printf "%s%s",pfx,arr[i]
pfx="."}
print "\t" arr[n] "\t" $2} # print last array entry and last field of line
' file.txt
Removing comments and reducing to a one-liner:
awk '{n=split($1,arr,"."); pfx=""; for (i=1;i<n;i++) {printf "%s%s",pfx,arr[i]; pfx="."}; print "\t" arr[n] "\t" $2}' file.txt
This generates:
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10
With your shown samples, here is one more variant of rev + awk solution.
rev Input_file | awk '{sub(/\./,OFS)} 1' | rev
Explanation: Simple explanation would be, using rev to print reverse order(from last character to first character) for each line, then sending its output as a standard input to awk program where substituting first dot(which is last dot as per OP's shown samples only) with spaces and printing all lines. Then sending this output as a standard input to rev again to print output into correct order(to remove effect of 1st rev command here).
$ sed 's/\.\([^.]*$\)/\t\1/' file
5_8S_A.3-C_1 A 50
6_FS_B.L.3-O_1 A 20
H.YU-201 D 80
UI-LP.56.2011 A 10

How to split a delimited string into an array in awk?

How to split the string when it contains pipe symbols | in it.
I want to split them to be in array.
I tried
echo "12:23:11" | awk '{split($0,a,":"); print a[3] a[2] a[1]}'
Which works fine. If my string is like "12|23|11" then how do I split them into an array?
Have you tried:
echo "12|23|11" | awk '{split($0,a,"|"); print a[3],a[2],a[1]}'
To split a string to an array in awk we use the function split():
awk '{split($0, array, ":")}'
# \/ \___/ \_/
# | | |
# string | delimiter
# |
# array to store the pieces
If no separator is given, it uses the FS, which defaults to the space:
$ awk '{split($0, array); print array[2]}' <<< "a:b c:d e"
c:d
We can give a separator, for example ::
$ awk '{split($0, array, ":"); print array[2]}' <<< "a:b c:d e"
b c
Which is equivalent to setting it through the FS:
$ awk -F: '{split($0, array); print array[2]}' <<< "a:b c:d e"
b c
In GNU Awk you can also provide the separator as a regexp:
$ awk '{split($0, array, ":*"); print array[2]}' <<< "a:::b c::d e
#note multiple :
b c
And even see what the delimiter was on every step by using its fourth parameter:
$ awk '{split($0, array, ":*", sep); print array[2]; print sep[1]}' <<< "a:::b c::d e"
b c
:::
Let's quote the man page of GNU awk:
split(string, array [, fieldsep [, seps ] ])
Divide string into pieces separated by fieldsep and store the pieces in array and the separator strings in the seps array. The first piece is stored in array[1], the second piece in array[2], and so forth. The string value of the third argument, fieldsep, is a regexp describing where to split string (much as FS can be a regexp describing where to split input records). If fieldsep is omitted, the value of FS is used. split() returns the number of elements created. seps is a gawk extension, with seps[i] being the separator string between array[i] and array[i+1]. If fieldsep is a single space, then any leading whitespace goes into seps[0] and any trailing whitespace goes into seps[n], where n is the return value of split() (i.e., the number of elements in array).
Please be more specific! What do you mean by "it doesn't work"?
Post the exact output (or error message), your OS and awk version:
% awk -F\| '{
for (i = 0; ++i <= NF;)
print i, $i
}' <<<'12|23|11'
1 12
2 23
3 11
Or, using split:
% awk '{
n = split($0, t, "|")
for (i = 0; ++i <= n;)
print i, t[i]
}' <<<'12|23|11'
1 12
2 23
3 11
Edit: on Solaris you'll need to use the POSIX awk (/usr/xpg4/bin/awk) in order to process 4000 fields correctly.
I do not like the echo "..." | awk ... solution as it calls unnecessary fork and execsystem calls.
I prefer a Dimitre's solution with a little twist
awk -F\| '{print $3 $2 $1}' <<<'12|23|11'
Or a bit shorter version:
awk -F\| '$0=$3 $2 $1' <<<'12|23|11'
In this case the output record put together which is a true condition, so it gets printed.
In this specific case the stdin redirection can be spared with setting an awk internal variable:
awk -v T='12|23|11' 'BEGIN{split(T,a,"|");print a[3] a[2] a[1]}'
I used ksh quite a while, but in bash this could be managed by internal string manipulation. In the first case the original string is split by internal terminator. In the second case it is assumed that the string always contains digit pairs separated by a one character separator.
T='12|23|11';echo -n ${T##*|};T=${T%|*};echo ${T#*|}${T%|*}
T='12|23|11';echo ${T:6}${T:3:2}${T:0:2}
The result in all cases is
112312
Actually awk has a feature called 'Input Field Separator Variable' link. This is how to use it. It's not really an array, but it uses the internal $ variables. For splitting a simple string it is easier.
echo "12|23|11" | awk 'BEGIN {FS="|";} { print $1, $2, $3 }'
I know this is kind of old question, but I thought maybe someone like my trick. Especially since this solution not limited to a specific number of items.
# Convert to an array
_ITEMS=($(echo "12|23|11" | tr '|' '\n'))
# Output array items
for _ITEM in "${_ITEMS[#]}"; do
echo "Item: ${_ITEM}"
done
The output will be:
Item: 12
Item: 23
Item: 11
Joke? :)
How about echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
This is my output:
p2> echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
112312
so I guess it's working after all..
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
should work.
echo "12|23|11" | awk '{split($0,a,"|"); print a[3] a[2] a[1]}'
code
awk -F"|" '{split($0,a); print a[1],a[2],a[3]}' <<< '12|23|11'
output
12 23 11
The challenge: parse and store split strings with spaces and insert them into variables.
Solution: best and simple choice for you would be convert the strings list into array and then parse it into variables with indexes. Here's an example how you can convert and access the array.
Example: parse disk space statistics on each line:
sudo df -k | awk 'NR>1' | while read -r line; do
#convert into array:
array=($line)
#variables:
filesystem="${array[0]}"
size="${array[1]}"
capacity="${array[4]}"
mountpoint="${array[5]}"
echo "filesystem:$filesystem|size:$size|capacity:$capacity|mountpoint:$mountpoint"
done
#output:
filesystem:/dev/dsk/c0t0d0s1|size:4000|usage:40%|mountpoint:/
filesystem:/dev/dsk/c0t0d0s2|size:5000|usage:50%|mountpoint:/usr
filesystem:/proc|size:0|usage:0%|mountpoint:/proc
filesystem:mnttab|size:0|usage:0%|mountpoint:/etc/mnttab
filesystem:fd|size:1000|usage:10%|mountpoint:/dev/fd
filesystem:swap|size:9000|usage:9%|mountpoint:/var/run
filesystem:swap|size:1500|usage:15%|mountpoint:/tmp
filesystem:/dev/dsk/c0t0d0s3|size:8000|usage:80%|mountpoint:/export
awk -F'['|'] -v '{print $1"\t"$2"\t"$3}' file <<<'12|23|11'