Why doesn't awk and gsub remove only the dot? - awk

This awk command:
awk -F ',' 'BEGIN {line=1} {print line "\n0" gsub(/\./, ",", $2) "0 --> 0" gsub(/\./, ",", $3) "0\n" $10 "\n"; line++}' file
is supposed to convert these lines:
Dialogue: 0,1:51:19.56,1:51:21.13,Default,,0000,0000,0000,,Hello!
into these:
1273
01:51:19.560 --> 01:51:21.130
Hello!
But somehow I'm not able to make gsub behave to replace the . by , and instead get 010 as both gsub results. Can anyone spot the issue?
Thanks

The return value from gsub is not the result from the substitution. It returns the number of substitutions it performed.
You want to gsub first, then print the modified string, which is the third argument you pass to gsub.
awk -F ',' 'BEGIN {line=1}
{ gsub(/\./, ",", $2);
gsub(/\./, ",", $3);
print line "\n0" $2 "0 --> 0" $3 "0\n" $10 "\n";
line++}' file

Another way is to use GNU awk's gensub instead of gsub:
$ awk -F ',' '
{
print NR ORS "0" gensub(/\./, ",","g", $2) "0 --> 0" gensub(/\./, ",","g",$3) "0" ORS $10 ORS
}' file
Output:
1
01:51:19,560 --> 01:51:21,130
Hello!
It's not as readable as the gsub solution by #tripleee but there is a place for it.
Also, I replace the line with builtin NR and \ns with ORS.

Related

awk to add prefix if not present in field

I am trying to add a prefix to a field in awk if it is not already present. That is if chr isn't present before the number it is inserted. However, if it is there it is skipped.
The first awk adds the prefix to each $2 even if it is present and the senond awk does skip the $2 with chr in them, but does print chr in the $2 without. Thank you :).
file
ASPA,17:3483575-3483585
ATM,11:108289609-108289613
ATP7B,13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
desired
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
awk
awk -F, '{$2="chr"$2; print}' file
awk 2
awk -F, '$2 !~/chr/{gsub("chr","chr",$2)}1' file
You can use:
awk 'BEGIN {FS=OFS=","} $2 !~ /^chr/ {$2="chr" $2} 1' file
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123
Or without using any regex:
awk 'BEGIN {FS=OFS=","} index($2 , "chr") != 1 {$2="chr" $2} 1' file
Another solution that might be shortest of all:
awk '{sub(/,(chr)?/, ",chr")} 1' file
1st solution: With your shown samples, please try following awk code.
awk '
BEGIN{FS=OFS=":"}
{
split($1,arr,",")
if(int(arr[2]) || arr[2]==0){
$1=arr[1] ",chr" arr[2]
}
}
1
' Input_file
2nd solution: With GNU awk using its match function which captures values into an array from capturing groups try following code.
awk '
match($0,/^([^,]*,)([^:]*)(:.*)/,arr){
if(int(arr[2]) || arr[2]==0){
arr[2]="chr" arr[2]
}
print arr[1] arr[2] arr[3]
}
' Input_file
3rd solution(Bonus one): Just in case your 2nd field is having Negative values(integers) and you want to change it Eg: from -11 to -chr11 then you can try following GNU awk code.
awk '
match($0,/^([^,]*,)(-)?([^:]*)(:.*)/,arr){
if(int(arr[3]) || arr[3]==0){
if(arr[2]=="-"){
arr[3]="-chr" arr[3]
}
else{
arr[3]="chr" arr[3]
}
$0=arr[1] arr[3] arr[4]
}
print
}
' Input_file
mawk NF=NF FS=',(chr)?' OFS=',chr'
ASPA,chr17:3483575-3483585
ATM,chr11:108289609-108289613
ATP7B,chr13:51937469-51937480
ATR,chr3:142562768-142562773
BAG3,chr10:119670120-119670123

Merge lines based on first column without delimiter

I need to merge all the lines that have the same value on the first column.
The input file is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)
34600000031|(2|2|0|2|2|20190114180000|20191027185959)
34600000031|(3|3|0|3|3|20190114180000|20191027185959)
34600000031|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)
34600000015|(2|2|100|2|9|20190114180000|20191027185959)
34600000015|(3|3|100|3|10|20190114180000|20191027185959)
34600000015|(4|4|100|4|11|20190114180000|20191027185959)
I was able to partially achieve it using the following:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p,x); s=s $0} END{print s}' INPUT
The output is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)|(2|2|0|2|2|20190114180000|20191027185959)|(3|3|0|3|3|20190114180000|20191027185959)|(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)|(2|2|100|2|9|20190114180000|20191027185959)|(3|3|100|3|10|20190114180000|20191027185959)|(4|4|100|4|11|20190114180000|20191027185959)
What I need (and i cannot find how) is the following:
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)
I could do a sed after the initial awk but I don't believe that this is the proper way to do it.
You need to substitute the separator in the values too. Your fixes awk would look like this:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub(p "\\|",x); s=s $0} END{print s}'
but it's also good to match beginning of the string:
awk -F'|' '$1!=p{if(p)print s; p=$1; s=$0; next}{sub("^" p "\\|",x); s=s $0} END{print s}'
I would do it somewhat simpler, which uses more memory (as it stores everything in an array) but doesn't need the file to be sorted:
awk -F'|' '{ k=$1; sub("^" $1 "\\|", ""); a[k] = a[k] $0 } END{ for (i in a) print i "|" a[i] }'
For each line, remember the first field, substitute the first field with | for nothing, then add it to an array indexed by the first field. On the end, print each element in the array with the key, separator and value.
$ awk -F'|' '
{
curr = $1
sub(/^[^|]+\|/,"")
printf "%s%s", (curr==prev ? "" : ors curr FS), $0
ors = ORS
prev = curr
}
END { print "" }
' file
34600000031|(1|1|0|1|1|20190114180000|20191027185959)(2|2|0|2|2|20190114180000|20191027185959)(3|3|0|3|3|20190114180000|20191027185959)(4|4|0|4|4|20190114180000|20191027185959)
34600000015|(1|1|100|1|8|20190114180000|20191027185959)(2|2|100|2|9|20190114180000|20191027185959)(3|3|100|3|10|20190114180000|20191027185959)(4|4|100|4|11|20190114180000|20191027185959)

How to supress original line at gawk

why does gawk write the input line first?
ws#i7$ echo "8989889898 jAAA_ALL_filenames.txt" | gawk 'match($0, /([X0-9\\\-]{9,13})/, arr); {print arr[1];}'
my output
8989889898 jAAA_ALL_filenames.txt
8989889898
I do not want that the same first line is printed.
Thanks
Walter
You have a stray semicolon in there.
$ echo "8989889898 jAAA_ALL_filenames.txt" | gawk 'match($0, /([X0-9\\\-]{9,13})/, arr); {print arr[1];}'
8989889898 jAAA_ALL_filenames.txt
8989889898
$ echo "8989889898 jAAA_ALL_filenames.txt" | gawk 'match($0, /([X0-9\\\-]{9,13})/, arr) {print arr[1];}'
8989889898
The semicolon after match($0, /([X0-9\\\-]{9,13})/, arr) means that your script is effectively:
match($0, /([X0-9\\\-]{9,13})/, arr) { print $0 } # default action block inserted
1 {print arr[1];} # default condition inserted
match returns a "true" value so the whole line gets printed.
To fix it, remove the semicolon:
match($0, /([X0-9\\\-]{9,13})/, arr) {print arr[1];}
Now the code only has one condition { action } structure, as you intended, so it does what you want.

gawk sub() with ampersand and toupper() not working

I'm having trouble using toupper() inside a gawk sub(). I'm using the feature that & substitutes for the matched string.
$ gawk '{sub(/abc/, toupper("&")); print $0; }'
xabcx
xabcx
I expected:
xABCx
Variants with toupper() but without & and with & but without toupper() work:
$ gawk '{sub(/abc/, toupper("def")); print $0; }'
xabcx
xDEFx
$ gawk '{sub(/abc/, "-&-"); print $0; }'
xabcx
x-abc-x
It fails similarly with tolower(). Am I misunderstanding something about how & works?
(Tested with gawk 3.1.x and the latest, 4.1.3).
I think I see what's going on: the toupper function is being evaluated first, before sub constructs the replacement string.
So you get
sub(/abc/, toupper("def")) => sub(/abc/, "DEF")
and the not-so-useful
sub(/abc/, toupper("&")) => sub(/abc/, "&")
To get your desired results, you have to extract the match first, upper-case it, and then perform the substitution:
$ echo foobar | gawk '{sub(/o+/, toupper("&")); print}'
foobar
$ echo foobar | gawk '{
if (match($0, /o+/, m)) {
replacement = toupper(m[0])
sub(/o+/, replacement)
}
print
}'
fOObar
Alternatively, you don't need the sub, you can reconstruct the record thusly:
echo foobar | gawk '{
if (match($0, /o+/, m)) {
$0 = substr($0, 1, RSTART-1) toupper(m[0]) substr($0, RSTART+RLENGTH)
}
print
}'

How to print out a specific field in AWK?

A very simple question, which a found no answer to. How do I print out a specific field in awk?
awk '/word1/', will print out the whole sentence, when I need just a word1. Or I need a chain of patterns (word1 + word2) to be printed out only from a text.
Well if the pattern is a single word (which you want to print and can't contaion FS (input field separator)) why not:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print MYPATTERN }' INPUTFILE
If your pattern is a regex:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN { print gensub(".*(" MYPATTERN ").*","\\1","1",$0) }' INPUTFILE
If your pattern must be checked in every single field:
awk -v MYPATTERN="INSERT_YOUR_PATTERN" '$0 ~ MYPATTERN {
for (i=1;i<=NF;i++) {
if ($i ~ MYPATTERN) { print "Field " i " in " NR " row matches: " MYPATTERN }
}
}' INPUTFILE
Modify any of the above to your taste.
The fields in awk are represented by $1, $2, etc:
$ echo this is a string | awk '{ print $2 }'
is
$0 is the whole line, $1 is the first field, $2 is the next field ( or blank ),
$NF is the last field, $( NF - 1 ) is the 2nd to last field, etc.
EDIT (in response to comment).
You could try:
awk '/crazy/{ print substr( $0, match( $0, "crazy" ), RLENGTH )}'
i know you can do this with awk :
an alternative would be :
sed -nr "s/.*(PATTERN_TO_MATCH).*/\1/p" file
or you can use grep -o
Something like this perhaps:
awk '{split("bla1 bla2 bla3",a," "); print a[1], a[2], a[3]}'