How do I tell awk to use = as a separator (with white spaces removed too) - awk

Suppose I have the following file.
John=good
Tom = ok
Tim = excellent
I know the following let's me use = as a separator.
awk -F= '{print $1,$2}' file
This gives me the following results.
John good
Tom ok
Tim excellent
I would like the white spaces to be ignored, so that only the names and their performances are printed out.
One way to get around this is run another awk on the results.
awk -F= '{print$1,$2}' file | awk '{print $1,$2}'
But I wanted to know if I could do this in one awk?

Include them in the separator definition; it's a regexp.
jinx:1654 Z$ awk -F' *= *' '{print $1, $2}' foo.ds
John good
Tom ok
Tim excellent

The FS variable can be set to a regular expression.
From the AWK manual
The following table summarizes how fields are split, based on the value of FS.
FS == " "
Fields are separated by runs of whitespace. Leading and trailing whitespace are ignored. This is the default.
FS == any single character
Fields are separated by each occurrence of the character. Multiple successive occurrences delimit empty fields, as do leading and trailing occurrences.
FS == regexp
Fields are separated by occurrences of characters that match regexp. Leading and trailing matches of regexp delimit empty fields.

Related

Regex match last three integers and one character before that integers

I have been trying this for a lot of time,but my search is failing..I have below test data
mhie0104:x:104:600:Martinescu Horia:/home/scs/gr911/mhie0104:/bin/bash
mlie0105:x:105:600:Martinescu Laurentiu:/home/scs/gr911/mlie0105:/bin/bash
mmie0106:x:106:600:Martinescu Marius:/home/scs/gr911/mmie0106:/bin/bash
mnie0107:x:107:600:Martinescu Nicolae:/home/scs/gr911/mnie0107:/bin/bash
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
I am trying to find out users,who has exact three digits at the end ..So below is what i did
awk -F: '$1 ~ /*[a-z]//d{3}/'
My understanding of using above regex is :
"*" at the begining should match any characters
[a-z] it should match any character string just before digits
Finally three digits
I also tried with below variation
awk -F: '$1 ~ /*?//d{3}/'
So what i need from above test data is
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
1st solution: If you want to see only last 4 characters of 1st field where 4th last character is NOT digit then you can try following code.
awk -F':' '$1 ~ /[^0-9][0-9]{3}$/' Input_file
Explanation:
Simply making field separator as : for all the line of Input_file.
Then checking condition with 1st field /[^0-9][0-9]{3}$/ if 4 letter from last is anything apart from digit and last 3 are digit then print that line.
2nd solution: In case you want to check if none of characters(from starting of 1st field except last 3 characters) should contain digit and last 3 characters should be digits then try following code.
awk -F':' '
substr($1,1,length($1)-3)!~/[0-9]/ && int(substr($1,length($1)-2))~/^[0-9]{3}$/
' Input_file
Explanation:
First thing first making field separator as : for this awk program.
using substr function of awk to get sub string and using substr($1,1,length($1)-3)!~/[0-9]/ condition I am checking if everything of 1st field apart from last 3 characters is NOT having digit.
Then checking another condition int(substr($1,length($1)-2))~/^[0-9]{3}$/ where last 3 characters are of 3 digits.
If both of the conditions are TRUE then print that line.
You can't use this kind of notation : \d
This is perl type regex.
Solution:
$ awk -F: '$1 ~ /[a-zA-Z][0-9]{3}$/' file
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
You can use negative lookbehind in perl
$ perl -F: -ne ' print if $F[0]=~/(?<!\d)\d{3}$/ ' gameiswar.txt
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
$
For this particular task, sed might be used as well:
sed '/^[^0-9]*[0-9]\{3\}:/!d' file
Not sure if you can use only digits as the username, but as it is the first field, and the : is present as the delimiter.
Here, ([^:]*[^0-9])? matches optional repetitions of any char except : followed by a char other than 0-9:
awk '/^([^:]*[^0-9])?[0-9]{3}:/' file
If there has to be a leading char a-z
awk '/^[^:]*[a-z][0-9]{3}:/' file
Output
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
mawk '!_<NF' FS='^[^:]*[a-z][0-9][0-9][0-9]:'
— or —
gawk '!_<NF' FS='^[^:]*[a-z][0-9]{3}:'
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash

Awk multi character field separator containing caret not working as expected

I have tried multiple google searches, but none of the proposed answers are working for my example below. NF should be 3, but I keep getting 1.
# cat a
1^%2^%3
# awk -F^% '{print NF}' a
1
# awk -F'^%' {print NF}' a
1
awk -F "^%" {print NF}' a
1
The -F variable in awk takes a regular expression as its value. So the value ^ is interpreted as a special anchor regex pattern which need to be deprived of its special meaning. So you escape it a with a literal backslash \ character
awk -F'\\^%' '{ print NF }'
from GNU Awk manual for Escape Sequences
The backslash character itself is another character that cannot be included normally; you must write \\ to put one backslash in the string or regexp. Thus, the string whose contents are the two characters " and \ must be written \"\\.
You should escape ^ to remove its special meaning which is getting used as a regex by field separator.Once you escape ^ by doing \\^ it will be treated as a normal/literal character and then ^% will be considered as string and you will get answer as 3.
awk -F'\\^%' '{print NF}' Input_file
Here is one nice SO link which you could take it as an example too for better understanding, it doesn't talk about specifically ^ character but it talks about how to use escape sequence in field separator in awk.
https://stackoverflow.com/a/44072825/5866580

To find ";" then delete spaces up to next character

I have many lines starting with ";", then 1 or more spaces, followed by some other character(s) on the same line. I need to remove the spaces following the ";" up to but not including the characters that follow.
I tried a few variations of the following code because it worked great on lines with empty spaces, but I am not very familiar with awk.
awk '{gsub(/^ +| +$/,"")}1' filea>fileb
Sample input:
; 4
; group 452
; ring
Output wanted:
;4
;group 452
;ring
To remove any white space after the first semicolon, try:
$ awk '{sub(/^;[[:blank:]]+/, ";")} 1' filea
;4
;group 452
;ring
The regex ^;[[:blank:]]+ matches the first semicolon and any blanks or tabs which follow it. The function sub replaces this with ;. Since this only occurs once on the line (at the beginning), there is no need for gsub.
[:blank:] is a unicode-safe way of specifying blank space.
awk '{sub(/^; +/,";")}1' file
;4
;group 452
;ring
sed would also do :
sed -E 's/^(;){1}([[:blank:]]+)/\1/' file
The parentheses is used as selectors and a \number combination represents the corresponding selection.
In ^(;){1}([[:blank:]]+) we check for the start of line (^) and a ; that occurs one time({1}), followed by any number of blank characters ([[:blank:]]+) and then replace the matched pattern with our first selection.
This defines the field separator FS to be the start of the record followed by a ; and a set of spaces. It then redefines the output field separator OFS to be just a ;. The conversion from FS to OFS is done by reassigning $1 to itself.
awk 'BEGIN{FS="^; *";OFS=";"}{$1=$1}1'
or
awk -F'^; *' -vOFS=";" '{$1=$1}1'

Can RS be set "empty" to split string characters to records?

Is there a way in awk—gawk most likely—to set the record separator RS to empty value to process each character of a string as a separate record? Kind of like setting the FS to empty to separate each character in its own field:
$ echo abc | awk -F '' '{print $2}'
b
but to separate them each as a separate record, like:
$ echo abc | awk -v RS='?' '{print $0}'
a
b
c
The most obvious one:
$ echo abc | awk -v RS='' '{print $0}'
abc
didn't award me (as that one was apparently meant for something else per GNU awk documentation).
Am I basically stuck using for etc.?
EDIT:
#xhienne's answer was what I was looking for but even using that (20 chars and a questionable variable A :):
$ echo abc | awk -v A="\n" -v RS='(.)' -v ORS="" '{print(RT==A?NR:RT)}'
abc4
wouldn't help me shorten my earlier code using length. Then again, how could I win the Pyth code: +Qfql+Q :D.
If you just want to print one character per line, #klashxx's answer is OK. But a sed 's/./&\n/g' would be shorter since you are golfing.
If you truly want a separate record for each character, the best approaching solution I have found for you is:
echo -n abc | awk -v RS='(.)' '{ print RT }'
(use gawk; your input character is in RT, not $1)
[update] If RS is set to the null string, it means to awk that records are separated by blank lines. If I had just defined RS='.', the record separator would have been a mere dot (i.e. a fixed string). But if its length is more than one character, one feature of gawk is to consider RS as a regex. So, what I did here is to give gawk a regex meaning "each character" as a record separator. And I use another feature of gawk: to retrieve the string that matched the regex in the special variable RT (record terminator)
Here is the relevant parts of the gwak manual:
Normally, records are separated by newline characters. You can control how records are separated by assigning values to the built-in variable RS. If RS is any single character, that character separates records. Otherwise, RS is a regular expression. Text in the input that matches this regular expression separates the record.
If RS is set to the null string, then records are separated by blank lines.
Gawk sets RT to the input text that matched the character or regular expression specified by RS.
It is not possible
The empty string "" (a string without any characters) has a special
meaning as the value of RS. It means that records are separated by one
or more blank lines and nothing else.
A simply alternative:
echo abc | awk 'BEGIN{FS="";OFS="\n"}$1=$1'
No there is no setting of RS that will do what you want. It looks like your requirement is to append a newline after every character that is not a newline, if so this will produce the output you want:
$ echo 'abc' | awk -v ORS= 'gsub(/[^\n]/,"&\n")'
a
b
c
That will work on any awk on any UNIX system.

Getting numerical sub-string of fields using awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).
You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'
If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.
$ awk -F'[".]' '{print $3}' file
1
2
3333
777