get only string after matching pattern and exclude everything else - awk

I would like to get string which comes after matching pattern and exclude everything else. For example say,
Nov 17 21:52:06 web01-san roundcube: <he1v330n> User dxxssjksdfd [121.177.26.200]; \
Message for undisclosed-recipients:, stanpiatt#yahoo.com
Nov 17 21:48:26 web01-san roundcube: <fqu8k29l> User cxcnjdfdssd [121.177.26.200]; \
Message for undisclosed-recipients:, stanpiatt#yahoo.com
So I would like to get ONLY string after pattern User and exclude everything else, so output should be
User dxxssjksdfd
User cxcnjdfdssd
I've tried grep -Po 'User\K[^\s]*' but it doesn't give what I want. How can I do that ?

$ cat infile
Nov 17 21:52:06 web01-san roundcube: <he1v330n> User dxxssjksdfd [121.177.26.200]; \
Message for undisclosed-recipients:, stanpiatt#yahoo.com
Nov 17 21:48:26 web01-san roundcube: <fqu8k29l> User cxcnjdfdssd [121.177.26.200]; \
Message for undisclosed-recipients:, stanpiatt#yahoo.com
Using grep
$ grep -Po 'User [^\s]*' infile
User dxxssjksdfd
User cxcnjdfdssd
Using awk
$ awk 'match($0,/User [^ ]*/){ print substr($0, RSTART,RLENGTH)}' infile
User dxxssjksdfd
User cxcnjdfdssd
Using GNU awk
$ awk 'match($0,/User [^ ]*/,arr){ print arr[0]}' infile
User dxxssjksdfd
User cxcnjdfdssd
Explanation:
/User [^\s]*/
User matches the characters User literally (case sensitive)
Match a single character not present in the list below [^\s]*
* Quantifier — Matches between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\s matches any whitespace character (equal to [\r\n\t\f\v ])

Solution 1st:
Following awk should be helping you in same.
awk -v RS=" " '/User/{getline;print "User",$0}' Input_file
Output will be as follows.
User dxxssjksdfd
User cxcnjdfdssd
Solution 2nd: You could use following too by going through the fields of line too.
awk '{for(i=1;i<=NF;i++){if($i ~ /User/){print $i,$(i+1)}}}' Input_file
Solution 3rd: By using sub utility of awk here too.
awk 'sub(/.*User/,""){print "User",$1}' Input_file

Related

How to match pattern after first 32 letters using the grep?

I was trying to filter lines with pattern 04:26. I expected the command,
cat file1.txt | grep -E '04:26'
to filter the lines which contain 04:26 after timestamps. Instead, I got the second line also.
file1.txt
2022-12-23T04:26:47.748412+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
2022-12-23T04:26:47.749307+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPACK(eth0) 192.168.42.17 04:c8:07:23:34:13
How to mask the first 32 letters of timestamps from matching?
You may use this grep:
grep -E '^.{32,}04:26' file
2022-12-23T04:26:47.748412+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
Breakdown:
^: Start
.{32,}: Match 32 or more characters
04:26: Match 04:26
Alternatively you can use this grep as well:
grep ' .*04:26' file
Considering the fact that you want to ignore timestamp text that is before first space in each line.
An awk solution:
awk '$NF ~ /04:26/' file
With your shown samples please try following awk code. Simple explanation would be, setting field separator to 32 characters from starting of line, then in main program checking if 2nd field is matching everything till : followed by 04:26 if this condition matches then print that line.
awk -F'^.{32}' '$2~/^.*:04:26/' Input_file
With awk checking that there are 2 or more fields as the value that you don't want to match is in the first field.
awk 'NF > 1 && $NF ~ /04:26/' file
Or with awk checking that the line has more than 37 characters and match 04:26 in last field.
awk 'length($0) > 37 && index($NF, "04:26")' file
Or grep matching 32 or more characters and then match 04:26
grep -E '^.{32,}04:26' file
Output
2022-12-23T04:26:47.748412+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
There are many simple ways you can do this, trying to avoid side cases. The cleanest way would be the programmatical way in which you identify what you try to search for. The robust way would be awk but you can do it also with grep pipe-lines:
grep for MAC-address:
$ grep -E '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}' file
$ awk '/([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}/' file
grep for MAC-address that ends with 04:46:
$ grep -E '([[:xdigit:]]{2}:){4}04:46' file
$ awk '/([[:xdigit:]]{2}:){4}04:46/' file
grep for MAC-address in last field that ends with 04:46:
$ grep -E '([[:xdigit:]]{2}:){4}04:46[[:blank:]]*$' file
$ awk '$NF~/([[:xdigit:]]{2}:){4}04:46/' file
grep for MAC-address that contains with 04:46:
$ grep -oE '([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}' file | grep '04:46' | grep -Ff - file
$ awk 'match($0,/([[:xdigit:]]{2}:){5}[[:xdigit:]]{2}/) && substr($0,RSTART,RLENGTH)~/04:46/' file
How to mask first 32 letters
You might use cut to get 33th and following character in each line, let file1.txt content be
2022-12-23T04:26:47.748412+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
2022-12-23T04:26:47.749307+00:00 raspberrypi dnsmasq-dhcp[698]: DHCPACK(eth0) 192.168.42.17 04:c8:07:23:34:13
then
cut --characters=33- file.txt
gives output
raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
raspberrypi dnsmasq-dhcp[698]: DHCPACK(eth0) 192.168.42.17 04:c8:07:23:34:13
which could then by fused with your code as follows
cut --characters=33- file.txt | grep -E '04:26'
that result in output output
raspberrypi dnsmasq-dhcp[698]: DHCPREQUEST(eth0) 192.168.42.17 04:c8:07:23:04:26
Explanation: --characters= is used to select certain characters from each line, 33- means 33th character and following.
(tested in GNU grep 3.4)

Extract substring from a field with single awk in AIX

I have a file file with content like:
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
I have to get a substring from field 2 (first two values with the dot).
I am currently doing cat file | awk '{print $2}' | awk -F. '{print $1"."$2}' and getting the expected output:
8.0
12.01
The query is how to do this with single awk?
I have tried with match() but not seeing an option for a back reference.
Any help would be appreciated.
You can do something like this.
$ awk '{ split($2,str,"."); print str[1]"."str[2] }' file
8.0
12.01
Also, keep in mind that your cat is not needed. Simply give the file directly to awk.
With GNU grep please try following command once.
grep -oP '^\S+\s+\K[[:digit:]]+\.[[:digit:]]+' Input_file
Explanation: Using GNU grep here. Using its -oP options to print matched part and enable PCRE with -P option here. In main program, matching from starting non-space characters followed by 1 or more spaces, then using \K option to forget that match. Then matching 1 or more digits occurrences followed by a dot; which is further followed by digits. If a match is found then it prints matched value.
I would use GNU AWK's split function as follow, let file.txt content be
stringa 8.0.1.2 stringx
stringb 12.01.0.0 stringx
then
awk '{split($2,arr,".");print arr[1]"."arr[2]}' file.txt
output
8.0
12.01
Explantion: split at . 2nd field and put elements into array arr.
(tested in gawk 4.2.1)
You could match digits . digits from the second column and print if there is a match:
awk 'match($2, /^[[:digit:]]+\.[[:digit:]]+/) {
print substr($2, RSTART, RLENGTH)
}
' file
Output
8.0
12.01
Also with GNU awk and gensub():
awk '{print gensub(/([[:digit:]]+[.][[:digit:]]+)(.*)/,"\\1","g",$2)}' file
8.0
12.01
gensub() provides the ability to specify components of a regexp in the replacement text using parentheses in the regexp to mark the components and then specifying \\n in the replacement text, where n is a digit from 1 to 9.
You should perhaps not use awk at all (or any other external program, for that matter) but rely on the field-splitting capabilities of the shell and some variable expansion. For instance:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "=%s= =%s= =%s=\n" "$first" "$second" "$third"
done
=stringa= =8.0.1.2= =stringx=
=stringb= =12.01.0.0= =stringx=
As you can see the value is captured in the variable "$second" already and you just need to further isolate the parts you want to see - the first and second part separated by a dot. You can do that either with parameter expansion:
# variable="8.0.1.2"
# echo ${variable%.*.*}
8.0
or like this:
# variable="12.01.0.0"
# echo ${variable%${variable#*.*.}}
12.01
or you can use a further read-statement to separate the parts and then put them back together:
# variable="12.01.0.0"
# echo ${variable} | IFS=. read parta partb junk
# echo ${parta}.${partb}
12.01
So, putting all together:
# printf "%s\n%s\n" "stringa 8.0.1.2 stringx" \
"stringb 12.01.0.0 stringx" |\
while read first second third junk ; do
printf "%s\n" "$second" | IFS=. read parta partb junk
printf "%s.%s\n" "$parta" "$partb"
done
8.0
12.01

awk command to print columns with colum data

cat file1.txt | awk -F '{print $1 "|~|" $2 "|~|" $3}' > file2.txt
I am using above command to filter first three columns from file1 and put into file.
But only getting the column names and not the column data.
How to do that?
|~| - is the delimiter.
file1.txt has values as :
a|~|b|~|c|~|d|~|e
1|~|2|~|3|~|4|~|5
11|~|22|~|33|~|44|~|55
111|~|222|~|333|~|444|~|555
my expedted output is :
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
With your shown samples, please try following awk code. You need to set field separator to |~| and remove starting space from lines, then print the lines.
awk -F'\\|~\\|' -v OFS='|~|' '{sub(/^[[:blank:]]+/,"");print $1,$2,$3}' Input_file
In case you want to keep spaces(which was in initial post before edit) then try following:
awk -F'\\|~\\|' -v OFS='|~|' '{print $1,$2,$3}' Input_file
NOTE: Had a chat with user in room and got to know why this code was not working for user because of gunzip -c file was being used wrongly, its output was being saved into a variable on which user was running awk program, so correcting that command generated right file and awk program ran fine on it. Adding this as a reference for future readers.
One approach would be:
awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
The approach simply replaces all "|~|" with a "," setting the output file separator to "|~|". All leading whitespace is trimmed with sub().
Example Use/Output
With your data in file1.txt, you would have:
$ awk -v FS="," -v OFS="|~|" '{gsub(/[|][~][|]/,","); sub(/^\s*/,""); print $1,$2,$3}' file1.txt
a|~|b|~|c
1|~|2|~|3
11|~|22|~|33
111|~|222|~|333
Let me know if this is what you intended. You can simply redirect, e.g. > file2.txt to write to the second file.
For such cases, my bash+awk script rcut comes in handy:
rcut -Fd'|~|' -f-3 ip.txt
The -F option enables fixed string input delimiter (which is given using the -d option). And by default, the output field separator will also be same as -d when -F is active. -f-3 is similar to cut syntax to specify first three fields.
For better speed, use hck command:
hck -Ld'|~|' -D'|~|' -f-3 ip.txt
Here, -L enables literal field separator and -D specifies output field separator.
Another benefit is that hck supports -z option to automatically handle common compressed formats based on filename extension (adding this since OP had an issue with compressed input).
Another way:
sed 's/|~|/\t/g' file1.txt | awk '{print $1"|~|"$2"|~|"$3}' > file2.txt
First replace the |~| delimiter, and use the default awk separator, then print columns what you need.

what's the '~!' function in awk

I made a wrong type in awk and get a unexpected result.
So I got confused about this.
my passwd file has 39 line.
# wc -l /etc/passwd
39 /etc/passwd
now the '!~' in awk shows
# awk -F':' '$1 !~ /root/' /etc/passwd | wc -l
37
while the '~!' got a different result
# awk -F':' '$1 ~! /root/' /etc/passwd | wc -l
1
after that I deleted wc command to see which line it got
# awk -F':' '$1 ~! /root/' /etc/passwd
vbirduser1:x:1000:1001::/home/vbirduser1:/bin/bash
It's the first common user in my Linux.
So, what's the '~!' mean in awk?
https://www.gnu.org/software/gawk/manual/html_node/Comparison-Operators.html
x !~ y True if the string x does not match the regexp denoted by y
In your case, awk -F':' '$1 !~ /root/' /etc/passwd will print all lines, which the first field does not match /root/, so you have 37 lines.
~! is not an operator. The expression '$1 ~! /root/' does two things:
!/root/ : evaluate $0 !~ /root/ if a line doesn't contain root, return 1
$1 ~ !/root/: I added spaces in between, to make it more clear. Here, we check if $1 matches the result of the previous step, (0 or 1)
vbirduser1 contains 1, therefore, it is in the output. If you have other lines, which the first field contains 1 or 0, they will also be printed in the output.
But we should avoid using expressions like that in production codes

awk to transpose lines of a text file

A .csv file that has lines like this:
20111205 010016287,1.236220,1.236440
It needs to read like this:
20111205 01:00:16.287,1.236220,1.236440
How do I do this in awk? Experimenting, I got this far. I need to do it in two passes I think. One sub to read the date&time field, and the next to change it.
awk -F, '{print;x=$1;sub(/.*=/,"",$1);}' data.csv
Use that awk command:
echo "20111205 010016287,1.236220,1.236440" | \
awk -F[\ \,] '{printf "%s %s:%s:%s.%s,%s,%s\n", \
$1,substr($2,1,2),substr($2,3,2),substr($2,5,2),substr($2,7,3),$3,$4}'
Explanation:
-F[\ \,]: sets the delimiter to space and ,
printf "%s %s:%s:%s.%s,%s,%s\n": format the output
substr($2,0,3): cuts the second firls ($2) in the desired pieces
Or use that sed command:
echo "20111205 010016287,1.236220,1.236440" | \
sed 's/\([0-9]\{8\}\) \([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{2\}\)\([0-9]\{3\}\)/\1 \2:\3:\4.\5/g'
Explanation:
[0-9]\{8\}: first match a 8-digit pattern and save it as \1
[0-9]\{2\}...: after a space match 3 times a 2-digit pattern and save them to \2, \3 and \4
[0-9]\{3\}: and at last match 3-digit pattern and save it as \5
\1 \2:\3:\4.\5: format the output
sed is better suited to this job since it's a simple substitution on single lines:
$ sed -r 's/( ..)(..)(..)/\1:\2:\3./' file
20111205 01:00:16.287,1.236220,1.236440
but if you prefer here's GNU awk with gensub():
$ awk '{print gensub(/( ..)(..)(..)/,"\\1:\\2:\\3.","")}' file
20111205 01:00:16.287,1.236220,1.236440