Filter fields with multiple delimiters - awk

I've done extensive searching for a solution but can't quite find what I need. Have a file like this:
aaa|bbb|ccc|ddd~eee^fff^ggg|hhh|iii
111|222|333|444~555^666^777|888|999
AAA|BBB|CCC||EEE|FFF
What I want to do is use awk or something else to return lines from this file with a change to field 4(pipe delimited). Field 4 has a tilde and caret as delimiters which is where I'm struggling. We want the lines returned as this:
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
If field 4 is empty, it's returned as is. But when field 4 has multiple values, we want the first value right after the tilde returned only.

awk -F "[|^~]" 'BEGIN{OFS="|"}NF==6{print} NF==9{print $1,$2,$3,$5,$8,$9}' tmp.txt
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
use a regular expression as your delimiter
count the fields to decide what to do
set the output delimiter to pipe

$ awk -F'|' '{sub(/^[^~]*~/, "", $4); sub(/\^.*/, "", $4)} 1' OFS='|' file
aaa|bbb|ccc|eee|hhh|iii
111|222|333|555|888|999
AAA|BBB|CCC||EEE|FFF
This approach makes no assumption about the contents of fields other than field 4. The other fields may, for example, contain ~ or ^ characters and that will not affect the results.
How it works
-F'|'
This sets the field delimiter on input to |.
sub(/^[^~]*~/, "", $4)
If field 4 contains a ~, this removes the first ~ and everything before the first ~.
sub(/\^.*/, "", $4)
If field 4 contains ^, this removes the first ^ and everything after it.
1
This is awk's cryptic shorthand for print-the-line.
OFS='|'
This sets the field separator on output to |.

Related

Regex match last three integers and one character before that integers

I have been trying this for a lot of time,but my search is failing..I have below test data
mhie0104:x:104:600:Martinescu Horia:/home/scs/gr911/mhie0104:/bin/bash
mlie0105:x:105:600:Martinescu Laurentiu:/home/scs/gr911/mlie0105:/bin/bash
mmie0106:x:106:600:Martinescu Marius:/home/scs/gr911/mmie0106:/bin/bash
mnie0107:x:107:600:Martinescu Nicolae:/home/scs/gr911/mnie0107:/bin/bash
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
I am trying to find out users,who has exact three digits at the end ..So below is what i did
awk -F: '$1 ~ /*[a-z]//d{3}/'
My understanding of using above regex is :
"*" at the begining should match any characters
[a-z] it should match any character string just before digits
Finally three digits
I also tried with below variation
awk -F: '$1 ~ /*?//d{3}/'
So what i need from above test data is
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
1st solution: If you want to see only last 4 characters of 1st field where 4th last character is NOT digit then you can try following code.
awk -F':' '$1 ~ /[^0-9][0-9]{3}$/' Input_file
Explanation:
Simply making field separator as : for all the line of Input_file.
Then checking condition with 1st field /[^0-9][0-9]{3}$/ if 4 letter from last is anything apart from digit and last 3 are digit then print that line.
2nd solution: In case you want to check if none of characters(from starting of 1st field except last 3 characters) should contain digit and last 3 characters should be digits then try following code.
awk -F':' '
substr($1,1,length($1)-3)!~/[0-9]/ && int(substr($1,length($1)-2))~/^[0-9]{3}$/
' Input_file
Explanation:
First thing first making field separator as : for this awk program.
using substr function of awk to get sub string and using substr($1,1,length($1)-3)!~/[0-9]/ condition I am checking if everything of 1st field apart from last 3 characters is NOT having digit.
Then checking another condition int(substr($1,length($1)-2))~/^[0-9]{3}$/ where last 3 characters are of 3 digits.
If both of the conditions are TRUE then print that line.
You can't use this kind of notation : \d
This is perl type regex.
Solution:
$ awk -F: '$1 ~ /[a-zA-Z][0-9]{3}$/' file
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
You can use negative lookbehind in perl
$ perl -F: -ne ' print if $F[0]=~/(?<!\d)\d{3}$/ ' gameiswar.txt
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
$
For this particular task, sed might be used as well:
sed '/^[^0-9]*[0-9]\{3\}:/!d' file
Not sure if you can use only digits as the username, but as it is the first field, and the : is present as the delimiter.
Here, ([^:]*[^0-9])? matches optional repetitions of any char except : followed by a char other than 0-9:
awk '/^([^:]*[^0-9])?[0-9]{3}:/' file
If there has to be a leading char a-z
awk '/^[^:]*[a-z][0-9]{3}:/' file
Output
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash
mawk '!_<NF' FS='^[^:]*[a-z][0-9][0-9][0-9]:'
— or —
gawk '!_<NF' FS='^[^:]*[a-z][0-9]{3}:'
mpiel110:x:110:600:Malinescu Paul:/home/scs/gr911/mpie110:/bin/bash

Gawk matching one word - one unexpected match

I wanted to get all matches in Column 3 which have the exact word "aa" (case insensitive match) in the string in Column 3
The gawk command used in the awk file is:
$3 ~ /\<aa\>/
The BEGIN statement specifies: IGNORECASE = 1
The command returns 20 rows. What is puzzling is this value in Column 3 in the returned rows:
aA.AHAB
How do I avoid this row as it is not a word by itself because there is dot following the first two aa's and not a space?
A is a word character. . is not a word character. \> matches the zero-width string at the end of a word. Such a zero-width string occurs between A and ..
To search for the string aa delimited by space characters (or start/end of field):
$3 ~ /(^|[ ])aa([ ]|$)
Add any other characters that you care about inside the set ([ ]).
Note that by default, awk splits records into fields on whitespace, so you will not get any spaces in $3 unless you have changed the value of FS.
1st solution: OR to exactly match aa try:
awk 'BEGIN{IGNORECASE=1} $3 ~ /^aa$/' Input_file
2nd solution: OR without IGNORECASE option try:
awk 'tolower($3)=="aa"' Input_file
Question: Why does the awk regex-pattern /\<aa\>/ matches a string like: "aa.bbb"?
We can quickly verify this with:
$ echo aa.bbb | awk '/\<aa\>/'
aa.bbb
The answer is simply found in the manual of gnu awk:
3.7 gawk-Specific Regexp Operators
GNU software that deals with regular expressions provides a number of additional regexp operators. These operators are described in this section and are specific to gawk; they are not available in other awk implementations. Most of the additional operators deal with word matching. For our purposes, a word is a sequence of one or more letters, digits, or underscores (‘_’):
\<: Matches the empty string at the beginning of a word. For example, /\<away/ matches "away" but not "stowaway".
\>:
Matches the empty string at the end of a word. For example, /stow\>/ matches "stow" but not "stowaway".
source: GNU awk manual: Section 3 :: Regular Expressions
So to come back to the example from above, the string "aa.bbb" contains two words "aa" and "bbb" since the <dot>-character is not part of the character set that can build up a word. The empty strings matched here is the empty string before "aa.bbb" and the empty string between the characters a and . (an empty string is really an empty string, length 0, 0 characters, commonly written as "")
Solution to the OP: Since FS is most likely the default value, the field $3 cannot have a space. So the following two solutions are possible:
$3 ~ /^aa$/
$3 == "aa"
If the field separator FS is defined in the code, the following might work
" "$3" " ~ /" aa "/
$3 ~ /(^|[ ])aa([ ]|$) # See solution of JHNC

AWK doesn't update record with new separator

I'm a bit confused with awk (I'm totally new to awk)
find static/*
static/conf
static/conf/server.xml
my goal is to romove 'static/' from the result
First step:
find static/* | awk -F/ '{print $(0)}'
static/conf
static/conf/server.xml
Same result. I expected it. Now deleting the first part:
find static/* | awk -F/ '{$1="";print $(0)}'
conf
conf server.xml
thats nearly good, but I don't now why the delimiter is killed
But I can deal with it just adding the delimiter to the output:
find static/* | awk -F/ '{$1="";OFS=FS;print $(0)}'
conf
/conf/server.xml
OK now I'm completley lost.
Why is a '/' on the second line and not on the first? In both cases I deleted the first column.
Any explanations, ideas.
BTW my preferred output would be
conf
conf/server.xml
Addendum: thank you for your kind answers. they will help me to fix the problem.
However I want to understand why the first '/' is deleted in my last try. To make it a bit clearer:
find static/* | awk -F/ '{$1="";OFS="#";print $(0)}'
conf
^ a space and no / ?
#conf#server.xml
but I don't now why the delimiter is killed.
Whenever you redefine a field in awk using a statement like:
$n = new_value
awk will rebuild the current record $0 and automatically replace all field separators defined by FS, by the output field separator OFS (see below). The default value of OFS is a single space. This implies the following:
awk -F/ '{$1="";print $(0)}'
The field separator FS is set to a single <slash>-character. The first field is reset to "" which enables the re-evaluation of $0 by which all regular expression matches corresponding to FS are replaced by the string OFS which is currently a single space.
awk -F/ '{$1="";OFS=FS;print $(0)'
The same action applies as earlier. However, after the re-computation of $0, the output field separator OFS is set to FS. This implies that from record 2 onward, you will not replace FS with a space, but with the value of FS.
Possible solution with same ideology
awk 'BEGIN{FS=OFS="/"}{$1=""}{print substr($0,2)}'
The substring function substr is needed to remove the first /
DESCRIPTION
The awk utility shall interpret each input record as a sequence of fields where, by default, a field is a string of non- <blank> non- <newline> characters. This default <blank> and <newline> field delimiter can be changed by using the FS built-in variable or the -F sepstring option. The awk utility shall denote the first field in a record $1, the second $2, and so on. The symbol $0 shall refer to the entire record; setting any other field causes the re-evaluation of $0. Assigning to $0 shall reset the values of all other fields and the NF built-in variable.
Variables and Special Variables
References to nonexistent fields (that is, fields after $NF), shall evaluate to the uninitialized value. Such references shall not create new fields. However, assigning to a nonexistent field (for example, $(NF+2)=5) shall increase the value of NF; create any intervening fields with the uninitialized value; and cause the value of $0 to be recomputed, with the fields being separated by the value of OFS. Each field variable shall have a string value or an uninitialized value when created. Field variables shall have the uninitialized value when created from $0 using FS and the variable does not contain any characters.
source: POSIX standard: awk utility
Be aware that the default field separator FS=" " has some special rules
If you have GNU find you don't need awk at all.
$ find static/ -mindepth 1 -printf '%P\n'
conf
conf/server.xml
1st solution: Considering that in your output word static will come only once if this is the case try. I am simply making field separator as string static/ for lines and printing the last field of lines then which will be after word static/.
find static/* | awk -F'static/' '{print $NF}'
2nd solution: Adding a more generic solution here. Which will match values from very first occurrence of / to till last of the line and while printing it will not printing starting /.
find static/* | awk 'match($0,/\/.*/){print substr($0,RSTART+1,RLENGTH)}'
When you reset the first field value the field is still there. Just remove the initial / chars after that with sub(/^\/+/, "") (where ^\/+ pattern matches one or more / chars at the start of the string):
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1'
See an online demo:
s="static/conf
static/conf/server.xml"
awk 'BEGIN{OFS=FS="/"} {$1="";sub(/^\/+/, "")}1' <<< "$s"
Output:
conf
conf/server.xml
Note that with BEGIN{OFS=FS="/"} you set the input/output field separator just once at the start, and 1 at the end triggers the default line print operation.

Find an exact match from a patterns file for another file using awk (patterns contain regex symbols to be ignored)

I have a file which has the following patterns.
NO_MATCH
NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH
NO_MATCH||NO_MATCH||NO_MATCH||NO_MATCH
These should be matched exactly with the 5th column of the target csv. I have tried:
awk 'NR==FNR{a[$0]=$0; next;} NR>FNR{if($5==a[$0])print $0}' pattern.csv input.csv > final_out.csv
But the || in the patterns file result in bad matches. The 5th column in the target csv looks something like this:
"AAAA||AAAA"
"BBBB||BBBB"
"NO_MATCH"
"NO_MATCH||NO_MATCH||NO_MATCH"
"NO_MATCH||BBBB"
I need to extract the 3rd and 4th lines.
Edit: I need exact match such as line 3 & 4. Hope this clears up the issue. The columns in the csv are double quoted as shown, and the quotes around fifth column should be removed.
awk 'BEGIN{FS=OFS=","} NR==FNR{a["\""$0"\""];next} ($5 in a){gsub(/^"|"$/,"",$5);print}' pattern.csv input.csv > final_out.csv
Keep pattern.csv's contents in an array with enclosing each line in quotes. For each line in input.csv, if fifth column exists in the array, remove quotes around it and print the line.

Getting numerical sub-string of fields using awk

I was wondering how I can get the numerical sub-string of fields using awk in a text file like what is shown below. I am already familiar with substr() function. However, since the length of fields are not fixed, I have no idea how to separate text from numerical part.
A.txt
"Asd.1"
"bcdujcd.2"
"mshde.3333"
"deuhdue.777"
P.S. All the numbers are separated from text part with a single dot (.).
You may try like this:
rt$ echo "bcdujcd.2"|awk -F'[^0-9]*' '$0=$2'
If you don't care about any non-digit parts of the line and only want to see the digit parts as output you could use:
awk '{gsub(/[^[:digit:]]+/, " ")}7' A.txt
which will generate:
1
2
3333
777
as output (there's a leading space on each line for the record).
If there can only be one number field per line than the replacement above can be "" instead of " " in the gsub and the leading space will do away. The replacement with the space will keep multiple numerical fields separated by a space if they occur on a single line. (i.e. "foo.88.bar.11" becomes 88 11 instead of 8811).
If you just need the second (period delimited) field of each line of that sort then awk -F. '{print $2}' will do that.
$ awk -F'[".]' '{print $3}' file
1
2
3333
777