How to find and match an exact string in a column using AWK?

How to find and match an exact string in a column using AWK? - awk

I'm having trouble on matching an exact string that I want to find in a file using awk.
I have the file called "sup_groups.txt" that contains:
(the structure is: "group_name:pw:group_id:user1<,user2>...")
adm:x:4:syslog,adm1
admins:x:1006:adm2,adm12,manuel
ssl-cert:x:122:postgres
ala2:x:1009:aceto,salvemini
conda:x:1011:giovannelli,galise,aceto,caputo,haymele,salvemini,scala,adm2,adm12
adm1Group:x:1022:adm2,adm1,adm3
docker:x:998:manuel
now, I want to extract the records that have in the user list the user "adm1" and print the first column (the group name), but you can see that there is a user called "adm12", so when i do this:
awk -F: '$4 ~ "adm1" {print $1}' sup_groups.txt
the output is:
adm
admins
conda
adm1Group
the command of course also prints those records that contain the string "adm12", but I don't want these lines because I'm interested only on the user "adm1".
So, How can I change this command so that it just prints the lines 1 and 6 (excluding 2 and 5)?
thank you so much and sorry for my bad English
EDIT: thank you for the answers, u gave me inspiration for the solution, i think this might work as well as your solutions but more simplified:
awk -F: '$4 ~ "adm,|adm1$|:adm1," {print $1}' sup_groups.txt
basically I'm using ORs covering all the cases and excluding the "adm12"
let me know if you think this is correct

1st solution: Using split function of awk. With your shown samples, please try following awk code.
awk -F':' '
{
num=split($4,arr,",")
for(i=1;i<=num;i++){
if(arr[i]=="adm1"){
print
}
}
}
' Input_file
Explanation: Adding detailed explanation for above.
awk -F':' ' ##Starting awk program from here setting field separator as : here.
{
num=split($4,arr,",") ##Using split to split 4th field into array arr with delimiter of ,
for(i=1;i<=num;i++){ ##Running for loop till value of num(total elements of array arr).
if(arr[i]=="adm1"){ ##Checking condition if arr[i] value is equal to adm1 then do following.
print ##printing current line here.
}
}
}
' Input_file ##Mentioning Input_file name here.
2nd solution: Using regex and conditions in awk.
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/' Input_file
OR if 4th field doesn't have comma at all then try following:
awk -F':' '$4~/^adm1,/ || $4~/,adm1,/ || $4~/,adm1$/ || $4=="adm1"' Input_file
Explanation: Making field separator as : and checking condition if 4th field is either equal to ^adm1,(starting adm1,) OR its equal to ,adm1, OR its equal to ,adm1$(ending with ,adm1) then print that line.

This should do the trick:
$ awk -F: '"," $4 "," ~ ",adm1," { print $1 }' file
The idea behind this is the encapsulate both the group field between commas such that each group entry is encapsulated by commas. So instead of searching for adm1 you search for ,adm1,
So if your list looks like:
adm2,adm12,manuel
and, by adding commas, you convert it too:
,adm2,adm12,manuel,
you can always search for ,adm1, and find the perfect match .

once u setup FS per task requirements, then main body becomes barely just :
NF = !_ < NF
or even more straight forward :
{m,n,g}awk —- --NF
=
{m,g}awk 'NF=!_<NF' OFS= FS=':[^:]*:[^:]*:[^:]*[^[:alpha:]]?adm[0-9]+.*$'
adm
admins
conda
adm1Group

Related

right pad regex with spaces using sed or awk

I have a file with two fields separated with :, both fields are varying length, second field can have all sort of characters(user input). I want the first field to be right padded with spaces to fixed length of 15 characters, for first field I have a working regex #.[A-Z0-9]{4,12}.
sample:
#ABC123:"wild things here"
#7X3Z:"":":#":";:*:-user input:""
#99999X999:"also, imagine: unicode, yay!"
desired output:
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"
There is plenty of examples how to zero pad a number, but surprisingly not a lot about general padding a regex or a field, any help using (preferably) sed or awk?

Here is another awk solution that would work with any version of awk:
awk 'BEGIN {FS=OFS=":"} {$1 = sprintf("%-15s", $1)} 1' file
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"

With perl:
$ perl -pe 's/^[^:]+/sprintf("%-15s",$&)/e' ip.txt
#ABC123 :"wild things here"
#7X3Z :"":":#":";:*:-user input:""
#99999X999 :"also, imagine: unicode, yay!"
The e flag allows you to use Perl code in replacement section. $& will have the matched portion which gets formatted by sprintf.
With awk:
# should work with any awk
awk 'match($0, /^[^:]+/){printf "%-15s%s\n", substr($0,1,RLENGTH), substr($0,RLENGTH+1)}'
# can be simplified with GNU awk
awk 'match($0, /^[^:]+/, m){printf "%-15s%s\n", m[0], substr($0,RLENGTH+1)}'
# or
awk 'match($0, /^([^:]+)(.+)/, m){printf "%-15s%s\n", m[1], m[2]}'
substr($0,1,RLENGTH) or m[0] will give contents of first field. I have used 1 instead of the usual RSTART here since we are matching start of line
substr($0,RLENGTH+1) will give rest of the line contents (i.e. from the first :)
See awk manual: String-Manipulation for details about match function.

Adding one more way of adding spaces to 1st columns here, though anubhava's answer with sprintf is better answer, adding is as an option here. Here I have created a variable named spaces, where one could define number of spaces which we need to add to it.
awk -v spaces="15" 'BEGIN{FS=OFS=":"} {sub(/:/,sprintf("%"spaces-length($1)"s",":"))} 1' Input_file
Explanation: Adding detailed explanation for above.
awk -v spaces="15" ' ##Starting awk program from here, setting spaces to 15 here.
BEGIN{ ##Starting BEGIN section of this program from here.
FS=OFS=":" ##Setting FS and OFS as colon here.
}
{
sub(/:/,sprintf("%"spaces-length($1)"s",":")) ##Substituting colon first occurrence with spaces(left padding of spaces) along with colon here.
}
1 ##Printing current line here.
' Input_file ##Mentioning Input_file name here.

i believe anbhava's solution of
awk 'BEGIN {FS=OFS=":"} {$1 = sprintf("%-15s", $1)} 1' file
can be even further simplified as :
awk -F: 'BEGIN{FS=OFS} $1=sprintf("%-15s",$1)'
the { } and final 1 are optional

How to find a match to a partial string and then delete the string from the reference file using awk?

I have a problem that I have been trying to solve, but have not been able to figure out how to do it. I have a reference file that has all of the devices in my inventory by bar code.
Reference file:
PTR10001,PRINTER,SN A
PTR10002,PRINTER,SN B
PTR10003,PRINTER,SN C
MON10001,MONITOR,SN A
MON10002,MONITOR,SN B
MON10003,MONITOR,SN C
CPU10001,COMPUTER,SN A
CPU10002,COMPUTER,SN B
CPU10003,COMPUTER,SN C
What I would like to do is make a file where I only have to put the abbreviation of what I need on it.
File 2 would look like this:
PTR
CPU
MON
MON
The desired output of this would be a file that would tell me what items by barcode that I need to pull off the shelf.
Desired output file:
PTR10001
CPU10001
MON10001
MON10002
As seen in the output, since I cannot have 2 of the same barcode, I need it to look through the reference file and find the first match. After the number is copied to the output file, I would like to remove the number from the reference file so that it doesn't repeat the number.
I have tried several iterations of awk, but have not been able get the desired output.
The closest that I have gotten is the following code:
awk -F'/' '{ key = substr($1,1,3) } NR==FNR {id[key]=$1; next} key in id { $1=id[key] } { print }' $file1 $file2 > $file3
I am writing this in ksh, and would like use awk as I think this would be the best answer to the problem.
Thanks for helping me with this.

First solution:
From your detailed description, I assume order doesn't matter, as you want to know what to pull off the shelf. So you could do the opposite, first read file2, count the items, and then go to the shelf and get them.
awk -F, 'FNR==NR{c[$0]++; next} c[substr($1,1,3)]-->0{print $1}' file2 file1
output:
PTR10001
MON10001
MON10002
CPU10001
Second solution:
Your awk is very close to what you want, but you need a second dimension in your array, and not overwriting the existing ids. We will do it with a pseudo-2-d array (BTW GNU awk has real 2-dimensional arrays) where we store the ids like PTR10001,PTR10002,PTR10003, we retrieve them with split and we remove from shelf also.
> cat tst.awk
BEGIN { FS="," }
NR==FNR {
key=substr($1,1,3)
ids[key] = (ids[key]? ids[key] "," $1: $1) #append new id.
next
}
$0 in ids {
split(ids[$0], tmp, ",")
print(tmp[1])
ids[$0]=substr(ids[$0],length(tmp[1])+2) #remove from shelf
}
Output
awk -f tst.awk file1 file2
PTR10001
CPU10001
MON10001
MON10002
Here we keep the order of file2 as this is based on the idea you have tried.

Could you please try following, written and tested with shown samples in GNU awk.
awk '
FNR==NR{
iniVal[$0]++
next
}
{
counter=substr($0,1,3)
}
counter in iniVal{
if(++currVal[counter]<=iniVal[counter]){
print $1
if(currVal[counter]==iniVal[counter]){ delete iniVal[$0] }
}
}
' Input_file2 FS="," Input_file1
Explanation: Adding detailed explanation for above.
awk ' ##Starting awk program from here.
FNR==NR{ ##Checking condition if FNR==NR which is true when Input_file2 is being read.
iniVal[$0]++ ##Creating array iniVal with index of current line with increment of 1 each time it comes here.
next ##next will skip all further statements from here.
}
{
counter=substr($0,1,3) ##Creating counter variable which has 1st 3 characters of Input_file1 here.
}
counter in iniVal{ ##Checking if counter is present in iniVal then do following.
if(++currVal[counter]<=iniVal[counter]){ ##Checking if currValarray with index of counter value is lesser than or equal to iniVal then do following.
print $1 ##Printing 1st field of current line here.
if(currVal[counter]==iniVal[counter]){ ##Checking if currVal value is equal to iniVal with index of counter here.
delete iniVal[$0] ##If above condition is TRUE then deleting iniVal here.
}
}
}
' Input_file2 FS="," Input_file1 ##Mentioning Input_file names here.

AIX/KSH Extract string from a comma seperated line

I want to extract the part "virtual_eth_adapters" from the following comma seperated line:
lpar_io_pool_ids=none,max_virtual_slots=300,"virtual_serial_adapters=0/server/1/any//any/1,1/server/1/any//any/1","virtual_scsi_adapters=166/client/1/ibm/166/0,266/client/2/ibm/266/0",virtual_eth_adapters=116/0/263,proc_mode=shared,min_proc_units=0.5,desired_proc_units=2.0,max_proc_units=8.0
Im using AIX with ksh.
I found a workaround with awk and the -F flag to seperate the string with a delimiter and then printing the item ID. But if the input string changes the id may differ...

1st solution: Could you please try following in case you want to print string virtual_eth_adapters too in output.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART,RLENGTH)
}
' Input_file
Output will be as follows.
virtual_eth_adapters=116/0/263
2nd solution: In case you want to print only value for String virtual_eth_adapters then try following.
awk '
match($0,/virtual_eth_adapters[^,]*/){
print substr($0,RSTART+21,RLENGTH-21)
}
' Input_file
Output will be as follows.
116/0/263
Explanation: Adding explanation for code.
awk ' ##Starting awk program here.
match($0,/virtual_eth_adapters[^,]*/){ ##Using match function of awk here, to match from string virtual_eth_adapters till first occurrence of comma(,)
print substr($0,RSTART,RLENGTH) ##Printing sub-string whose starting value is RSTART and till value of RLENGTH, where RSTART and RLENGTH variables will set once a regex found by above line.
}
' Input_file ##Mentioning Input_file name here.

I do use these approach to get data out in middle of lines.
awk -F'virtual_eth_adapters=' 'NF>1{split($2,a,",");print a[1]}' file
116/0/263
Its short and easy to learn. (no counting or regex needed)
-F'virtual_eth_adapters=' split the line by virtual_eth_adapters=
NF>1 if there are more than one field (line contains virtual_eth_adapters=)
split($2,a,",") split last part of line in to array a separated by ,
print a[1] print first part of array a

And one more solution (assuming the position of the string)
awk -F\, '{print $7}'
If you need only the value try this:
awk -F\, '{print $7}'|awk -F\= '{print $2}'
Also is possible to get the value on this way:
awk -F\, '{split($7,a,"=");print a[2]}'

awk to parse field if specific value is in another

In the awk below I am trying to parse $2 using the _ only if $3 is a specific valus (ID). I am reading that parsed value into an array and going to use it as a key in a lookup. The awk does execute but the entire line 2 or line with ID in $3 prints not just the desired. The print statement is only to see what results (for testing only) and will not be part of the script. Thank you :).
awk
awk -F'\t' '$3=="ID"
f="$(echo $2|cut -d_ -f1,1)"
{
print $f
}' file
file tab-delimited
R_Index locus type
17 chr20:31022959 NON
18 chr11:118353210-chr9:20354877_KMT2A-MLLT3.K8M9 ID
desired
$f = chr11:118353210-chr9:20354877

Not completely clear, could you please try following.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){print array[1]}}' Input_file
Or if you want top change 2nd field's value along with printing all lines then try following once.
awk '{split($2,array,"_");if(array[2]=="KMT2A-MLLT3.K8M9"){$2=array[1]}} 1' Input_file

How to print fields for repeated key column in one line

I'd like to transform a table in such a way that for duplicated
values in column #2 it would have corresponding values from column #1.
I.e. something like that...
MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003
to
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680
AC149475.2_FG003 MZ00013456
As I need it to computations in R I tried to use:
x=aggregate(mz_grmz,by=list(mz_grmz[,2]),FUN=paste(mz_grmz[,1],sep="|"))
but it don't work (wrong function)
Error in match.fun(FUN) :
'paste(mz_grmz[, 1], sep = "|")' is not a function, character or symbol
I also remind myself about unstack() function, but it isn't what I need.
I tried to do it using awk, based on my base knowledge I reworked code given here:
site1
#! /bin/sh
for y do
awk -v FS="\t" '{
for (x=1;x<=NR;x++) {
if (NR>2 && x=x+1) {
print $2"\t"x
}
else {print NR}
}
}' $y > $y.2
done
unfortunately it doesn't work, it's only produce enormous file with field #2 and some numbers.
I suppose it is easy task, but it is above my skills right now.
Could somebody give me a hint? Maybe just function to use in aggregate in R.
Thanks

You could do it in awk like this:
awk '
{
if ($2 in a)
a[$2] = a[$2] "|" $1
else
a[$2] = $1
}
END {
for (i in a)
print i, a[i]
}' INFILE > OUTFILE

to keep the output as same as the text in your question (empty lines etc..):
awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}\
END{for(x in a){print x,a[x];print ""}}' inputFile
test:
kent$ echo "MZ00024296 AC148152.3_FG005
MZ00047079 AC148152.3_FG006
MZ00028122 AC148152.3_FG008
MZ00032922 AC148152.3_FG008
MZ00048218 AC148152.3_FG008
MZ00024680 AC148167.6_FG001
MZ00013456 AC149475.2_FG003"|awk '{if($0 &&($2 in a))a[$2]=a[$2]"|"$1;else if ($0) a[$2]=$1;}END{for(x in a){print x,a[x];print ""}}'
AC149475.2_FG003 MZ00013456
AC148152.3_FG005 MZ00024296
AC148152.3_FG006 MZ00047079
AC148152.3_FG008 MZ00028122|MZ00032922|MZ00048218
AC148167.6_FG001 MZ00024680

This GNU sed solution might work for you:
sed -r '1{h;d};H;${x;s/(\S+)\s+(\S+)/\2\t\1/g;:a;s/(\S+\t)([^\n]*)(\n+)\1([^\n]*)\n*/\1\2|\4\3/;ta;p};d' input_file
Explanation: Use the extended regex option-r to make regex's more readable. Read the whole file into the hold space (HS). Then on end-of-file, switch to the HS and firstly swap and tab separate fields. Then compare the first fields in adjacent lines and if they match, tag the second field from the second record to the first line separated by a |. Repeated until no further adjacent lines have duplicate first fields then print the file out.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find and match an exact string in a column using AWK? - awk

once u setup FS per task requirements, then main body becomes barely just : NF = !_ < NF or even more straight forward : {m,n,g}awk —- --NF = {m,g}awk 'NF=!_<NF' OFS= FS=':[^:]:[^:]:[^:][^[:alpha:]]?adm[0-9]+.$' adm admins conda adm1Group

Related

right pad regex with spaces using sed or awk

How to find a match to a partial string and then delete the string from the reference file using awk?

AIX/KSH Extract string from a comma seperated line

awk to parse field if specific value is in another

How to print fields for repeated key column in one line

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to find and match an exact string in a column using AWK? - awk

once u setup FS per task requirements, then main body becomes barely just : NF = !_ < NF or even more straight forward : {m,n,g}awk —- --NF = {m,g}awk 'NF=!_<NF' OFS= FS=':[^:]*:[^:]*:[^:]*[^[:alpha:]]?adm[0-9]+.*$' adm admins conda adm1Group

Related

right pad regex with spaces using sed or awk

How to find a match to a partial string and then delete the string from the reference file using awk?

AIX/KSH Extract string from a comma seperated line

awk to parse field if specific value is in another

How to print fields for repeated key column in one line

Categories

Resources

once u setup FS per task requirements, then main body becomes barely just : NF = !_ < NF or even more straight forward : {m,n,g}awk —- --NF = {m,g}awk 'NF=!_<NF' OFS= FS=':[^:]:[^:]:[^:][^[:alpha:]]?adm[0-9]+.$' adm admins conda adm1Group