read file and extract variables based on what is in the line - awk

I have a file that looks like this:
$ cat file_test
garbage text A=one B=two C=three D=four
garbage text A= B=six D=seven
garbage text A=eight E=nine D=ten B=eleven
I want to go through each line and extract specific "variables" to use in the loop. And if a line doesn't have a variable then set it to an empty string.
So, for the above example, lets say I want to extract the variables A, B, and C, then for each line, the loop would have this:
garbage text A=one B=two C=three D=four
A = "one"
B = "two"
C = "three"
garbage text A= B=six D=seven
A = ""
B = "six"
C = ""
garbage text A=eight E=nine D=ten B=eleven
A = "eight"
B = "eleven"
C = ""
My original plan was to use sed but that won't work since the order of the "variables" is not consistent (the last line for example) and a "variable" may be missing (the second line for example).
My next thought is to go through line by line, then split the line into fields using awk and set variables based on each field but I have no clue where or how to start.
I'm open to other ideas or better suggestions.

right answer depends on what you're going to do with the variables.
assuming you need them as shell variables, here is a different approach
$ while IFS= read -r line;
do A=""; B=""; C="";
source <(echo "$line" | grep -oP "(A|B|C)=\w*" );
echo "A=$A B=$B C=$C";
done < file
A=one B=two C=three
A= B=six C=
A=eight B=eleven C=
the trick is using source for variable declarations extracted from each line with grep. Since value assignments carry over, you need to reset them before each new line.

If perl is your option, please try:
perl -ne 'undef %a; while (/([\w]+)=([\w]*)/g) {$a{$1}=$2;}
for ("A", "B", "C") {print "$_=\"$a{$_}\"\n";}' file_test
It parses each line for assignments with =, store the key-value pair in an assoc array %a, then finally reports the values for A, B and C.

I'm partial to the awk solution, e.g.
$ awk '{for (i = 1; i <= NF; i++) if ($i ~ /^[A-Za-z_][^=]*[=]/) print $i}' file
for (i = 1; i <= NF; i++) loop over each space separated field;
if ($i ~ /^[A-Za-z_][^=]*[=]/) if the field begins with at least one character that is [A-Za-z_] followed by an '='; then
print $i print the field.

On my first 3 solutions, I am considering that your need to use shell variables from the values of strings A,B,C and you do not want to simply print them, if this is the case then following(s) may help you.
1st Solution: It considers that your variables A,B,C are always coming in same field number.
while read first second third fourth fifth sixth
echo $third,$fourth,$fifth ##Printing values here.
echo "Using new values of variables here...."
echo "NEW A="$a_var
echo "NEW B="$b_var
echo "NEW C="$c_var
done < "Input_file"
It is simply printing the variables values in each line since you have NOT told what use you are going to do with these variables so I am simply printing them you could use them as per your use case too.
2nd solution: This considers that variables are coming in same order but it does check if A is coming on 3rd place or not, B is coming on 4th place or not etc and prints accordingly.
while read first second third fourth fifth sixth
echo $third,$fourth,$fifth ##Printing values here.
a_var=$(echo "$third" | awk '$0 ~ /^A/{sub(/.*=/,"");print}')
b_var=$(echo "$fourth" | awk '$0 ~ /^B/{sub(/.*=/,"");print}')
c_var=$(echo "$fifth" | awk '$0 ~ /^C/{sub(/.*=/,"");print}')
echo "Using new values of variables here...."
echo "NEW A="$a_var
echo "NEW B="$b_var
echo "NEW C="$c_var
done < "Input_file"
3rd Solution: Which looks perfect FIT for your requirement, not sure how much efficient from coding vice(I am still analyzing more if we could do something else here too). This code will NOT look for A,B, or C's order in line it will match it let them be anywhere in line, if match found it will assign value of variable OR else it will be NULL value.
while read line
a_var=$(echo "$line" | awk 'match($0,/A=[^ ]*/){val=substr($0,RSTART,RLENGTH);sub(/.*=/,"",val);print val}')
b_var=$(echo "$line" | awk 'match($0,/B=[^ ]*/){val=substr($0,RSTART,RLENGTH);sub(/.*=/,"",val);print val}')
c_var=$(echo "$line" | awk 'match($0,/C=[^ ]*/){val=substr($0,RSTART,RLENGTH);sub(/.*=/,"",val);print val}')
echo "Using new values of variables here...."
echo "NEW A="$a_var
echo "NEW B="$b_var
echo "NEW C="$c_var
done < "Input_file
Output will be as follows.
Using new values of variables here....
NEW A=one
NEW B=two
NEW C=three
Using new values of variables here....
NEW B=six
Using new values of variables here....
NEW A=eight
NEW B=eleven
EDIT1: In case you simply want to print values of A,B,C then try following.
awk '{
if($i ~ /[ABCabc]=/){
print "A="a[1] ORS "B=" a[2] ORS "C="a[3];count=""
delete a
}' Input_file

Another Perl
perl -lne ' %x = /(\S+)=(\S+)/g ; for("A","B","C") { print "$_ = $x{$_}" } %x=() '
with the input file
$ perl -lne ' %x = /(\S+)=(\S+)/g ; for("A","B","C") { print "$_ = $x{$_}" } %x=() ' file_test
A = one
B = two
C = three
A =
B = six
C =
A = eight
B = eleven
C =

a generic variable awk seld documented.
Assuming variable separator are = and not part of text before nor variable content itself.
awk 'BEGIN {
# load the list of variable and order to print
VarSize = split( "A B C", aIdx )
# create a pattern filter for variable catch in lines
for ( Idx in aIdx ) VarEntry = ( VarEntry ? ( VarEntry "|^" ) : "^" ) aIdx[Idx] "="
# reset varaible value
split( "", aVar )
# for each part of the line
for ( Fld=1; Fld<=NF; Fld++ ) {
# if part is a varaible assignation
if( $Fld ~ VarEntry ) {
# separate variable name and content in array
split( $Fld, aTemp, /=/ )
# put variable content in corresponding varaible name container
aVar[aTemp[1]] = aTemp[2]
# print all variable content (empty or not) found on this line
for ( Idx in aIdx ) printf( "%s = \042%s\042\n", aIdx[Idx], aVar[aIdx[Idx]] )
' YourFile

Its unclear whether you're trying to set awk variables or shell variables but here's how to populate an associative awk array and then use that to populate an associative shell array:
$ cat tst.awk
numKeys = split("A B C",keys)
delete f
for (i=1; i<=NF; i++) {
if ( split($i,t,/=/) == 2 ) {
f[t[1]] = t[2]
for (keyNr=1; keyNr<=numKeys; keyNr++) {
key = keys[keyNr]
printf "[%s]=\"%s\"%s", key, f[key], (keyNr<numKeys ? OFS : ORS)
$ awk -f tst.awk file
[A]="one" [B]="two" [C]="three"
[A]="" [B]="six" [C]=""
[A]="eight" [B]="eleven" [C]=""
$ while IFS= read -r out; do declare -A arr="( $out )"; declare -p arr; done < <(awk -f tst.awk file)
declare -A arr=([A]="one" [B]="two" [C]="three" )
declare -A arr=([A]="" [B]="six" [C]="" )
declare -A arr=([A]="eight" [B]="eleven" [C]="" )
$ echo "${arr["A"]}"


extract info from a tag using awk

I have multi columns file and i want to extract some info in column 71.
I want to extract using tags which the value can be anything, for example i want to just extract AC=* ; AF=* , where the value can be anything .
I found similar question and gave it a try but it didn't work
Extract columns with values matching a specific pattern
Column 71 looks like this:
The code that i tried:
awk '{
for (i = 1; i <= NF; i++) {
if ($i ~ /AC|AF/) {
printf "%s %s ", $i, $(i + 1)
print ""
I keep getting syntax error.
output wanted :
Whenever you have name=value pairs, it's usually simplest to first create an array that maps names to values (n2v[] below) and then you can just access the values by their names.
$ cat file
AC=1;AC_AFR=2;AF=3 AC=4;AC_AFR=5;AF=6
$ cat tst.awk
delete n2v
for (i=1; i in tmp; i+=2) {
n2v[tmp[i]] = tmp[i+1]
function prt(name) { print name, "=", n2v[name] }
$ awk -f tst.awk file
AC = 4
AF = 6
Just change $2 to $71 for your real input.
Something like this should do it (in Gnu awk due to switch):
$ awk '{split($71,a,";");for(i in a )if(a[i]~/^AF/) print a[i]}' foo
You split the field $71 by ;s, loop thru the array you split to looking for desired match. For multiple matches use switch:
$ awk '{
for(i in a )
switch(a[i]) {
case /^AF=/:
b=b a[i] OFS;
case /^AC=/:
b=b a[i] OFS;
printf b
}' foo
AC=14511 AF=0.137
EDIT: Now it buffers output to a variable and prints it in the end. You can control the separator with OFS.

awk | Rearrange fields of CSV file on the basis of column value

I need you help in writing awk for the below problem. I have one source file and required output of it.
Source File
Output File
Fields are not organised in source file
In Output file: fields are organised in their specific format, for example: all a values are in 2nd column and then b and then c
For value c, in second line, its coming as n number of times, so in output its merged with PIPE symbol.
Please help.
Will work in any modern awk:
$ cat file
$ cat tst.awk
BEGIN{ FS="[,:]"; split("session,a,b,c",order) }
split("",val) # or delete(val) in gawk
for (i=1;i<NF;i+=2) {
val[$i] = (val[$i]=="" ? "" : val[$i] "|") $(i+1)
for (i=1;i in order;i++) {
name = order[i]
printf "%s%s", (i==1 ? name ":" : "," name "="), val[name]
print ""
$ awk -f tst.awk file
If you actually want the e values printed, unlike your posted desired output, just add ,e to the string in the split() in the BEGIN section wherever you'd like those values to appear in the ordered output.
Note that when b was missing from the input on line 2 above, it output a null value as you said you wanted.
Try with:
awk '
FS = "[,:]"
OFS = ","
for ( i = 1; i <= NF; i+= 2 ) {
if ( $i == "session" ) { printf "%s:%s", $i, $(i+1); continue }
hash[$i] = hash[$i] (hash[$i] ? "|" : "") $(i+1)
asorti( hash, hash_orig )
for ( i = 1; i <= length(hash); i++ ) {
printf ",%s:%s", hash_orig[i], hash[ hash_orig[i] ]
printf "\n"
delete hash
delete hash_orig
' infile
that splits line with any comma or colon and traverses all odd fields to save either them and its values in a hash to print at the end. It yields:

Why does awk "not in" array work just like awk "in" array?

Here's an awk script that attempts to set difference of two files based on their first column:
file = ARGV[1]
while (getline < file)
Contained[$1] = $1
delete ARGV[1]
$1 not in Contained{
print $0
Here is TestFileA:
Here is TestFileB:
However, when I run the following command:
gawk -f Diff.awk TestFileA TestFileB
I get the output just as if the script had contained "in":
While I am uncertain about whether "not in" is correct syntax for my intent, I'm very curious about why it behaves exactly the same way as when I wrote "in".
I cannot find any doc about element not in array.
Try !(element in array).
I guess: awk sees not as an uninitialized variable, so not is evaluated as an empty string.
$1 not == $1 "" == $1
I figured this one out. The ( x in array ) returns a value, so to do "not in array", you have to do this:
if ( x in array == 0 )
print "x is not in the array"
or in your example:
($1 in Contained == 0){
print $0
In my solution for this problem I use the following if-else statement:
if($1 in contained);else{print "Here goes your code for \"not in\""}
Not sure if this is anything like you were trying to do.
#! /bin/awk
# will read in the second arg file and make a hash of the token
# found in column one. Then it will read the first arg file and print any
# lines with a token in column one not matching the tokens already defined
file = ARGV[1]
while (getline &lt file)
Contained[$1] = $1
# delete ARGV[1] # I don't know what you were thinking here
# for(i in Contained) {print Contained[i]} # debuging, not just for sadists
close (ARGV[1])
if ($1 in Contained){} else { print $1 }
In awk commande line I use:
! ($1 in a)
$1 pattern
a array
awk 'NR==FNR{a[$1];next}! ($1 in a) {print $1}' file1 file2

How to print specific duplicate line based on fields number

I need to print out only one of various consecutive lines with same first field, and the one must be the one with "more fields in its last field". That means that last field is a set of words, and I need to print the line with more elements in its last field. In case of same number of max elements in last field, any of the max is ok.
Example input:
Example output:
solution with awk would be nice, but no need of one liner.
generate index file
$ cat input.txt |
sed 's/,\[/|[/g' |
awk -F'|' '
{if(!gensub(/[[\])]/, "", "g", $NF))n=0;else n=split($NF, a, /,/); print NR,$1,n}
' |
sort -k2,2 -k3,3nr |
awk '$2!=x{x=$2;print $1}' >idx.txt
content of index file
$ cat idx.txt
select lines
$ awk 'NR==FNR{idx[$0]; next}; (FNR in idx)' idx.txt input.txt
Note: no space in input.txt
Use [ as the field delimiter, then split the last field on ,:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f)}
END {for (l in line) print line[l]}
' filename
Since order is important, an update:
awk -F '[[]' '
{split($NF, f, /,/)}
length(f) > max[$1] {line[$1] = $0; max[$1] = length(f); nr[$1] = NR}
END {for (l in line) printf("%d\t%s\n", nr[$1], line[l])}
' filename |
sort -n |
cut -f 2-
Something like this might work:
awk 'BEGIN {FS="["}
Ff != gensub("^([^,]+).*","\\1","g",$0) { Ff = gensub("^([^,]+).*","\\1","g",$0) ; Lf = $NF ; if (length(Ml) > 0) { print Ml } }
Ff == gensub("^([^,]+).*","\\1","g",$0) { if (length($NF) > length(Lf)) { Lf=$NF ; Ml=$0 } }
END {if (length(Ml) > 0) { print Ml } }' INPUTFILE
See here in action. BUT it's not the solution you want to use, as this is rather a hack. And it fails you if you meant that your last field is longer if it contains more , separated elements than the length of your last element. (E.g. the above script happily reports [KABLAMMMMMMMMMMM!] as longer than [A,B,C].)
This might work for you:
sort -r file | sort -t, -k1,1 -u

awk Joining n fields with delimiter

How can I use awk to join various fields, given that I don't know how many of them I have? For example, given the input string
I use -F'/' as delimiter, do some manipulation on aaa, bbb, ccc, ddd, eee (altering, removing...) and I want to join it back to print something line
... given that I don't know how many of them I have?
Ah, but you do know how many you have. Or you will soon, if you keep reading :-)
Before giving you a record to process, awk will set the NF variable to the number of fields in that record, and you can use for loops to process them (comments aren't part of the script, I've just put them there to explain):
$ echo pax/is/a/love/god | awk -F/ '{
gsub (/god/,"dog",$5); # pax,is,a,love,dog
$4 = ""; # pax,is,a,,dog
$6 = $5; # pax,is,a,,dog,dog
$5 = "rabid"; # pax,is,a,,rabid,dog
printf $1; # output "pax"
for (i = 2; i <= NF; i++) { # output ".<field>"
if ($i != "") { # but only for non-blank fields (skip $4)
printf "."$i;
printf "\n"; # finish line
This shows manipulation of the values, as well as insertion and deletion.
The following will show you how to process each field and do some example manipulations on them.
The only caveat of using the output field separator OFS is that "deleted" fields will still have delimiters as shown in the output below; however it makes the code much simpler if you can live with that.
awk '
if($i == "aaa")
else if($i ~ /c/)
else if($i ~ /^eee$/)
}1' <<<'aaa/bbb/ccc/ddd/eee'
This might work for you:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'BEGIN{FS=OFS="/"}{sub(/../,"",$4);NF=4;print}'
To delete fields not at the end use a function to shuffle the values:
echo "aaa/bbb/ccc/ddd/eee" |
awk 'func d(n){for(x=n;x<=NF-1;x++){y=x+1;$x=$y}NF--};BEGIN{FS=OFS="/"}{d(2);print}'
Deletes the second field.
awk -F'/' '{ # I'd suggest to add them to an array, like:
# for (i=1;i<=NF;i++) {a[i]=$i }
# Now manipulate your elements in the array
# then finally print them:
n = asorti(a, dest)
for (i=1;i<=n;i++) { output+=dest[i] "/") }
print gensub("/$","","g",output)
Doing it this way you can delete elements as well. Note deleting an item can be done like delete array[index].