Understanding awk command - awk

I need help in understanding the following awk command
awk -F "<name>|</name>|<machine>|</machine>" '{if($0 ~ "<name>" && $0 ~ "</name>") nm=$2;else if($0 ~ "<machine>" && $0 ~ "</machine>") {print nm,$2}}' config.xml
This command is giving me the output of weblogic managed servers and their respective hosts in the following format.
managed_server1 host1
managed_server2 host2
managed_server3 host3

It's not particularly well written script but extracts values from lines in this format (and in this order)
<name>xxx</name>
<machine>yyy</machine>
and outputs
xxx yyy
Sets the field delimiter to open/close xml tags and if the first pair is present in the line sets a variable to the value from the second field and when the second pair is present print the previous value set and current second field.

Related

I need to sum all the values in a column across multiple files

I have a directory with multiple csv text files, each with a single line in the format:
field1,field2,field3,560
I need to output the sum of the fourth field across all files in a directory (can be hundreds or thousands of files). So for an example of:
file1.txt
field1,field2,field3,560
file2.txt
field1,field2,field3,415
file3.txt
field1,field2,field3,672
The output would simply be:
1647
I've been trying a few different things, with the most promising being an awk command that I found here in response to another user's question. It doesn't quite do what I need it to do, and I am an awk newb so I'm unsure how to modify it to work for my purpose:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt
This correctly outputs 975.
However if I try pass it a 3rd file, rather than add field 4 from all 3 files, it adds file1 to file2, then file1 to file3:
awk -F"," 'NR==FNR{a[NR]=$4;next}{print $4+a[FNR]:' file1.txt file2.txt file3.txt
975
1232
Can anyone show me how I can modify this awk statement to accept more than two files or, ideally because there are thousands of files to sum up, an * to output the sum of the fourth field of all files in the directory?
Thank you for your time and assistance.
A couple issues with the current code:
NR==FNR is used to indicate special processing for the 1st file; in this case there is no processing that is 'special' for just the 1st file (ie, all files are to be processed the same)
an array (eg, a[NR]) is used to maintain a set of values; in this case you only have one global value to maintain so there is no need for an array
Since you're only looking for one global sum, a bit more simpler code should suffice:
$ awk -F',' '{sum+=$4} END {print sum+0}' file{1..3}.txt
1647
NOTES:
in the (unlikely?) case all files are empty, sum will be undefined so print sum will display a blank link; sum+0 insures we print 0 if sum remains undefined (ie, all files are empty)
for a variable number of files file{1..3}.txt can be replaced with whatever pattern will match on the desired set of files, eg, file*.txt, *.txt, etc
Here we go (no need to test NR==FNR in a concatenation):
$ cat file{1,2,3}.txt | awk -F, '{count+=$4}END{print count}'
1647
Or same-same 🇹🇭 (without wasting some pipe(s)):
$ awk -F, '{count+=$4}END{print count}' file{1,2,3}.txt
1647
$ perl -MList::Util=sum0 -F, -lane'push #a,$F[3];END{print sum0 #a}' file{1..3}.txt
1647
$ perl -F, -lane'push #a,$F[3];END{foreach(#a){ $sum +=$_ };print "$sum"}' file{1..3}.txt
1647
$ cut -d, -f4 file{1..3}.txt | paste -sd+ - | bc
1647

Bash code for Selecting few columns from a variable

In a file I have a list of coordinates stored (see figure, to the left).
From there I want to copy the coordinates only (red marked) and put them in another file.
I copy the correct section from the file using COORD=`grep -B${i} '&END COORD' ${cpki_file}. Then I tried to use awk to extract the required numbers from the COORD variable . It does output all the numbers in the file but deletes the spaces between values (figure, to the right).
How to write the red marked section as they are?
N=200
NEndCoord=`grep -B${N} '&END COORD' ${cpki_file}|wc -l`
NCoord=`grep -B${N} '&END COORD' ${cpki_file}| grep -B200 '&COORD' |wc -l`
let i=$NEndCoord-$NCoord
COORD=`grep -B${i} '&END COORD' ${cpki_file}`
echo "$COORD" | awk '{ print $2 $3 $4 }'
echo "$COORD" | awk '{ print $2 $3 $4 }'>tmp.txt
When you start using combinations of grep, sed, awk, cut and alike, you should realize you can do it all in a single awk command. In case of the OP, this would do exactly the same:
awk '/[&]END COORD/{p=0}
p { print $2,$3,$4 }
/[&]COORD/{p=1}' file
This parses the file keeping track of a printing flag p. The flag is set if "&COORD" is found and unset if "&END COORD" is found. Printing is done, only when the flag p is set. Since we don't want to print the line with "&END COORD", we have to reset the flag before we do the check for the printing. The same holds for the line with "&COORD", but there we have to reset it after we do the check for the printing (its a bit a weird reversed logic).
The problem with the above is that it will also process the lines
UNIT angstrom
If you want to have these removed, you might want to do a check on the total columns:
awk '/[&]END COORD/{p=0}
p && (NF==4){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
Of only print the lines which do not contain "UNIT" or are empty:
awk '/[&]END COORD/{p=0}
p && (NF>0) && ($1 != "UNIT"){ print $2,$3,$4 }
/[&]COORD/{p=1}' file
sed one-liner:
sed -n '/^&COORD$/,/^UNIT/{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}' <infile.txt >outfile.txt
Explanation:
Invocation:
sed: stream editor
-n: do not print unless eplicit
Commands in sed:
/^&COORD$/,/^UNIT/: Selects groups of lines after &COORDS and before UNIT.
{s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)/\1\t\2\t\3/p}: Process each selected lines.
s/.*[[:space:]]\+\(.*\)[[:space:]]\+\(.*\)[[:space:]]\+\(.*\): Regex capture space delimited groups except the first.
/\1\t\2\t\3/: Replace with tab delimited values of the captured groups.
p: Explicit printout.

How do I obtain a specific row with the cut command?

Background
I have a file, named yeet.d, that looks like this
JET_FUEL = /steel/beams
ABC_DEF = /michael/jackson
....50 rows later....
SHIA_LEBEOUF = /just/do/it
....73 rows later....
GIVE_FOOD = /very/hungry
NEVER_GONNA = /give/you/up
I am familiar with the f and d options of the cut command. The f option allows you to specify which column(s) to extract from, while the d option allows you to specify what the delimiters.
Problem
I want this output returned using the cut command.
/just/do/it
From what I know, this is part of the command I want to enter:
cut -f1 -d= yeet.d
Given that I want the values to the right of the equals sign, with the equals sign as the delimiter. However this would return:
/steel/beams
/michael/jackson
....50 rows later....
/just/do/it
....73 rows later....
/very/hungry
/give/you/up
Which is more than what I want.
Question
How do I use the cut command to return only /just/do/it and nothing else from the situation above? This is different from How to get second last field from a cut command because I want to select a row within a large file, not just near from the end or the beginning.
This looks like it would be easier to express with awk...
# awk -v _s="${_string}" '$3 == _s {print $3}' "${_path}"
## Above could be more _scriptable_ form of bellow example
awk -v _search="/just/do/it" '$3 == _search {print $3}' <<'EOF'
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
EOF
## Either way, output should be similar to
## /just/do/it
-v _something="Some Thing" bit allows for passing Bash variables to awk
$3 == _search bit tells awk to match only when column 3 is equal to the search string
To search for a sub-string within a line one can use $0 ~ _search
{print $3} bit tells awk to print column 3 for any matches
And the <<'EOF' bit tells Bash to not expand anything within the opening and closing EOF tags
... however, the above will still output duplicate matches, eg. if yeet.d somehow contained...
JET_FULE = /steal/beams
SHIA_LEBEOUF = /just/do/it
NEVER_GONNA = /give/you/up
AGAIN = /just/do/it
... there'd be two /just/do/it lines outputed by awk.
Quickest way around that would be to pipe | to head -1, but the better way would be to tell awk to exit after it's been told to print...
_string='/just/do/it'
_path='yeet.d'
awk -v _s="${_string}" '$3 == _s {print $3; exit}' "${_path}"
... though that now assumes that only the first match is wanted, obtaining the nth is possible though currently outside the scope of the question as of last time read.
Updates
To trip awk on the first column while printing the third column and exiting after the first match may look like...
_string='SHIA_LEBEOUF'
_path='yeet.d'
awk -v _s="${_string}" '$1 == _s {print $3; exit}' "${_path}"
... and generalize even further...
_string='^SHIA_LEBEOUF '
_path='yeet.d'
awk -v _s="${_string}" '$0 ~ _s {print $3; exit}' "${_path}"
... because awk totally gets regular expressions, mostly.
It depends on how you want to identify the desired line.
You could identify it by the line number. In this case you can use sed
cut -f2 -d= yeet.d | sed '53q;d'
This extracts the 53th line.
Or you could identify it by a keyword. In this case use grep
cut -f2 -d= yeet.d | grep just
This extracts all lines containing the word just.

Use of `NF` in awk command

I am trying to understand what is the difference between two command (I was expecting same result from the two):
Case-I
echo 'one,two,three,four,five' |awk -v FS=, '{NF=3}1'
one two three
Case-II
echo 'one,two,three,four,five' |awk -v FS=, -v NF=3 '{$1=$1}1'
one two three four five
Here is my current understanding:
$1=$1 is used to force awk to reconstruct and use the variables defined. I am assigning FS like -v FS="," which is in effect unlike -v NF=3 .
Question: Why NF=3 is not taking effect where as FS=, does.
https://www.gnu.org/software/gawk/manual/gawk.html#Options:
-v var=val
--assign var=val
Set the variable var to the value val before execution of the program begins.
https://www.gnu.org/software/gawk/manual/gawk.html#Fields:
NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record.
In your first program, you execute {NF=3} after each line is read, overwriting NF.
In your second program, you initially set NF=3 via -v, but that value is overwritten by awk when the first line of input is read.
FS is different because awk never sets this variable. It will keep whatever value you give it.
NF is a predefined variable whose value is the number of fields in the current record. awk automatically updates the value of NF each time it reads a record.
Remember : whenever awk reads record/line/row, awk will parse fields by
field separator FS (default single space), and will recalculate fields
and update the same in variable NF.
Therefore, below one does not work.
Why this doesn't work ?
You defined NF, which is before the execution of the program
awk read record/line/row, parsed fields, recalculated fields, so variable NF overwritten.
case - 1 :
echo 'one,two,three,four,five' |awk -v FS=, -v NF=3 '{$1=$1}1'
one two three four five
Why this works ?
awk read record/line/row, parsed fields, calculated fields, NF will be 5
you have overwritten variable NF
case -2 :
echo 'one,two,three,four,five' |awk -v FS=, '{ NF=3 }1'
one two three
^
Because you have overwritten variable
$ echo 'one,two,three,four,five' |awk -v FS=, '{print "Before:"NF; NF=3; print "After:"NF}1'
Before:5
After:3
one two three

Use AWK to search through fasta file, given a second file containing sequence names

I have a 2 files. One is a fasta file contain multiple fasta sequences, while another file includes the names of candidate sequences I want to search (file Example below).
seq.fasta
>Clone_18
GTTACGGGGGACACATTTTCCCTTCCAATGCTGCTTTCAGTGATAAATTGAGCATGATGGATGCTGATAATATCATTCCCGTGT
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC
>Clone_27-2
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTCGTTTTGTTCTAGATTAACTATCAGTTTGGTTCTGTTTGTCCTCGTACTGGGTTGTGTCAATGCACAACTT
>Clone_34-1
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCG
>Clone_34-3
GTTACGGGGGAATAACAAAACTCACCAACTAACAACTAACTACTACTTCACTTTTCAACTACTTTACTACAATACTAAGAATGAAAACCATTCTCCTCATTATCTTTGCTCTCGCTCTTTTCACAAGAGCTCAAGTCCCTGGCTACCAAGCCATCGATATCGCTGAAGCCCAATC
>Clone_44-1
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCC
>Clone_44-3
GTTACGGGGGAATCCGAATTCACAGATTCAATTACACCCTAAAATCTATCTTCTCTACTTTCCCTCTCTCCATTCTCTCTCACACACTGTCACACACATCCCGGCAGCGCAGCCGTCGTCTCTACCCTTCACCAGGAATAAGTTTATTTTTCTACTTAC
name.txt
Clone_23
Clone_27-1
I want to use AWK to search through the fasta file, and obtain all the fasta sequences for given candidates whose names were saved in another file.
awk 'NR==FNR{a[$1]=$1} BEGIN{RS="\n>"; FS="\n"} NR>FNR {if (match($1,">")) {sub(">","",$1)} for (p in a) {if ($1==p) print ">"$0}}' name.txt seq.fasta
The problem is that I can only extract the sequence of first candidate in name.txt, like this
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
Can anyone help to fix one-line awk command above?
If it is ok or even desired to print the name as well, you can simply use grep:
grep -Ff name.txt -A1 a.fasta
-f name.txt picks patterns from name.txt
-F treats them as literal strings rather than regular expressions
A1 prints the matching line plus the subsequent line
If the names are not desired in output I would simply pipe to another grep:
above_command | grep -v '>'
An awk solution can look like this:
awk 'NR==FNR{n[$0];next} substr($0,2) in n && getline' name.txt a.fasta
Better explained in a multiline version:
# True as long as we are reading the first file, name.txt
NR==FNR {
# Store the names in the array 'n'
n[$0]
next
}
# I use substr() to remove the leading `>` and check if the remaining
# string which is the name is a key of `n`. getline retrieves the next line
# If it succeeds the condition becomes true and awk will print that line
substr($0,2) in n && getline
$ awk 'NR==FNR{n[">"$0];next} f{print f ORS $0;f=""} $0 in n{f=$0}' name.txt seq.fasta
>Clone_23
GTTACGGGGGGCCGAAAAACACCCAATCTCTCTCTCGCTGAAACCCTACCTGTAATTTGCCTCCGATAGCCTTCCCCGGTGA
>Clone_27-1
GTTACGGGGACCACACCCTCACACATACAAACACAAACACTTCAAGTGACTTAGTGTGTTTCAGCAAAACATGGCTTC