printing previous field in AWK - awk

I think awk will be the solution to my problem. My tools are limited b/c I'm using busybox on ESXi 4.0u1. I have a log file from a VM backup program (ghettoVCB). I need to scan this file for the expression
"Failed to clone disk : There is not enough space on the file system for the selected operation"
In my file, this is around line '43'. The previous field (in AWK vocab) represents the VM name that I want to print to an output text file. In my example the VM name is TEST12-RH4-AtlassianTest.
awk 'RS=""
/There is not enough space/ {
print $17
} '
print $17 is hard-coded, and I don't want this. I want to find the field that is one less than the first field on the line returned by the regex above. Any suggestions are appreciated.
[Awk Input File]

Update (Optimized version)
awk 'NR==1{print $NF}' RS="Failed to clone" input-awk.txt
Proof of Concept
$ awk 'NR==1{print $NF}' RS="Failed to clone" input-awk.txt
TEST12-RH4-AtlassianTest
Update 2 (Uber optimized version)
Technically, the following would be the uber optimized version but it leaves too much chance for false hits on the record separator, although it works for your sample input.
awk 'NR<2{print $NF}' RS="Fa" input-awk.txt`
Update 3 (Ultimate mega-kill optimized version)
I wouldn't use this in production code, but it just goes to show you there is always a way to make it simpler. If somebody can beat this for code golf purposes, I'd certainly like to see it!
awk '!a++,$0=$NF' RS="Fa" input-awk.txt
Original
Assuming your VM name is always the last field in the record you want to print, this works:
awk '/not enough space/{split(pre,a);print a[pNF]}{pre=$0;pNF=NF}' input-awk.txt

So couldn't you use something like
'{if it matches, print foo; foo=$17}'

Related

awk - store first occurrence based on cell

I have a file (around 10k entries) with following format:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
A;B;C;n/a;n/a
D;E;F;56.011;13.099
D;E;F;56.01;13.01
D;E;F;n/a;n/a
I;B;C;n/a;n/a
containing duplicates, some without, others with mildly contradicting LAT;LONG coordinates.
I only want to store first unique value of [$1;$2;$3;$4;$5] as output, so desired output should look like:
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
I'd assume that I want to create an array, but I struggle with proper formating of it... so any help appreciated !
I'm glad you have it working, but personally, I would suggest something a little more along the lines of:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
Example Use/Output
With your data in file, you could then do:
$ awk -F";" '!seen[$1,$2,$3] {print; seen[$1,$2,$3]=1}' file
text1;text2;text3;lat;long
A;B;C;55.01;12.01
D;E;F;56.011;13.099
I;B;C;n/a;n/a
You can shorten it to about your example (which simply checks if the unique index of the first three fields combined has been set yet and relies on the default print operation to output the first records having the unique combination):
$ awk -F";" '!seen[$1,$2,$3]++' file
However, using the joined fields $1,$2,$3 as the index is about the only way you can ensure uniqueness.
If you say your works, then it is certainly shorter. Let me know if you have further questions.
Found it by stopping to look for creating arrays
created a new $1 being $1,$2,$3, but the other solutions is indeed more elegant, here is the command I came up with after merging the fields in the file (and setting them as new $1), which I then didn't have to do
awk -F';' '!seen[($1)]++' file1.csv > file2.csv

Using pipe and shell command in awk script

I am writing an awk script which needs to produce an output which needs to be sorted.
I am able to get the desired unsorted output in an awk array. I tried the following code to sort the array and it works and I don't know why and whether it is the expected behavior.
Sample Input to the question:
Ram,London,200
Alex,London,500
David,Birmingham,300
Axel,Mumbai,150
John,Seoul,450
Jen,Tokyo,600
Sarah,Tokyo,630
The expected output should be:
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
The following script is required to show the city name along with the respective cumulative total of the integers present in the third field.
BEGIN{
FS = ","
OFS = ","
}
{
if($2 in arr){
arr[$2]+=$3;
}else{
arr[$2]=$3;
}
}
END{
for(i in arr){
print i,arr[i] | "sort"
}
}
The following code is in question:
for(i in arr){
print i,arr[i] | "sort"
}
The output of the print is piped to sort, which is a bash command.
So, how does this output travel from awk to bash?
Is this the expected behavior or a mere side effect?
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
Addressing your specific questions in order:
So, how does this output travel from awk to bash?
A pipe to a spawned process.
Is this the expected behavior or a mere side effect?
Expected
Is there a better awk way to do it? Have tried asort and asorti already, but they exist with gawk and not awk.
Yes, pipe the output of the whole awk command to sort.
PS: I am trying to specifically write a .awk file for the task, without using bash commands. Please suggest the same.
See https://web.archive.org/web/20150928141114/http://awk.info/?Sorting for the implementation of a few common sorting algorithms in awk. See also https://rosettacode.org/wiki/Category:Sorting_Algorithms.
With respect to the question in your comments:
Since a process is spawned to sort from within the loop in the END rule, I was confused whether this will call the sort function on a single line and the spawned process will die there after, and a new process to sort will be spawned in the next iteration of the loop
The spawned process won't die until your awk script terminates or you call close("sort").
Could you please try changing you sort to sort -t',' -k1 in your code. Since your delimiter is comma so you need to inform sort that your delimiter is different than space. By default sort takes delimiter as comma.
Also you could remove if, else block ftom your main block and you could use only arr[$2]+=$3. Keep the rest code as it is apart from sort changes which I mentioned above
I am on mobile so couldn't paste all code but explanation should help you here.
What I would suggest is piping the output of awk to sort and not try and worry about piping the output within the END rule. While GNU awk provides asorti() to allow sorting the contents of an array, in this case since it is just the output you want sorted, a single pipe to sort after your awk script completes is all you need, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file | sort
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230
And since it is a single pipe of the output, you incur no per-iteration overhead for the subshell required by the pipe.
If you want to avoid the pipe altogether, if you have bash, you can simply use process-substitution with redirection, e.g.
$ sort < <(awk -F, -v OFS=, '{a[$2]+=$3}END{for(i in a)print i, a[i]}' file)
(same result)
If you have GNU awk, then asorti() will sort a by index and you can place the sorted array in a new array b and then output the sorted results within the END rule, e.g.
$ awk -F, -v OFS=, '{a[$2]+=$3}END{asorti(a,b);for(i in b)print b[i], a[b[i]]}' file
Birmingham,300
London,700
Mumbai,150
Seoul,450
Tokyo,1230

awk - How to extract quoted string in space delimited log file

I'm hoping there might be some simple way to do this, as I'm a total novice using awk.
I have a bunch of log files from an AWS load balancer, and I want to extract entries from these logs, where a particular response code was received.
Checking the response code is easy enough, I can do the following...
$9=="403" {print $0}
However what I really want is just the request itself, $13, However this column is quoted, and will contain spaces. It looks like so...
"GET https://[my domain name]:443/[my path] HTTP/2.0"
If I do the following...
$9=="403" {print $13}
I just get...
"GET
So what I think I need to do, is for awk (or some other appropriate utility) to extract the complete column 13, and then be able to break that down into it's individual fields, for method, URL etc.
Could you please try following. I have given inside regex of match 443 as per your sample to match it you could give it as per your need to look for 403 change it to match($0,/\".*403.*\"/) too.
awk 'match($0,/\".*443.*\"/){print substr($0,RSTART,RLENGTH)}' Input_file
IMHO advantage of this approach will be you need NOT to hard code any field number in your awk. 1 more thing I have assumed that your Input_file will have "......403....." kind of section only once and you want to print that only.
1 more additional awk where I am assuming you may have multiple occurrences of "..." so picking only that one where 403|443 is coming.
awk 'match($0,/\".*443[^"]*/){print substr($0,RSTART,RLENGTH+1)}' Input_file
EDIT: Or if your Input_file has "...443..." one time or this text is coming first after starting of line(assuming if other occurrences of ".." will come later) then you could try following.
awk -F'"' '/443/{print $2}' Input_file
newer version gawk has a built-in variable FPAT which you can use to define fields by a regex pattern. For your logs, if no other quoted fields before the field 9 and 13:
awk -v FPAT='[^[:space:]]+|"[^"]*"' '$9 == "403"{print $13}' log_file
REF: https://www.gnu.org/software/gawk/manual/html_node/Splitting-By-Content.html

Output field separators in awk after substitution in fields

Is it always the case, after modifying a specific field in awk, that information on the output field separator is lost? What happens if there are multiple field separators and I want them to be recovered?
For example, suppose I have a simple file example that contains:
a:e:i:o:u
If I just run an awk script, which takes account of the input field separator, that prints each line in my file, such as running
awk -F: '{print $0}' example
I will see the original line. If however I modify one of the fields directly, e.g. with
awk -F: '{$2=$2"!"; print $0}' example
I do not get back a modified version of the original line, rather I see the fields separated by the default whitespace separator, i.e:
a e! i o u
I can get back a modified version of the original by specifying OFS, e.g.:
awk -F: 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example
In the case, however, where there are multiple potential field separators but in the case of multiple separators is there a simple way of restoring the original separators?
For example, if example had both : and ; as separators, I could use -F":|;" to process the file but OFS would no be sufficient to restore the original separators in their relative positions.
More explicitly, if we switched to example2 containing
a:e;i:o;u
we could use
awk -F":|;" 'BEGIN {OFS=":"} {$2=$2"!"; print $0}' example2
(or -F"[:;]") to get
a:e!:i:o:u
but we've lost the distinction between : and ; which would have been maintained if we could recover
a:e!;i:o;u
You need to use GNU awk for the 4th arg to split() which saves the separators, like RT does for RS:
$ awk -F'[:;]' '{split($0,f,FS,s); $2=$2"!"; r=s[0]; for (i=1;i<=NF;i++) r=r $i s[i]; $0=r} 1' file
a:e!;i:o;u
There is no automatically populated array of FS matching strings because of how expensive it'd be in time and memory to store the string that matches FS every time you split a record into fields. Instead the GNU awk folks provided a 4th arg to split() so you can do it yourself if/when you want it. That is the result of a long conversation a few years ago in the comp.lang.awk newsgroup between experienced awk users and gawk providers before all agreeing that this was the best approach.
See split() at https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions.

Weird AWK behavior while forcing expression to be a number (adding 0)

I noticed an odd behavior while populating an array in awk. The indices and value both were numbers, so adding 0 shouldn’t have impacted. For the sake of understanding, lets take the following example:
Here is a file that I wish to use for this demo:
$ cat file
2.60E5-2670161065730303122012098 Invnum987678
2.60E5-2670161065846403042011098 Invnum987912
2.60E5-2670161065916903012012075 Invnum987654
2.60E5-2670161066813503042011075 Invnum987322
2.60E5-2670161066835008092012075 Invnum987323
2.60E5-2670161067040701122012075 Invnum987324
2.60E5-2670161067106602122010074 Invnum987325
What I would like to do is create an index from $1 and assign it value from $2. I will extract pieces of value from $1 and $2 using substr function.
$ awk '{p=substr($1,12)+0; A[p]=substr($2,7)+0;next}END{for(x in A) print x,A[x]}’ file
Now, ideally what the output should have been is as follows (ignore the fact that associative arrays may output in random):
161065730303122012098 987678
161065846403042011098 987912
161065916903012012075 987654
161066813503042011075 987322
161066835008092012075 987323
161067040701122012075 987324
161067106602122010074 987325
But, the output I got was as follows:
161066835008092012544 987323
161065846403042017280 987912
161067040701122019328 987324
161067106602122018816 987325
161066813503041994752 987322
161065916903012007936 987654
161065730303122014208 987678
If I remove the +0 from above awk one-liner, the output seems to be what I expect. What I would like to know is why would it corrupt the keys?
The above test was done on:
$ awk -version
awk version 20070501
It appears that AWK has some numerical limitations - I get even weirder results on gawk - perhaps the discussion in this SO will help you.