Assigning value of a filed (positional variable) into a user defined variable in gawk/awk - awk

I am creating a variable called "size" and trying to assign a value to it from gawk positional variable. But, that does not seem to work. In the example below, I am trying to store the value of field 4 into a variable "size". When I print the variable size, entire line is printed instead just the filed 4.
How can I save the filed value into a variable for later use?
prompt> echo "Live in a big city" | gawk '/Live/ {size=$4; print $size}'
The following is outputted:
Live in a big city
I would like to see just this:
big

Leave out the dollar sign. awk is like C, not like shell or perl, where you don't need any extra punctuation to dereference a variable. You only use a dollar sign to get the value of the n'th field on the current line.
echo "Live in a big city" | gawk '/Live/ {size=$4; print size}'
The reason you get the whole line printed is this: the awk variable size is assigned the value big. Then, in the print statement, awk dereferences the size variable and attempts print $big. The string "big" is interpreted as an integer and, as it does not begin with any digits, it is treated as the number 0. So you get print $0, and hence the complete line.

Related

What ends up happening when we try to use regex modifiers in awk?

See this output:
❯ awk '/indubitably/i' /usr/share/dict/words | wc -l
102401
Awk does not complain about invalid syntax or anything like that, and just spits out all lines in the file (words indeed has 102401 words inside).
Since it's very reasonable as an awk newbie to try this as a guess for case insensitivity (I am aware that IGNORECASE=1; is the right way to do it) I'm now curious how awk actually interprets /indubitably/i.
actually there's nothing invalid about that syntax.
it's regex matching "indubitably" anywhere in each input line, and concat with a uninitialized variable "i" that, by default, is an empty string, or boolean value of False.
but i'm guessing what happened instead, is that awk concatenating that empty string into regex (not as a comparison item, but afterwards), which becomes a non-empty string, since you have a word inside the regex.
and basically anything that evaluates to non-zero numerically or non-empty string becomes a boolean True at the pattern level, which then defaults to print as an action. you literally can throw anything there -
writing a "1" there is just conventional notation, but you can even place NF, OFMT, FNR, SUBSEP, or RS right there at the pattern (as long as it's not an empty string), and it'll work as if you've placed a "1" there.

How to remove unix timestamp specific data from a flatfile

I have a huge file containing a list like this
email#domain.com^B1569521698
email2#domain.com,#2domain.com^B1569521798
email3#domain.com,test#2domain.com^B1569521898
email10000#domain.com^B1569521998
..
..
The file is named /usr/local/email/whitelist
The number after ^B is a unix timestamp
I need to remove from the list all the rows having a timestamp smaller than
(e.g.) 1569521898.
I tried using various awk/sed combinations with no result.
The character ^B you notice is a control character. The first 32 control-characters which are ASCII codes 0 through 1FH, form a special set of non-printing characters. These characters are called the control characters because these characters perform various printer and display control operations rather than displaying symbols. This particular one stands for STX or Start of Text.
You can type control-charcters in a shell as: Ctrl+v Ctrl+b, or you can use the octal representation directly (\002).
awk -F '\002' '($2 >= 1569521898)'
Since you have control characters in your Input_file could you please try following once. This is written and tested with given samples only.
awk '
match($0,/\002[0-9]+/){
val=substr($0,RSTART+1,RLENGTH-1)
if(val>=1569521898){ print }
val=""
}
' Input_file

awk print 4 columns with different colours - from a declared variable

I'm just after a little help pulling in a value from a variable. I'm writing a statement to print the contents of a file to a 4 columns output on screen, colouring the 3rd column depending on what the 4th columns value is.
The file has contents as follows...
Col1=date(yymmdd)
Col2=time(hhmmss)
Col3=Jobname(test1, test2, test3, test4)
Col4=Value(null, 0, 1, 2)
Column 4 should be a value of null, 0, 1 or 2 and this is the value that will determine the colour of the 3rd column. I'm declaring the colour codes in a variable at the top of the script as follows...
declare -A colours
colours["0"]="\033[0;31m"
colours["1"]="\033[0;34m"
colours["2"]="\033[0;32m"
(note I don't have a colour for a null value, I don't know how to code this yet but I'm wanting it to be red)
My code is as follows...
cat TestScript.txt | awk '{ printf "%20s %20s %20s %10s\n", "\033[1;31m"$1,"\033[1;32m"$2,${colours[$4]}$3,"\033[1;34m"$4}'
But I get a syntax error and can't for the life of me figure a way around it no matter what I do.
Thanks for any help
Amended code below to show working solution.
I've removed the variable set originally which was done in bash, added an inline variable into the awk line...
cat TestScript.txt | awk 'BEGIN {
colours[0]="\033[0;31m"
colours[1]="\033[0;34m"
colours[2]="\033[0;32m"
}
{printf "%20s %20s %20s %10s\n","\033[1;31m"$1,"\033[1;32m"$2,colours[$4]$3,"\033[1;34m"$4}'
}
Just define the colours array in awk.
Either
BEGIN {
colours[0]="\033[0;31m"
colours[1]="\033[0;34m"
colours[2]="\033[0;32m"
}
or
BEGIN { split("\033[0;31m \033[0;34m \033[0;32m", colours) }
But in the second way, remind the first index in the array is 1, not 0.
Then, in your printf sentence the use of colours array must be changed to:
,colours[$4]$3,
But if you have defined the array using the second method, then a +1 is required:
,colours[$4+1]$3,
Best regards
In awk you can use the built-in ENVIRON hash to access the environment variables.
So instead of ${colours[$4]} (which syntax is for bash not for awk) you can write ENVIRON["something"]. Unfortunately arrays cannot accessed on this way. So instead of using colours array in environment you should use colours_1, ..., colours_2. and then you can use ENVIRON["colours_"$4].

How does associative arrays work in awk?

I wanted to remove duplicate lines from a file based on a column. A quick search let me this page which had the following solution:
awk '!x[$1]++' filename
It works, but I am not sure how it works. I know it uses associate arrays in awk but I am not able to infer anything beyond it.
Update:
Thanks everyone for the explanation. With my new knowledge, I have wrote a blog post with further explanation of how it works.
That awk script !x[$1]++ fills an array named x. Suppose the first word ($1 refers to the first word in a line of text) in a line of text is line1. It effectively results in this operation on the array:
x["line1"]++
The "index" (the key) of the array is the text encountered in the file (line1 in this example), and the value associated with that key is an integer that is incremented by 1.
When a unique line of text is encountered, the current value of the array is zero, which is then post-incremented to 1. The not operator ! evaluates to non-zero (true) for each new unique line of text and so prints it. The next time the same value is encountered, the value in the array is non-zero and so the not operation results in zero (false), so the line is not printed.
A less "clever" way of writing the same thing (but possibly more clear and less fun) would be this:
{
if (x[$1] == 0 )
print
x[$1]++
}

Fortran read statement reading beyond an end of line

do you know if the following statement is guaranteed to be true by one of the fortran 90/95/2003 standards?
"Suppose a read statement for a character variable is given a blank line (i.e., containing only white spaces and new line characters). If the format specifier is an asterisk (*), it continues to read the subsequent lines until a non-blank line is found. If the format specifier is '(A)', a blank string is substituted to the character variable."
For example, please look at the following minimal program and input file.
program code:
PROGRAM chk_read
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str1, str2
str1='minomonta'
read(*,*) str1
write(*,'(3A)') 'str1_start|', str1, '|str1_end'
str2='minomonta'
read(*,'(A)') str2
write(*,'(3A)') 'str2_start|', str2, '|str2_end'
END PROGRAM chk_read
input file:
----'input.dat' content is below this line----
yamanakako
kawaguchiko
----'input.dat' content is above this line----
Please note that there are four lines in 'input.dat' and the first and third lines are blank (contain only white spaces and new line characters). If I run the program as
$ ../chk_read < input.dat > output.dat
I get the following output
----'output.dat' content is below this line----
str1_start|yamanakako |str1_end
str2_start| |str2_end
----'output.dat' content is above this line----
The first read statement for the variable 'str1' seems to look at the first line of 'input.dat', find a blank line, move on to the second line, find the character value 'yamanakako', and store it in 'str1'.
In contrast, the second read statement for the variable 'str2' seems to be given the third line, which is blank, and store the blank line in 'str2', without moving on to the fourth line.
I tried compiling the program by Intel Fortran (ifort 12.0.4) and GNU Fortran (gfortran 4.5.0) and got the same result.
A little bit about a background of asking this question: I am writing a subroutine to read a data file that uses a blank line as a separator of data blocks. I want to make sure that the blank line, and only the blank line, is thrown away while reading the data. I also need to make it standard conforming and portable.
Thanks for your help.
From Fortran 2008 standard draft:
List-directed input/output allows data editing according to the type
of the list item instead of by a format specification. It also allows
data to be free-field, that is, separated by commas (or semicolons) or
blanks.
Then:
The characters in one or more list-directed records constitute a
sequence of values and value separators. The end of a record has the
same effect as a blank character, unless it is within a character
constant. Any sequence of two or more consecutive blanks is treated as
a single blank, unless it is within a character constant.
This implicitly states that in list-directed input, blank lines are treated as blanks until the next non-blank value.
When using a fmt='(A)' format descriptor when reading, blank lines are read into str. On the other side, fmt=*, which implies list-directed I/O in free-form, skips blank lines until it finds a non-blank character string. To test this, do something like:
PROGRAM chk_read
INTEGER :: cnt
INTEGER, PARAMETER :: MAXLEN=30
CHARACTER(len=MAXLEN) :: str
cnt=1
do
read(*,fmt='(A)',end=100)str
write(*,'(I1,3A)')cnt,' str_start|', str, '|str_end'
cnt=cnt+1
enddo
100 continue
END PROGRAM chk_read
$ cat input.dat
yamanakako
kawaguchiko
EOF
Running the program gives this output:
$ a.out < input.dat
1 str_start| |str_end
2 str_start| |str_end
3 str_start| |str_end
4 str_start|yamanakako |str_end
5 str_start| |str_end
6 str_start|kawaguchiko |str_end
On the other hand, if you use default input:
read(*,fmt=*,end=100)str
You end up with this output:
$ a.out < input.dat
1 str1_start|yamanakako |str1_end
2 str2_start|kawaguchiko |str2_end
This Part of the F2008 standard draft probably treats your problem:
10.10.3 List-directed input
7 When the next effective item is of type character, the input form
consists of a possibly delimited sequence of zero or more
rep-char s whose kind type parameter is implied by the kind of the
effective item. Character sequences may be continued from the end of
one record to the beginning of the next record, but the end of record
shall not occur between a doubled apostrophe in an
apostrophe-delimited character sequence, nor between a doubled quote
in a quote-delimited character sequence. The end of the record does
not cause a blank or any other character to become part of the
character sequence. The character sequence may be continued on as many
records as needed. The characters blank, comma, semicolon, and slash
may appear in default, ASCII, or ISO 10646 character sequences.