awk command to print out entire row that contains a certain string - awk

output.txt
Test Results 1 PASSED with 2 minutes to process 0 issues
Test Results 2 PASSED with 10 minutes to process 0 issues
Test Results 3 FAILED ERROR 1 issues
Test Results 4 PASSED with 4 minutes to process 0 issues
Test Results 5 FAILED ERROR 3 issues
Test Results 6 PASSED with 19 minutes to process 0 issues
I need help coming up with a awk command to parse through this text. I want to list only rows that has more than 0 issue.
So in this case
Test Results 1 PASSED with 2 minutes to process 0 issues
Test Results 5 FAILED ERROR 3 issues

Try this:
$ awk '$(NF-1)' file
Test Results 3 FAILED 1 issues
Test Results 5 FAILED 3 issues

Try the following
awk '{if (($0 ~ /[0-9] issues/) && ($(NF-1) != "0")) {print $0}}' output.txt
Here the text is checked for number of issues.
It could have been done using
awk '{if ($0 ~ /[1-9] issues/) {print $0}}' output.txt
in case you are certain that there would be 1-9 issues only and not 10 or more. (regex would ignore 10, 20, 100 ... issues)

Related

Which AWK program can do this manipulation?

Given a file containing a structure arranged like the following (with fields separated by SP or HT)
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
Which AWK program do I need to get the following output?
4 5
m d
t 7
h 5
r 5
4 1
x c
0 0
6 2
6 7
4 2
6 2
7 1
9 0
a 2
3 2
9 8
9 5
4 2
5 s
2 2
5 6
3 4
1 4
4 8
4 g
5 3
3 4
4 1
d f
5 9
q w
Thanks in advance for any and all help.
Postscript
Please bear in mind,
My input file is much larger than the one depicted in this question.
My computer science skills are seriously limited.
This task has been imposed on me.
awk -v n=4 '
function join(start, end, result, i) {
for (i=start; i<=end; i++)
result = result $i (i==end ? ORS : FS)
return result
}
{
c=0
for (i=1; i<NF; i+=n) {
c++
col[c] = col[c] join(i, i+n-1)
}
}
END {
for (i=1; i<=c; i++)
printf "%s", col[i] # the value already ends with newline
}
' file
The awk info page has a short primer on awk, so read that too.
Benchmarking
create an input file with 1,000,000 columns and 8 rows (as specified by OP)
#!perl
my $cols = 2**20; # 1,048,576
my $rows = 8;
my #alphabet=( 'a'..'z', 0..9 );
my $size = scalar #alphabet;
for ($r=1; $r <= $rows; $r++) {
for ($c = 1; $c <= $cols; $c++) {
my $idx = int rand $size;
printf "%s ", $alphabet[$idx];
}
printf "\n";
}
$ perl createfile.pl > input.file
$ wc input.file
8 8388608 16777224 input.file
time various implementations: I use the fish shell, so the timing output is different from bash's
my awk
$ time awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.62 secs fish external
usr time 3.49 secs 0.24 millis 3.49 secs
sys time 0.11 secs 1.96 millis 0.11 secs
$ wc output.file
2097152 8388608 16777216 output.file
Timur's perl:
$ time perl -lan columnize.pl input.file > output.file
________________________________________________________
Executed in 3.25 secs fish external
usr time 2.97 secs 0.16 millis 2.97 secs
sys time 0.27 secs 2.87 millis 0.27 secs
Ravinder's awk
$ time awk -f columnize.ravinder input.file > output.file
________________________________________________________
Executed in 4.01 secs fish external
usr time 3.84 secs 0.18 millis 3.84 secs
sys time 0.15 secs 3.75 millis 0.14 secs
kvantour's awk, first version
$ time awk -f columnize.kvantour -v n=4 input.file > output.file
________________________________________________________
Executed in 3.84 secs fish external
usr time 3.71 secs 166.00 micros 3.71 secs
sys time 0.11 secs 1326.00 micros 0.11 secs
kvantour's second awk version: Crtl-C interrupted after a few minutes
$ time awk -f columnize.kvantour2 -v n=4 input.file > output.file
^C
________________________________________________________
Executed in 260.80 secs fish external
usr time 257.39 secs 0.13 millis 257.39 secs
sys time 1.68 secs 2.72 millis 1.67 secs
$ wc output.file
9728 38912 77824 output.file
The $0=a[j] line is pretty expensive, as it has to parse the string into fields each time.
dawg's python
$ timeout 60s fish -c 'time python3 columnize.py input.file 4 > output.file'
[... 60 seconds later ...]
$ wc output.file
2049 8196 16392 output.file
another interesting data point: using different awk implementations. I'm on a Mac with GNU awk and mawk installed via homebrew
with many columns, few rows
$ time gawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 3.78 secs fish external
usr time 3.62 secs 174.00 micros 3.62 secs
sys time 0.13 secs 1259.00 micros 0.13 secs
$ time /usr/bin/awk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 17.73 secs fish external
usr time 14.95 secs 0.20 millis 14.95 secs
sys time 2.72 secs 3.45 millis 2.71 secs
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 2.01 secs fish external
usr time 1892.31 millis 0.11 millis 1892.21 millis
sys time 95.14 millis 2.17 millis 92.97 millis
with many rows, few columns, this test took over half an hour on a MacBook Pro, 6 core Intel cpu, 16GB ram
$ time mawk -f columnize.awk -v n=4 input.file > output.file
________________________________________________________
Executed in 32.30 mins fish external
usr time 23.58 mins 0.15 millis 23.58 mins
sys time 8.63 mins 2.52 millis 8.63 mins
Use this Perl script:
perl -lane '
push #rows, [#F];
END {
my $delim = "\t";
my $cols_per_group = 2;
my $col_start = 0;
while ( 1 ) {
for my $row ( #rows ) {
print join $delim, #{$row}[ $col_start .. ($col_start + $cols_per_group - 1) ];
}
$col_start += $cols_per_group;
last if ($col_start + $cols_per_group - 1) > $#F;
}
}
' in_file > out_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
This script reads the file into memory. This is okay for most modern computers and the file sizes in question.
Each line is split on whitespace (use -F'\t' for TAB as delimiter) into array #F. The references to this array for each line are stored as elements in array #rows. After the file is read, and the end of the script (in the END { ... } block), the contents of the file are printed in groups of columns, with $cols_per_group columns per group. Columns are delimited by $delim.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
Could you please try following, written and tested with ONLY shown samples in GNU awk.
awk '
{
for(i=1;i<=NF;i+=2){
arr[i]=(arr[i]?arr[i] ORS :"")$i OFS $(i+1)
}
}
END{
for(i=1;i<=NF;i+=2){
print arr[i]
}
}' Input_file
Since we all love awk, here is another one:
awk -v n=2 '{for(i=1;i<=NF;++i) { j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}'
This will output exactly what is requested by the OP.
How does it work?
Awk performs actions per record/line it reads. For each record, it processes all the fields and appends them to a set of strings stored in an array a. It processes it in such way that a[1] contains the first n columns. a[2] the second set of n columns, etc. The relation between field number and string number is based on the equation int((i-1)/n).
When creating the strings, we try to keep track if we need to add a field separator OFS or a new line (record separator ORS). We decide this based on the modulus of the field number and the total number of columns we expect (i.e. n). Note, that we always use ORS if we process the last field.
An alternative approach:
Thanks to the comment Dawg, a flaw in the above code was found. He informed us that the program scales really badly when moving to large files. The real reason for this is not 100% known, but I assume it is from constantly having to rewrite memory by doing operations as a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS). This can be eliminated by just buffering the entire file and do all operations in the end:
awk -v n=2 '{a[NR]=$0}
END{ for(i=1;i<=NF;i+=n)
for(j=1;j<=NR;++j) {
$0=a[j]
for(k=0;k<n&&(i+k<=NF);++k)
printf "%s%s", $(i+k), ((i+k==NF || ((i+k) % n == 0)) ? ORS : OFS)
}
}' file
Note: the latter seems only efficient for a small amount of columns. This is because of the constant re-splitting done with $0=a[j]. The split takes much more time due to the large amount of fields. The complexity of this system is O(NF^2*NR)
A final alternative approach: While the first solution is fast for large amount of columns and small amount of rows, the second is fast for small amount of columns and large amount of rows. Below you find a final version that is not as fast, but stable and produces the same timing for the transposed files.
awk -v n=2 '{ for(i=1;i<=NF;i+=n) {
s=""
for(k=0;k<n&&(i+k<=NF);++k)
s=s $(i+k) ((i+k==NF || ((i+k) % n == 0)) ? "ORS" : OFS);
a[i,NR]=s
}
}
END{for(i=1;i<=NF;i+=n)for(j=1;j<=NR;++j) printf "%s",a[i,j]}' file
My old answer is below and no longer applicable...
You can use this awk for a file that could be millions of rows or millions of columns. The basic scheme is to suck all the values into a single array then use indexing arithmetic and nested loops at the end to get the correct order:
$ cat col.awk
{
for (i=1; i<=NF; i++) {
vals[++numVals] = $i
}
}
END {
for(col_offset=0; col_offset + cols <= NF; col_offset+=cols) {
for (i=1; i<=numVals; i+=NF) {
for(j=0; j<cols; j++) {
printf "%s%s", vals[i+j+col_offset], (j<cols-1 ? FS : ORS)
}
}
}
}
$ awk -f col.awk -v cols=2 file
4 5
m d
t 7
h 5
...
3 4
4 1
d f
5 9
q w
My old answer is based on the dramatic slowdown seen in most of these awks with a large number of rows.
See this question for more discussion regarding the slowdown.
The comments in the original answer below are no longer applicable.
OLD ANSWER
Only here for consistency...
The awk solutions here are all good for small files. What they all have in common is the file either needs to fit in RAM or the OS virtual memory is an acceptable fall back if the file does not fit in RAM. But with a larger file, since the time of the awk increases exponentially, you can have a very bad result. With a 12MB version of your file, an in-memory awk becomes unusably slow.
This is the case if there are millions of rows not millions of columns.
The only alternative to an in-memory solution is reading the file multiple times or managing temp files yourself. (Or use a scripting language that manages VM internally, such as Perl or Python... Timur Shatland's Perl is fast even on huge files.)
awk does not have an easy mechanism to loop over a file multiple times until a process is done. You would need to use the shell to do that and invoke awk multiple times.
Here is a python script that reads a file line-by-line and prints cols number of columns until all the original columns have been printed
$ cat sys.py
import sys
filename=sys.argv[1]
cols=int(sys.argv[2])
offset=0
delimiter=" "
with open(filename, "r") as f:
max_cols=len(f.readline().split())
while offset<max_cols:
with open(filename, "r") as f:
for line in f:
col_li=line.rstrip().split()
l=len(col_li)
max_cols=l if max_cols>l else l
print(delimiter.join(col_li[offset:offset+cols]))
offset+=cols
It is counterintuitive, but it is often significantly faster and more efficient to read a file multiple times than it is to gulp the entire thing -- if that gulp then results in a bad result with larger data.
So how does this perform compared to one of the awks in this post? Let's time it.
Given your example, the in memory awk will likely be faster:
$ cat file
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
$ time python pys.py file 2 >file2
real 0m0.027s
user 0m0.009s
sys 0m0.016s
$ time awk -v n=2 '{for(i=1;i<=NF;++i) {
j=int((i-1)/n); a[j] = a[j] $i (i%n==0 || i==NF ?ORS:OFS) }}
END{for(j=0;j<=int(NF/n);j++) printf "%s", a[j]}' file >file3
real 0m0.009s
user 0m0.003s
sys 0m0.003s
And that is true. BUT, let's make the file 1000x bigger with this Python script:
txt='''\
4 5 6 2 9 8 4 8
m d 6 7 9 5 4 g
t 7 4 2 4 2 5 3
h 5 6 2 5 s 3 4
r 5 7 1 2 2 4 1
4 1 9 0 5 6 d f
x c a 2 3 4 5 9
0 0 3 2 1 4 q w
'''
with open('/tmp/file', 'w') as f:
f.write(txt*1000) # change the 1000 to the multiple desired
# file will have 8000 lines and about 125KB
Rerun those timings, same way, and you get:
#python
real 0m0.061s
user 0m0.044s
sys 0m0.015s
# awk
real 0m0.050s
user 0m0.043s
sys 0m0.004s
About the same time... Now make the file BIGGER by multiplying the original by 100,000 to get 800,000 lines and 12MB and run the timing again:
# python
real 0m3.475s
user 0m3.434s
sys 0m0.038s
#awk
real 22m45.118s
user 16m40.221s
sys 6m4.652s
With a 12MB file, the in memory method becomes essentially unusable since the VM system on this computer with is subject to massive disc swapping to manage that particular type of memory allocation. It is likely O n**2 or worse. This computer is a 2019 Mac Pro 16 core Xeon with 192GB memory, so it is not hardware...

How can I pass a predefined variable into an awk column function?

I'd like to pass a predefined variable as the column number for an awk script. I've stripped out the unnecessary bits and below is an example of what I'd like to get done. Further below is a portion of what I've tried so far.
Reason: This is a semi-long script that currently works though I'd like to define the columns early in the script as this would make the script much easier to update as columns change.
I'd like for the "state" variable to be passed on to awk's column identifier, eg:
#/bin/bash
export state='$6'
cat ~/file | awk -v column="$state" 'state!="FAILED"'
Running the above code produces rows that do indeed have column 6 as "FAILED", so there must be something wrong. While awk '$6!="FAILED"' works as expected
Different things I've tried so far:
defining $state as 6 rather than '$6' and including the $ in the awk != command.
awk '{ENVIRON["state"]!="FAILED"}', with the same modifications as 1
This should work:
state=6
cat ~/file | awk -v column="$state" '$column != "FAILED"'
$var in awk will get the field specified by the value of variable var.
So, $NF will get the last field. Note that the awk variable here is column, not state.
For example:
% seq 1 20 | paste - - - -
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
17 18 19 20
% seq 1 20 | paste - - - - | awk -v column=3 '{print $column}'
3
7
11
15
19

awk: Search missing value in file

awk newbie here! I am asking for help to solve a simple specific task.
Here is file.txt
1
2
3
5
6
7
8
9
As you can see a single number (the number 4) is missing. I would like to print on the console the number 4 that is missing. My idea was to compare the current line number with the entry and whenever they don't match I would print the line number and exit. I tried
cat file.txt | awk '{ if ($NR != $1) {print $NR; exit 1} }'
But it prints only a newline.
I am trying to learn awk via this small exercice. I am therefore mainly interested in solutions using awk. I also welcome an explanation for why my code does not do what I would expect.
Try this -
awk '{ if (NR != $1) {print NR; exit 1} }' file.txt
4
since you have a solution already, here is another approach, comparing with previous values.
awk '$1!=p+1{print p+1} {p=$1}' file
you positional comparison won't work if you have more than one missing value.
Maybe this will help:
seq $(tail -1 file)|diff - file|grep -Po '.*(?=d)'
4
Since I am learning awk as well
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
`awk` explanation read each number from `$1` into array `i` and increment that number list line by line with `i++`, if the number is not sequential, then print it.
cat file
1
2
3
5
6
7
8
9
11
12
13
15
awk 'BEGIN{i=0}{i++;if(i!=$1){print i;i=$1}}' file
4
10
14

Print every second consequtive field in two columns - awk

Assume the following file
#zvview.exe
#begin Present/3
77191.0000 189.320100 0 0 3 0111110 16 1
-8.072430+6-8.072430+6 77190 0 1 37111110 16 2
37 2 111110 16 3
8.115068+6 0.000000+0 8.500000+6 6.390560-2 9.000000+6 6.803440-1111110 16 4
9.500000+6 1.685009+0 1.000000+7 2.582780+0 1.050000+7 3.260540+0111110 16 5
37 2 111110 16 18
What I would like to do, is print in two columns, the fields after line 6. This can be done using NR. The tricky part is the following : Every second field, should go in one column as well as adding an E before the sign, so that the output file will look like this
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
From the output file you see that I want to keep in $6 only length($6)=10 characters.
How is it possible to do it in awk?
can do all in awk but perhaps easier with the unix toolset
$ sed -n '6,7p' file | cut -c2-66 | tr ' ' '\n' | pr -2ats' '
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Here is a awk only solution or comparison
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) {f[++c]=$i;s[c]=$(i+1)}}
END{for(i=1;i<=c;i++) print f[i],s[i]}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
Perhaps shorter version,
$ awk 'NR>=6 && NR<=7{$6=substr($6,1,10);
for(i=1;i<=6;i+=2) print $i FS $(i+1)}' file
8.115068+6 0.000000+0
8.500000+6 6.390560-2
9.000000+6 6.803440-1
9.500000+6 1.685009+0
1.000000+7 2.582780+0
1.050000+7 3.260540+0
to convert format to standard scientific notation, you can pipe the result to
sed or embed something similar in awk script (using gsub).
... | sed 's/[+-]/E&/g'
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
With GNU awk for FIELDWIDTHS:
$ cat tst.awk
BEGIN { FIELDWIDTHS="9 2 9 2 9 2 9 2 9 2 9 2" }
NR>5 && NR<8 {
for (i=1;i<NF;i+=4) {
print $i "E" $(i+1), $(i+2) "E" $(i+3)
}
}
$ awk -f tst.awk file
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0
If you really want to get rid of the leading blanks then there's various ways to do it (simplest being gsub(/ /,"",$<field number>) on the relevant fields) but I left them in because the above allows your output to line up properly if/when your numbers start with a -, like they do on line 4 of your sample input.
If you don't have GNU awk, get it as you're missing a LOT of extremely useful functionality.
I tried to combine #karafka 's answer using substr, so the following does the trick!
awk 'NR>=6 && NR<=7{$6=substr($6,1,10);for(i=1;i<=6;i+=2) print substr($i,1,8) "E" substr($i,9) FS substr($(i+1),1,8) "E" substr($(i+1),9)}' file
and the output is
8.115068E+6 0.000000E+0
8.500000E+6 6.390560E-2
9.000000E+6 6.803440E-1
9.500000E+6 1.685009E+0
1.000000E+7 2.582780E+0
1.050000E+7 3.260540E+0

move certain columns to end using awk

I have large tab delimited file with 1000 columns. I want to rearrange so that certain columns have to be moved to the end.
Could anyone help using awk
Example input:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Move columns 5,6,7,8 to the end.
Output:
1 2 3 4 9 10 11 12 13 14 15 16 17 18 19 20 5 6 7 8
This prints columns 1 to a, then b to the last, and then columns a+1 to b-1:
$ awk -v a=4 -v b=9 '{for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i};for (i=a+1;i<b;i++) {printf "%s\t",$i};print""}' file
1 2 3 4 9 10 11 12 13 14 15 16
17 18 19 20 5 6 7 8
The columns are moved in this way for every line in the input file, however many lines there are.
How it works
-v a=4 -v b=9
This defines the variables a and b which determine the limits on which columns will be moved.
for (i=1;i<=NF;i+=i==a?b-a:1) {printf "%s\t",$i}
This prints all columns except the ones from a+1 to b-1.
In this loop, i is incremented by one except when i==a in which case it is incremented by b-a so as to skip over the columns to be moved. This is done with awk's ternary statement:
i += i==a ? b-a : 1
+= simply means "add to." i==a ? b-a : 1 is the ternary statement. The value that it returns depends on whether i==a is true or false. If it is true, the value before the colon is returned. If it is false, the value after the colon is returned.
for (i=a+1;i<b;i++) {printf "%s\t",$i}
This prints columns a+1 to b-1.
print""
This prints a newline character to end the line.
Alternative solution that avoids printf
This approach assembles the output into the variable out and then prints with a plain print command, avoiding printf and the need for percent signs:
awk -v a=4 -v b=9 '{out="";for (i=1;i<=NF;i+=i==a?b-a:1) out=out $i"\t";for (i=a+1;i<b;i++) out=out $i "\t";print out}' file
One way to rearrange 2 columns ($5 become $20 and $20 become $5) the rest stay unchanged :
$ awk '{x=$5; $5=$20; $20=x; print}' file.txt
for 4 columns :
$ awk '{
x=$5; $5=$20; $9=x;
y=$9; $9=$10; $10=y;
print
}' file.txt
My approach:
awk 'BEGIN{ f[5];f[6];f[7];f[8] } \
{ for(i=1;i<=NF;i++) if(!(i in f)) printf "%s\t", $i; \
for(c in f) printf "%s\t", $c; printf "\n"} ' file
It's splitted in 3 parts:
The BEGIN{} part determines which field should be moved to the end. The indexes of the array f are moved. In the example it's 5, 6, 7 and 8.
Cycle trough every field (doesn't matter if there are 1000 fields or more) and check if they are in the array. If not print them.
Now we need the skipped fields. Cycle trough the f array and print those values.
Another way in awk
Switch last A-B with last N fields
awk -vA=4 -vB=8 '{x=B-A;for(i=A;i<=B;i++){y=$i;$i=$(t=(NF-x--));$t=y}}1' file
Put N rows from end into positon A
awk -vA=3 -vB=8 '{split($0,a," ");x=A++;while(x++<B)$x=a[NF-(B-x)];while(B++<NF)$B=a[A++]}1' file