Awk - store line that matched range pattern start - awk

I use awk to operate on lines within a range, but I need to use the line the matched the range pattern start in my action.
Now I am doing this:
awk '/BANANA/,/END/ {if ($0 ~ /BANANA/) line=$0; print line, $2}' infile.txt
Is there a more elegant way of doing this? A way that does not require me to store $0 at the beginning of the range? Does awk keep this line somewhere?
Thanks and best regards
EDIT (added samples):
infile.txt
few
r t y u i
few
BANANA
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
END
r t y u i
ewqf
few
r t y u i
few
r t y u i
f
expected output
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA

Never use a range expression as they make trivial tasks very slightly briefer but then need a complete rewrite when you need to do anything the slightest bit more interesting. Always use a flag instead. Instead of:
awk '/BANANA/,/END/ { do something }' infile.txt
you should write:
awk '/BANANA/{f=1} f{ do something } /END/{f=0} ' infile.txt
and then to enhance that to do what you want now is simply:
awk '/BANANA/{f=1; line=$0} f{ print line, $2 } /END/{f=0} ' infile.txt
and any other changes (e.g. skip first line, skip last line, etc.) are equally trivial.

The only "trick" I can suggest in your case is "assignment in condition":
awk '/BANANA/ && (r=$0),/END/{ print r, $2 }' infile.txt
(r=$0) - assign current record value (i.e. BANANA) to variable r only once, thereby avoiding condition check if ($0 ~ /BANANA/) on each record within a range
The output:
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA

Related

Find repeat in one column then subtract value in another column

My input file columns are:
a Otu1 w 4
b Otu1 x 1
c Otu2 y 12424
d Otu3 z 1756
I want to search for each repetition of second column, subtract their values in fourth column. My desired output would be:
a Otu1 w 3
c Otu2 y 12424
d Otu3 z 1756
I have tried the following awk script in a small file with two column
a 3
a 1
b 4
awk '$1 in a{print $1, a[$1]-$2} {a[$1]=$2}' small_input_file
Which gives me the subtracting value only
a 2
How can I modify this script for my input file with four columns?
Thanks.
a double scan algorithm won't care how many records are there or whether they are consecutive
$ awk 'NR==FNR {a[$2]=$2 in a?a[$2]-$4:$4; next}
!b[$2]++ {print $1,$2,$3,a[$2]}' file{,}
a Otu1 w 3
c Otu2 y 12424
d Otu3 z 1756
Here is a single pass that outputs in awk default order:
$ awk '{
if($2 in a) # current $2 met before
b[$2]-=$4 # subtract $4
else { # first time meet current $2
a[$2]=$0 # store record to a var
b[$2]=$4 # and $4 to another, key with $2
}
}
END { # after processing
for(i in a) { # iterate all stored records
sub(/[^ ]+$/,b[i],a[i]) # replace the last space separated string with the count
print a[i] # output
}
}' file
Output order appears random:
d Otu3 z 1756
a Otu1 w 3
c Otu2 y 12424

awk remove mirrored duplicates from 2 columns

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

Compare columns in 2 files, append data for shared items and print the non shared items of the first file

I have found similar questions as mine but none helped me with my specific problem (and i'm not quite sure whether there actually is such an easy solution..)
I have two files:
file1:
a b c
d e f
g h i
file2:
a b x y z
d e x
f h i
Desired Output:
a b c x y z
d e f x
g h i
So, I want all the rows and columns from file 1 and additionally, if there is a match of the first two columns in file 2, I want to append the rest of those columns (from file 2) to the ones in file 1 and write it in a new file.
I have tried with awk but so far I have only managed to append the columns for those rows that have a match, but the other ones (in my example the "g,h,i" row) is not printed.
Another issue seems to be that the items in file 2 do not always have the same amount of columns.
Does anyone have an idea how to solve this?
Thank you!
here is another awk
awk '{k=$1 FS $2}
NR==FNR {sub(k,"",$0); a[k]=$0; next}
k in a {$0 = $0 a[k]}1' file2 file1
a b c x y z
d e f x
g h i
note the order of the files.
Use the following approach:
awk 'FNR==NR{k=$1$2; $1=$2=""; a[k]=$0; next}
{ if($1$2 in a){print $0a[$1$2] } else print $0}' file2 file1 | tr -s ' '
The output:
a b c x y z
d e f x
g h i
FNR==NR - ensures performing the first file
k=$1$2; - k is a key for associative array which accumulates all column values from the second file except the first two columns(as they become a key/hash). For ex. for the first file2 line the array will be indexeed as a['ab']='x y z'

Check which string in certain column is repeated

I want to see which string in my column 2 is repeated.
For example:
a apple
b peach
c grape
d peach
e peach
f apple
My output would be:
a apple
f apple
b peach
d peach
e peach
Showing the whole line that has common string on second column.
If you do not want to store all the file in memory, the best thing is to read the file twice.
$ awk 'FNR==NR {a[$2]++; next} a[$2]>1' file file
a apple
b peach
d peach
e peach
f apple
firstly to count how many times a column value appears
secondly to print rows in which the second column was counted at least twice.
As Jonathan Leffler suggests, to reproduce the exact output you are getting, just pipe to sort indicating that it should sort firstly by column 2 and then by column 1:
awk 'FNR==NR {a[$2]++; next} a[$2]>1' file file | sort -k2,2 -k1
A perl solution that doesn't read the file twice:
perl -lane 'push #{$s{$F[1]}},$_;
END{
do{print join "\n", #{$s{$_}} if scalar(#{$s{$_}})>1}for(%s)
}' file
This goes through the file and keeps each line in a hash whose key is the 2st field and whose values are lists of lines. Then, at the end, it will print the lists whose key was seen more than once.
With GNU awk for true 2D arrays:
gawk '
{ vals[$2][++cnt[$2]] = $0 }
END {
for (fruit in vals)
if (cnt[fruit] > 1)
for (i=1; i<=cnt[fruit]; i++)
print vals[fruit][i]
}
' file
a apple
f apple
b peach
d peach
e peach

Split large single column into two columns

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G