awk remove mirrored duplicates from 2 columns - awk

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?

The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F

another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file

In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

Related

Remove empty columns by awk

I have a input file, which is tab delimited, but I want to remove all empty columns. Empty columns : $13=$14=$15=$84=$85=$86=$87=$88=$89=$91=$94
INPUT: tsv file with more than 90 columns
a b d e g...
a b d e g...
OUTPUT: tsv file without empty columns
a b d e g....
a b d e g...
Thank you
This might be what you want:
$ printf 'a\tb\tc\td\te\n'
a b c d e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=""} 1'
a c e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a c e
Note that the above doesn't remove all empty columns as some potential solutions might do, it only removes exactly the column numbers you want removed:
$ printf 'a\tb\t\td\te\n'
a b d e
$ printf 'a\tb\t\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a e
remove ALL empty columns:
If you have a tab-delimited file, with empty columns and you want to remove all empty columns, it implies that you have multiple consecutive tabs. Hence you could just replace those with a single tab and delete then the first starting tab if you also removed the first column:
sed 's/\t\+/\t/g;s/^\t//' <file>
remove SOME columns: See Ed Morton or just use cut:
cut --complement -f 13,14,15,84,85,86,87,88,89,91,94 <file>
remove selected columns if and only if they are empty:
Basically a simple adaptation from Ed Morton :
awk 'BEGIN{FS=OFS="\t"; n=split(col,a,",")}
{ for(i=1;i<=n;++i) if ($a[i]=="") $a[i]=RS; gsub("(^|"FS")"RS,"") }
1' col=13,14,15,84,85,86,87,88,89,91,94 <file>

Awk - store line that matched range pattern start

I use awk to operate on lines within a range, but I need to use the line the matched the range pattern start in my action.
Now I am doing this:
awk '/BANANA/,/END/ {if ($0 ~ /BANANA/) line=$0; print line, $2}' infile.txt
Is there a more elegant way of doing this? A way that does not require me to store $0 at the beginning of the range? Does awk keep this line somewhere?
Thanks and best regards
EDIT (added samples):
infile.txt
few
r t y u i
few
BANANA
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
a b c d
END
r t y u i
ewqf
few
r t y u i
few
r t y u i
f
expected output
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA
Never use a range expression as they make trivial tasks very slightly briefer but then need a complete rewrite when you need to do anything the slightest bit more interesting. Always use a flag instead. Instead of:
awk '/BANANA/,/END/ { do something }' infile.txt
you should write:
awk '/BANANA/{f=1} f{ do something } /END/{f=0} ' infile.txt
and then to enhance that to do what you want now is simply:
awk '/BANANA/{f=1; line=$0} f{ print line, $2 } /END/{f=0} ' infile.txt
and any other changes (e.g. skip first line, skip last line, etc.) are equally trivial.
The only "trick" I can suggest in your case is "assignment in condition":
awk '/BANANA/ && (r=$0),/END/{ print r, $2 }' infile.txt
(r=$0) - assign current record value (i.e. BANANA) to variable r only once, thereby avoiding condition check if ($0 ~ /BANANA/) on each record within a range
The output:
BANANA
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA b
BANANA

Compare columns in 2 files, append data for shared items and print the non shared items of the first file

I have found similar questions as mine but none helped me with my specific problem (and i'm not quite sure whether there actually is such an easy solution..)
I have two files:
file1:
a b c
d e f
g h i
file2:
a b x y z
d e x
f h i
Desired Output:
a b c x y z
d e f x
g h i
So, I want all the rows and columns from file 1 and additionally, if there is a match of the first two columns in file 2, I want to append the rest of those columns (from file 2) to the ones in file 1 and write it in a new file.
I have tried with awk but so far I have only managed to append the columns for those rows that have a match, but the other ones (in my example the "g,h,i" row) is not printed.
Another issue seems to be that the items in file 2 do not always have the same amount of columns.
Does anyone have an idea how to solve this?
Thank you!
here is another awk
awk '{k=$1 FS $2}
NR==FNR {sub(k,"",$0); a[k]=$0; next}
k in a {$0 = $0 a[k]}1' file2 file1
a b c x y z
d e f x
g h i
note the order of the files.
Use the following approach:
awk 'FNR==NR{k=$1$2; $1=$2=""; a[k]=$0; next}
{ if($1$2 in a){print $0a[$1$2] } else print $0}' file2 file1 | tr -s ' '
The output:
a b c x y z
d e f x
g h i
FNR==NR - ensures performing the first file
k=$1$2; - k is a key for associative array which accumulates all column values from the second file except the first two columns(as they become a key/hash). For ex. for the first file2 line the array will be indexeed as a['ab']='x y z'

Print Specific Columns using AWK

I'm trying to fetch the data from column B to D from a tab delimited file "FILE". The simple AWK code I use fetch the data, but unfortunately keeps the output in a single column and remove the identifiers (shown below).
Any suggestions please.
CODE
awk '{for(i=2;i<=4;++i)print $i}' FILE
FILE
A B C D E F G
1_at 10.8435630935 10.8559287854 8.6666141543 8.820310681 9.9024050571 8.613199083 11.9807771094
2_at 4.7615531106 4.5209119307 11.2467919586 8.8105151099 7.1831990104 11.0645055836 4.3726598561
3_at 6.0025262754 5.4058080843 3.2475272982 3.1869728585 3.5654989547
OUTPUT OBTAINED
B
C
D
10.8435630935
10.8559287854
8.6666141543
4.7615531106
4.5209119307
11.2467919586
6.0025262754
5.4058080843
3.2475272982
Why don't you directly use cut?
$ cut -d$'\t' -f2-4 < file
B C D
10.8435630935 10.8559287854 8.6666141543
4.7615531106 4.5209119307 11.2467919586
6.0025262754 5.4058080843 3.2475272982
With awk you would need printf to avoid new lines of print:
awk -F"\t" '{for(i=2;i<=4;++i) printf "%s%s", $i, (i==4?RS:FS)}'

Split large single column into two columns

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G