Split large single column into two columns - awk

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?

$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G

You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.

I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G

Related

Move every first row down to the second row

I have the following sample text:
a b c
x_y_
d e f
x_y_
g h i
x_y_
k l m
x_y_
I need it to be formatted as follows:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Using sed, awk or something else in bash, how do we accomplish this?
Another awk:
$ awk 'NR%2==0{print $0,p}{p=$0}' file
Output:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Explained:
$ awk '
NR%2==0 { # on every even numbered record
print $0,p # output current record and previous
}{
p=$0 # buffer record for next round
}' file
Update:
In case of odd number of records (mostly due to the peer pressure :), you need to deal with the left-over x y z:
$ awk 'NR%2==0{print $0,p}{p=$0}END{if(NR%2)print}' file
Output:
...
x_y_ g h i
x_y_ k l m
x y z
With sed:
sed -E 'N;s/(.*)\n(.*)/\2 \1/g' sample.txt
a short pipeline:
tac file | paste -d ' ' - - | tac
$ awk 'NR%2{s=$0; next} {print $0, s}' file
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
1st solution: Could you please try following, tested and created with GNU awk.
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){printf("%s\n",$i OFS $(i-1))}}' Input_file
OR(with print):
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){print $i,$(i-1)}}' Input_file
2nd solution: By checking if a line number is completely divided by 2 then print previous and current lines values. It also checks if total number of lines are ODD in Input_file then it prints last remaining line too(by checking a flag(variable)'s status).
awk 'prev && FNR%2==0{print $0 OFS prev;prev="";next} {prev=$0} END{if(prev){print prev}}' Input_file
Output will be as follows.
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
This might work for you (GNU sed):
sed '1~2{h;d};G;s/\n/ /' file
Save odd line numbered lines in the hold space and append them to even numbered lines and replace the newline with a space.
Another variation:
sed -n 'h;n;G;s/\n/ /p' file
There are many more ways to achieve this, as can be seen by answers above.
How about this:
parallel -N2 echo "{2} {1}" :::: file
See here for parallel.

awk remove mirrored duplicates from 2 columns

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

Compare columns in 2 files, append data for shared items and print the non shared items of the first file

I have found similar questions as mine but none helped me with my specific problem (and i'm not quite sure whether there actually is such an easy solution..)
I have two files:
file1:
a b c
d e f
g h i
file2:
a b x y z
d e x
f h i
Desired Output:
a b c x y z
d e f x
g h i
So, I want all the rows and columns from file 1 and additionally, if there is a match of the first two columns in file 2, I want to append the rest of those columns (from file 2) to the ones in file 1 and write it in a new file.
I have tried with awk but so far I have only managed to append the columns for those rows that have a match, but the other ones (in my example the "g,h,i" row) is not printed.
Another issue seems to be that the items in file 2 do not always have the same amount of columns.
Does anyone have an idea how to solve this?
Thank you!
here is another awk
awk '{k=$1 FS $2}
NR==FNR {sub(k,"",$0); a[k]=$0; next}
k in a {$0 = $0 a[k]}1' file2 file1
a b c x y z
d e f x
g h i
note the order of the files.
Use the following approach:
awk 'FNR==NR{k=$1$2; $1=$2=""; a[k]=$0; next}
{ if($1$2 in a){print $0a[$1$2] } else print $0}' file2 file1 | tr -s ' '
The output:
a b c x y z
d e f x
g h i
FNR==NR - ensures performing the first file
k=$1$2; - k is a key for associative array which accumulates all column values from the second file except the first two columns(as they become a key/hash). For ex. for the first file2 line the array will be indexeed as a['ab']='x y z'

Random selection of ids from a file

I have a text file in the following format, the alphabets are ids separated by a space.
OG1: A B C D E
OG2: C F G D R
OG3: A D F F F
I would like to randomly extract one id from each group as
OG1: E
OG2: D
OG3: A
I tried using
shuf -n 1 data.txt
which gives me
OG2: C F G D R
awk to the rescue!
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $1,$(rand()*(NF-1)+2)}' file
OG1: D
OG2: F
OG3: F
to skip a certain letter, you can change the main block to
... {while ("C"==r=$(rand()*(NF-1)+2)); print $1,r}' file
perl -lane 'print "$F[0] ".$F[rand($#F-1)+1]' data.txt
Explanation:
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
#F is the array of words in each line, indexed starting with $F[0]
$#F is the number of words in #F
output:
OG1: A
OG2: F
OG3: F

Count number of different lines

I have a file that has a lot of its lines repeated, it looks like this:
a
a
.
.
.
a
b
b
c
.
.
c
d
.
.
d
e
.
.
.
e
I need to count each line value only once so for example if the only possible values lines can be are from a,b,c,d,e the number i'm interested in is 5.
here's how I've been counting all of the lines in the file:
wc -l file
which only gives me n times a, m times b, etc. and doesn't provide me any valuable information.
I sense this can be done using awk, any ideas?
Does it have to be awk? one way using shell-commands is
$ sort input.txt | uniq -c
10 .
3 a
2 b
2 c
2 d
2 e
Using awk:
$ awk '{a[$0]++}END{for(i in a){print i, a[i]}}' input.txt
a 3
b 2
. 10
c 2
d 2
e 2
You don't really need to do any programming for this, e.g.
$ sort -u input.txt | wc -l
sort -u sorts the input file removing any duplicates and the output is then piped to wc -l to generate a count of these unique lines.
Given this file:
$ cat /tmp/lines.txt
a
a
.
.
.
a
b
b
c
.
.
c
d
.
.
d
e
.
.
.
You can also Perl to filter the type of lines to count. In this case, only letters:
$ perl -lane '$c{$1}++ if /^(\w+)/; END {print "$_: $c{$_}" foreach (sort keys%c); $s = keys %c; print "total uniques: $s"}' /tmp/lines.txt
a: 3
b: 2
c: 2
d: 2
e: 2
total uniques: 5
The total unique values is found by the number of key, value pairs in the hash %c
Similarly in awk, you can do:
$ awk '/\w+/{ a[$0]++}END{for(i in a){print i, a[i]; c++} print "unique lines:", c}' /tmp/lines.txt
a 3
b 2
c 2
d 2
e 2
unique lines: 5
Or, cobble together a grep/uniq/wc solution:
$ grep -E '\w+' /tmp/lines.txt | uniq | wc -l
5
The idiomatic way to do this in awk:
awk '!seen[$0]++' file
That prints a line only the first time it is seen
awk '!seen[$0]++{cnt++} END{print cnt+0}' file