Replace space in a specific column with a character - awk

I have data in format
A ((!(A1+A2)))
B (A1+A2)
C (A1 A2)
D (!(A1 A2) B1)
E (!A1+!A2)
F ((A1+A2) A3 A4)
G ((A1 A2)+(A3 A4))
I want output as
A ((!(A1+A2)))
B (A1+A2)
C (A1&A2)
D (!(A1&A2)&B1
E (!A1+!A2)
F ((A1+A2)&A3&A4)
G ((A1&A2)+(A3&A4))
So whenever there is space in column2 I want it to get replaced with &
I tried
sed 's/ /&/2' file
But there is no change
I also tried
awk -F' ' '{if($2==" ")$2="&";}1' file
This also has no change getting back input file only.

You may use this awk with 2 spaces as input/output field separator:
awk 'BEGIN {FS=OFS=" "} {gsub(/ +/, "\\&", $2)} 1' file
A ((!(A1+A2)))
B (A1+A2)
C (A1&A2)
D (!(A1&A2)&B1)
E (!A1+!A2)
F ((A1+A2)&A3&A4)
G ((A1&A2)+(A3&A4))

You can harness sed for this task following way:
sed 's/\([^ ]\) \([^ ]\)/\1\&\2/g'
gives for input
A ((!(A1+A2)))
B (A1+A2)
C (A1 A2)
D (!(A1 A2) B1)
E (!A1+!A2)
F ((A1+A2) A3 A4)
G ((A1 A2)+(A3 A4))
output
A ((!(A1+A2)))
B (A1+A2)
C (A1&A2)
D (!(A1&A2)&B1)
E (!A1+!A2)
F ((A1+A2)&A3&A4)
G ((A1&A2)+(A3&A4))
Explanation: I used capturing groups here, 1st is any character but space, 2nd is also any character but space and there is space between them, such match is replaced by content of 1st group (\1) followed by & (\&) followed by content of 2nd group (\2). Note that we want multiple replacements, hence g. Disclaimer: this solution assumes there are not leading or trailing spaces in your input.

Related

Remove empty columns by awk

I have a input file, which is tab delimited, but I want to remove all empty columns. Empty columns : $13=$14=$15=$84=$85=$86=$87=$88=$89=$91=$94
INPUT: tsv file with more than 90 columns
a b d e g...
a b d e g...
OUTPUT: tsv file without empty columns
a b d e g....
a b d e g...
Thank you
This might be what you want:
$ printf 'a\tb\tc\td\te\n'
a b c d e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=""} 1'
a c e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a c e
Note that the above doesn't remove all empty columns as some potential solutions might do, it only removes exactly the column numbers you want removed:
$ printf 'a\tb\t\td\te\n'
a b d e
$ printf 'a\tb\t\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a e
remove ALL empty columns:
If you have a tab-delimited file, with empty columns and you want to remove all empty columns, it implies that you have multiple consecutive tabs. Hence you could just replace those with a single tab and delete then the first starting tab if you also removed the first column:
sed 's/\t\+/\t/g;s/^\t//' <file>
remove SOME columns: See Ed Morton or just use cut:
cut --complement -f 13,14,15,84,85,86,87,88,89,91,94 <file>
remove selected columns if and only if they are empty:
Basically a simple adaptation from Ed Morton :
awk 'BEGIN{FS=OFS="\t"; n=split(col,a,",")}
{ for(i=1;i<=n;++i) if ($a[i]=="") $a[i]=RS; gsub("(^|"FS")"RS,"") }
1' col=13,14,15,84,85,86,87,88,89,91,94 <file>

awk remove mirrored duplicates from 2 columns

Big question:
I want a list of the unique combinations between two fields in a data frame.
Example data:
A B
C D
E F
B A
C F
E F
I would like to be able to get the result of 4 unique combinations: AB, CD, EF, and CF. Since BA and and BA contain the same components but in a different order, I only want one copy (it is a mutual relationship so BA is the same thing as AB)
Attempt:
So far I have tried sorting and keeping unique lines:
sort file | uniq
but of course that produces 5 combinations:
A B
C D
E F
B A
C F
I do not know how to approach AB/BA being considered the same. Any suggestions on how to do this?
The idiomatic awk approach is to order the index parts:
$ awk '!seen[$1>$2 ? $1 FS $2 : $2 FS $1]++' file
A B
C D
E F
C F
another awk magic
awk '!a[$1,$2] && !a[$2,$1]++' file
In awk:
$ awk '($1$2 in a){next}{a[$1$2];a[$2$1]}1' file
A B
C D
E F
C F
Explained:
($1$2 in a) { next } # if duplicate in hash, next record
{ a[$1$2]; a[$2$1] } 1 # hash reverse also and output
It works for single char fields. If you want to use it for longer strings, add FS between fields, like a[$1 FS $2] etc. (thanks #EdMorton).

Random selection of ids from a file

I have a text file in the following format, the alphabets are ids separated by a space.
OG1: A B C D E
OG2: C F G D R
OG3: A D F F F
I would like to randomly extract one id from each group as
OG1: E
OG2: D
OG3: A
I tried using
shuf -n 1 data.txt
which gives me
OG2: C F G D R
awk to the rescue!
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $1,$(rand()*(NF-1)+2)}' file
OG1: D
OG2: F
OG3: F
to skip a certain letter, you can change the main block to
... {while ("C"==r=$(rand()*(NF-1)+2)); print $1,r}' file
perl -lane 'print "$F[0] ".$F[rand($#F-1)+1]' data.txt
Explanation:
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
#F is the array of words in each line, indexed starting with $F[0]
$#F is the number of words in #F
output:
OG1: A
OG2: F
OG3: F

Using multiple conditions in awk

I want to extract the information from a large file based on multiple conditions (from the same file) as well as pattern searching from other small file, Following is the script I used:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$0]++;next}$1 in a {print $2,$4,$5}' file2.txt file1.txt >output.txt
Now, I want to use the condition in the same awk script that ONLY print the line where the element of 4th column (any one character amongst the ATGC) matches the element of 5th column (any one character amongst the ATGC); both the column is in file 1.
Hence, in a way, I want to merge the following script with the script mentioned above:
awk '$4 " "==$5{print $2,$4,$5}' file1.txt
Following is the representation of file1.txt:
SNP Name Sample ID GC Score Allele1 - Forward Allele2 - Forward
ARS-BFGL-BAC-10172 834269752 0.9374 A G
ARS-BFGL-BAC-1020 834269752 0.9568 A A
ARS-BFGL-BAC-10245 834269752 0.7996 C C
ARS-BFGL-BAC-10345 834269752 0.9604 A C
ARS-BFGL-BAC-10365 834269752 0.5296 G G
ARS-BFGL-BAC-10591 834269752 0.4384 A A
ARS-BFGL-BAC-10793 834269752 0.9549 C C
ARS-BFGL-BAC-10867 834269752 0.9400 G G
ARS-BFGL-BAC-10951 834269752 0.5453 T T
enter code here
Following is the representation of file2.txt
ARS-BFGL-BAC-10172
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10245
ARS-BFGL-BAC-10345
ARS-BFGL-BAC-10365
ARS-BFGL-BAC-10591
ARS-BFGL-BAC-10793
ARS-BFGL-BAC-10867
ARS-BFGL-BAC-10951
Output should be:
834269752 A A
834269752 C C
834269752 G G
834269752 A A
834269752 C C
834269752 G G
834269752 T T
You can simply use boolean logic, and from your input file it seems you can get away with "normal" input field splitting, which will allow you to get rid of that space in the comparison:
awk 'BEGIN{OFS="\t"}
NR==FNR{a[$0]++;next}
($1 in a) && ($4==$5) {print $2,$4,$5}' file2.txt file1.txt > output.txt
As an example, here is my test file2.txt:
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10172
And here is the result of the command above:
834269752 A A

Split large single column into two columns

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G