Random selection of ids from a file - awk

I have a text file in the following format, the alphabets are ids separated by a space.
OG1: A B C D E
OG2: C F G D R
OG3: A D F F F
I would like to randomly extract one id from each group as
OG1: E
OG2: D
OG3: A
I tried using
shuf -n 1 data.txt
which gives me
OG2: C F G D R

awk to the rescue!
$ awk -v seed=$RANDOM 'BEGIN{srand(seed)} {print $1,$(rand()*(NF-1)+2)}' file
OG1: D
OG2: F
OG3: F
to skip a certain letter, you can change the main block to
... {while ("C"==r=$(rand()*(NF-1)+2)); print $1,r}' file

perl -lane 'print "$F[0] ".$F[rand($#F-1)+1]' data.txt
Explanation:
These command-line options are used:
-n loop around each line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the #F array. Defaults to splitting on whitespace.
-e execute the perl code
#F is the array of words in each line, indexed starting with $F[0]
$#F is the number of words in #F
output:
OG1: A
OG2: F
OG3: F

Related

Move every first row down to the second row

I have the following sample text:
a b c
x_y_
d e f
x_y_
g h i
x_y_
k l m
x_y_
I need it to be formatted as follows:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Using sed, awk or something else in bash, how do we accomplish this?
Another awk:
$ awk 'NR%2==0{print $0,p}{p=$0}' file
Output:
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
Explained:
$ awk '
NR%2==0 { # on every even numbered record
print $0,p # output current record and previous
}{
p=$0 # buffer record for next round
}' file
Update:
In case of odd number of records (mostly due to the peer pressure :), you need to deal with the left-over x y z:
$ awk 'NR%2==0{print $0,p}{p=$0}END{if(NR%2)print}' file
Output:
...
x_y_ g h i
x_y_ k l m
x y z
With sed:
sed -E 'N;s/(.*)\n(.*)/\2 \1/g' sample.txt
a short pipeline:
tac file | paste -d ' ' - - | tac
$ awk 'NR%2{s=$0; next} {print $0, s}' file
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
1st solution: Could you please try following, tested and created with GNU awk.
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){printf("%s\n",$i OFS $(i-1))}}' Input_file
OR(with print):
awk -v RS="" -v FS="\n" '{for(i=2;i<=NF;i+=2){print $i,$(i-1)}}' Input_file
2nd solution: By checking if a line number is completely divided by 2 then print previous and current lines values. It also checks if total number of lines are ODD in Input_file then it prints last remaining line too(by checking a flag(variable)'s status).
awk 'prev && FNR%2==0{print $0 OFS prev;prev="";next} {prev=$0} END{if(prev){print prev}}' Input_file
Output will be as follows.
x_y_ a b c
x_y_ d e f
x_y_ g h i
x_y_ k l m
This might work for you (GNU sed):
sed '1~2{h;d};G;s/\n/ /' file
Save odd line numbered lines in the hold space and append them to even numbered lines and replace the newline with a space.
Another variation:
sed -n 'h;n;G;s/\n/ /p' file
There are many more ways to achieve this, as can be seen by answers above.
How about this:
parallel -N2 echo "{2} {1}" :::: file
See here for parallel.

Remove empty columns by awk

I have a input file, which is tab delimited, but I want to remove all empty columns. Empty columns : $13=$14=$15=$84=$85=$86=$87=$88=$89=$91=$94
INPUT: tsv file with more than 90 columns
a b d e g...
a b d e g...
OUTPUT: tsv file without empty columns
a b d e g....
a b d e g...
Thank you
This might be what you want:
$ printf 'a\tb\tc\td\te\n'
a b c d e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=""} 1'
a c e
$ printf 'a\tb\tc\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a c e
Note that the above doesn't remove all empty columns as some potential solutions might do, it only removes exactly the column numbers you want removed:
$ printf 'a\tb\t\td\te\n'
a b d e
$ printf 'a\tb\t\td\te\n' | awk 'BEGIN{FS=OFS="\t"} {$2=$4=RS; gsub("(^|"FS")"RS,"")} 1'
a e
remove ALL empty columns:
If you have a tab-delimited file, with empty columns and you want to remove all empty columns, it implies that you have multiple consecutive tabs. Hence you could just replace those with a single tab and delete then the first starting tab if you also removed the first column:
sed 's/\t\+/\t/g;s/^\t//' <file>
remove SOME columns: See Ed Morton or just use cut:
cut --complement -f 13,14,15,84,85,86,87,88,89,91,94 <file>
remove selected columns if and only if they are empty:
Basically a simple adaptation from Ed Morton :
awk 'BEGIN{FS=OFS="\t"; n=split(col,a,",")}
{ for(i=1;i<=n;++i) if ($a[i]=="") $a[i]=RS; gsub("(^|"FS")"RS,"") }
1' col=13,14,15,84,85,86,87,88,89,91,94 <file>

Using multiple conditions in awk

I want to extract the information from a large file based on multiple conditions (from the same file) as well as pattern searching from other small file, Following is the script I used:
awk 'BEGIN{FS=OFS="\t"}NR==FNR{a[$0]++;next}$1 in a {print $2,$4,$5}' file2.txt file1.txt >output.txt
Now, I want to use the condition in the same awk script that ONLY print the line where the element of 4th column (any one character amongst the ATGC) matches the element of 5th column (any one character amongst the ATGC); both the column is in file 1.
Hence, in a way, I want to merge the following script with the script mentioned above:
awk '$4 " "==$5{print $2,$4,$5}' file1.txt
Following is the representation of file1.txt:
SNP Name Sample ID GC Score Allele1 - Forward Allele2 - Forward
ARS-BFGL-BAC-10172 834269752 0.9374 A G
ARS-BFGL-BAC-1020 834269752 0.9568 A A
ARS-BFGL-BAC-10245 834269752 0.7996 C C
ARS-BFGL-BAC-10345 834269752 0.9604 A C
ARS-BFGL-BAC-10365 834269752 0.5296 G G
ARS-BFGL-BAC-10591 834269752 0.4384 A A
ARS-BFGL-BAC-10793 834269752 0.9549 C C
ARS-BFGL-BAC-10867 834269752 0.9400 G G
ARS-BFGL-BAC-10951 834269752 0.5453 T T
enter code here
Following is the representation of file2.txt
ARS-BFGL-BAC-10172
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10245
ARS-BFGL-BAC-10345
ARS-BFGL-BAC-10365
ARS-BFGL-BAC-10591
ARS-BFGL-BAC-10793
ARS-BFGL-BAC-10867
ARS-BFGL-BAC-10951
Output should be:
834269752 A A
834269752 C C
834269752 G G
834269752 A A
834269752 C C
834269752 G G
834269752 T T
You can simply use boolean logic, and from your input file it seems you can get away with "normal" input field splitting, which will allow you to get rid of that space in the comparison:
awk 'BEGIN{OFS="\t"}
NR==FNR{a[$0]++;next}
($1 in a) && ($4==$5) {print $2,$4,$5}' file2.txt file1.txt > output.txt
As an example, here is my test file2.txt:
ARS-BFGL-BAC-1020
ARS-BFGL-BAC-10172
And here is the result of the command above:
834269752 A A

Print Specific Columns using AWK

I'm trying to fetch the data from column B to D from a tab delimited file "FILE". The simple AWK code I use fetch the data, but unfortunately keeps the output in a single column and remove the identifiers (shown below).
Any suggestions please.
CODE
awk '{for(i=2;i<=4;++i)print $i}' FILE
FILE
A B C D E F G
1_at 10.8435630935 10.8559287854 8.6666141543 8.820310681 9.9024050571 8.613199083 11.9807771094
2_at 4.7615531106 4.5209119307 11.2467919586 8.8105151099 7.1831990104 11.0645055836 4.3726598561
3_at 6.0025262754 5.4058080843 3.2475272982 3.1869728585 3.5654989547
OUTPUT OBTAINED
B
C
D
10.8435630935
10.8559287854
8.6666141543
4.7615531106
4.5209119307
11.2467919586
6.0025262754
5.4058080843
3.2475272982
Why don't you directly use cut?
$ cut -d$'\t' -f2-4 < file
B C D
10.8435630935 10.8559287854 8.6666141543
4.7615531106 4.5209119307 11.2467919586
6.0025262754 5.4058080843 3.2475272982
With awk you would need printf to avoid new lines of print:
awk -F"\t" '{for(i=2;i<=4;++i) printf "%s%s", $i, (i==4?RS:FS)}'

Split large single column into two columns

I need to split a single column of data in a large file into two columns as follows:
A
B B A
C ----> D C
D F E
E H G
F
G
H
Is there an easy way of doing it with unix shell commands and/or small shell script? awk?
$ awk 'NR%2{s=$0;next} {print $0,s}' file
B A
D C
F E
H G
You can use the following awk script:
awk 'NR % 2 != 0 {cache=$0}; NR % 2 == 0 {print $0 cache}' data.txt
Output:
BA
DC
FE
HG
It caches the value of odd lines and outputs even lines + appends the cache to them.
I know this is tagged awk, but I just can't stop myself from posting a sed solution, since the question left it open for "easy way . . . with unix shell commands":
$ sed -n 'h;n;G;s/\n/ /g;p' data.txt
B A
D C
F E
H G