Awk formatted output how to include sign for floats - awk

I want to reformat a file by adjusting the spacing between fields. I have the managed to get the content into this form:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
What I can't get right is the floats with minus signs. The above formatting is produced by awk '{printf " %6s%7s%5s %5.3f %5.3f %5.3f\n",$1,$2,$3,$4,$5,$6}' file. However, I want to get the following:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
In other words, account for the sign in counting the characters of the number, in this case the 5 part of %5.3f. How can I do this?

Just a simple change - see if it works:
awk '{printf " %6s%7s%5s % 5.3f % 5.3f % 5.3f\n",$1,$2,$3,$4,$5,$6}' file
Result:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
or
awk '{printf " %6s%7s%5s %+5.3f %+5.3f %+5.3f\n",$1,$2,$3,$4,$5,$6}'
Result:
130GLU HB2 383 +0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
Edit:
awk '{printf " %6s%7s%5s % 5.3f % 5.3f % 5.3f\n",$1,$2,$3,$4,$5,$6}' file
Result:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963

Related

bcftools mpileup failed, format error of index

I'm trying to generate bcf files with bcftools with the following general code
bcftools mpileup -Ou -f ref.fai file.bam
Index was built with
samtools faidx index.fna
Sorted and markduped bam was by
samtools sort file.bam -o file_sorted.bam
samtools markdup file_sorted.bam file_sorted_md.bam
Then I ran
bcftools mpileup -Ou -f index.fna.fai file_sorted_md.bam > raw.bcf
receiving the following error
[E::fai_build_core] Format error, unexpected "N" at line 1
A general view of the index is
NM_001248012.2 2447 80 2447 2448
NM_001248013.1 612 2593 612 613
NM_001248014.1 1919 3266 1919 1920
NM_001248015.2 552 5264 552 553
NM_001248016.2 1074 5917 1074 1075
NM_001248017.2 788 7070 788 789
NM_001248019.2 1069 7940 1069 1070
NM_001248020.2 970 9088 970 971
NM_001248022.3 832 10137 832 833
NM_001248023.2 890 11048 890 891
NM_001248024.1 808 12017 808 809
NM_001248025.3 884 12904 884 885
NM_001248026.3 929 13867 929 930
I already tried removing the "N" from all lines, and then removing all alphabetic characters, but still getting an error with the numbers, like
[E::fai_build_core] Format error, unexpected "0" at line 1
I've also tried passing the bam file in a txt file
ls | grep file_sorted_md.bam > bam_list.txt
bcftools mpileup -Ou -f index.fna.fai -b bam_list.txt > raw.bcf
but cannot continue
Also searched on the web for the error, but other different errors appear
What is the problem and how can I solve it?
The problem was that although an index must have been build, in the following code
instead of passing the index,
bcftools mpileup -Ou -f index.fna.fai -b bam_list.txt > raw.bcf
the reference genome must be passed
bcftools mpileup -Ou -f reference.fna -b bam_list.txt > raw.bcf

Keep only the last duplicate line

I have this data, how to remove the first duplicate ($1$2$3) using awk
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582
the expected output is
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582
With tac and awk:
tac file | awk '!a[$1,$2,$3]++' | tac
Output:
785016 AGTCGCGTCCGT 142
785031 CGGCGTCGACTA 705 CACTCCCCTGGAG
848841 GCTCAGTCAAAC 1595 matched
848847 CAAATCGAGATC 1672
880844 TGCCGACGACAT 520 GTGTTCCGATCAG
880851 GACGACAACGTC 582
See: man tac

awk getting good distribution of random integer values between 2 inputs

How to get a good distribution of random integer values between 2 inputs using awk?.
I'm trying the below
$ awk -v min=200 -v max=500 ' BEGIN { srand();for(i=0;i<10;i++) print int(min+rand()*100*(max/min)) } '
407
406
360
334
264
365
303
417
249
225
$
Is there a better way in awk
Sorry to inform you that your code is not even correct. Try with min=max=10.
Something like this will work. You can verify the uniformity as well.
$ awk -v min=200 -v max=210 ' BEGIN{srand();
for(i=0;i<10000;i++) a[int(min+rand()*(max-min))]++;
for(k in a) print k,a[k]}' | sort
200 1045
201 966
202 1014
203 1016
204 985
205 1010
206 988
207 1027
208 986
209 963
Note also that min is inclusive but max is not.

matching columns in two files with different line numbers

This is a rather repeated question but I could not figure it out with my files, so, any help will be highly appreciated.
I have two files, I want to compare their first fields and print the common lines into a third file, an example of my files:
file 1:
gene1
gene2
gene3
file 2:
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
gene4|trans2|12|344|343 2112
gene4|trans2|12|344|343 455
file 2 contains more fields. Please pay attention that the first field is not exactly like the first file but the gene element only matches.
The output should look like this:
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
I use this code, it does not give me any error but it does not give me any output either:
awk '{if (f[$1] != FILENAME) a[$1]++; f[$1] = FILENAME; } END{ for (i in a) if (a[i] > 1) print i; }' file1 file1
thank you very much
Some like this?
awk -F\| 'FNR==NR {a[$0]++;next} $1 in a' file1 file2
gene1|trans1|12|233|345 45
gene1|trans2|12|342|232 45
gene2|trans2|12|344|343 12
gene2|trans2|12|344|343 45
gene2|trans2|12|344|343 12
gene2|trans3|12|34r|343 325
gene2|trans2|12|344|343 545
gene3|trans4|12|344|333 454
gene3|trans2|12|343|343 545
gene3|trans3|12|344|343 45
In this example, grep is sufficient:
grep -w -f file1 file2

awk print multiple column file into single column

My file looks like that:
315
717
461 737
304
440
148 206 264 322 380 438 496
801
495
355
249 989
768
946
I want to print all those columns in a single column file (one long first column).
If I try to
awk '{print $1}'> new_file;
awk '{print $2}' >> new_file
There are white spaces in between. How to solve this thing?
Perhaps a bit cryptic:
awk '1' RS='[[:space:]]+' inputfile
It says "print every record, treating any whitespace as record separators".
You can simply use something like:
awk '{ for (i=1; i<=NF; i++) print $i }' file
For each line, iterate through columns, and print each column in a new line.
You don't need as much as sed for this: just translate spaces to newlines
tr ' ' '\n' < file
tr is purely a filter, so you have to redirect a file into it.
A perl solution:
perl -pe 's/\s+(?=\S)/\n/g' infile
Output:
315
717
461
737
304
440
148
206
264
322
380
438
496
801
495
355
249
989
768
946