bcftools mpileup failed, format error of index - samtools

I'm trying to generate bcf files with bcftools with the following general code
bcftools mpileup -Ou -f ref.fai file.bam
Index was built with
samtools faidx index.fna
Sorted and markduped bam was by
samtools sort file.bam -o file_sorted.bam
samtools markdup file_sorted.bam file_sorted_md.bam
Then I ran
bcftools mpileup -Ou -f index.fna.fai file_sorted_md.bam > raw.bcf
receiving the following error
[E::fai_build_core] Format error, unexpected "N" at line 1
A general view of the index is
NM_001248012.2 2447 80 2447 2448
NM_001248013.1 612 2593 612 613
NM_001248014.1 1919 3266 1919 1920
NM_001248015.2 552 5264 552 553
NM_001248016.2 1074 5917 1074 1075
NM_001248017.2 788 7070 788 789
NM_001248019.2 1069 7940 1069 1070
NM_001248020.2 970 9088 970 971
NM_001248022.3 832 10137 832 833
NM_001248023.2 890 11048 890 891
NM_001248024.1 808 12017 808 809
NM_001248025.3 884 12904 884 885
NM_001248026.3 929 13867 929 930
I already tried removing the "N" from all lines, and then removing all alphabetic characters, but still getting an error with the numbers, like
[E::fai_build_core] Format error, unexpected "0" at line 1
I've also tried passing the bam file in a txt file
ls | grep file_sorted_md.bam > bam_list.txt
bcftools mpileup -Ou -f index.fna.fai -b bam_list.txt > raw.bcf
but cannot continue
Also searched on the web for the error, but other different errors appear
What is the problem and how can I solve it?

The problem was that although an index must have been build, in the following code
instead of passing the index,
bcftools mpileup -Ou -f index.fna.fai -b bam_list.txt > raw.bcf
the reference genome must be passed
bcftools mpileup -Ou -f reference.fna -b bam_list.txt > raw.bcf

Related

awk getting good distribution of random integer values between 2 inputs

How to get a good distribution of random integer values between 2 inputs using awk?.
I'm trying the below
$ awk -v min=200 -v max=500 ' BEGIN { srand();for(i=0;i<10;i++) print int(min+rand()*100*(max/min)) } '
407
406
360
334
264
365
303
417
249
225
$
Is there a better way in awk
Sorry to inform you that your code is not even correct. Try with min=max=10.
Something like this will work. You can verify the uniformity as well.
$ awk -v min=200 -v max=210 ' BEGIN{srand();
for(i=0;i<10000;i++) a[int(min+rand()*(max-min))]++;
for(k in a) print k,a[k]}' | sort
200 1045
201 966
202 1014
203 1016
204 985
205 1010
206 988
207 1027
208 986
209 963
Note also that min is inclusive but max is not.

Awk value greater than 40

Can someone please help me. I'm trying to get values greater than 40, but when it's at 100 it doesn't get it.
[root#localhost home]# df -Pk --block-size=1M
Filesystem 1048576-blocks Used Available Capacity Mounted on
/dev/mapper/rhel-root 22510 13135 9375 59% /
devtmpfs 905 0 905 0% /dev
tmpfs 920 1 920 1% /dev/shm
tmpfs 920 9 911 1% /run
tmpfs 920 0 920 0% /sys/fs/cgroup
/dev/sda1 1014 178 837 18% /boot
Linux_DB2 240879 96794 144086 41% /media/sf_Linux_DB2
tmpfs 184 1 184 1% /run/user/42
tmpfs 184 1 184 1% /run/user/0
*/dev/sr0 56 56 0 100% /run/media/root/VBox_GAs_5.2.20*
[root#localhost home]# df -Pk --block-size=1M | awk '$5 > 40'
Filesystem 1048576-blocks Used Available Capacity Mounted on
/dev/mapper/rhel-root 22510 13135 9375 59% /
Linux_DB2 240879 96794 144086 41% /media/sf_Linux_DB2
The /dev/sr0 56 56 0 100% /run/media/root/VBox_GAs_5.2.20 doesn't come out.
Could you please try following once.
df -hP | awk '$5+0>40'
Explanation: Since 5th field of disk usage is having string with digits added, so by adding a zero +0 with $5 it tells awk to keep only digits in comparison and it will NOT have strings in it. Then this condition will considered like digits are getting compared, will show the right output then. Here -P option with df command is also crucial since it gives the output of df in a single line and it makes awk command's life easy to get its calculations done.

Awk formatted output how to include sign for floats

I want to reformat a file by adjusting the spacing between fields. I have the managed to get the content into this form:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
What I can't get right is the floats with minus signs. The above formatting is produced by awk '{printf " %6s%7s%5s %5.3f %5.3f %5.3f\n",$1,$2,$3,$4,$5,$6}' file. However, I want to get the following:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
In other words, account for the sign in counting the characters of the number, in this case the 5 part of %5.3f. How can I do this?
Just a simple change - see if it works:
awk '{printf " %6s%7s%5s % 5.3f % 5.3f % 5.3f\n",$1,$2,$3,$4,$5,$6}' file
Result:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
or
awk '{printf " %6s%7s%5s %+5.3f %+5.3f %+5.3f\n",$1,$2,$3,$4,$5,$6}'
Result:
130GLU HB2 383 +0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963
Edit:
awk '{printf " %6s%7s%5s % 5.3f % 5.3f % 5.3f\n",$1,$2,$3,$4,$5,$6}' file
Result:
130GLU HB2 383 0.058 5.178 2.925
130GLU CG 384 -0.108 5.065 2.887
130GLU HG1 385 -0.079 5.007 2.963

How to append the count of numbers in each line of text using awk?

I have several very large text files and would like to append the count of numbers following by a space in front of each line. Could anyone kindly suggest how to do it efficiently using Awk?
Input:
10 109 120 200 1148 1210 1500 5201
9 139 1239 1439 6551
199 5693 5695
Desired Output:
8 10 109 120 200 1148 1210 1500 5201
5 9 139 1239 1439 6551
3 199 5693 5695
You can use
awk '{print NF,$0}' input.txt
It says print number of field of the current line NF separated by current field separator , which in this case is a space then print the current line itself $0.
this will work for you:
awk '{$0=NF FS $0}7' file

awk print multiple column file into single column

My file looks like that:
315
717
461 737
304
440
148 206 264 322 380 438 496
801
495
355
249 989
768
946
I want to print all those columns in a single column file (one long first column).
If I try to
awk '{print $1}'> new_file;
awk '{print $2}' >> new_file
There are white spaces in between. How to solve this thing?
Perhaps a bit cryptic:
awk '1' RS='[[:space:]]+' inputfile
It says "print every record, treating any whitespace as record separators".
You can simply use something like:
awk '{ for (i=1; i<=NF; i++) print $i }' file
For each line, iterate through columns, and print each column in a new line.
You don't need as much as sed for this: just translate spaces to newlines
tr ' ' '\n' < file
tr is purely a filter, so you have to redirect a file into it.
A perl solution:
perl -pe 's/\s+(?=\S)/\n/g' infile
Output:
315
717
461
737
304
440
148
206
264
322
380
438
496
801
495
355
249
989
768
946