Faster way to split a string and count characters using R? - optimization

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?

Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.

A one liner:
table(strsplit(toupper(a), '')[[1]])

I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".

There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}

Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100

Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")

Related

Create histogram like bins for a range including negative numbers

I have numbers in a range from -4 to 4, including 0, as in
-0.526350041828112
-0.125648350883331
0.991377353361933
1.079241128983
1.06322905224238
1.17477528478982
-0.0651086035371559
0.818471811380787
0.0355593553368815
I need to create histogram like buckets, and have being trying to use this
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0+delta) / delta)
cnt[bucketNr]++
numBuckets = (numBuckets > bucketNr ? numBuckets : bucketNr)
}
END {
for (bucketNr=1; bucketNr<=numBuckets; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
from Create bins with awk histogram-like
The output would look like
-2.4 -2.1 8
-2.1 -1.8 25
-1.8 -1.5 108
-1.5 -1.2 298
-1.2 -0.9 773
-0.9 -0.6 1067
-0.6 -0.3 1914
-0.3 0.0 4174
0.0 0.3 3969
0.3 0.6 2826
0.6 0.9 1460
0.9 1.2 752
1.2 1.5 396
1.5 1.8 121
1.8 2.1 48
2.1 2.4 13
2.4 2.7 1
2.7 3.0 1
I'm thinking I would have to run this 2x, one with delta let's say 0.3 and another with delta -0.3, and cat the two together.
But I'm not sure this intuition is correct.
This might work for you:
BEGIN { delta = (delta == "" ? 0.1 : delta) }
{
bucketNr = int(($0<0?$0-delta:$0)/delta)
cnt[bucketNr]++
maxBucket = (maxBucket > bucketNr ? maxBucket : bucketNr)
minBucket = (minBucket < bucketNr ? minBucket : bucketNr)
}
END {
beg = minBucket*delta
for (bucketNr=minBucket; bucketNr<=maxBucket; bucketNr++) {
end = beg + delta
printf "%0.1f %0.1f %d\n", beg, end, cnt[bucketNr]
beg = end
}
}
It's basically the code you posted + handling negative numbers.

awk script to sum numbers in a column over a loop not working for some iterations in the loop

Sample input
12.0000 0.6000000 0.05
13.0000 1.6000000 0.05
14.0000 2.6000000 0.05
15.0000 3.0000000 0.05
15.0000 3.2000000 0.05
15.0000 3.4000000 0.05
15.0000 3.6000000 0.10
15.0000 3.8000000 0.10
15.0000 4.0000000 0.10
15.0000 4.2000000 0.11
15.0000 4.4000000 0.12
15.0000 4.6000000 0.13
15.0000 4.8000000 0.14
15.0000 5.0000000 0.15
15.0000 5.2000000 0.14
15.0000 5.4000000 0.13
15.0000 5.6000000 0.12
15.0000 5.8000000 0.11
15.0000 6.0000000 0.10
15.0000 6.2000000 0.10
15.0000 6.4000000 0.10
15.0000 6.6000000 0.05
15.0000 6.8000000 0.05
15.0000 7.0000000 0.05
Goal
Print line 1 in output as 0 0
For $2 = 5.000000, $3 = 0.15.
Print line 2 in output as 1 0.15
For $2 = 4.800000 through $2 = 5.200000, sum+=$3 for each line (i.e. 0.14 + 0.15 + 0.14 = 0.43).
Print line 3 in output as 2 0.43.
For $2 = 4.600000 through $2 = 5.400000, sum+=$3 for each line (i.e. 0.13 + 0.14 + 0.15 + 0.14 + 0.13 = 0.69).
Print line 4 in output as 3 0.69
Continue this pattern until $2 = 5.000000 +- 1.6 (9 lines total, plus line 1 as 0 0 = 10 total lines in output)
Desired Output
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
Attempt
Script 1
#!/bin/bash
for (( i=0; i<=8; i++ )); do
awk '$2 >= 5.0000000-'$i'*0.2 {sum+=$3}
$2 == 5.0000000+'$i'*0.2 {print '$i', sum; exit
}' test.dat
done > test.out
produces
0 0.15
1 0.43
2 0.69
3 0.93
4 1.15
5 1.35
6 1.55
7 1.75
8 1.85
This is very close. However, the output is missing 0 0 for line 1, and because of this, lines 2 through 10 have $1 and $2 mismatched by 1 line.
Script 2
#!/bin/bash
for (( i=0; i<=8; i++ )); do
awk ''$i'==0 {sum=0}
'$i'>0 && $2 > 5.0000000-'$i'*0.2 {sum+=$3}
$2 == 5.0000000+'$i'*0.2 - ('$i' ? 0.2 : 0) {print '$i', sum; exit
}' test.dat
done > test.out
which produces
0 0
1 0.15
2 0.43
4 0.93
5 1.15
6 1.35
7 1.55
$1 and $2 are now correctly matched. However, I am missing the lines with $1=3, $1=8, and $1=9 completely. Adding the ternary operator causes my code to skip these iterations in the loop somehow.
Question
Can anyone explain what's wrong with script 2, or how to achieve the desired output in one line of code? Thank you.
Solution
I used Ed Morton's solution to solve this. Both of them work for different goals. Instead of using the modulus to save array space, I constrained the array to $1 = 15.0000. I did this instead of the modulus in order to include two other "key" variables that I had wanted to also sum over at different parts of the input, into separate output files.
Furthermore, as far as I understood it, the script summed only for lines with $2 >= 5.0000000, and then multiplied the summation by 2, in order to include the lines with $2 <= 5.0000000. This works for the sample input here because I made $3 symmetric around 0.15. I modified it to sum them separately, though.
awk 'BEGIN { key=5; range=9}
$1 == 15.0000 {
a[NR] = $3
}
$2 == key { keyIdx = NR}
END {
print (0, 0) > "test.out"
sum = a[keyIdx]
for (delta=1; delta<=range; delta++) {
print (delta, sum) > "test.out"
plusIdx = (keyIdx + delta)
minusIdx = (keyIdx - delta)
sum += a[plusIdx] + a[minusIdx]
}
exit
}' test.dat
Is this what you're trying to do?
$ cat tst.awk
$2 == 5 { keyNr = NR }
{ nr2val[NR] = $3 }
END {
print 0, 0
sum = nr2val[keyNr]
for (delta=1; delta<=9; delta++) {
print delta, sum
sum += nr2val[keyNr+delta] + nr2val[keyNr-delta]
}
}
$ awk -f tst.awk file
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
We could optimize it to only store 2*(range=9) values in vals[] (using a modulus operator NR%(2*range) for the index) and do the calculation when we hit an NR that's range lines past the line where $2 == key rather than doing it after we've read the whole of the input if it's either too slow or your input file is too big to store all in memory, e.g.:
$ cat tst.awk
BEGIN { key=5; range=9 }
{
idx = NR % (2*range)
nr2val[idx] = $3
}
$2 == key { keyIdx = idx; endNr = NR+range }
NR == endNr { exit }
END {
print 0, 0
sum = nr2val[keyIdx]
for (delta=1; delta<=range; delta++) {
print delta, sum
idx = (keyIdx + delta) % (2*range)
sum += nr2val[idx] + nr2val[idx]
}
exit
}
$ awk -f tst.awk file
0 0
1 0.15
2 0.43
3 0.69
4 0.93
5 1.15
6 1.35
7 1.55
8 1.75
9 1.85
I like your problem. It is an adequate challenge.
My approach is to put all possible into the awk script. And scan the input file only once. Because I/O manipulation is slower than computation (these days).
Do as many computation (actually 9) on the relevant input line.
The required inputs are variable F1 and text file input.txt
The execution command is:
awk -v F1=95 -f script.awk input.txt
So the logic is:
1. Initialize: Compute the 9 range markers and store their values in an array.
2. Store the 3rd input value in an order array `field3`. We use this array to compute the sum.
3. On each line that has 1st field equals 15.0000.
3.1 If found begin marker then mark it.
3.2 If found end marker then compute the sum, and mark it.
4. Finalize: Output all the computed results
script.awk including few debug printout to assist in debugging
BEGIN {
itrtns = 8; # iterations count consistent all over the program.
for (i = 0; i <= itrtns; i++) { # compute range markers per iteration
F1start[i] = (F1 - 2 - i)/5 - 14; # print "F1start["i"]="F1start[i];
F1stop[i] = (F1 - 2 + i)/5 - 14; # print "F1stop["i"]="F1stop[i];
b[i] = F1start[i] + (i ? 0.2 : 0); # print "b["i"]="b[i];
}
}
{ field3[NR] = $3;} # store 3rd input field in ordered array.
$1==15.0000 { # for each input line that has 1st input field 15.0000
currVal = $2 + 0; # convert 2nd input field to numeric value
for (i = 0; i <= itrtns; i++) { # on each line scan for range markers
# print "i="i, "currVal="currVal, "b["i"]="b[i], "F1stop["i"]="F1stop[i], isZero(currVal-b[i]), isZero(currVal-F1stop[i]);
if (isZero(currVal - b[i])) { # if there is a begin marker
F1idx[i] = NR; # store the marker index postion
# print "F1idx["i"] =", F1idx[i];
}
if (isZero(currVal - F1stop[i])) { # if there is an end marker
for (s = F1idx[i]; s <= NR; s++) {sum[i] += field3[s];} # calculate its sum
F2idx[i] = NR; # store its end marker postion (for debug report)
# print "field3["NR"]=", field3[NR];
}
}
}
END { # output the computed results
for (i = 0; i <= itrtns; i++) {print i, sum[i], "rows("F1idx[i]"-"F2idx[i]")"}
}
function isZero(floatArg) { # floating point number pecision comparison
tolerance = 0.00000000001;
if (floatArg < tolerance && floatArg > -1 * tolerance )
return 1;
return 0;
}
Provided input.txt from the question.
12.0000 0.6000000 0.05
13.0000 1.6000000 0.05
14.0000 2.6000000 0.05
15.0000 3.0000000 0.05
15.0000 3.2000000 0.05
15.0000 3.4000000 0.05
15.0000 3.6000000 0.10
15.0000 3.8000000 0.10
15.0000 4.0000000 0.10
15.0000 4.2000000 0.11
15.0000 4.4000000 0.12
15.0000 4.6000000 0.13
15.0000 4.8000000 0.14
15.0000 5.0000000 0.15
15.0000 5.2000000 0.14
15.0000 5.4000000 0.13
15.0000 5.6000000 0.12
15.0000 5.8000000 0.11
15.0000 6.0000000 0.10
15.0000 6.2000000 0.10
15.0000 6.4000000 0.10
15.0000 6.6000000 0.05
15.0000 6.8000000 0.05
15.0000 7.0000000 0.05
The output for: awk -v F1=95 -f script.awk input.txt
0 0.13 rows(12-12)
1 0.27 rows(12-13)
2 0.54 rows(11-14)
3 0.79 rows(10-15)
4 1.02 rows(9-16)
5 1.24 rows(8-17)
6 1.45 rows(7-18)
7 1.6 rows(6-19)
8 1.75 rows(5-20)
The output for: awk -v F1=97 -f script.awk input.txt
0 0.15 rows(14-14)
1 0.29 rows(14-15)
2 0.56 rows(13-16)
3 0.81 rows(12-17)
4 1.04 rows(11-18)
5 1.25 rows(10-19)
6 1.45 rows(9-20)
7 1.65 rows(8-21)
8 1.8 rows(7-22)

Mixture prior not working in JAGS, only when likelihood term included

The code at the bottom will replicate the problem, just copy and paste it into R.
What I want is for the mean and precision to be (-100, 100) 30% of the time, and (200, 1000) for 70% of the time. Think of it as lined up in a, b, and p.
So 'pick' should be 1 30% of the time, and 2 70% of the time.
What actually happens is that on every iteration, pick is 2 (or 1 if the first element of p is the larger one). You can see this in the summary, where the quantiles for 'pick', 'testa', and 'testb' remain unchanged throughout. The strangest thing is that if you remove the likelihood loop, pick then works exactly as intended.
I hope this explains the problem, if not let me know. It's my first time posting so I'm bound to have messed things up.
library(rjags)
n = 10
y <- rnorm(n, 5, 10)
a = c(-100, 200)
b = c(100, 1000)
p = c(0.3, 0.7)
## Model
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu, 10)
}
# ISSUE HERE: MIXTURE PRIOR
mu ~ dnorm(a[pick], b[pick])
pick ~ dcat(p[1:2])
testa = a[pick]
testb = b[pick]
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'testa', 'testb', 'mu'), n.iter = 10000)
summary(res)
I think you are having problems for a couple of reasons. First, the data that you have supplied to the model (i.e., y) is not a mixture of normal distributions. As a result, the model itself has no need to mix. I would instead generate data something like this:
set.seed(320)
# number of samples
n <- 10
# Because it is a mixture of 2 we can just use an indicator variable.
# here, pick (in the long run), would be '1' 30% of the time.
pick <- rbinom(n, 1, p[1])
# generate the data. b is in terms of precision so we are converting this
# to standard deviations (which is what R wants).
y_det <- pick * rnorm(n, a[1], sqrt(1/b[1])) + (1 - pick) * rnorm(n, a[2], sqrt(1/b[2]))
# add a small amount of noise, can change to be more as necessary.
y <- rnorm(n, y_det, 1)
These data look more like what you would want to supply to a mixture model.
Following this, I would code the model up in a similar way as I did the data generation process. I want some indicator variable to jump between the two normal distributions. Thus, mu may change for each scalar in y.
mod_str = "model{
# Likelihood
for (i in 1:n){
y[i] ~ dnorm(mu[i], 10)
mu[i] <- mu_ind[i] * a_mu + (1 - mu_ind[i]) * b_mu
mu_ind[i] ~ dbern(p[1])
}
a_mu ~ dnorm(a[1], b[1])
b_mu ~ dnorm(a[2], b[2])
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('mu_ind', 'a_mu', 'b_mu'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
a_mu -100.4 -100.3 -100.2 -100.1 -100
b_mu 199.9 200.0 200.0 200.0 200
mu_ind[1] 0.0 0.0 0.0 0.0 0
mu_ind[2] 1.0 1.0 1.0 1.0 1
mu_ind[3] 0.0 0.0 0.0 0.0 0
mu_ind[4] 1.0 1.0 1.0 1.0 1
mu_ind[5] 0.0 0.0 0.0 0.0 0
mu_ind[6] 0.0 0.0 0.0 0.0 0
mu_ind[7] 1.0 1.0 1.0 1.0 1
mu_ind[8] 0.0 0.0 0.0 0.0 0
mu_ind[9] 0.0 0.0 0.0 0.0 0
mu_ind[10] 1.0 1.0 1.0 1.0 1
If you supplied more data, you would (in the long run) have the indicator variable mu_ind take the value of 1 30% of the time. If you had more than 2 distributions you could instead use dcat. Thus, an alternative and more generalized way of doing this would be (and I am borrowing heavily from this post by John Kruschke):
mod_str = "model {
# Likelihood:
for( i in 1 : n ) {
y[i] ~ dnorm( mu[i] , 10 )
mu[i] <- muOfpick[ pick[i] ]
pick[i] ~ dcat( p[1:2] )
}
# Prior:
for ( i in 1:2 ) {
muOfpick[i] ~ dnorm( a[i] , b[i] )
}
}"
model = jags.model(textConnection(mod_str), data = list(y = y, n=n, a=a, b=b, p=p), n.chains=1)
update(model, 10000)
res = coda.samples(model, variable.names = c('pick', 'muOfpick'), n.iter = 10000)
summary(res)
2.5% 25% 50% 75% 97.5%
muOfpick[1] -100.4 -100.3 -100.2 -100.1 -100
muOfpick[2] 199.9 200.0 200.0 200.0 200
pick[1] 2.0 2.0 2.0 2.0 2
pick[2] 1.0 1.0 1.0 1.0 1
pick[3] 2.0 2.0 2.0 2.0 2
pick[4] 1.0 1.0 1.0 1.0 1
pick[5] 2.0 2.0 2.0 2.0 2
pick[6] 2.0 2.0 2.0 2.0 2
pick[7] 1.0 1.0 1.0 1.0 1
pick[8] 2.0 2.0 2.0 2.0 2
pick[9] 2.0 2.0 2.0 2.0 2
pick[10] 1.0 1.0 1.0 1.0 1
The link above includes even more priors (e.g., a Dirichlet prior on the probabilities incorporated into the Categorical distribution).

go benchmark and gc: B/op alloc/op

benchmark code:
func BenchmarkSth(b *testing.B) {
var x []int
b.ResetTimer()
for i := 0; i < b.N; i++ {
x = append(x, i)
}
}
result:
BenchmarkSth-4 50000000 20.7 ns/op 40 B/op 0 allocs/op
question/s:
where did 40 B/op come from? (any way of tracing + instructions is greatly appreciated)
how is it possible to have 40 B/op while having 0 allocs?
which one affects GC and how? (B/op or allocs/op)
is it really possible to have 0 B/op using append?
The Go Programming Language Specification
Appending to and copying slices
The variadic function append appends zero or more values x to s of
type S, which must be a slice type, and returns the resulting slice,
also of type S.
append(s S, x ...T) S // T is the element type of S
If the capacity of s is not large enough to fit the additional values,
append allocates a new, sufficiently large underlying array that fits
both the existing slice elements and the additional values. Otherwise,
append re-uses the underlying array.
For your example, on average, [40, 41) bytes per operation are allocated to increase the capacity of the slice when necessary. The capacity is increased using an amortized constant time algorithm: up to len 1024 increase to 2 times cap then increase to 1.25 times cap. On average, there are [0, 1) allocations per operation.
For example,
func BenchmarkMem(b *testing.B) {
b.ReportAllocs()
var x []int64
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkMem-4 50000000 26.6 ns/op 40 B/op 0 allocs/op
--- BENCH: BenchmarkMem-4
bench_test.go:32: op 1 B 8 alloc 1 lx 1 cx 1
bench_test.go:32: op 100 B 2040 alloc 8 lx 100 cx 128
bench_test.go:32: op 10000 B 386296 alloc 20 lx 10000 cx 12288
bench_test.go:32: op 1000000 B 45188344 alloc 40 lx 1000000 cx 1136640
bench_test.go:32: op 50000000 B 2021098744 alloc 57 lx 50000000 cx 50539520
For op = 50000000,
B/op = floor(2021098744 / 50000000) = floor(40.421974888) = 40
allocs/op = floor(57 / 50000000) = floor(0.00000114) = 0
Read:
Go Slices: usage and internals
Arrays, slices (and strings): The mechanics of 'append'
'append' complexity
To have zero B/op (and zero allocs/op) for append, allocate a slice with sufficient capacity before appending.
For example, with var x = make([]int64, 0, b.N),
func BenchmarkZero(b *testing.B) {
b.ReportAllocs()
var x = make([]int64, 0, b.N)
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkZero-4 100000000 11.7 ns/op 0 B/op 0 allocs/op
--- BENCH: BenchmarkZero-4
bench_test.go:51: op 1 B 0 alloc 0 lx 1 cx 1
bench_test.go:51: op 100 B 0 alloc 0 lx 100 cx 100
bench_test.go:51: op 10000 B 0 alloc 0 lx 10000 cx 10000
bench_test.go:51: op 1000000 B 0 alloc 0 lx 1000000 cx 1000000
bench_test.go:51: op 100000000 B 0 alloc 0 lx 100000000 cx 100000000
Note the reduction in benchmark CPU time from around 26.6 ns/op to around 11.7 ns/op.

How to explain the strange results for while loop with floating point in swift

I have tested while loop below and don’t understand the result.
var x: Float = 0.0
var counter = 0
while x < 1.41
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x) // 1.5
How is it possible to have the result x = 1.5 for the used while condition where x < 14.1 ? How to explain this result?
Update:
and one more. Why the results are different for Double and Float ?
var x: Double = -0.5
var counter = 0
while x < 1.0
{
x += 0.1
counter += 1
}
print (counter) // 16
print (x)//1.1
var x: Float = -0.5
var counter = 0
while x < 1.0
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x)//1.0
Update 2
and another one. Why there is no difference for < and <= conditions. Does it mean that usage of <= has no sense for floating point ?
var x: Double = 0.0
var counter = 0
while x < 1.5
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x) //1.5
var x: Double = 0.0
var counter = 0
while x <= 1.5
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x) //1.5
What else would you expect? The loop is executed 15 times. On the 14th time, x is 1.4 and so you add another 0.1, making it 1.5.
If you expect the loop to terminate at 1.4, you should increment x before checking the while condition, not after that.
If you expect the loop to terminate on 1.41, your increment is wrong and you should do
x += 0.01
instead, making it 141 iterations.
As for the second question, I am aware that Float should not be used for monetary calculations and such due to its lack of precision. However, I trusted Double so far, and the while loop in run 15 actually claims the Double value to be less than 1.0 while it is reported to be 1.0. We have got a precision problem here, as we can see if we substract x from 1.0:
print(1.0-x)
which returns: 1.11022302462516e-16
At the same time, Float seems to be unprecise in the other direction. In the last run, it is a little bigger than 0.9 (0.9 + 5.96046e-08), making it bigger than 10 in the following run.
The reason why Double and Float are wrong in different directions is just a matter of how the values are stored, and the result will be different depending on the number. For example, with 2.0 both actual values are bigger: Double by 4.440892..e-16 and Float by 2.38419e-07. For 3.0 Double is bigger by 1.33226e-15 and Float smaller by 7.1525e-07.
The same problems occur using x.isLess(than: 1.0), but this method is the basis for the < operator as of https://developer.apple.com/reference/swift/floatingpoint/1849403-isless
isLessThanOrEqualTo(1.0), on the other hand, seems to work reliably as expected.
This answer is pretty much a question itself by now, so I'm curious if anyone has an in-depth explanation of this...
Update
The more I think about it, the less of a Swift problem it is. Basically, you have that problem in all floating point calculations, because they are never precise. Both Float and Double are not precise, Double is just twice as accurate. However, this means that comparisons like == are useless with floating point values unless they are both rounded. Therefore, good advice in loops like those of yours with a known precision (in your case one decimal) would be to round to that precision before doing any kind of comparison. For example, this would fix the loop:
var x: Double = -0.5
var counter = 0
while (round(x * 1000) / 1000) < 1.0
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x)//1.0
var x: Float = -0.5
var counter = 0
while (round(x * 1000) / 1000) < 1.0
{
x += 0.1
counter += 1
}
print (counter) // 15
print (x)//1.0