The performance difference between java.lang.System and Unsafe - jvm

The System and Unsafe both offer some overlapped functionality (
For example, System.arraycopy v.s _UNSAFE.copyMemory).
In terms of implementations, it looks like both are relied on jni, is this a correct statement? (I could find unsafe.cpp but could not find the corresponding arraycopy implementation in JVM source code).
Also, if both are relied on JNI, could I say the invocation overhead to both of them are similar?
I know Unsafe could manipulate the offheap memory, but lets restrict our context on onheap memory here for the comparison.
Thanks for the answer.

Both System.arraycopy and Unsafe.copyMemory are HotSpot intrinsics. This means, JVM does not use JNI implementation when calling these methods from a JIT-compiled method. Instead, it replaces the call with an architecture-specific optimized assembly code.
You may find the sources in stubGenerator_<arch>.cpp.
Here is a simple JMH benchmark:
import org.openjdk.jmh.annotations.Benchmark;
import org.openjdk.jmh.annotations.Param;
import org.openjdk.jmh.annotations.Scope;
import org.openjdk.jmh.annotations.Setup;
import org.openjdk.jmh.annotations.State;
import java.util.concurrent.ThreadLocalRandom;
import static one.nio.util.JavaInternals.byteArrayOffset;
import static one.nio.util.JavaInternals.unsafe;
#State(Scope.Benchmark)
public class CopyMemory {
#Param({"12", "123", "1234", "12345", "123456"})
int size;
byte[] src;
byte[] dst;
#Setup
public void setup() {
src = new byte[size];
dst = new byte[size];
ThreadLocalRandom.current().nextBytes(src);
}
#Benchmark
public void systemArrayCopy() {
System.arraycopy(src, 0, dst, 0, src.length);
}
#Benchmark
public void unsafeCopyMemory() {
unsafe.copyMemory(src, byteArrayOffset, dst, byteArrayOffset, src.length);
}
}
It confirms the performance of both methods is similar:
Benchmark (size) Mode Cnt Score Error Units
CopyMemory.systemArrayCopy 12 avgt 16 5.294 ± 0.162 ns/op
CopyMemory.systemArrayCopy 123 avgt 16 7.057 ± 0.406 ns/op
CopyMemory.systemArrayCopy 1234 avgt 16 18.761 ± 0.492 ns/op
CopyMemory.systemArrayCopy 12345 avgt 16 353.386 ± 3.627 ns/op
CopyMemory.systemArrayCopy 123456 avgt 16 5234.125 ± 57.914 ns/op
CopyMemory.unsafeCopyMemory 12 avgt 16 5.028 ± 0.120 ns/op
CopyMemory.unsafeCopyMemory 123 avgt 16 8.055 ± 0.405 ns/op
CopyMemory.unsafeCopyMemory 1234 avgt 16 19.776 ± 0.523 ns/op
CopyMemory.unsafeCopyMemory 12345 avgt 16 353.549 ± 5.878 ns/op
CopyMemory.unsafeCopyMemory 123456 avgt 16 5246.298 ± 65.427 ns/op
If you run this JMH benchmark with -prof perfasm profiler, you'll see both methods boil down to exactly the same assembly loop:
# systemArrayCopy
0.64% ↗ 0x00007fa95d4336d0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
2.81% │ 0x00007fa95d4336d6: vmovdqu %ymm0,-0x38(%rsi,%rdx,8)
5.67% │ 0x00007fa95d4336dc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
69.64% │ 0x00007fa95d4336e2: vmovdqu %ymm1,-0x18(%rsi,%rdx,8)
15.28% │ 0x00007fa95d4336e8: add $0x8,%rdx
╰ 0x00007fa95d4336ec: jle Stub::jbyte_disjoint_arraycopy+112 0x00007fa95d4336d0
# unsafeCopyMemory
1.08% ↗ 0x00007f2d39833af0: vmovdqu -0x38(%rdi,%rdx,8),%ymm0
3.09% │ 0x00007f2d39833af6: vmovdqu %ymm0,-0x38(%rcx,%rdx,8)
5.78% │ 0x00007f2d39833afc: vmovdqu -0x18(%rdi,%rdx,8),%ymm1
66.44% │ 0x00007f2d39833b02: vmovdqu %ymm1,-0x18(%rcx,%rdx,8)
19.00% │ 0x00007f2d39833b08: add $0x8,%rdx
╰ 0x00007f2d39833b0c: jle Stub::jlong_disjoint_arraycopy+48 0x00007f2d39833af0
When working with regular arrays in Java heap, there is absolutely no need to use Unsafe API. The standard System.arraycopy is very well optimized. JDK class library itself uses System.arraycopy pretty much everywhere, including StringBuilder, ArrayList, ByteArrayOutputStream, etc.

Related

Is there a Micropython library for Adafruit's TLC5947?

I'm working on a RPi Pico W based project, and I need to use a TLC5947 led driver. The connection is SPI, which I'm told is pretty simple, but I tried to implement it myself and couldn't. Adafruit has a circuitpython module, but it dosen't seem to translate (directly at least) into micropython.
Do I need to keep researching making it myself or is there a module already made?
My attempt: (I assume write or pwmbuffer is the problem fwiw. Comments are copied directly from Adafruit's C++ version of the library for arduino.)
from machine import Pin
import machine
class TLC5947:
def __init__(self, clock: int = 2, data: int = 3, latch: int = 5):
self.numdrivers = 1
self.data = Pin(data, Pin.OUT)
self.clock = Pin(clock, Pin.OUT)
self.latch = Pin(latch, Pin.OUT)
self.latch.low()
self._spi = machine.SPI(0)
# self.OE = OE
self.pwmbuffer = [0] * (24 * 2 * self.numdrivers) # memset(pwmbuffer, 0, 2 * 24 * n);
# self.spi = machine.SPI(0)
def write(self):
self.latch.low() # digitalWrite(_lat, LOW);
# // 24 channels per TLC5974
for c in range(24 * self.numdrivers - 1, -1, -1): # for (int16_t c = 24 * numdrivers - 1; c >= 0; c--) {
# // 12 bits per channel, send MSB first
for b in range(11, -1, -1): # for (int8_t b = 11; b >= 0; b--) {
self.clock.low() # digitalWrite(_clk, LOW);
if self.pwmbuffer[c] & (1 << b): # if (pwmbuffer[c] & (1 << b))
self.data.high() # digitalWrite(_dat, HIGH);
else: # else
self.data.low() # digitalWrite(_dat, LOW);
#
self.clock.high() # digitalWrite(_clk, HIGH);
# }
# }
self.clock.low() # digitalWrite(_clk, LOW);
self.latch.high() # digitalWrite(_lat, HIGH);
self.latch.low() # digitalWrite(_lat, LOW);
def setLed(self, lednum, r,g,b):
self.setPWM(lednum * 3, r)
self.setPWM(lednum * 3 + 1, g)
self.setPWM(lednum * 3 + 2, b)
def setPWM(self, chan: int, pwm: int):
if (pwm > 4095):
pwm = 4095
try:
self.pwmbuffer[chan] = pwm
except:
pass
Edit:
Got it. That repo refers to a folder structure like this:
project/
├── modules/
│ └──tlc5947-rgb-micropython/
│ ├──...
│ └──micropython.mk
└── micropython/
├──ports/
... ├──stm32/
...
But I don't have anything like that. Mine is:
project/
|_ .vscode/
| |_ ...
|_ lib/
|_ code.py
|_ i2c_display.py
|_ tlc5947_ME.py
|_ .picowgo # Used by Pico-W-Go vscode extention to allow Pico programming in vscode
Via Awesome MicroPython, I found https://gitlab.com/peterzuger/tlc5947-rgb-micropython - it looks fairly up-to-date.

pandas iterate over 3 data frames element wise into a function

i wrote :
def revertcheck(basevalue,first,second):
if basevalue==1:
return 0
elif basevalue > first and first > second:
return -abs(first-second)
elif basevalue < first and first < second:
return -abs(first-second)
else:
return abs(first-second)
and now I have 3 same sized correlation matrices of the type
pandas.core.frame.DataFrame
I want to iterate over every element, and feed all those 3 values into my function at a time. Can someone give me a hint how to do that?
AAPL AMZN BAC GE GM GOOG GS SNP XOM
AAPL 1.000000 0.567053 0.410656 0.232328 0.562110 0.616592 0.800797 -0.139989 0.147852
AMZN 0.567053 1.000000 -0.012830 0.071066 0.271695 0.715317 0.146355 -0.861710 -0.015936
BAC 0.410656 -0.012830 1.000000 0.953016 0.958784 0.680979 0.843638 0.466912 0.942582
GE 0.232328 0.071066 0.953016 1.000000 0.935008 0.741110 0.667574 0.308813 0.995237
GM 0.562110 0.271695 0.958784 0.935008 1.000000 0.857678 0.857719 0.206432 0.899904
GOOG 0.616592 0.715317 0.680979 0.741110 0.857678 1.000000 0.632255 -0.326059 0.675568
GS 0.800797 0.146355 0.843638 0.667574 0.857719 0.632255 1.000000 0.373738 0.623147
SNP -0.139989 -0.861710 0.466912 0.308813 0.206432 -0.326059 0.373738 1.000000 0.369004
XOM 0.147852 -0.015936 0.942582 0.995237 0.899904 0.675568 0.623147 0.369004 1.000000
Let's assume basevalue, first and second are your three dataframes of exactly the same size and structure, then you can do what you want in a vectorised manner:
output = abs(first - second)
output = output.mask(basevalue == 1, 0)
output = output.mask((basevalue > first) & (first > second), -abs(first - second))
output = output.mask((basevalue < first) & (first < second), -abs(first - second))

ceres solver analytical derivative doesn't work

template<typename ConcreteOccGridMapUtil>
class getResidual : public ceres::SizedCostFunction<1,3>
{
public:
ConcreteOccGridMapUtil* occ;
DataContainer dataPoints;
getResidual(ConcreteOccGridMapUtil* occ, const DataContainer& dataPoints)
{
this->occ = occ;
this->dataPoints = dataPoints;
}
virtual ~getResidual() {}
virtual bool Evaluate(double const* const* parameters,
double* residuals,
double** jacobians) const
{
Eigen::Matrix<double, 3, 1> pose1(parameters[0][0],parameters[0][1],parameters[0][2]);
Eigen::Vector3f pose = pose1.cast<float>();
Eigen::Affine2f transform(occ->getTransformForState(pose)); // transform: rotation->translation
float sinRot = std::sin(pose[2]);
float cosRot = std::cos(pose[2]);
int size = dataPoints.getSize();
residuals[0] = 0;
jacobians[0][0]=0;
jacobians[0][1]=0;
jacobians[0][2]=0;
for (int i = 0; i < size; ++i)
{
const Eigen::Vector2f& currPoint (dataPoints.getVecEntry(i)); /// lidar point
Eigen::Vector3f transformedPointData(occ->interpMapValueWithDerivatives(transform * currPoint)); /// {M,dM/dx,dM/dy}
float funVal = 1.0f - transformedPointData[0];
// float weight=util::WeightValue(funVal);
float weight=1.0;
residuals[0] += static_cast<double>(funVal);
jacobians[0][0] += static_cast<double>(transformedPointData[1]);
jacobians[0][1] += static_cast<double>(transformedPointData[2]);
double rotDeriv = ((-sinRot * currPoint.x() - cosRot * currPoint.y()) * transformedPointData[1] + (cosRot * currPoint.x() - sinRot * currPoint.y()) * transformedPointData[2]);
jacobians[0][2] += static_cast<double>(rotDeriv);
}
return true;
}
};
my parameter to optimize is the pose = [x,y,theta]
my objective function is to minimize the occupancy value about pose and laser point. And here I add them manually together into residuals[0]
I have 3 parameters [x,y,theta] so my jacobians have 3 dimensions in jocobians[0]
But when I run the program, the report is like below:
Solver Summary (v 1.12.0-eigen-(3.2.0)-lapack-suitesparse-(4.2.1)-openmp)
Original Reduced
Parameter blocks 1 1
Parameters 3 3
Residual blocks 1 1
Residual 1 1
Minimizer TRUST_REGION
Dense linear algebra library EIGEN
Trust region strategy LEVENBERG_MARQUARDT
Given Used
Linear solver DENSE_QR DENSE_QR
Threads 1 1
Linear solver threads 1 1
Linear solver ordering AUTOMATIC 1
Cost:
Initial 8.569800e+04
Final 8.569800e+04
Change 0.000000e+00
Minimizer iterations 1
Successful steps 1
Unsuccessful steps 0
Time (in seconds):
Preprocessor 0.0001
Residual evaluation 0.0000
Jacobian evaluation 0.0050
Linear solver 0.0000
Minimizer 0.0051
Postprocessor 0.0000
Total 0.0052
Termination: CONVERGENCE (Gradient tolerance reached. Gradient max norm: 0.000000e+00 <= 1.000000e-10)
Since I have set the jacobians, how can it say that the gradient norm is so small?
Two things.
1. You cannot unconditionally set the Jacobian, you need to check if the solver is actually asking for and the pointers are non-null.
2. There is something wrong with your Jacobian eval, because as far as Ceres can tell it is seeing a zero gradient. Simple thing to check would be to dump out the Jacobian and Jacobian'residual from the CostFunction before returning.
for example are you sure size != 0?

PETSc - MatMultScale? Matrix X vector X scalar

I'm using PETSc and I wanted to do something like,
I know I can do:
Mat A
Vec x,y
MatMult(A,x,y)
VecScale(y,0.5)
I was just curious if there is a function that would do all of these in one shot. It seems like it would save a loop.
MatMultScale(A,x,0.5,y)
Does such a function exist?
This function (or anything close) does not seems to be in the list of functions operating on Mat. So a brief answer to your question would be...no.
If you often use $y=\frac12 Ax$, a solution would be to scale the matrix once for all, using MatScale(A,0.5);.
Would such a function be useful ? One way to check this is to use the -log_summary option of petsc, to get some profiling information. If your matrix is dense, you will see that the time spent in MatMult() is much larger than the time spent in VecScale(). This question is meaningful only if a sparce matrix is handled, with a few non-null terms per line.
Here is a code to test it, using 2xIdentity as the matrix :
static char help[] = "Tests solving linear system on 0 by 0 matrix.\n\n";
#include <petscksp.h>
#undef __FUNCT__
#define __FUNCT__ "main"
int main(int argc,char **args)
{
Vec x, y;
Mat A;
PetscReal alpha=0.5;
PetscErrorCode ierr;
PetscInt n=42;
PetscInitialize(&argc,&args,(char*)0,help);
ierr = PetscOptionsGetInt(NULL,"-n",&n,NULL);CHKERRQ(ierr);
/* Create the vector*/
ierr = VecCreate(PETSC_COMM_WORLD,&x);CHKERRQ(ierr);
ierr = VecSetSizes(x,PETSC_DECIDE,n);CHKERRQ(ierr);
ierr = VecSetFromOptions(x);CHKERRQ(ierr);
ierr = VecDuplicate(x,&y);CHKERRQ(ierr);
/*
Create matrix. When using MatCreate(), the matrix format can
be specified at runtime.
Performance tuning note: For problems of substantial size,
preallocation of matrix memory is crucial for attaining good
performance. See the matrix chapter of the users manual for details.
*/
ierr = MatCreate(PETSC_COMM_WORLD,&A);CHKERRQ(ierr);
ierr = MatSetSizes(A,PETSC_DECIDE,PETSC_DECIDE,n,n);CHKERRQ(ierr);
ierr = MatSetFromOptions(A);CHKERRQ(ierr);
ierr = MatSetUp(A);CHKERRQ(ierr);
/*
This matrix is diagonal, two times identity
should have preallocated, shame
*/
PetscInt i,col;
PetscScalar value=2.0;
for (i=0; i<n; i++) {
col=i;
ierr = MatSetValues(A,1,&i,1,&col,&value,INSERT_VALUES);CHKERRQ(ierr);
}
ierr = MatAssemblyBegin(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
ierr = MatAssemblyEnd(A,MAT_FINAL_ASSEMBLY);CHKERRQ(ierr);
/*
let's do this 42 times for nothing :
*/
for(i=0;i<42;i++){
ierr = MatMult(A,x,y);CHKERRQ(ierr);
ierr = VecScale(y,alpha);CHKERRQ(ierr);
}
ierr = VecDestroy(&x);CHKERRQ(ierr);
ierr = VecDestroy(&y);CHKERRQ(ierr);
ierr = MatDestroy(&A);CHKERRQ(ierr);
ierr = PetscFinalize();
return 0;
}
The makefile :
include ${PETSC_DIR}/conf/variables
include ${PETSC_DIR}/conf/rules
include ${PETSC_DIR}/conf/test
CLINKER=g++
all : ex1
ex1 : main.o chkopts
${CLINKER} -w -o main main.o ${PETSC_LIB}
${RM} main.o
run :
mpirun -np 2 main -n 10000000 -log_summary -help -mat_type mpiaij
And here the resulting two lines of -log_summary that could answer your question :
Event Count Time (sec) Flops --- Global --- --- Stage --- Total
Max Ratio Max Ratio Max Ratio Mess Avg len Reduct %T %F %M %L %R %T %F %M %L %R Mflop/s
------------------------------------------------------------------------------------------------------------------------
--- Event Stage 0: Main Stage
VecScale 42 1.0 1.0709e+00 1.0 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 4 50 0 0 0 4 50 0 0 0 392
MatMult 42 1.0 5.7360e+00 1.1 2.10e+08 1.0 0.0e+00 0.0e+00 0.0e+00 20 50 0 0 0 20 50 0 0 0 73
So the 42 VecScale() operations took 1 second while the 42 MatMult() operations took 5.7 seconds. Suppressing the VecScale() operation would speed up the code by 20%, in the best case. The overhead due to the for loop is even lower than that. I guess that's the reason why this function does not exist.
I apologize for the poor performance of my computer (392Mflops for VecScale()...). I am curious to know what happens on yours !

Faster way to split a string and count characters using R?

I'm looking for a faster way to calculate GC content for DNA strings read in from a FASTA file. This boils down to taking a string and counting the number of times that the letter 'G' or 'C' appears. I also want to specify the range of characters to consider.
I have a working function that is fairly slow, and it's causing a bottleneck in my code. It looks like this:
##
## count the number of GCs in the characters between start and stop
##
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]]
numGC = 0
for(j in st:sp){
##nested ifs faster than an OR (|) construction
if(chars[[j]] == "g"){
numGC <- numGC + 1
}else if(chars[[j]] == "G"){
numGC <- numGC + 1
}else if(chars[[j]] == "c"){
numGC <- numGC + 1
}else if(chars[[j]] == "C"){
numGC <- numGC + 1
}
}
return(numGC)
}
Running Rprof gives me the following output:
> a = "GCCCAAAATTTTCCGGatttaagcagacataaattcgagg"
> Rprof(filename="Rprof.out")
> for(i in 1:500000){gcCount(a,1,40)};
> Rprof(NULL)
> summaryRprof(filename="Rprof.out")
self.time self.pct total.time total.pct
"gcCount" 77.36 76.8 100.74 100.0
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.58 3.6 3.64 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$by.total
total.time total.pct self.time self.pct
"gcCount" 100.74 100.0 77.36 76.8
"==" 18.30 18.2 18.30 18.2
"strsplit" 3.64 3.6 3.58 3.6
"+" 1.14 1.1 1.14 1.1
":" 0.30 0.3 0.30 0.3
"as.logical" 0.04 0.0 0.04 0.0
"as.character" 0.02 0.0 0.02 0.0
$sampling.time
[1] 100.74
Any advice for making this code faster?
Better to not split at all, just count the matches:
gcCount2 <- function(line, st, sp){
sum(gregexpr('[GCgc]', substr(line, st, sp))[[1]] > 0)
}
That's an order of magnitude faster.
A small C function that just iterates over the characters would be yet another order of magnitude faster.
A one liner:
table(strsplit(toupper(a), '')[[1]])
I don't know that it's any faster, but you might want to look at the R package seqinR - http://pbil.univ-lyon1.fr/software/seqinr/home.php?lang=eng. It is an excellent, general bioinformatics package with many methods for sequence analysis. It's in CRAN (which seems to be down as I write this).
GC content would be:
mysequence <- s2c("agtctggggggccccttttaagtagatagatagctagtcgta")
GC(mysequence) # 0.4761905
That's from a string, you can also read in a fasta file using "read.fasta()".
There's no need to use a loop here.
Try this:
gcCount <- function(line, st, sp){
chars = strsplit(as.character(line),"")[[1]][st:sp]
length(which(tolower(chars) == "g" | tolower(chars) == "c"))
}
Try this function from stringi package
> stri_count_fixed("GCCCAAAATTTTCCGG",c("G","C"))
[1] 3 5
or you can use regex version to count g and G
> stri_count_regex("GCCCAAAATTTTCCGGggcc",c("G|g|C|c"))
[1] 12
or you can use tolower function first and then stri_count
> stri_trans_tolower("GCCCAAAATTTTCCGGggcc")
[1] "gcccaaaattttccggggcc"
time performance
> microbenchmark(gcCount(x,1,40),gcCount2(x,1,40), stri_count_regex(x,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(x, 1, 40) 109.568 112.42 113.771 116.473 146.492 100
gcCount2(x, 1, 40) 15.010 16.51 18.312 19.213 40.826 100
stri_count_regex(x, c("[GgCc]")) 15.610 16.51 18.912 20.112 61.239 100
another example for longer string. stri_dup replicates string n-times
> stri_dup("abc",3)
[1] "abcabcabc"
As you can see, for longer sequence stri_count is faster :)
> y <- stri_dup("GCCCAAAATTTTCCGGatttaagcagacataaattcgagg",100)
> microbenchmark(gcCount(y,1,40*100),gcCount2(y,1,40*100), stri_count_regex(y,c("[GgCc]")))
Unit: microseconds
expr min lq median uq max neval
gcCount(y, 1, 40 * 100) 10367.880 10597.5235 10744.4655 11655.685 12523.828 100
gcCount2(y, 1, 40 * 100) 360.225 369.5315 383.6400 399.100 438.274 100
stri_count_regex(y, c("[GgCc]")) 131.483 137.9370 151.8955 176.511 221.839 100
Thanks to all for this post,
To optimize a script in which I want to calculate GC content of 100M sequences of 200bp, I ended up testing different methods proposed here. Ken Williams' method performed best (2.5 hours), better than seqinr (3.6 hours). Using stringr str_count reduced to 1.5 hour.
In the end I coded it in C++ and called it using Rcpp, which cuts the computation time down to 10 minutes!
here is the C++ code:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
float pGC_cpp(std::string s) {
int count = 0;
for (int i = 0; i < s.size(); i++)
if (s[i] == 'G') count++;
else if (s[i] == 'C') count++;
float pGC = (float)count / s.size();
pGC = pGC * 100;
return pGC;
}
Which I call from R typing:
sourceCpp("pGC_cpp.cpp")
pGC_cpp("ATGCCC")