I have written the following function to calculate a check digit in R.
verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
}
In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.
How could this function be vectorized?
I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?
We begin by defining the lookup matrices. I've laid them out in a way
that should make them easier to check against a reference, e.g.
http://en.wikipedia.org/wiki/Verhoeff_algorithm.
d5_mult <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 0, 6, 7, 8, 9, 5,
2, 3, 4, 0, 1, 7, 8, 9, 5, 6,
3, 4, 0, 1, 2, 8, 9, 5, 6, 7,
4, 0, 1, 2, 3, 9, 5, 6, 7, 8,
5, 9, 8, 7, 6, 0, 4, 3, 2, 1,
6, 5, 9, 8, 7, 1, 0, 4, 3, 2,
7, 6, 5, 9, 8, 2, 1, 0, 4, 3,
8, 7, 6, 5, 9, 3, 2, 1, 0, 4,
9, 8, 7, 6, 5, 4, 3, 2, 1, 0
)), ncol = 10, byrow = TRUE)
d5_perm <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 5, 7, 6, 2, 8, 3, 0, 9, 4,
5, 8, 0, 3, 7, 9, 6, 1, 4, 2,
8, 9, 1, 6, 0, 4, 3, 5, 2, 7,
9, 4, 5, 3, 1, 2, 6, 8, 7, 0,
4, 2, 8, 6, 5, 7, 3, 9, 0, 1,
2, 7, 9, 3, 8, 0, 6, 4, 1, 5,
7, 0, 4, 6, 9, 1, 3, 2, 5, 8
)), ncol = 10, byrow = TRUE)
d5_inv <- as.integer(c(0, 4, 3, 2, 1, 5, 6, 7, 8, 9))
Next, we'll define the check function, and try it out with a test input.
I've followed the derivation in wikipedia as closely as possible.
p <- function(i, n_i) {
d5_perm[(i %% 8) + 1, n_i + 1] + 1
}
d <- function(c, p) {
d5_mult[c + 1, p]
}
verhoeff <- function(x) {
#split and convert to numbers
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff(142857)
## [1] 0
This function is fundamentally iterative, as each iteration depends on
the value of the previous. This means that we're unlikely to be able to
vectorise in R, so if we want to vectorise, we'll need to use Rcpp.
However, before we turn to that, it's worth exploring if we can do the
initial split faster. First we do a little microbenchmark to see if it's
worth bothering:
library(microbenchmark)
digits <- function(x) {
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
rev(digs)
}
microbenchmark(
digits(142857),
verhoeff(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.30 12.01 12.43 12.85 28.79 100
## verhoeff(142857) 32.24 33.81 34.66 35.47 95.85 100
It looks like it! On my computer, verhoeff_prepare() accounts for
about 50% of the run time. A little searching on stackoverflow reveals
another approach to turning a number into
digits:
digits2 <- function(x) {
n <- floor(log10(x))
x %/% 10^(0:n) %% 10
}
digits2(12345)
## [1] 5 4 3 2 1
microbenchmark(
digits(142857),
digits2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.495 12.102 12.468 12.834 79.60 100
## digits2(142857) 2.322 2.784 3.358 3.561 13.69 100
digits2() is a lot faster than digits() but it has limited impact on
the whole runtime.
verhoeff2 <- function(x) {
digs <- digits2(x)
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff2(142857)
## [1] 0
microbenchmark(
verhoeff(142857),
verhoeff2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff(142857) 33.06 34.49 35.19 35.92 73.38 100
## verhoeff2(142857) 20.98 22.58 24.05 25.28 48.69 100
To make it even faster we could try C++.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff3_c(IntegerVector digits, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int n = digits.size();
int c = 0;
for(int i = 0; i < n; ++i) {
int p = perm(i % 8, digits[i]);
c = mult(c, p);
}
return inv[c];
}
verhoeff3 <- function(x) {
verhoeff3_c(digits(x), d5_mult, d5_perm, d5_inv)
}
verhoeff3(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.00 22.85 25.53 27.11 63.71 100
## verhoeff3(142857) 16.75 17.99 18.87 19.64 79.54 100
That doesn't yield much of an improvement. Maybe we can do better if we
pass the number to C++ and process the digits in a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff4_c(int number, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int c = 0;
int i = 0;
for (int i = 0; number > 0; ++i, number /= 10) {
int p = perm(i % 8, number % 10);
c = mult(c, p);
}
return inv[c];
}
verhoeff4 <- function(x) {
verhoeff4_c(x, d5_mult, d5_perm, d5_inv)
}
verhoeff4(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857),
verhoeff4(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.808 24.910 26.838 27.797 64.22 100
## verhoeff3(142857) 17.699 18.742 19.599 20.764 81.67 100
## verhoeff4(142857) 3.143 3.797 4.095 4.396 13.21 100
And we get a pay off: verhoeff4() is about 5 times faster than
verhoeff2().
If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.
verhoeffCheckNew <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## check for string since leading zeros with numbers will be lost
if (!is.character(x)) stop("Must enter a string")
#split and convert to numbers
digs <- strsplit(x, "")
digs <- lapply(digs, function(x) rev(as.numeric(x)))
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algorithm - note 1-based indexing in R
sapply(digs, function(x)
{
d <- 0
for (i in 1:length(x)){
d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
}
d5_inv[d+1]
})
}
Since d depends on what it was previously, the is no easy way to vectorise the for loop.
My version runs in about half the time for 1e5 strings.
rand_string <- function(n = 12)
{
paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
}
big_test <- replicate(1e5, rand_string())
tic()
res1 <- unname(sapply(big_test, verhoeffCheck))
toc()
tic()
res2 <- verhoeffCheckNew(big_test)
toc()
identical(res1, res2) #hopefully TRUE!
See this question for tic and toc.
Further thoughts:
You may want additional input checking for "" and other strings that return NA when converted in numeric.
Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)
Richie C answered the vectorisation question nicely; as for only creatig the tables once without cluttering the global name space, one quick solution that does not require a package is
verhoeffCheck <- local(function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
})
assign("d5_mult", matrix(c(
0:9, c(1:4,0,6:9,5), c(2:4,0:1,7:9,5:6), c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8), c(5,9:6,0,4:1), c(6:5,9:7,1:0,4:2), c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4), 9:0), 10, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_perm", matrix(c(
0:9, c(1,5,7,6,2,8,3,0,9,4), c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7), c(9,4,5,3,1,2,6,8,7,0), c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5), c(7,0,4,6,9,1,3,2,5,8)), 8, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_inv", c(0,4:1,5:9), envir = environment(verhoeffCheck))
## Now just use the function
which keeps the data in the environment of the function. You can time it to see how much faster it is.
Hope this helps.
Allan