How to convert two bytes to floating point number - vb.net

I have some legacy files that need mined for data. The files were created by Lotus123 Release 4 for DOS. I'm trying to read the files faster by parsing the bytes rather than using Lotus to open the files.
Dim fileBytes() As Byte = My.Computer.FileSystem.ReadAllBytes(fiPath)
'I loop through all the data getting first/second bytes for each value
do ...
Dim FirstByte As Int16 = Convert.ToInt16(fileBytes(Index))
Dim SecondByte As Int16 = Convert.ToInt16(fileBytes(Index + 1))
loop ...
I can get integer values like this:
Dim value As Int16 = BitConverter.ToInt16(fileBytes, Index + 8) / 2
But floating numbers are more complicated. Only the smaller numbers are stored with two bytes. Larger values take 10 bytes, but that's another question. Here we only have smaller values with two bytes. Here are some sample values. I entered the byte values into Excel and use the =DEC2BIN() to convert to binary adding zeros on the left as needed to get 8 bits.
First Second
Byte Byte Value First Byte 2nd Byte
7 241 = -1.2 0000 0111 1111 0001
254 255 = -1 1111 1110 1111 1111
9 156 = -0.8 0000 1001 1001 1100
9 181 = -0.6 0000 1001 1011 0101
9 206 = -0.4 0000 1001 1100 1110
9 231 = -0.2 0000 1001 1110 0111
13 0 = 0 0000 1101 0000 0000
137 12 = 0.1 1000 1001 0000 1100
9 25 = 0.2 0000 1001 0001 1001
137 37 = 0.3 1000 1001 0010 0101
9 50 = 0.4 0000 1001 0011 0010
15 2 = 0.5 0000 1111 0000 0010
9 75 = 0.6 0000 1001 0100 1011
137 87 = 0.7 1000 1001 0101 0111
9 100 = 0.8 0000 1001 0110 0100
137 112 = 0.9 1000 1001 0111 0000
2 0 = 1 0000 0010 0000 0000
199 13 = 1.1 1100 0111 0000 1101
7 15 = 1.2 0000 0111 0000 1111
71 16 = 1.3 0100 0111 0001 0000
135 17 = 1.4 1000 0111 0001 0001
15 6 = 1.5 0000 1111 0000 0110
7 20 = 1.6 0000 0111 0001 0100
71 21 = 1.7 0100 0111 0001 0101
135 22 = 1.8 1000 0111 0001 0110
199 23 = 1.9 1100 0111 0001 0111
4 0 = 2 0000 0100 0000 0000
I'm hoping for a simple conversion method. Or maybe it'll be more complicated.
I looked at BCD: "BCD was used in many early decimal computers, and is implemented in the instruction set of machines such as the IBM System/360 series" and Intel BCD opcode
I do not know if this is BCD or what it is. How do I convert the two bits into a floating point number?

I used the information from the website pointed out by Andrew Morton in comments. Basically the stored 16-bit quantity consists of either a 15-bit two's complement integer (when the lsb is 0) or a 12-bit two's complement integer plus a processing code indicating a scale factor to be applied to that integer (when the lsb is 1). I am not familiar with vb.net so am providing ISO-C code here. Program below successfully decodes all the data provided in the question.
Note: I am converting to an 8-byte double in code below, while the question suggests that the original conversion may have been to a 10-byte long double format (the 80-bit extended-precision format of the 8087 math coprocessor). It would seem like a good idea to try more test data to achieve full coverage of the eight scaling codes: Large integers like 1,000,000 and 1,000,000,000; decimal fractions like 0.0003, 0.000005, and 0.00000007; and binary fractions like 0.125 (1/8) and 0.046875 (3/64).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
typedef struct {
uint8_t byte1;
uint8_t byte2;
} num;
num data[] =
{
{ 7, 241}, {254, 255}, { 9, 156}, { 9, 181}, { 9, 206}, { 9, 231},
{ 13, 0}, {137, 12}, { 9, 25}, {137, 37}, { 9, 50}, { 15, 2},
{ 9, 75}, {137, 87}, { 9, 100}, {137, 112}, { 2, 0}, {199, 13},
{ 7, 15}, { 71, 16}, {135, 17}, { 15, 6}, { 7, 20}, { 71, 21},
{135, 22}, {199, 23}, { 4, 0}
};
int data_count = sizeof (data) / sizeof (data[0]);
/* define operators that may look more familiar to vb.net programmers */
#define XOR ^
#define MOD %
int main (void)
{
int i;
uint8_t b1, b2;
uint16_t h, code;
int32_t n;
double r;
for (i = 0; i < data_count; i++) {
b1 = data[i].byte1;
b2 = data[i].byte2;
/* data word */
h = ((uint16_t)b2 * 256) + b1;
/* h<0>=1 indicates stored integer needs to be scaled */
if ((h MOD 2) == 1) {
/* extract scaling code in h<3:1> */
code = (h / 2) MOD 8;
/* scaled 12-bit integer in h<15:4>. Extract, sign-extend to 32 bits */
n = (int32_t)((((uint32_t)h / 16) XOR 2048) - 2048);
/* convert integer to floating-point */
r = (double)n;
/* scale based on scaling code */
switch (code) {
case 0x0: r = r * 5000; break;
case 0x1: r = r * 500; break;
case 0x2: r = r / 20; break;
case 0x3: r = r / 200; break;
case 0x4: r = r / 2000; break;
case 0x5: r = r / 20000; break;
case 0x6: r = r / 16; break;
case 0x7: r = r / 64; break;
};
} else {
/* unscaled 15-bit integer in h<15:1>. Extract, sign extend to 32 bits */
n = (int32_t)((((uint32_t)h / 2) XOR 16384) - 16384);
/* convert integer to floating-point */
r = (double)n;
}
printf ("[%3d,%3d] n=%08x r=% 12.8f\n", b1, b2, n, r);
}
return EXIT_SUCCESS;
}
The output of this program is as follows:
[ 7,241] n=ffffff10 r= -1.20000000
[254,255] n=ffffffff r= -1.00000000
[ 9,156] n=fffff9c0 r= -0.80000000
[ 9,181] n=fffffb50 r= -0.60000000
[ 9,206] n=fffffce0 r= -0.40000000
[ 9,231] n=fffffe70 r= -0.20000000
[ 13, 0] n=00000000 r= 0.00000000
[137, 12] n=000000c8 r= 0.10000000
[ 9, 25] n=00000190 r= 0.20000000
[137, 37] n=00000258 r= 0.30000000
[ 9, 50] n=00000320 r= 0.40000000
[ 15, 2] n=00000020 r= 0.50000000
[ 9, 75] n=000004b0 r= 0.60000000
[137, 87] n=00000578 r= 0.70000000
[ 9,100] n=00000640 r= 0.80000000
[137,112] n=00000708 r= 0.90000000
[ 2, 0] n=00000001 r= 1.00000000
[199, 13] n=000000dc r= 1.10000000
[ 7, 15] n=000000f0 r= 1.20000000
[ 71, 16] n=00000104 r= 1.30000000
[135, 17] n=00000118 r= 1.40000000
[ 15, 6] n=00000060 r= 1.50000000
[ 7, 20] n=00000140 r= 1.60000000
[ 71, 21] n=00000154 r= 1.70000000
[135, 22] n=00000168 r= 1.80000000
[199, 23] n=0000017c r= 1.90000000
[ 4, 0] n=00000002 r= 2.00000000

Just a VB.Net translation of the C code posted by njuffa.
The original structure has been substituted with a Byte array and the numeric data type adapted to .Net types. That's all.
Dim data As Byte(,) = New Byte(,) {
{7, 241}, {254, 255}, {9, 156}, {9, 181}, {9, 206}, {9, 231}, {13, 0}, {137, 12}, {9, 25},
{137, 37}, {9, 50}, {15, 2}, {9, 75}, {137, 87}, {9, 100}, {137, 112}, {2, 0}, {199, 13},
{7, 15}, {71, 16}, {135, 17}, {15, 6}, {7, 20}, {71, 21}, {135, 22}, {199, 23}, {4, 0}
}
Dim byte1, byte2 As Byte
Dim word, code As UShort
Dim nValue As Integer
Dim result As Double
For i As Integer = 0 To (data.Length \ 2 - 1)
byte1 = data(i, 0)
byte2 = data(i, 1)
word = (byte2 * 256US) + byte1
If (word Mod 2) = 1 Then
code = (word \ 2US) Mod 8US
nValue = ((word \ 16) Xor 2048) - 2048
Select Case code
Case 0 : result = nValue * 5000
Case 1 : result = nValue * 500
Case 2 : result = nValue / 20
Case 3 : result = nValue / 200
Case 4 : result = nValue / 2000
Case 5 : result = nValue / 20000
Case 6 : result = nValue / 16
Case 7 : result = nValue / 64
End Select
Else
'unscaled 15-bit integer in h<15:1>. Extract, sign extend to 32 bits
nValue = ((word \ 2) Xor 16384) - 16384
result = nValue
End If
Console.WriteLine($"[{byte1,3:D}, {byte2,3:D}] number = {nValue:X8} result ={result,12:F8}")
Next

Related

Remove nan from pandas binner

I have created the following pandas dataframe called train:
import pandas as pd
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats
ds = {
'matchKey' : [621062, 622750, 623508, 626451, 626611, 626796, 627114, 630055, 630225],
'og_max_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_min_last_dpd' : [10, 10, -99999, 10, 10, 10, 10, 10, 10],
'og_max_max_dpd' : [0, 0, -99999, 1, 0, 5, 0, 4, 0],
'Target':[1,0,1,0,0,1,1,1,0]
}
train = pd.DataFrame(data=ds)
The dataframe looks like this:
print(train)
matchKey og_max_last_dpd og_min_last_dpd og_max_max_dpd Target
0 621062 10 10 0 1
1 622750 10 10 0 0
2 623508 -99999 -99999 -99999 1
3 626451 10 10 1 0
4 626611 10 10 0 0
5 626796 10 10 5 1
6 627114 10 10 0 1
7 630055 10 10 4 1
8 630225 10 10 0 0
I have then binned the column called og_max_max_dpd using this code:
def mono_bin(Y, X, char, n=20):
X2 = X.fillna(-99999)
r = 0
while np.abs(r) < 1:
d1 = pd.DataFrame({"X": X2, "Y": Y, "Bucket": pd.qcut(X2, n, duplicates="drop")})#,include_lowest=True
d2 = d1.groupby("Bucket", as_index=True)
r, p = stats.spearmanr(d2.mean().X, d2.mean().Y)
n = n - 1
d3 = pd.DataFrame(d2.min().X, columns=["min_" + X.name])
d3["max_" + X.name] = d2.max().X
d3[Y.name] = d2.sum().Y
d3["total"] = d2.count().Y
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
pinf = float("+inf")
array = []
for i in range(len(d4) - 1):
array.append(d4["max_" + char].iloc[i])
return [ninf] + array + [pinf]
binner = mono_bin(train['Target'], train['og_max_max_dpd'], 'og_max_max_dpd')
I have printed out the binner which looks like this:
print(binner)
[-inf, -99999.0, nan, 0.0, nan, nan, 1.0, nan, nan, 4.0, nan, inf]
I want to remove the nan from that list so that the binner looks like this:
[-inf, -99999.0, 0.0, 1.0, 4.0, inf]
Does anyone know how to remove the nan?
You can simply use dropna to remove it from d4:
...
d3[Y.name + "_rate"] = d2.mean().Y
d4 = (d3.sort_values(by="min_" + X.name)).reset_index(drop=True)
d4.dropna(inplace=True)
# print("=" * 85)
# print(d4)
ninf = float("-inf")
...

How can I conditionally remove elements from level 1 of a nested list given the value of a level 2 element?

Platform: Mathematica
I have a table of x and y coordinates belonging to individual connected paths (trajectories):
{{Trajectory, Frame, x, y}, {1, 0, 158.22, 11.519}, {1, 1, 159.132, 11.637}, ... {6649, 1439, 148.35, 316.144}}
in table format it would look like this:
Trajectory Frame x y
------------------------------------------
1 0 158.22 11.519
1 1 159.13 11.637
1 2 158.507 11.68
1 3 157.971 11.436
1 4 158.435 11.366
1 5 158.626 11.576
2 0 141 12 remove this row, path too short!
2 1 143 15 remove this row, path too short!
2 2 144 16 remove this row, path too short!
2 3 147 18 remove this row, path too short!
3 0 120 400
3 1 121 401
3 2 121 396
3 3 122 394
3 4 121 392
3 5 120 390
3 6 124 388
3 7 125 379
...
I want to remove any elements/rows where the total length of the trajectory is less than "n" frames/rows/elements (5 frames for this example). The list is ~80k elements long, and I want to remove all the rows containing trajectories under the specified threshold.
For the given example, trajectory 2 exists across only 4 frames, so I want to delete all rows for Trajectory 2.
I am new to Mathematica and I don't even know where to begin. I thought perhaps creating a list that contains the trajectory numbers that have a Count[] value less than the threshold, then conditionally eliminating any elements that follow that pattern with something like DeleteCases[], but I wasn't able to get very far given my limited syntax knowledge.
I appreciate your help and look forward to a solution!
table = {{"Trajectory", "Frame", "x", "y"},
{1, 0, 158.22, 11.519}, {1, 1, 159.13, 11.637},
{1, 2, 158.507, 11.68}, {1, 3, 157.971, 11.436},
{1, 4, 158.435, 11.366}, {1, 5, 158.626, 11.576},
{2, 0, 141, 12}, {2, 1, 143, 15}, {2, 2, 144, 16},
{2, 3, 147, 18}, {3, 0, 120, 400}, {3, 1, 121, 401},
{3, 2, 121, 396}, {3, 3, 122, 394}, {3, 4, 121, 392},
{3, 5, 120, 390}, {3, 6, 124, 388}, {3, 7, 125, 379}};
traj = First /# Rest[table];
n = 5;
under = First /# Select[Tally[traj], Last[#] < n &];
discard = Flatten[Position[table[[All, 1]], #] & /# under];
newtable = Delete[table, List /# discard]
or alternatively, for the last two lines, this could be faster
discard = Position[table[[All, 1]], _?(MemberQ[under, #] &)];
newtable = Delete[table, discard]

How to efficiently bucket pandas data frames and then perform groupby operations on those buckets?

I want to bucket volumes and build up a summary report over the aggregate data via those buckets. Currently I use apply to do this, but apply can be very slow for large data sets. Is there a generic form of the syntax given in create_lt_ten_buckets? I'm guessing this is more of a numpy thing, with which I am less familiar.
def create_buckets(df_internal, comparison_operator, column_to_bucket, min_value, max_value, ranges_pivots):
low = [min_value] + ranges_pivots
high = ranges_pivots + [max_value]
ranges = list(zip(low, high))
max_str_len = len(str(max(high + low)))
def get_value(row):
count = 0
for l,h in ranges:
if comparison_operator(l, row[column_to_bucket]) and comparison_operator(row[column_to_bucket], h):
return "{}|{}_to_{}".format(str(count).zfill(max_str_len),l,h)
count+=1
return "OUTOFBAND"
df_internal["{}_BUCKETED".format(column_to_bucket)] = df_internal.apply(get_value, axis=1)
def create_lt_ten_bucket(df_internal, column_to_bucket):
df_internal["{}_is_lt_ten".format(column_to_bucket)] = df_internal[column_to_bucket] < 10
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
dftest.groupby('value_alpha_BUCKETED').sum().sort_values('value_alpha_BUCKETED')
OUTPUT
value_alpha value_alpha_BUCKETED
0 1 000|0_to_1
1 2 001|1_to_2
2 3 002|2_to_5
3 4 002|2_to_5
4 5 002|2_to_5
5 44 005|25_to_50
6 250 008|200_to_999
7 22 004|10_to_25
dftest = pd.DataFrame([1,2,3,4,5, 44, 250, 22], columns=["value_alpha"])
create_buckets(dftest, lambda v1,v2: v1 <= v2, "value_alpha", 0, 999999999, [1, 2, 5, 10, 25, 50, 100, 200])
display(dftest)
create_lt_ten_bucket(dftest, "value_alpha")
display(dftest)
OUTPUT
value_alpha value_alpha_BUCKETED value_alpha_is_lt_ten
0 1 000|0_to_1 True
1 2 001|1_to_2 True
2 3 002|2_to_5 True
3 4 002|2_to_5 True
4 5 002|2_to_5 True
5 44 005|25_to_50 False
6 250 008|200_to_999 False
7 22 004|10_to_25 False
In the end I'm trying to get a summary of the data similar to this:
dftest.groupby('value_alpha_BUCKETED').sum().sort_values('value_alpha_BUCKETED')
value_alpha value_alpha_is_lt_ten
value_alpha_BUCKETED
000|0_to_1 1 1.0
001|1_to_2 2 1.0
002|2_to_5 12 3.0
004|10_to_25 22 0.0
005|25_to_50 44 0.0
008|200_to_999 250 0.0
I'm not entirely clear on what you're asking, but what you have is roughly pd.cut and pd.DataFrame.groupby:
dftest['new_bucket'] = pd.cut(dftest['value_alpha'], [0, 1, 2, 5, 10, 25, 50, 100, 200, 999])
dftest['value_alpha_is_lt_ten'] = dftest['value_alpha'] < 10
print(dftest.groupby("new_bucket").sum())
value_alpha value_alpha_is_lt_ten
new_bucket
(0, 1] 1 1.0
(1, 2] 2 1.0
(2, 5] 12 3.0
(5, 10] 0 0.0
(10, 25] 22 0.0
(25, 50] 44 0.0
(50, 100] 0 0.0
(100, 200] 0 0.0
(200, 999] 250 0.0
If you don't want the empty buckets, you could .query on values where value_alpha > 0

H264 encoding and decoding using Videotoolbox

I was testing the encoding and decoding using videotoolbox, to convert the captured frames to H264 and using that data to display it in AVSampleBufferdisplayLayer.
error here while decompress CMVideoFormatDescriptionCreateFromH264ParameterSets with error code -12712
I follow this code from mobisoftinfotech.com
status = CMVideoFormatDescriptionCreateFromH264ParameterSets(
kCFAlloc‌​‌ atorDefault, 2,
(const uint8_t const)parameterSetPointers,
parameterSetSizes, 4, &_formatDesc);
videoCompressionTest; can anyone figure out the problem?
I am not sure if you did figure out the problem yet. However, I found 2 places in your code that leading to the error. After fixed them and run locally your test app, it seems to be working fine. (Tested with Xcode 9.4.1, MacOS 10.13)
The first one is in -(void)CompressAndConvertToData:(CMSampleBufferRef)sampleBuffer method where the while loop should be like this
while (bufferOffset < blockBufferLength - AVCCHeaderLength) {
// Read the NAL unit length
uint32_t NALUnitLength = 0;
memcpy(&NALUnitLength, bufferDataPointer + bufferOffset, AVCCHeaderLength);
// Convert the length value from Big-endian to Little-endian
NALUnitLength = CFSwapInt32BigToHost(NALUnitLength);
// Write start code to the elementary stream
[elementaryStream appendBytes:startCode length:startCodeLength];
// Write the NAL unit without the AVCC length header to the elementary stream
[elementaryStream appendBytes:bufferDataPointer + bufferOffset + AVCCHeaderLength
length:NALUnitLength];
// Move to the next NAL unit in the block buffer
bufferOffset += AVCCHeaderLength + NALUnitLength;
}
uint8_t *bytes = (uint8_t*)[elementaryStream bytes];
int size = (int)[elementaryStream length];
[self receivedRawVideoFrame:bytes withSize:size];
The second place is the decompression code where you process for NALU type 8, the block of code in if(nalu_type == 8) statement. This is a tricky one.
To fix it, update
for (int i = _spsSize + 12; i < _spsSize + 50; i++)
to
for (int i = _spsSize + 12; i < _spsSize + 12 + 50; i++)
And you are freely to remove this hack
//was crashing here
if(_ppsSize == 0)
_ppsSize = 4;
Why? Lets print out the frame packet format.
po frame
▿ 4282 elements
- 0 : 0
- 1 : 0
- 2 : 0
- 3 : 1
- 4 : 39
- 5 : 100
- 6 : 0
- 7 : 30
- 8 : 172
- 9 : 86
- 10 : 193
- 11 : 112
- 12 : 247
- 13 : 151
- 14 : 64
- 15 : 0
- 16 : 0
- 17 : 0
- 18 : 1
- 19 : 40
- 20 : 238
- 21 : 60
- 22 : 176
- 23 : 0
- 24 : 0
- 25 : 0
- 26 : 1
- 27 : 6
- 28 : 5
- 29 : 35
- 30 : 71
- 31 : 86
- 32 : 74
- 33 : 220
- 34 : 92
- 35 : 76
- 36 : 67
- 37 : 63
- 38 : 148
- 39 : 239
- 40 : 197
- 41 : 17
- 42 : 60
- 43 : 209
- 44 : 67
- 45 : 168
- 46 : 0
- 47 : 0
- 48 : 3
- 49 : 0
- 50 : 0
- 51 : 3
- 52 : 0
- 53 : 2
- 54 : 143
- 55 : 92
- 56 : 40
- 57 : 1
- 58 : 221
- 59 : 204
- 60 : 204
- 61 : 221
- 62 : 2
- 63 : 0
- 64 : 76
- 65 : 75
- 66 : 64
- 67 : 128
- 68 : 0
- 69 : 0
- 70 : 0
- 71 : 1
- 72 : 37
- 73 : 184
- 74 : 32
- 75 : 1
- 76 : 223
- 77 : 205
- 78 : 248
- 79 : 30
- 80 : 231
… more
The first NALU start code if (nalu_type == 7) is 0, 0, 0, 1 from index of 15 to 18. The next 0, 0, 0, 1 (from 23 to 26) is type 6, type 8 NALU start code is from 68 to 71. That why I modify the for loop a bit to scan from start index (_spsSize + 12) with a range of 50.
I haven't fully tested your code to make sure encode and decode work properly as expected. However, I hope this finding would help you.
By the way, if there is any misunderstanding, I would love to learn from your comments.

Optimizing the Verhoeff Algorithm in R

I have written the following function to calculate a check digit in R.
verhoeffCheck <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
}
In order to run on a vector of strings, sapply must be used. This is in part because of the use of strsplit, which returns a list of vectors. This does impact on the performance even for only moderately sized inputs.
How could this function be vectorized?
I am also aware that some performance is lost in having to create the tables in each iteration. Would storing these in a new environment be a better solution?
We begin by defining the lookup matrices. I've laid them out in a way
that should make them easier to check against a reference, e.g.
http://en.wikipedia.org/wiki/Verhoeff_algorithm.
d5_mult <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 2, 3, 4, 0, 6, 7, 8, 9, 5,
2, 3, 4, 0, 1, 7, 8, 9, 5, 6,
3, 4, 0, 1, 2, 8, 9, 5, 6, 7,
4, 0, 1, 2, 3, 9, 5, 6, 7, 8,
5, 9, 8, 7, 6, 0, 4, 3, 2, 1,
6, 5, 9, 8, 7, 1, 0, 4, 3, 2,
7, 6, 5, 9, 8, 2, 1, 0, 4, 3,
8, 7, 6, 5, 9, 3, 2, 1, 0, 4,
9, 8, 7, 6, 5, 4, 3, 2, 1, 0
)), ncol = 10, byrow = TRUE)
d5_perm <- matrix(as.integer(c(
0, 1, 2, 3, 4, 5, 6, 7, 8, 9,
1, 5, 7, 6, 2, 8, 3, 0, 9, 4,
5, 8, 0, 3, 7, 9, 6, 1, 4, 2,
8, 9, 1, 6, 0, 4, 3, 5, 2, 7,
9, 4, 5, 3, 1, 2, 6, 8, 7, 0,
4, 2, 8, 6, 5, 7, 3, 9, 0, 1,
2, 7, 9, 3, 8, 0, 6, 4, 1, 5,
7, 0, 4, 6, 9, 1, 3, 2, 5, 8
)), ncol = 10, byrow = TRUE)
d5_inv <- as.integer(c(0, 4, 3, 2, 1, 5, 6, 7, 8, 9))
Next, we'll define the check function, and try it out with a test input.
I've followed the derivation in wikipedia as closely as possible.
p <- function(i, n_i) {
d5_perm[(i %% 8) + 1, n_i + 1] + 1
}
d <- function(c, p) {
d5_mult[c + 1, p]
}
verhoeff <- function(x) {
#split and convert to numbers
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff(142857)
## [1] 0
This function is fundamentally iterative, as each iteration depends on
the value of the previous. This means that we're unlikely to be able to
vectorise in R, so if we want to vectorise, we'll need to use Rcpp.
However, before we turn to that, it's worth exploring if we can do the
initial split faster. First we do a little microbenchmark to see if it's
worth bothering:
library(microbenchmark)
digits <- function(x) {
digs <- strsplit(as.character(x), "")[[1]]
digs <- as.numeric(digs)
rev(digs)
}
microbenchmark(
digits(142857),
verhoeff(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.30 12.01 12.43 12.85 28.79 100
## verhoeff(142857) 32.24 33.81 34.66 35.47 95.85 100
It looks like it! On my computer, verhoeff_prepare() accounts for
about 50% of the run time. A little searching on stackoverflow reveals
another approach to turning a number into
digits:
digits2 <- function(x) {
n <- floor(log10(x))
x %/% 10^(0:n) %% 10
}
digits2(12345)
## [1] 5 4 3 2 1
microbenchmark(
digits(142857),
digits2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## digits(142857) 11.495 12.102 12.468 12.834 79.60 100
## digits2(142857) 2.322 2.784 3.358 3.561 13.69 100
digits2() is a lot faster than digits() but it has limited impact on
the whole runtime.
verhoeff2 <- function(x) {
digs <- digits2(x)
c <- 0
for (i in 1:length(digs)) {
c <- d(c, p(i, digs[i]))
}
d5_inv[c + 1]
}
verhoeff2(142857)
## [1] 0
microbenchmark(
verhoeff(142857),
verhoeff2(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff(142857) 33.06 34.49 35.19 35.92 73.38 100
## verhoeff2(142857) 20.98 22.58 24.05 25.28 48.69 100
To make it even faster we could try C++.
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff3_c(IntegerVector digits, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int n = digits.size();
int c = 0;
for(int i = 0; i < n; ++i) {
int p = perm(i % 8, digits[i]);
c = mult(c, p);
}
return inv[c];
}
verhoeff3 <- function(x) {
verhoeff3_c(digits(x), d5_mult, d5_perm, d5_inv)
}
verhoeff3(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.00 22.85 25.53 27.11 63.71 100
## verhoeff3(142857) 16.75 17.99 18.87 19.64 79.54 100
That doesn't yield much of an improvement. Maybe we can do better if we
pass the number to C++ and process the digits in a loop:
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::export]]
int verhoeff4_c(int number, IntegerMatrix mult, IntegerMatrix perm,
IntegerVector inv) {
int c = 0;
int i = 0;
for (int i = 0; number > 0; ++i, number /= 10) {
int p = perm(i % 8, number % 10);
c = mult(c, p);
}
return inv[c];
}
verhoeff4 <- function(x) {
verhoeff4_c(x, d5_mult, d5_perm, d5_inv)
}
verhoeff4(142857)
## [1] 3
microbenchmark(
verhoeff2(142857),
verhoeff3(142857),
verhoeff4(142857)
)
## Unit: microseconds
## expr min lq median uq max neval
## verhoeff2(142857) 21.808 24.910 26.838 27.797 64.22 100
## verhoeff3(142857) 17.699 18.742 19.599 20.764 81.67 100
## verhoeff4(142857) 3.143 3.797 4.095 4.396 13.21 100
And we get a pay off: verhoeff4() is about 5 times faster than
verhoeff2().
If your input strings can contain different numbers of characters, then I don't see any way round lapply calls (or a plyr equivalent). The trick is to move them inside the function, so verhoeffCheck can accept vector inputs. This way you only need to create the matrices once.
verhoeffCheckNew <- function(x)
{
## calculates check digit based on Verhoeff algorithm
## check for string since leading zeros with numbers will be lost
if (!is.character(x)) stop("Must enter a string")
#split and convert to numbers
digs <- strsplit(x, "")
digs <- lapply(digs, function(x) rev(as.numeric(x)))
## tables required for D_5 group
d5_mult <- matrix(c(
0:9,
c(1:4,0,6:9,5),
c(2:4,0:1,7:9,5:6),
c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8),
c(5,9:6,0,4:1),
c(6:5,9:7,1:0,4:2),
c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4),
9:0
),10,10,byrow=T)
d5_perm <- matrix(c(
0:9,
c(1,5,7,6,2,8,3,0,9,4),
c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7),
c(9,4,5,3,1,2,6,8,7,0),
c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5),
c(7,0,4,6,9,1,3,2,5,8)
),8,10,byrow=T)
d5_inv <- c(0,4:1,5:9)
## apply algorithm - note 1-based indexing in R
sapply(digs, function(x)
{
d <- 0
for (i in 1:length(x)){
d <- d5_mult[d + 1, (d5_perm[(i %% 8) + 1, x[i] + 1]) + 1]
}
d5_inv[d+1]
})
}
Since d depends on what it was previously, the is no easy way to vectorise the for loop.
My version runs in about half the time for 1e5 strings.
rand_string <- function(n = 12)
{
paste(sample(as.character(0:9), sample(n), replace = TRUE), collapse = "")
}
big_test <- replicate(1e5, rand_string())
tic()
res1 <- unname(sapply(big_test, verhoeffCheck))
toc()
tic()
res2 <- verhoeffCheckNew(big_test)
toc()
identical(res1, res2) #hopefully TRUE!
See this question for tic and toc.
Further thoughts:
You may want additional input checking for "" and other strings that return NA when converted in numeric.
Since you are dealing exclusively with integers, you may get a slight performance benefit from using them rather than doubles. (Use as.integer rather than as.numeric and append L to the values in your matrices.)
Richie C answered the vectorisation question nicely; as for only creatig the tables once without cluttering the global name space, one quick solution that does not require a package is
verhoeffCheck <- local(function(x)
{
## calculates check digit based on Verhoeff algorithm
## note that due to the way strsplit works, to call for vector x, use sapply(x,verhoeffCheck)
## check for string since leading zeros with numbers will be lost
if (class(x)!="character"){stop("Must enter a string")}
#split and convert to numbers
digs <- strsplit(x,"")[[1]]
digs <- as.numeric(digs)
digs <- rev(digs) ## right to left algorithm
## apply algoritm - note 1-based indexing in R
d <- 0
for (i in 1:length(digs)){
d <- d5_mult[d+1,(d5_perm[(i%%8)+1,digs[i]+1])+1]
}
d5_inv[d+1]
})
assign("d5_mult", matrix(c(
0:9, c(1:4,0,6:9,5), c(2:4,0:1,7:9,5:6), c(3:4,0:2,8:9,5:7),
c(4,0:3,9,5:8), c(5,9:6,0,4:1), c(6:5,9:7,1:0,4:2), c(7:5,9:8,2:0,4:3),
c(8:5,9,3:0,4), 9:0), 10, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_perm", matrix(c(
0:9, c(1,5,7,6,2,8,3,0,9,4), c(5,8,0,3,7,9,6,1,4,2),
c(8,9,1,6,0,4,3,5,2,7), c(9,4,5,3,1,2,6,8,7,0), c(4,2,8,6,5,7,3,9,0,1),
c(2,7,9,3,8,0,6,4,1,5), c(7,0,4,6,9,1,3,2,5,8)), 8, 10, byrow = TRUE),
envir = environment(verhoeffCheck))
assign("d5_inv", c(0,4:1,5:9), envir = environment(verhoeffCheck))
## Now just use the function
which keeps the data in the environment of the function. You can time it to see how much faster it is.
Hope this helps.
Allan