Slow AES GCM encryption and decryption with Java 8u20 - cryptography

I am trying to encrypt and decrypt data using AES/GCM/NoPadding. I installed the JCE Unlimited Strength Policy Files and ran the (simple minded) benchmark below. I've done the same using OpenSSL and was able to achieve more than 1 GB/s encryption and decryption on my PC.
With the benchmark below I'm only able to get 3 MB/s encryption and decryption using Java 8 on the same PC. Any idea what I am doing wrong?
public static void main(String[] args) throws Exception {
final byte[] data = new byte[64 * 1024];
final byte[] encrypted = new byte[64 * 1024];
final byte[] key = new byte[32];
final byte[] iv = new byte[12];
final Random random = new Random(1);
random.nextBytes(data);
random.nextBytes(key);
random.nextBytes(iv);
System.out.println("Benchmarking AES-256 GCM encryption for 10 seconds");
long javaEncryptInputBytes = 0;
long javaEncryptStartTime = System.currentTimeMillis();
final Cipher javaAES256 = Cipher.getInstance("AES/GCM/NoPadding");
byte[] tag = new byte[16];
long encryptInitTime = 0L;
long encryptUpdate1Time = 0L;
long encryptDoFinalTime = 0L;
while (System.currentTimeMillis() - javaEncryptStartTime < 10000) {
random.nextBytes(iv);
long n1 = System.nanoTime();
javaAES256.init(Cipher.ENCRYPT_MODE, new SecretKeySpec(key, "AES"), new GCMParameterSpec(16 * Byte.SIZE, iv));
long n2 = System.nanoTime();
javaAES256.update(data, 0, data.length, encrypted, 0);
long n3 = System.nanoTime();
javaAES256.doFinal(tag, 0);
long n4 = System.nanoTime();
javaEncryptInputBytes += data.length;
encryptInitTime = n2 - n1;
encryptUpdate1Time = n3 - n2;
encryptDoFinalTime = n4 - n3;
}
long javaEncryptEndTime = System.currentTimeMillis();
System.out.println("Time init (ns): " + encryptInitTime);
System.out.println("Time update (ns): " + encryptUpdate1Time);
System.out.println("Time do final (ns): " + encryptDoFinalTime);
System.out.println("Java calculated at " + (javaEncryptInputBytes / 1024 / 1024 / ((javaEncryptEndTime - javaEncryptStartTime) / 1000)) + " MB/s");
System.out.println("Benchmarking AES-256 GCM decryption for 10 seconds");
long javaDecryptInputBytes = 0;
long javaDecryptStartTime = System.currentTimeMillis();
final GCMParameterSpec gcmParameterSpec = new GCMParameterSpec(16 * Byte.SIZE, iv);
final SecretKeySpec keySpec = new SecretKeySpec(key, "AES");
long decryptInitTime = 0L;
long decryptUpdate1Time = 0L;
long decryptUpdate2Time = 0L;
long decryptDoFinalTime = 0L;
while (System.currentTimeMillis() - javaDecryptStartTime < 10000) {
long n1 = System.nanoTime();
javaAES256.init(Cipher.DECRYPT_MODE, keySpec, gcmParameterSpec);
long n2 = System.nanoTime();
int offset = javaAES256.update(encrypted, 0, encrypted.length, data, 0);
long n3 = System.nanoTime();
javaAES256.update(tag, 0, tag.length, data, offset);
long n4 = System.nanoTime();
javaAES256.doFinal(data, offset);
long n5 = System.nanoTime();
javaDecryptInputBytes += data.length;
decryptInitTime += n2 - n1;
decryptUpdate1Time += n3 - n2;
decryptUpdate2Time += n4 - n3;
decryptDoFinalTime += n5 - n4;
}
long javaDecryptEndTime = System.currentTimeMillis();
System.out.println("Time init (ns): " + decryptInitTime);
System.out.println("Time update 1 (ns): " + decryptUpdate1Time);
System.out.println("Time update 2 (ns): " + decryptUpdate2Time);
System.out.println("Time do final (ns): " + decryptDoFinalTime);
System.out.println("Total bytes processed: " + javaDecryptInputBytes);
System.out.println("Java calculated at " + (javaDecryptInputBytes / 1024 / 1024 / ((javaDecryptEndTime - javaDecryptStartTime) / 1000)) + " MB/s");
}
EDIT:
I leave it as a fun exercise to improve this simple minded benchmark.
I've tested some more using the ServerVM, removed nanoTime calls and introduced warmup, but as I expected none of this had any improvement on the benchmark results. It is flat-lined at 3 megabytes per second.

Micro-benchmarking aside, the performance of the GCM implementation in JDK 8 (at least up to 1.8.0_25) is crippled.
I can consistently reproduce the 3MB/s (on a Haswell i7 laptop) with a more mature micro-benchmark.
From a code dive, this appears to be due to a naive multiplier implementation and no hardware acceleration for the GCM calculations.
By comparison AES (in ECB or CBC mode) in JDK 8 uses an AES-NI accelerated intrinsic and is (for Java at least) very quick (in the order of 1GB/s on the same hardware), but the overall AES/GCM performance is completely dominated by the broken GCM performance.
There are plans to implement hardware acceleration, and there have been third party submissions to improve the performance with, but these haven't made it to a release yet.
Something else to be aware of is that the JDK GCM implementation also buffers the entire plaintext on decryption until the authentication tag at the end of the ciphertext is verified, which cripples it for use with large messages.
Bouncy Castle has (at the time of writing) faster GCM implementations (and OCB if you're writing open source software of not encumbered by software patent laws).
Updated July 2015 - 1.8.0_45 and JDK 9
JDK 8+ will get an improved (and constant time) Java implementation (contributed by Florian Weimer of RedHat) - this has landed in JDK 9 EA builds, but apparently not yet in 1.8.0_45.
JDK9 (since EA b72 at least) also has GCM intrinsics - AES/GCM speed on b72 is 18MB/s without intrinsics enabled and 25MB/s with intrinsics enabled, both of which are disappointing - for comparison the fastest (not constant time) BC implementation is ~60MB/s and the slowest (constant time, not fully optimised) is ~26MB/s.
Updated Jan 2016 - 1.8.0_72:
Some performance fixes landed in JDK 1.8.0_60 and performance on the same benchmark now is 18MB/s - a 6x improvement from the original, but still much slower than the BC implementations.

This has now been partially addressed in Java 8u60 with JDK-8069072. Without this fix I get 2.5M/s. With this fix I get 25M/s. Disabling GCM completely gives me 60M/s.
To disable GCM completely create a file named java.security with the following line:
jdk.tls.disabledAlgorithms=SSLv3,GCM
Then start your Java process with:
java -Djava.security.properties=/path/to/my/java.security ...
If this doesn't work, you may need to enable overriding security properties by editing /usr/java/default/jre/lib/security/java.security (actual path may be different depending on OS) and adding:
policy.allowSystemProperty=true

The OpenSSL implementation is optimized by the assembly routine using pclmulqdq instruction(x86 platform). It very fast due to the paralleled algorithm.
The java implementation is slow. but it was also optimized in Hotspot using assembly routine(not paralleled). you have to warm up the jvm to use Hotspot intrinsic. The default value of -XX:CompileThreshold is 10000.
// pseudocode
warmUp_GCM_cipher_loop10000_times();
do_benchmark();

Related

How To Stop JVM Skipping Loop

I have my own test class that is supposed to do timing without JVM deleting anything. Some example test times of 100,000,000 reps comparing the native that Java calls from StrictMath.sin() to my own:
30 degrees
sineNative(): 18,342,858 ns (#1), 1,574,331 ns (#10)
sinCosTanNew6(): 13,751,140 ns (#1), 1,569,848 ns (#10)
60 degrees
sineNative(): 2,520,327,020 ns (#1), 2,520,108,337 ns (#10)
sinCosTanNew6(): 12,935,959 ns (#1), 1,565,365 ns (#10)
From 30 to 60 native time skyrockets * 137 while mine is ~constant. Also, some of the times are impossibly low even when repsDone returns == reps. I expect they should be > 1*reps.
CPU: G3258 # 4GHz
OS: Windows 7 HB SP1
Build Path: jre1.8.0_211
Reprex:
public final class MathTest {
private static int sysReps = 1_000_000;
private static double value = 0;
private static final double DRAD_ANGLE_30 = 0.52359877559829887307710723054658d;
private static final double DRAD_ANGLE_60 = 1.0471975511965977461542144610932d;
private static double sineNative(double angle ) {
int reps = sysReps * 100;
//int repsDone = 0;
value = 0;
long startTime, endTime, timeDif;
startTime = System.nanoTime();
for (int index = reps - 1; index >= 0; index--) {
value = Math.sin(angle);
//repsDone++;
}
endTime = System.nanoTime();
timeDif = endTime - startTime;
System.out.println("sineNative(): " + timeDif + "ns for " + reps + " sine " + value + " of angle " + angle);
//System.out.println("reps done: "+repsDone);
return value;
}
private static void testSines() {
sineNative(DRAD_ANGLE_30);
//sinCosTanNew6(IBIT_ANGLE_30);
}
/* Warm Up */
private static void repeatAll(int reps) {
for (int index = reps - 1; index >= 0; index--) {
testSines();
}
}
public static void main(String[] args) {
repeatAll(10);
}
}
I tried adding angle++ in the loop and that multiplies the times to a more reasonable level, but that messes with the math. I need a way to trick it into the running all of the code all x times. Single pass times are extremely volatile and calling nanotime() takes time, so I need the average of a large number.
The problem is that you never use/refer to the results returned by sineNative. The JIT compiler is clever enough to work out that you never use the return value, so it will just do nothing eventually. A very simple way to fix this is to add a dummy check for your return value. (e.g. if (Math.sin(angle) > 1) { System.out.println("Impossible!"); })
If you are writing benchmark like this it would be useful to use something like JMH (https://github.com/openjdk/jmh) which would automatically create a blackhole for your return variable, so that the JIT compiler will not optimise the value. (see example https://github.com/openjdk/jmh/blob/master/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_09_Blackholes.java)

Is write_image atomic? Is it better to use atomic_max?

Full disclosure: I am cross-posting from the kronos opencl forums, since I have not received any reply there so far:
https://community.khronos.org/t/is-write-image-atomic-is-it-better-than-atomic-max/106418
I’m writing a connected components labelling algorithm for images (2d and 3d); I found no existing implementations and decided to write one based on pointer jumping and a “recollection step” (btw: if you are aware of an easy-to-use, production ready connected component labelling let me know).
The “recollection” step kernel pseudocode for 2d images is as follows:
1) global_id = (x,y)
2) read v from img[x,y], decode it to a pair (tx,ty)
3) read v1 from img[tx,ty]
4) do some calculations to extract a boolean value C and a target value T from v1, v, and the neighbours of (x,y) and (tx,ty)
5) *** IF ( C ) THEN WRITE T INTO (tx,ty).
Q1: all the kernels where “C” is true will compete for writing. Suppose it does not matter which one wins (writes last). I’ve done some tests on an intel GPU, and (with filtering disabled, and clamping enabled) there seems to be no issue at all, write_image seems to be atomic, there is a winning value and my algorithm converges very fast. Can I safely assume that write_image on “unfiltered” images is atomic?
Q2: What I really need is to write into (tx,ty) the maximum T obtained from each kernel. That would involve using buffers instead of images, do clamping myself (or use a larger buffer padded with zeroes), and ** using atomic_max in each kernel**. I did not do this yet out of laziness since I need to change my code to use a buffer just to test it, but I believe it would be far slower. Am I right?
For completeness, here is my actual kernel (to be optimized, any suggestions welcome!)
```
__kernel void color_components2(/* base image */ __read_only image2d_t image,
/* uint32 */ __read_only image2d_t inputImage1,
__write_only image2d_t outImage1) {
int2 gid = (int2)(get_global_id(0), get_global_id(1));
int x = gid.x;
int y = gid.y;
int lock = 0;
int2 size = get_image_dim(inputImage1);
const sampler_t sampler =
CLK_NORMALIZED_COORDS_FALSE | CLK_ADDRESS_CLAMP | CLK_FILTER_NEAREST;
uint4 base = read_imageui(image, sampler, gid);
uint4 ui4a = read_imageui(inputImage1, sampler, gid);
int2 t = (int2)(ui4a[0] % size.x, ui4a[0] / size.x);
unsigned int m = ui4a[0];
unsigned int n = ui4a[0];
if (base[0] > 0) {
for (int a = -1; a <= 1; a++)
for (int b = -1; b <= 1; b++) {
uint4 tmpa =
read_imageui(inputImage1, sampler, (int2)(t.x + a, t.y + b));
m = max(tmpa[0], m);
uint4 tmpb = read_imageui(inputImage1, sampler, (int2)(x + a, y + b));
n = max(tmpb[0], n);
}
}
if(n > m) write_imageui(outImage1,t,(uint4)(n,0,0,0));
}
```

Ambiguous process calcChecksum

CONTEXT
I'm using a code written to work with a GPS module that connects to the Arduino through serial communication. The module starts each packet with a header (0xb5, 0x62), continues with the information you requested and ends with to bytes of checksum, CK_A, and CK_B. I don't understand the code that calculates that checksum. More info about the algorithm of checksum (8-Bit Fletcher Algorithm) in the module protocol (https://www.u-blox.com/sites/default/files/products/documents/u-blox7-V14_ReceiverDescriptionProtocolSpec_%28GPS.G7-SW-12001%29_Public.pdf), page 74 (87 with index).
MORE INFO
Just wanted to understand the code, it works fine. In the UBX protocol, I mentioned there is also a piece of code that explains how it works (isn't write in c++)
struct NAV_POSLLH {
//Here goes the struct
};
NAV_POSLLH posllh;
void calcChecksum(unsigned char* CK) {
memset(CK, 0, 2);
for (int i = 0; i < (int)sizeof(NAV_POSLLH); i++) {
CK[0] += ((unsigned char*)(&posllh))[i];
CK[1] += CK[0];
}
}
In the link you provide, you can find a link to RFC 1145, containing that Fletcher 8 bit algorithm as well and explaining
It can be shown that at the end of the loop A will contain the 8-bit
1's complement sum of all octets in the datagram, and that B will
contain (n)*D[0] + (n-1)*D[1] + ... + D[n-1].
n = sizeof byte D[];
Quote adjusted to C syntax
Try it with a couple of bytes, pen and paper, and you'll see :)

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

QR Code encode mode for short URLs

Usual URL shortening techniques use few characters of the usual URL-charset, because not need more. Typical short URL is http://domain/code, where code is a integer number. Suppose that I can use any base (base10, base16, base36, base62, etc.) to represent the number.
QR Code have many encoding modes, and we can optimize the QR Code (minimal version to obtain lowest density), so we can test pairs of baseX-modeY...
What is the best base-mode pair?
NOTES
A guess...
Two modes fit with the "URL shortening profile",
0010 - Alphanumeric encoding (11 bits per 2 characters)
0100- Byte encoding (8 bits per character)
My choice was "upper case base36" and Alphanumeric (that also encodes "/", ":", etc.), but not see any demonstration that it is always (for any URL-length) the best. There are some good Guide or Mathematical demonstration about this kind of optimization?
The ideal (perhaps impracticable)
There are another variation, "encoding modes can be mixed as needed within a QR symbol" (Wikipedia)... So, we can use also
HTTP://DOMAIN/ with Alphanumeric + change_mode + Numeric encoding (10 bits per 3 digits)
For long URLs (long integers), of course, this is the best solution (!), because use all charset, no loose... Is it?
The problem is that this kind of optimization (mixed mode) is not accessible in usual QRCode-image generators... it is practicable? There are one generator using correctally?
An alternative answer format
The (practicable) question is about best combination of base and mode, so we can express it as a (eg. Javascript) function,
function bestBaseMode(domain,number_range) {
var dom_len = domain.length;
var urlBase_len = dom_len+8; // 8 = "http://".length + "/".length;
var num_min = number_range[0];
var num_max = number_range[1];
// ... check optimal base and mode
return [base,mode];
}
Example-1: the domain is "bit.ly" and the code is a ISO3166-1-numeric country-code,
ranging from 4 to 894. So urlBase_len=14, num_min=4 and num_max=894.
Example-2: the domain is "postcode-resolver.org" and number_range parameter is the range of most frequent postal codes integer representations, for instance a statistically inferred range from ~999 to ~999999. So urlBase_len=27, num_min=999 and num_max=9999999.
Example-3: the domain is "my-example3.net" and number_range a double SHA-1 code, so a fixed length code with 40 bytes (2 concatenated hexadecimal 40 digits long numbers). So num_max=num_min=Math.pow(8,40).
Nobody want my bounty... I lost it, and now also need to do the work by myself ;-)
about the ideal
The goQR.me support reply the particular question about mixed encoding remembering that, unfortunately, it can't be used,
sorry, our api does not support mixed qr code encoding.
Even the standard may defined it. Real world QR code scanner apps
on mobile phone have tons of bugs, we would not recommend to rely
on this feature.
functional answer
This function show the answers in the console... It is a simplification and "brute force" solution.
/**
* Find the best base-mode pair for a short URL template as QR-Code.
* #param Msg for debug or report.
* #param domain the string of the internet domain
* #param digits10 the max. number of digits in a decimal representation
* #return array of objects with equivalent valid answers.
*/
function bestBaseMode(msg, domain,digits10) {
var commomBases= [2,8,10,16,36,60,62,64,124,248]; // your config
var dom_len = domain.length;
var urlBase_len = dom_len+8; // 8 = "http://".length + "/".length
var numb = parseFloat( "9".repeat(digits10) );
var scores = [];
var best = 99999;
for(i in commomBases) {
var b = commomBases[i];
// formula at http://math.stackexchange.com/a/335063
var digits = Math.floor(Math.log(numb) / Math.log(b)) + 1;
var mode = 'alpha';
var len = dom_len + digits;
var lost = 0;
if (b>36) {
mode = 'byte';
lost = parseInt( urlBase_len*0.25); // only 6 of 8 bits used at URL
}
var score = len+lost; // penalty
scores.push({BASE:b,MODE:mode,digits:digits,score:score});
if (score<best) best = score;
}
var r = [];
for(i in scores) {
if (scores[i].score==best) r.push(scores[i]);
}
return r;
}
Running the question examples:
var x = bestBaseMode("Example-1", "bit.ly",3);
console.log(JSON.stringify(x)) // "BASE":36,"MODE":"alpha","digits":2,"score":8
var x = bestBaseMode("Example-2", "postcode-resolver.org",7);
console.log(JSON.stringify(x)) // "BASE":36,"MODE":"alpha","digits":5,"score":26
var x = bestBaseMode("Example-3", "my-example3.net",97);
console.log(JSON.stringify(x)) // "BASE":248,"MODE":"byte","digits":41,"score":61