Modulo arithmetic in Bigquery. Compute `x % y`, where `x` is a 128-bit number - sql

Taking the MD5 of a string as a 128-bit representation of an integer x, how do I compute x % y in Google Bigquery, where y will typically be relatively small (approx 1000)?
Bigquery has an MD5 function, returning a result of type BYTES with 16 bytes (i.e. 128 bits).
(Background: this is to compute deterministic pseudo random numbers. However, for legacy and compatibility reasons, I have no flexibility in the algorithm! Even though we know it has a (very slight) bias.)
This needs to be done millions/billions of times every day for different input strings and diffierent moduluses, so hopefully it can be done efficiently. As a fall back, I can compute it externally with another language, and then upload to Bigquery afterwards; but it would be great if I could do this directly in Bigquery.
I have studied a lot of number theory, so maybe we can use some mathematical tricks. However, I'm still stuck on more basic BiqQuery issues
How do I convert a bytes array to some sort of "big integer" type?
Can I access a subrange of the bytes from a BYTES array?
Given one byte (or maybe four bytes?), can I convert it to an integer type on which I can apply arithmetic operations?

With the power of math and a longish SQL function:
CREATE TEMP FUNCTION modulo_md5(str ANY TYPE, m ANY TYPE) AS ((
SELECT MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(MOD(0
* 256 + num[OFFSET(0)], m )
* 256 + num[OFFSET(1)], m )
* 256 + num[OFFSET(2)], m )
* 256 + num[OFFSET(3)], m )
* 256 + num[OFFSET(4)], m )
* 256 + num[OFFSET(5)], m )
* 256 + num[OFFSET(6)], m )
* 256 + num[OFFSET(7)], m )
* 256 + num[OFFSET(8)], m )
* 256 + num[OFFSET(9)], m )
* 256 + num[OFFSET(10)], m )
* 256 + num[OFFSET(11)], m )
* 256 + num[OFFSET(12)], m )
* 256 + num[OFFSET(13)], m )
* 256 + num[OFFSET(14)], m )
* 256 + num[OFFSET(15)], m )
FROM (SELECT TO_CODE_POINTS(MD5(str)) num)
));
SELECT title, modulo_md5(title, 177) result, TO_HEX(MD5(title)) md5
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
WHERE wiki='en'
LIMIT 100000
And now you can use it as a persistent shared UDF:
SELECT fhoffa.x.modulo_md5("any string", 177) result

Related

Size of buffer to hold base58 encoded data

When trying to understand how base58check works, in the referenced implementation by bitcoin, when calculating the size needed to hold a base58 encoded string, it used following formula:
// https://github.com/bitcoin/libbase58/blob/master/base58.c#L155
size = (binsz - zcount) * 138 / 100 + 1;
where binsz is the size of the input buffer to encode, and zcount is the number of leading zeros in the buffer. What is 138 and 100 coming from and why?
tl;dr
It’s a formula to approximate the output size during base58 <-> base256 conversion.
i.e. the encoding/decoding parts where you’re multiplying and mod’ing by 256 and 58
Encoding output is ~138% of the input size (+1/rounded up):
n * log(256) / log(58) + 1
(n * 138 / 100 + 1)
Decoding output is ~73% of the input size (+1/rounded up):
n * log(58) / log(256) + 1
( n * 733 /1000 + 1)

Precision of div in SQL

select 15000000.0000000000000 / 6060802.6136561442650
gives 2.47491973525125848
How can I get 2.4749197352512584803724193507358?
Thanks a lot
You can't, because of the result rules for determining precision and scale. In fact, your scale is so large that there's no way to shift the result (ie, specifying no scale for the left operand).
First...
The decimal data type supports precision up to 38 digits
... but "precision" here means the total number of digits. Which, yes, your result should fit, but the engine won't shift things for you. The relevant rule is:
Operation Result precision Result scale *
e1 / e2 p1 - s1 + s2 + max(6, s1 + p2 + 1) max(6, s1 + p2 + 1)
* The result precision and scale have an absolute maximum of 38.
When a result precision is greater than 38, the corresponding scale is
reduced to prevent the integral part of a result from being truncated.
.... you're running afoul of the last note there. Here, let's run the numbers.
Your operands have precisions (total digits) of 21 and 20 (p1 and p2, respectively)
Your operands have scales (digits after the decimal) of 13 (s1 and s2)
So:
21 - 13 + 13 + max(6, 13 + 20 + 1) <- The bit in max is the scale, too
21 + max(6, 34)
21 + 34
= 55, with a scale of 34
... except 55 > 38. So the number of digits needs to be reduced. Which, because digits become less significant as the value gets smaller, are dropped from the scale (which also reduces the precision):
55 - 38 = 17 <- difference
55 - 17 = 38 <- final precision
34 - 17 = 17 <- final scale
Now, if we count the number of digits from the answer it gives you, .47491973525125848, you'll get 17 digits.
SQL Server can store decimal numbers with a maximum precision of 38.
SELECT CONVERT(decimal(38,37), 15000000.0000000000000 / 6060802.6136561442650)
AS TestValue brings 2.4749197352512584800000000000000000000.
If there is a pattern in the first parameter, you may save some precision with re-formulation such as
select 1000000 * (15 / 6060802.6136561442650)
I can't test it in sql-server, I have only Oracle available and I get
2,47491973525125848037241935073575410941

Explanation of Processing Image to Byte Array

Can someone explain me how an image converted to byte array?
I need the theory.
I want to use the image for AES encryption {VB .Net), so after I use OpenFile Dialog, my app will load the image and then process it into byte array, but I need the explanation for that process (how pixels turn into byte array)
Thanks for the answer and sorry for the beginner question.
Reference link accepted :)
When you read the bytes from the image file via File.ReadAllBytes(), their meaning depends on the image's file format.
The image file format (e.g. Bitmap, PNG, JPEG2000) defines how pixel values are converted to bytes, and conversely, how you get pixel values back from bytes.
The PNG and JPEG formats are compressed formats, so it would be difficult for you to write code to do that. For Bitmaps, it would be rather easy because it's a simple format. (See Wikipedia.)
But it's much simpler. You can just use .NET's Bitmap class to load any common image file into memory and then use Bitmap.GetPixel() to access pixels via their x,y coordinates.
Bitmap.GetPixel() is slow for larger images, though. To speed this up, you'll want to access the raw representation of the pixels directly in memory. No matter what kind of image you load with the Bitmap class, it always creates a Bitmap representation for it in memory. Its exact layout depends on Bitmap.PixelFormat. You can access it using a pattern like this. The work flow would be:
Copy memory bitmap to byte array using Bitmap.LockBits() and Marshal.Copy().
Extract R, G, B values from byte array using e.g. this formula in case of PixelFormat.RGB24:
// Access pixel at (x,y)
B = bytes[bitmapData.Scan0 + x * 3 + y * bitmapData.Stride + 0]
G = bytes[bitmapData.Scan0 + x * 3 + y * bitmapData.Stride + 1]
R = bytes[bitmapData.Scan0 + x * 3 + y * bitmapData.Stride + 2]
Or for PixelFormat.RGB32:
// Access pixel at (x,y)
B = bytes[bitmapData.Scan0 + x * 4 + y * bitmapData.Stride + 0]
G = bytes[bitmapData.Scan0 + x * 4 + y * bitmapData.Stride + 1]
R = bytes[bitmapData.Scan0 + x * 4 + y * bitmapData.Stride + 2]
A = bytes[bitmapData.Scan0 + x * 4 + y * bitmapData.Stride + 3]
Each Pixel is a byte and the image is made by 3 or 4 bytes, depending of its pattern. Some images has 3 bytes per pixel (related to Red, Greed and Blue), other formats may require 4 bytes (ALpha Channel, R, G and B).
You may use something like:
Dim NewByteArray as Byte() = File.ReadAllbytes("c:\folder\image")
The NewByteArray will be fulfilled with every byte of image and you need to process them using AES, regardless of its position or meaning.

Shuffle data in a repeatable way (ability to get the same "random" order again)

This is the opposite of what most "random order" questions are about.
I want to select data from a database in random order. But I want to be able to repeat certain selects, getting the same order again.
Current (random) select:
SELECT custId, rand() as random from
(
SELECT DISTINCT custId FROM dummy
)
Using this, every key/row gets a random number. Ordering those ascending results in a random order.
But I want to repeat this select, getting the very same order again. My idea is to calculate a random number (r) once per session (e.g. "4") and use this number to shuffle the data in some way.
My first idea:
SELECT custId, custId * 4 as random from
(
SELECT DISTINCT custId FROM dummy
)
(in real life "4" would be something like 4005226664240702)
This results in a different number for each line but the same ones every run. By changing "r" to 5 all numbers will change.
The problem is: multiplication is not sufficient here. It just increases the numbers but keeps the order the same. Therefore I need some other kind of arithmetic function.
More abstract
Starting with my data (A-D). k is the key and r is the random number currently used:
k r
A = 1 4
B = 2 4
C = 3 4
D = 4 4
Doing some calculation using k and r in every line I want to get something like:
k r
A = 1 4 --> 12
B = 2 4 --> 13
C = 3 4 --> 11
D = 4 4 --> 10
The numbers can be whatever they want, but when I order them ascending I want to get a different order than the initial one. In this case D, C, A, B, E.
Setting r to 7 should result in a different order (C, A, B, D):
k r
A = 1 7 --> 56
B = 2 7 --> 78
C = 3 7 --> 23
D = 4 7 --> 80
Every time I use r = 7 should result in the same numbers => same order.
I'm looking for a mathematical function to do the calculation with k and r. Seeding the RAND() function is not suitable because it's not supported by some databases we support
Please note that r is already a randomly generated number
Background
One Table - Two data consumers. One consumer will get random 5% of the table, the other one the other 95%. They don't just get the data but a generated SQL. So there are two SQL's which must not select the same data twice but still random.
You could try and implement the Multiply-With-Carry PseudoRandomNumberGenerator. The C version goes like this (source: Wikipedia):
m_w = <choose-initializer>; /* must not be zero, nor 0x464fffff */
m_z = <choose-initializer>; /* must not be zero, nor 0x9068ffff */
uint get_random()
{
m_z = 36969 * (m_z & 65535) + (m_z >> 16);
m_w = 18000 * (m_w & 65535) + (m_w >> 16);
return (m_z << 16) + m_w; /* 32-bit result */
}
In SQL, you could create a table Random, with two columns to contain w and z, and one ID column to identify each session. Perhaps your vendor supports variables and you need not bother with the table.
Nonetheless, even if we use a table, we immediately run into trouble cause ANSI SQL doesn't support unsigned INTs. In SQL Server I could switch to BIGINT, unsure if your vendor supports that.
CREATE TABLE Random (ID INT, [w] BIGINT, [z] BIGINT)
Initialize a new session, say number 3, by inserting 1 into z and the seed into w:
INSERT INTO Random (ID, w, z) VALUES (3, 8921, 1);
Then each time you wish to generate a new random number, do the computations:
UPDATE Random
SET
z = (36969 * (z % 65536) + z / 65536) % 4294967296,
w = (18000 * (w % 65536) + w / 65536) % 4294967296
WHERE ID = 3
(Note how I have replaced bitwise operands with div and mod operations and how, after computing, you need to mod 4294967296 to stay within the proper 32 bits unsigned int range.)
And select the new value:
SELECT(z * 65536 + w) % 4294967296
FROM Random
WHERE ID = 3
SQLFiddle demo
Not sure if this applies in non-SQL Server, but typically when you use a RAND() function, you can specify a seed. Everytime you specify the same seed, the randomization will be the same.
So, it sounds like you just need to store the seed number and use that each time to get the same set of random numbers.
MSDN Article on RAND
Each vendor has solved this in its own way. Creating your own implementation will be hard, since random number generation is difficult.
Oracle
dbms_random can be initialized with a seed: http://docs.oracle.com/cd/B19306_01/appdev.102/b14258/d_random.htm#i998255
SQL Server
First call to RAND() can provide a seed: http://technet.microsoft.com/en-us/library/ms177610.aspx
MySql
First call to RAND() can provide a seed: http://dev.mysql.com/doc/refman/4.1/en/mathematical-functions.html#function_rand
Postgresql
Use SET SEED or SELECT setseed() : http://www.postgresql.org/docs/8.3/static/sql-set.html

Checkerboard indexing in CUDA

So, here's the question. I want to do a computation in CUDA where I have a large 1D array (which represents a lattice), I partition it into subarrays of length #part, and I want each thread to do a couple of computations on each subarray.
More specifically, let's say that we have a number of threads, #threads, and a number of blocks, #blocks. The array is of size N = 2 * #part * #threads * #blocks. If we number the subarrays from 1 to 2*#blocks*#threads, we want to first use the #threads*#blocks threads to do computation on the subarrays with an even number and then the same number of threads to do computation on the subarrays with an odd number.
I thought that I could have a local index in each thread which would denote from where it's subarray would start.
So, I used the following index :
localIndex = #part * (2 * threadIdx.x + var) + 2 * #part * #Nthreads * blockIdx.x;
var is either 1 or 0, depending on if we want to have the thread do computation on an subarray with an even or an odd number.
I've tried to run it and it seems that something goes wrong when I use more than one blocks. Have I done something wrong with the indexing?
Thanks.
Why is it important that the threads collectively do first even, then the odd subarrays, since block and thread execution is not guaranteed to be in order there is no benefit?
Assuming you index only using x-dimension for your kernel dimension setup:
subArrayIndexEven = 2 * (blockIdx.x * blockDim.x + threadIdx.x) * part
subArrayIndexOdd = subArrayIndexEven + part
Prove:
BLOCK_SIZE = 3
NUM_OF_BLOCKS = 2
PART = 4
N = 2 * 3 * 2 * 4 = 48
T(threadIdx.x, blockIdx.x)
T(0, 1) -> even = 2 * (1 * 3 + 0) * 4 = 24, odd = 28
T(1, 1) -> even = 2 * (1 * 3 + 1) * 4 = 32, odd = 36
T(2, 1) -> even = 2 * (1 * 3 + 2) * 4 = 40, odd = 44
idx = threads_per_block*blockIdx.x + threadIdx.x;
int my_even_offset, my_odd_offset, my_even_idx, my_odd_idx;
int my_offset = floor(float(idx)/float(num_part));
my_even_offset = 2*my_offset*num_part;
my_odd_offset = (2*my_offset+1)*num_part;
my_even_idx = idx + my_even_offset;
my_odd_idx = idx + my_odd_offset;
//Do stuff with the indices.