BigQuery UDF using BYTES datatype - google-bigquery

I am currently trying to calculate the Hamming distance between two binary strings in BigQuery using User defined functions in Javascript, my schema is quite simple:
row_id STRING
descriptors BYTES REPEATED
phash BYTES
What I am finding a bit confusing is the fact that you apparently deal with BYTES in BigQuery as a Base64 string, I imported both functions atob() and btoa() so I would be able to work with the binary form of the byte strings instead of the Base64 representation:
My Query currently looks like this:
CREATE TEMP FUNCTION f_PHASH_distance(ph1 BYTES, ph2 BYTES)
RETURNS INT64
LANGUAGE js AS
"""
return HammingDistance(ph1, ph2);
"""
OPTIONS (
library=["gs://test.appspot.com/HammingDistance.js",
"gs://test.appspot.com/btoa_atob.js"]
);
SELECT f_PHASH_distance(phash, CAST("9Slp3g9OgVI=" AS BYTES))
FROM ims.images WHERE row_id = "2333USX"
And the row with id = "2333USX" phash is equal to "9Slp3g9OgVI=" in base64, which means that the Hamming distance is 0. But instead of 0 I am currently getting is 35 on BigQuery.
HammingDistance.js has the following content:
function HammingDistance(a, b){
var count = 0;
for(var i = 0; i < a.length; i++){
// calculate XOR between the two chars
var xor = a.charCodeAt(i) ^ b.charCodeAt(i);
// count number of 1's on the result
for(var j = 0; j < 16; j++){
//add if LSB is 1
count += xor % 2;
//right shift the variable
xor = xor >> 1;
}
}
return count;
}
/**
* Calculates the distance between two Perceptual hashes of two images encoded
* in base 64
*/
function PHASHDistance(a, b){
return HammingDistance(atob(a), atob(b));
}
And testing it in the JS console of my browser I do get the expected result. So I assume that I am doing something wrong with the casts but the documentation is very scarce on UDFs with BYTE parameters.
Any help would be much appreciated.

It looks like the problem is that you are casting "9Slp3g9OgVI=" to bytes rather than converting it to bytes from base64. I think you want this instead:
SELECT f_PHASH_distance(phash, FROM_BASE64("9Slp3g9OgVI="))
FROM ims.images WHERE row_id = "2333USX"
You might be better off using SQL functions rather than JavaScript functions, though, since JavaScript normally isn't as fast. Here's a Hamming distance implementation in SQL, assuming that the bytes have equal lengths:
#standardSQL
CREATE TEMP FUNCTION HammingDistance(b1 BYTES, b2 BYTES) AS (
BIT_COUNT(b1 ^ b2)
);
WITH Input AS (
SELECT b'defdef' AS bytes UNION ALL
SELECT b'123de4' UNION ALL
SELECT b'abc123'
)
SELECT HammingDistance(b'abcdef', bytes)
FROM Input;
It takes the bitwise XOR of the two byte values, then checks how many bits are not the same.

In case someone is looking for a solution in the case of comparing regular strings (not binary ones as this question), look at my answer here

Related

Get the most occuring number amongst several integers without using arrays

DISCLAIMER: Rather theoretical question here, not looking for a correct answere, just asking for some inspiration!
Consider this:
A function is called repetitively and returns integers based on seeds (the same seed returns the same integer). Your task is to find out which integer is returned most often. Easy enough, right?
But: You are not allowed to use arrays or fields to store return values of said function!
Example:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
int occurencesOfResult = magic();
if(occurencesOfResult > occurencesOfMostFrequentNumber)
{
mostFrequentNumber = result;
occurencesOfMostFrequentNumber = occurencesOfResult;
}
}
If getNumberFromSeed() returns 2,1,5,18,5,6 and 5 then mostFrequentNumber should be 5 and occurencesOfMostFrequentNumber should be 3 because 5 is returned 3 times.
I know this could easily be solved using a two-dimentional list to store results and occurences. But imagine for a minute that you can not use any kind of arrays, lists, dictionaries etc. (Maybe because the system that is running the code has such a limited memory, that you cannot store enough integers at once or because your prehistoric programming language has no concept of collections).
How would you find mostFrequentNumber and occurencesOfMostFrequentNumber? What does magic() do?? (Of cause you do not have to stick to the example code. Any ideas are welcome!)
EDIT: I should add that the integers returned by getNumber() should be calculated using a seed, so the same seed returns the same integer (i.e. int result = getNumber(5); this would always assign the same value to result)
Make an hypothesis: Assume that the distribution of integers is, e.g., Normal.
Start simple. Have two variables
. N the number of elements read so far
. M1 the average of said elements.
Initialize both variables to 0.
Every time you read a new value x update N to be N + 1 and M1 to be M1 + (x - M1)/N.
At the end M1 will equal the average of all values. If the distribution was Normal this value will have a high frequency.
Now improve the above. Add a third variable:
M2 the average of all (x - M1)^2 for all values of xread so far.
Initialize M2 to 0. Now get a small memory of say 10 elements or so. For every new value x that you read update N and M1 as above and M2 as:
M2 := M2 + (x - M1)^2 * (N - 1) / N
At every step M2 is the variance of the distribution and sqrt(M2) its standard deviation.
As you proceed remember the frequencies of only the values read so far whose distances to M1 are less than sqrt(M2). This requires the use of some additional array, however, the array will be very short compared to the high number of iterations you will run. This modification will allow you to guess better the most frequent value instead of simply answering the mean (or average) as above.
UPDATE
Given that this is about insights for inspiration there is plenty of room for considering and adapting the approach I've proposed to any particular situation. Here are some thoughts
When I say assume that the distribution is Normal you should think of it as: Given that the problem has no solution, let's see if there is some qualitative information I can use to decide what kind of distribution would the data have. Given that the algorithm is intended to find the most frequent number, it should be fine to assume that the distribution is not uniform. Let's try with Normal, LogNormal, etc. to see what can be found out (more on this below.)
If the game completely disallows the use of any array, then fine, keep track of only, say 10 numbers. This would allow you to count the occurrences of the 10 best candidates, which will give more confidence to your answer. In doing this choose your candidates around the theoretical most likely value according to the distribution of your hypothesis.
You cannot use arrays but perhaps you can read the sequence of numbers two or three times, not just once. In that case you can read it once to check whether you hypothesis about its distribution is good nor bad. For instance, if you compute not just the variance but the skewness and the kurtosis you will have more elements to check your hypothesis. For instance, if the first reading indicates that there is some bias, you could use a LogNormal distribution instead, etc.
Finally, in addition to providing the approximate answer you would be able to use the information collected during the reading to estimate an interval of confidence around your answer.
Alright, I found a decent solution myself:
int mostFrequentNumber = 0;
int occurencesOfMostFrequentNumber = 0;
int iterations = 10000000;
int maxNumber = -2147483647;
int minNumber = 2147483647;
//Step 1: Find the largest and smallest number that _can_ occur
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result > maxNumber)
{
maxNumber = result;
}
if(result < minNumber)
{
minNumber = result;
}
}
//Step 2: for each possible number between minNumber and maxNumber, count occurences
for(int thisNumber = minNumber; thisNumber <= maxNumber; thisNumber++)
{
int occurenceOfThisNumber = 0;
for(int i = 0; i < iterations; i++)
{
int result = getNumberFromSeed(i);
if(result == thisNumber)
{
occurenceOfThisNumber++;
}
}
if(occurenceOfThisNumber > occurencesOfMostFrequentNumber)
{
occurencesOfMostFrequentNumber = occurenceOfThisNumber;
mostFrequentNumber = thisNumber;
}
}
I must admit, this may take a long time, depending on the smallest and largest possible. But it will work without using arrays.

.NET SQL Computed column [duplicate]

I'm using the chechsum function in sql server 2008 R2 and I would like to get the same int values in a C# app.
Is there any equivalent method in c# that returns the values like the sql checksum function?
Thanx
On SQL Server Forum, at this page, it's stated:
The built-in CHECKUM function in SQL Server is built on a series of 4 bit left rotational xor operations. See this post for more explanation.
I was able to port the BINARY_CHECKSUM to c# and it seems to be working... I'll be looking at the plain CHECKSUM later...
private int SQLBinaryChecksum(string text)
{
long sum = 0;
byte overflow;
for (int i = 0; i < text.Length; i++)
{
sum = (long)((16 * sum) ^ Convert.ToUInt32(text[i]));
overflow = (byte)(sum / 4294967296);
sum = sum - overflow * 4294967296;
sum = sum ^ overflow;
}
if (sum > 2147483647)
sum = sum - 4294967296;
else if (sum >= 32768 && sum <= 65535)
sum = sum - 65536;
else if (sum >= 128 && sum <= 255)
sum = sum - 256;
return (int)sum;
}
CHECKSUM docs don't disclose how it computes the hash. If you want a hash you can use in T-SQL and C#, pick from the algorithms supported in HashBytes
The T-SQL documentation does not specify what algorithm is used by checksum() outside of this:
CHECKSUM computes a hash value, called the checksum, over its list of arguments. The hash value
is intended for use in building hash indexes. If the arguments to CHECKSUM are columns, and an
index is built over the computed CHECKSUM value, the result is a hash index. This can be used for
equality searches over the columns.
It's unlikely to compute an MD5 hash, since its return value (the computed hash) is a 32-bit integer; an MD5 hash is 128 bits in length.
In case you need to do a checksum on a GUID, change dna2's answer to this:
private int SQLBinaryChecksum(byte[] text)
With a byte array, the value from SQL will match the value from C#. To test:
var a = Guid.Parse("DEAA5789-6B51-4EED-B370-36F347A0E8E4").ToByteArray();
Console.WriteLine(SQLBinaryChecksum(a));
vs SQL:
select BINARY_CHECKSUM(CONVERT(uniqueidentifier,'DEAA5789-6B51-4EED-B370-36F347A0E8E4'))
both answers will be -1897092103.
#Dan's implementation of BinaryChecksum can be greatly simplified down in c# down to
int SqlBinaryChecksum(string text)
{
uint accumulator = 0;
for (int i = 0; i < text.Length; i++)
{
var leftRotate4bit = (accumulator << 4) | (accumulator >> -4);
accumulator = leftRotate4bit ^ text[i];
}
return (int)accumulator;
}
This also makes it clearer what the algorithm is doing. For each character, a 4 bit circular shift then an xor with character's byte

How to generate random number

I am working on calculator application. For me all operations are cleared except "rand". Could any one tell me how to generate random number of some number by selecting rand.
For example
initially i select one(1) then rand... if so it has to be displayed random number of one(1).
In objective-C you should use (for between 0 and 1)
int r = arc4random() % 10;
float r2 = r/10
Imagine that you want a number between 0 and 50 with decimals, then you should do:
int r = arc4random()%50*100;
float r2 = r/100;
You will get something like 40.123
NSInteger num = (arc4random() % maxnumber) + 1;
Check the synopsis...
LIBRARY
Standard C Library (libc, -lc)
SYNOPSIS
#include
u_int32_t
arc4random(void);
void
arc4random_stir(void);
void
arc4random_addrandom(unsigned char *dat, int datlen);
DESCRIPTION
The arc4random() function uses the key stream generator employed by the arc4 cipher, which uses 8*8 8
bit S-Boxes. The S-Boxes can be in about (2*1700) states. The arc4random() function returns pseudo-
random numbers in the range of 0 to (2*32)-1, and therefore has twice the range of rand(3) and
random(3).
The arc4random_stir() function reads data from /dev/urandom and uses it to permute the S-Boxes via
arc4random_addrandom().
There is no need to call arc4random_stir() before using arc4random(), since arc4random() automatically
initializes itself.
EXAMPLES
The following produces a drop-in replacement for the traditional rand() and random() functions using
arc4random():
#define foo4random() (arc4random() % ((unsigned)RAND_MAX + 1))

Fast FFT Bit Reversal, Can I Count Down Backwards Bit Reversed?

I'm using FFT's for audio processing, and I've come up with some potentially very fast ways of doing the bit reversal needed which might be of use to others, but because of the size of my FFT's (8192), I'm trying to reduce memory usage / cache flushing do to size of lookup tables or code, and increase performance. I've seen lots of clever bit reversal routines; they all allow you can feed them with any arbitrary value and get a bit reversed output, but FFT's don't need that flexibility since they go in a predictable sequence. First let me state what I have tried and/or figured out since it may be the fastest to date and you can see the problem, then I'll ask the question.
1) I've written a program to generate straight through, unlooped x86 source code that can be pasted into my FFT code, which reads an audio sample, multiplies it by a window value (that's a lookup table itself) and then just places the resulting value in it's proper bit reversed sorted position by absolute values within the x86 addressing modes like: movlps [edi+1876],xmm0. This is the absolute fastest way to do this for smaller FFT sizes. The problem is when I write straight through code to handle 8192 values, the code grows beyond the L1 instruction cache size and performance drops way down. Of course in contrast, a 32K bit reversal lookup table mixed with a 32K window table, plus other stuff, is also too big to fit the L1 data cache, and performance drops way down, but that's the way I'm currently doing it.
2) I've found patterns in the bit reversal sequence that can be exploited to reduce lookup table size, for example using 4 bit numbers (0..15) as an example, the bit reversal sequence looks like: 0,8,4,12,2,10,6,14|1,5,9,13,3,11,7,15. First thing that can be seen is that the last 8 numbers are the same as the first 8 +1, so I can chop my LUT half. If I look at the difference between the numbers there is more redundancy, so if I start with a zero in a register and want to add values to it to get the next bit reversed number they would be: +0,+8,-4,+8,-10,+8,-4,+8 and the same for the second half. As can be seen, I could have a lookup table of just 0 and -10 because the +8's and -4's always show up in a predictable way. The code would be unrolled to handle 4 values per loop: one would be a lookup table read, and the other 3 would be straight code for +8, -4, +8, before looping around again. Then a second loop could handle the 1,5,9,13,3,11,7,15 sequence. This is great, because I can now chop down my lookup table by another factor of 4. This scales up the same way for an 8192 size FFT. I can now get by with a 4K size LUT instead of 32K. I can exploit the same pattern and double the size of my code and chop down the LUT by another half yet again, however far I want to go. But in order to eliminate the LUT altogether, I'm back to the prohibitive code size.
For large FFT sizes, I believe that this #2 solution is the absolute fastest to date, since a relatively small percentage of lookup table reads need to be done, and every algorithm I currently find on the web requires too many serial/dependency calculations which can't be vectorized.
The question is, is there an algorithm that can increment numbers so the MSB acts like the LSB, and so on? In other words (in binary): 0000, 1000, 0100, 1100, 0010, etc… I've tried to think up some way, and so far, short of a bunch of nested loops, I can't seem to find a way for a fast and simple algorithm that is a mirror image of simply adding 1 to the LSB of a number. Yet it seems like there should be a way.
One other approach to consider: take a well known bit reversal algorithm - typically a few masks, shifts, and ORs - then implement this with SSE, so you get e.g. 8 x 16 bit bit reversals for the price of one. For 16 bits you need 5*log2(N) = 20 instructions, so the aggregate throughput would be 2.5 instructions per bit reversal.
This is the most trivial and straightforward solution (in C):
void BitReversedIncrement(unsigned *var, int bit)
{
unsigned c, one = 1u << bit;
do {
c = *var & one;
(*var) ^= one;
one >>= 1;
} while (one && c);
}
The main problem with is the conditional branches, which are often costly on modern CPUs. You have one conditional branch per bit.
You can do reversed increments by working on several bits at a time, e.g. 3 if ints are 32-bit:
void BitReversedIncrement2(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 2 && !t)
{
unsigned tt = (r >> (bit - 2)) & 7;
t = (07351624 >> (tt * 3)) & 7;
r ^= ((tt ^ t) << (bit - 2));
bit -= 3;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 2 - bit;
t = (07351624 >> (t * 3)) & 7;
t >>= 2 - bit;
r |= t;
}
*var = r;
}
This is better, you only have 1 conditional branch per 3 bits.
If your CPU supports 64-bit ints, you can work on 4 bits at a time:
void BitReversedIncrement3(unsigned *var, int bit)
{
unsigned r = *var, t = 0;
while (bit >= 3 && !t)
{
unsigned tt = (r >> (bit - 3)) & 0xF;
t = (0xF7B3D591E6A2C48ULL >> (tt * 4)) & 0xF;
r ^= ((tt ^ t) << (bit - 3));
bit -= 4;
}
if (bit >= 0 && !t)
{
t = r & ((1 << (bit + 1)) - 1);
r ^= t;
t <<= 3 - bit;
t = (0xF7B3D591E6A2C48ULL >> (t * 4)) & 0xF;
t >>= 3 - bit;
r |= t;
}
*var = r;
}
Which is even better. And the only look-up table (07351624 or 0xF7B3D591E6A2C48) is tiny and likely encoded as an immediate instruction operand.
You can further improve the code if the bit position for the reversed "1" is a known constant. Just unroll the while loop into nested ifs, substitute the reversed one bit position constant.
For larger FFTs, paying attention to cache blocking (minimizing total uncovered cache miss cycles) can have a far larger effect on performance than optimization of the cycle count taken by indexing bit reversal. Make sure not to de-optimize a bigger effect by a larger cycle count while optimizing the smaller effect. For small FFTs, where everything fits in cache, LUTs can be a good solution as long as you pay attention to any load-use hazards by making sure things are or can be pipelined appropriately.

Fast way to sum Fourier series?

I have generated the coefficients using FFTW, now I want to reconstruct the original data, but using only the first numCoefs coefficients rather than all of them. At the moment I'm using the below code, which is very slow:
for ( unsigned int i = 0; i < length; ++i )
{
double sum = 0;
for ( unsigned int j = 0; j < numCoefs; ++j )
{
sum += ( coefs[j][0] * cos( j * omega * i ) ) + ( coefs[j][1] * sin( j * omega * i ) );
}
data[i] = sum;
}
Is there a faster way?
A much simpler solution would be to zero the unwanted coefficients and then do an IFFT with FFTW. This will be a lot more efficient than doing an IDFT as above.
Note that you may get some artefacts in the time domain when you do this kind of thing - you're effectively multiplying by a step function in the frequency domain, which is equivalent to convolution with a sinc function in the time domain. To reduce the resulting "ringing" in the time domain you should use a window function to smooth out the transition between non-zero and zero coeffs.
If your numCoefs is anywhere near or greater than log(length), then an IFFT, which is O(n*log(n)) in computational complexity, will most likely be faster, as well as pre-optimized for you. Just zero all the bins except for the coefficients you want to keep, and make sure to also keep their negative frequency complex conjugates as well if you want a real result.
If your numCoefs is small relative to log(length), then other optimizations you could try include using sinf() and cosf() if you don't really need more than 6 digits of precision, and pre-calculating omega*i outside the inner loop (although your compiler should do be doing that for you unless you have the optimization setting low or off).