What's the best way to store a bi-dimensional sparse array (2d sparse matrix) ? How much size will it have in VoltDB? - sql

Question one: Are there specialized databases to store dense and sparse matrices ? I googled but didn't find any...
The matrix in question is huge (10^5 by 10^5) but it's sparse, which means that most of its values are zeros and I only need to store the non-zero values. So I thought of making a table like this:
2D Matrix
---------------
X Y val
---------------
1 2 4.2
5 1 91.0
9 3 139.1
And so on. 3 columns, two for the coordinates, the third for the value of that cell in the sparse matrix. Question 2: Is this the best way to store a sparse matrix ? I also considered MongoDB but it seems that making one document per cell of the matrix would be too much overhead. Table oriented databases are slow but I can use VoltDB :) Side-node: I thought of a Redis Hash but can't make it bi-dimensional (found a way to serialize 2D matrixes and make it 1D, that way I can store in a Redis Hash or even List)
Question 3: How many bytes per line will VoltDB use ? The coordinates will be integers ranging from 0 to 10^5 maybe more, the values of the cell will be floats.

Regarding Question 3, based on your example, the X and Y columns could be the INTEGER datatype in VoltDB, which is 4 bytes. The value column could be a FLOAT datatype, which is 8 bytes.
Each record would therefore be 16 bytes, so the nominal size in memory would be 16 bytes * row count. In general, you add 30% for overhead, and then 1GB per server for heap size to determine the overall memory needed. See the references below for more detail.
You will probably want to index this table, so assuming you wanted a compound index of (x,y), the size would be as follows:
Tree index:
(sum-of-column-sizes + 8 + 32) * rowcount
Hash index:
(((2 * rowcount) + 1) * 8) + ((sum-of-column-sizes + 32) * rowcount)
sum-of-column-sizes for (x,y) is 8 bytes.
References:
The available datatypes are listed in Appendix A of Using VoltDB:
http://community.voltdb.com/docs/UsingVoltDB/ddlref_createtable#TabDatatypes
Guidelines and formulas for estimating the memory size are in the VoltDB Planning Guide:
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs

The two most relevant questions are 1) how sparse? and 2) how do you want to use the data?
First, I'm assuming you want to read/write/process the data from within the database. If not, then there are many sparse matrix encodings that could even be packed into a blob and optionally compressed.
Assuming your data is fairly sparse, and assuming you want to use the data within the database, the x,y,value tuple storage is probably best. Chapter 4 of the VoltDB Planning Guide covers estimating memory usage and sizing your hardware.
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs
The short answer is that tables with numeric data are packed pretty tight. You've got 12 bytes of real data per row (short, short, double). You should see an average of just over 1 byte beyond that in overhead per row. You'll also need to add in the size of any indexes. The documentation describes the worst case. I think for an index on two short integers such as the X and Y columns, the storage per key will be close to 28 bytes, including overhead.

Related

Get minimum Euclidean distance between a given vector and vectors in the database

I store 128D vectors in PostgreSQL table as double precision []:
create table tab (
   id integer,
   name character varying (200),
   vector double precision []
 )
For a given vector, I need to return one record from the database with the minimum Euclidean distance between this vector and the vector in the table entry.
I have a function that computes the Euclidean distance of two vectors according to the known formula SQRT ((v11-v21) ^ 2 + (v1 [2] -v2 [2]) ^ 2 + .... + (v1 [128] -v2 [128] ]) ^ 2):
CREATE OR REPLACE FUNCTION public.euclidian (
  arr1 double precision [],
  arr2 double precision [])
  RETURNS double precision AS
$ BODY $
  select sqrt (SUM (tab.v)) as euclidian from (SELECT
     UNNEST (vec_sub (arr1, arr2)) as v) as tab;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Ancillary function:
CREATE OR REPLACE FUNCTION public.vec_sub (
  arr1 double precision [],
  arr2 double precision [])
RETURNS double precision [] AS
$ BODY $
  SELECT array_agg (result)
    FROM (SELECT (tuple.val1 - tuple.val2) * (tuple.val1 - tuple.val2)
        AS result
        FROM (SELECT UNNEST ($ 1) AS val1
               , UNNEST ($ 2) AS val2
               , generate_subscripts ($ 1, 1) AS ix) tuple
    ORDER BY ix) inn;
$ BODY $
LANGUAGE sql IMMUTABLE STRICT
Query:
select tab.id as tabid, tab.name as tabname,
        euclidian ('{0.1,0.2,...,0.128}', tab.vector) as eucl from tab
order by eulc ASC
limit 1
Everything works fine since I have several thousands of records in tab. But the DB is going to be grown and I need to avoid full scan of tab running the query, add a kind of index search. Would be great to filter-out at least 80% of records by index, the remaining 20% can be handled by full scan.
One of the current directions of the solution search: PostGIS extension allows to search and sort by distance (ST_3DDistance), filter by distance (ST_3DWithin), etc. This works great and fast using indicies. Is it possible to abstract for N-dimensional space?
Some observations:
all coordinate values are between [-0.5...0.5] (I do not know exactly, I think [-1.0 ...1.0] are theoretical limits)
the vectors are not normalized, the distance from (0,0,... 0) is in range [1.2...1.6].
That is the translated post from StackExchange Russian.
Like #SaiBot hints at with local sensitivity hashing (LSH), there are plenty of researched techniques that allow you to run approximate nearest neighbors (ANN) searches. You have to accept a speed/accuracy tradeoff, but this is reasonable for most production scenarios since a brute-force approach to find the exact neighbors tends to be computationally prohibitive.
This article is an excellent overview of current state-of-the-art algorithms along with their pros and cons. Below, I've linked several popular open-source implementations. All 3 have Python bindings
Facebook FAISS
Spotify Annoy
Google ScaNN
With a 128 dimensional data constrained to PostgreSQL you will have no choice than to apply a full scan for each query.
Even highly optimized index structures for indexing high-dimensional data like the X-Tree or the IQ-Tree will have problems with that many dimensions and usually offer no big benefit over the pure scan.
The main issue here is the curse of dimensionality that will let index structures degenerate above 20ish dimensions.
Newer work thus considers the problem of approximate nearest neighbor search, since in a lot of applications with this many dimensions it is sufficient to find a good answer, rather than the best one. Locality Sensitive Hashing is among these approaches.
Note: Even if an index structure is able to filter out 80% of the records, you will have to access the remaining 20% of the records by random access operations from disk (making the application I/O bound), which will be even slower than reading all the data in one scan and computing the distances.
You could looking at Computational Geometry which is a field dedicated to efficient algorithms. There is generally a tradeoff between the amount of data stored and efficiency of an algorithm. So by adding data we can reduce speed. In particular, we are looking at a Nearest neighbour search the algorithm below uses a form of space partitioning.
Lets consider the 3D case as the distances from the origin lie in a narrow range it looks like they are clustered around a fuzzy sphere. Divide space into sub 8 cubes depending on the sign of each coordinate, label these +++, ++- etc. We can work out the minimum distance from the test vector to a vector in each cube.
Say our test vector is (0.4,0.5,0.6). The minimum distance from that to the +++ cube is zero. The minimum distance to the -++ cube is 0.4 as the closest vector in the -++ cube would be (-0.0001,0.5,0.6). Likewise, the minimum distances to +-+ is 0.5, ++-: 0.6, --+: sqrt(.4^2+.5^2) etc.
The algorithms then becomes: First search the cube the test vector is in and find the minimum distance to all the vectors in that cube. If that distance is smaller than the minimum distance to the other cubes then we are done. If not search the next closest cube until no vector in any other cube could be closer.
If we were to implement this is a database we would compute a key for each vector. In 3D this is a 3-bit integer with 0 or 1 in each bit depending on the sign of the coordinate + (0), - (-). So first select WHERE key = 000, then where key = 100, etc.
You can think of this as a type of hash function which has been specifically designed to make finding close points easy. I think these are called Locality-sensitive hashing
The high dimensions of your data make things much trickier. With 128 dimensions just using the signs of coordinates give 2^128 = 3.4E+38 possibilities. This is far too many hash buckets and some form of dimension reduction is needed.
You might be able to choose k points and partition space according to which of those you are closest too.

Is there a CRC or criptographic function for generating smaller size unique results from unique inputs?

I have a manufacturer unique number ID of 128 bits that I cannot change and it's size is just too long for our purpose (2^128). This is on some embedded micro controller.
One idea is to compute a (run time) CRC32 or hash for narrowing the results but I am not sure for unicity CRC32 as a example: this can be unique for 2^32
Or what king of cryptography function I can use for guarantee unicity of 32 bits output based on unique input?
Thanks for clarifications,
If you know all these ID values in advance, then you can check them using a hash table. You can save space by storing only as many bits of each hash value as are necessary to tell them apart if them happen to land in the same bucket.
If not, then you're going to have a hard time, I'm afraid.
Let's assume these 128-bit IDs are produced as the output of a cryptographic hash function (e.g., MD5), so each ID resembles 128 bits chosen uniformly at random.
If you reduce these to 32-bit values, then the best you can hope to achieve is a set of 32-bit numbers where each bit is 0 or 1 with uniform probability. You could do this by calculating the CRC32 checksum, or by simply discarding 96 bits — it makes no difference.
32 bits is not enough enough to avoid collisions. The collision probability exceeds 1 in a million after just 93 inputs, and 1 in a thousand after 2,900 inputs. After 77,000 inputs, the collision probability reaches 50%. (Source).
So instead, your only real options are to somehow reverse-engineer the ID values into something smaller, or implement some external means of replacing these IDs with sequential integers (e.g., using a hash table).

Mutable array, Objective-c, or variable array, c. Any difference in performance?

I have a multidimensional array (3D matrix) of unknown size, where each element in this matrix is of type short int.
The size of the matrix can be approximated to be around 10 x 10 x 1,000,000.
As I see it I have two options: Mutable Array (Objective-c) or Variable Array (c).
Are there any difference in reading writing to these arrays?
How large will these files become when I save to file?
Any advice would be gratefully accepted.
Provided you know the size of the array at the point of creation, i.e. you don't need to dynamically change the bounds, then a C array of short int with these dimensions will win easily - for reasons such as no encoding of values as objects and direct indexing.
If you write the array in binary to a file then it will just be the number of elements multiplied by sizeof(short int) without any overhead. If you need to also stored the dimensions that is 3 * sizeof(int) - 12 or 24 bytes.
The mutable array will be slower (albeit not by much) since its built on a C array. How large will the file be when you save this array?
It will take you more than 10x10x10000000 bytes because you'll have to encode it in a way where you can recall the matrix. This part is really up to you. For a 3D array, you'll have to use a special character/format in order to denote a 3D array. It depends on how you want to do this, but it will take 1 byte for every digit of every number + 1 char for the space you'll put between elements in the same row + (1 NL For every 2nd dimension in your array * n) + (1 other character for 3d values * n *n)
It might be easier to Stick each Row into its own file, and then stick the columns below it like you normally would. Then in a new file, I would start putting the 3d elements such that each line lines up with the column number of the 2nd dimension. That's just me though, its up to you.

How would you most efficiently store latitude and longitude data?

This question comes from a homework assignment I was given. You can base your storage system off of one of the three following formats:
DD MM SS.S
DD MM.MMM
DD.DDDDD
You want to maximize the amount of data you can store by using as few bytes as possible.
My solution is based off the first format. I used 3 bytes for latitude: 8 bits for the DD (-90 to 90), 6 bits for the MM (0-59), and 10 bits for the SS.S (0-59.9). I then used 25 bits for the longitude: 9 bits for the DDD (-180 to 180), 6 bits for the MM, and 10 for the SS.S. This solution doesn't fit nicely on a byte border, but I figured the next reading can be stored immediately following the previous one, and 8 readings would use only 49 bytes.
I'm curious what methods others can come up. Is there a more efficient method to storing this data? As a note, I considered an offset based storage, but the problem gave no indication of how much the values may change between readings, so I'm assuming any change is possible.
Your suggested method is not optimal. You are using 10 bits (1024 possible values) to store a value in the range (0..599). This is a waste of space.
If you'll use 3 bytes for latitude, you should map the range [0, 2^24-1] to the range [-90, 90]. Hence each of the 2^24 values represents 180/2^24 degrees, which is 0.086 seconds.
If you want only 0.1 second accuracy, you'll need 23 bits for latitudes and 24 bits for longitudes (you'll get 0.077 seconds accuracy). That's 47 bit total instead of your 49 bits, with better accuracy.
Can we do even better?
The exact number of bits needed for 0.1 second accuracy is log2(180*60*60*10 * 360*60*60*10) < 46.256. Which means that you can use 46256 bits (5782 bytes) to store 1000 (lat,lon) pairs, but the mathematics involved will require dealing with very large integers.
Can we do even better?
It depends. If your data set has concentrations, you can store only some points and relative distances from these points, using less bits. Clustering algorithms should be used.
Sticking to existing technology:
If you used half precision floating point numbers to store only the DD.DDDDD data, you can be a lot more space-efficent, but you'd have to accept an exponent bias of 15, which means: The coordinates stored might not be exact, but at an offset from the original value.
This is due to the way floating point numbers are stored, essentially: A normalized significant is multiplied by an exponent to result in a number, instead of just storing a single value (as in integer numbers, the way you calculated the numbers for your solution).
The next highest commonly used floating point number mechanism uses 32 bits (the type "float" in many programming languages) - still efficient, but larger than your custom format.
If, however, you would design your own custom floating point type as well, and you gradually added more bits, your results would become more exact and it would STILL be more efficient than the solution you first found. Just play around with the number of bits used for significant and exponent, and find out how close your fp approximations come to the desired result in degrees!
Well, if this is for a large number of readings, then you may try a differential approach. Start with an absolute location, and then start saving incremental changes, which should ideally require less bits, depending on the nature of the changes. This is effectively compressing the stream. But somehow I don't think that's what this homework is about.

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks
If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.
Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.