Data Compression Through Polynomial Equations - vb.net

I am trying to make a new method of compressing data. What I want to do is take say a massive set of numbers, and then try and apply/generate polynomial functions to the data. I would then store the polynomial, with starting parameters, rather than each individual bit of data. This would mean at runtime, that the opening program would just use the polynomial to recreate the data.
My question: Are there any libraries or functions in Visual Basic .NET that will take say an array of data {0, 2, 4, 6, ..., Large Even} and return a "trendline" for it? Much like excel's graph trendline option.

Related

Kotlin's Array vs ArrayList vs List for storing large amounts of data

I'm building a Deep Neural Network in Kotlin (I know Python would be better, but I have to do that in Kotlin).
For training the net I need a huge amount of data from the MNIST database, this means I need to read about 60,000 images from a single file in IDX format and store them for simultaneous use.
Every image consists of 784 Bytes. So the total size is:
784*60,000 = 47,040,000 = ~47 MB of training data.
Which ain't that much, since I'm running the JVM in an 8GB RAM env.
After reading an image i need to convert it to a KMatrix, a custom data structure for matrix math operations. Under the hood of a KMatrix there's an Array<Array<Double>>.
I need a structure to store all the images at once, so I'm currently using a List<KMatrix>, which basically tranlates to a List<Array<Array<Double>>>
The problem is that while building the List<KMatrix> the Garbage Collector runs out of memory, launching a OutOfMemoryException: GC overhead limit exceeded.
I wonder if the problem is which data structures I'm using (i.e. should I use an ArrayList instead of an Array?) or maybe how I'm building the entire thing up (i.e. I need some optimization work to do).
I'll put the code, if needed, as soon as I can.
Thanks for your help.
Self-answer with the summarized solution (Thanks to answers by #Tenfour04 and #gidds)
As #Tenfour04 stated, you have basically three alternatives to the Array<Array<Double>> for the KMatrix:
an Array<DoubleArray> which mantains the same logic as the original, but saving lots of memory and increasing performance;
a 1-Dimensional DoubleArray which saves a bit of extra memory and performance, but with increased complexity given by the index-mapping of the array (the [i;j] element of the matrix is given by the [i * w + j] element of the array), and this probably isn't worth it as #gidds pointed out;
a 1-D DoubleBuffer created with ByteBuffer.allocateDirect(8 * size).asDoubleBuffer(), which improves performances even further but has only get and put methods, so it is useless if you need simple and direct set operations.
Conclusion
I choose the option 2, since in my case I'm performing very intensive operations, but in common cases, probably option 1 is the best as it is balanced in complexity and performance.
If you need a highest-performance structure and read/put methods are enough, I'd say that option 3 is what you're looking for.
Hope this helps someone

VTK / ITK Dice Similarity Coefficient on Meshes

I am new to VTK and am trying to compute the Dice Similarity Coefficient (DSC), starting from 2 meshes.
DSC can be computed as 2 Vab / (Va + Vb), where Vab is the overlapping volume among mesh A and mesh B.
To read a mesh (i.e. an organ contour exported in .vtk format using 3D Slicer, https://www.slicer.org) I use the following snippet:
string inputFilename1 = "organ1.vtk";
// Get all data from the file
vtkSmartPointer<vtkGenericDataObjectReader> reader1 = vtkSmartPointer<vtkGenericDataObjectReader>::New();
reader1->SetFileName(inputFilename1.c_str());
reader1->Update();
vtkSmartPointer<vtkPolyData> struct1 = reader1->GetPolyDataOutput();
I can compute the volume of the two meshes using vtkMassProperties (although I observed some differences between the ones computed with VTK and the ones computed with 3D Slicer).
To then intersect 2 meshses, I am trying to use vtkIntersectionPolyDataFilter. The output of this filter, however, is a set of lines that marks the intersection of the input vtkPolyData objects, and NOT a closed surface. I therefore need to somehow generate a mesh from these lines and compute its volume.
Do you know which can be a good, accurate way to generete such a mesh and how to do it?
Alternatively, I tried to use ITK as well. I found a package that is supposed to handle this problem (http://www.insight-journal.org/browse/publication/762, dated 2010) but I am not able to compile it against the latest version of ITK. It says that ITK must be compiled with the (now deprecated) ITK_USE_REVIEW flag ON. Needless to say, I compiled it with the new Module_ITKReview set to ON and also with backward compatibility but had no luck.
Finally, if you have any other alternative (scriptable) software/library to solve this problem, please let me know. I need to perform these computation automatically.
You could try vtkBooleanOperationPolyDataFilter
http://www.vtk.org/doc/nightly/html/classvtkBooleanOperationPolyDataFilter.html
filter->SetOperationToIntersection();
if your data is smooth and well-behaved, this filter works pretty good. However, sharp structures, e.g. the ones originating from binary image marching cubes algorithm can make a problem for it. That said, vtkPolyDataToImageStencil doesn't necessarily perform any better on this regard.
I had once impression that the boolean operation on polygons is not really ideal for "organs" of size 100k polygons and more. Depends.
If you want to compute a Dice Similarity Coefficient, I suggest you first generate volumes (rasterize) from the meshes by use of vtkPolyDataToImageStencil.
Then it's easy to compute the DSC.
Good luck :)

Fastest way to apply arithmetic operations to System.Array in IronPython

I would like to add (arithmetics) two large System.Arrays element-wise in IronPython and store the result in the first array like this:
for i in range(0:ArrA.Count) :
arrA.SetValue(i, arrA.GetValue(i) + arrB.GetValue(i));
However, this seems very slow. Having a C background I would like to use pointers or iterators. However, I do not know how I should apply the IronPython idiom in a fast way. I cannot use Python lists, as my objects are strictly from type System.Array. The type is 3d float.
What is the fastests / a fast way to perform to compute this computation?
Edit:
The number of elements is appr. 256^3.
3d float means that the array can be accessed like this: array.GetValue(indexX, indexY, indexZ). I am not sure how the respective memory is organized in IronPython's System.Array.
Background: I wrote an interface to an IronPython API, which gives access to data in a simulation software tool. I retrieve 3d scalar data and accumulate it to a temporal array in my IronPython script. The accumulation is performed 10,000 times and should be fast, so that the simulation does not take ages.
Is it possible to use the numpy library developed for IronPython?
https://pytools.codeplex.com/wikipage?title=NumPy%20and%20SciPy%20for%20.Net
It appears to be supported, and as far as I know is as close you can get in python to C style pointer functionality with arrays and such.
Create an array:
x = np.array([[1, 2, 3], [4, 5, 6]], np.int32)
Multiply all elements by 3.0:
x *= 3.0

Understanding Numpy internals for profiling purposes

Profiling a piece of numpy code shows that I'm spending most of the time within these two functions
numpy/matrixlib/defmatrix.py.__getitem__:301
numpy/matrixlib/defmatrix.py.__array_finalize__:279
Here's the Numpy source:
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L301
https://github.com/numpy/numpy/blob/master/numpy/matrixlib/defmatrix.py#L279
Question #1:
__getitem__ seems to be called every time I'm using something like my_array[arg] and it's getting more expensive if arg is not an integer but a slice. Is there any way to speed up calls to array slices?
E.g. in
for i in range(idx): res[i] = my_array[i:i+10].mean()
Question #2:
When exactly does __array_finalize__ get called and how can I speed up by reducing the number of calls to this function?
Thanks!
You could not use matrices as much and just use 2d numpy arrays. I typically only use matrices for a short-time to take advantage of the syntax for multiplication (but with the addition of the .dot method on arrays, I find I do that less and less as well).
But, to your questions:
1) There really is no short-cut to __getitem__ unless defmatrix over-rides __getslice__ which it could do but doesn't yet. There are the .item and .itemset methods which are optimized for integer getting and setting (and return Python objects rather than NumPy's array-scalars)
2) __array_finalize__ is called whenever an array object (or a subclass) is created. It is called from the C-function that every array-creation gets funneled through. https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/ctors.c#L1003
In the case of sub-classes defined purely in Python, it is calling back into the Python interpreter from C which has overhead. If the matrix class were a builtin type (a Cython-based cdef class, for example), then the call could avoid the Python interpreter overhead.
Question 1:
Since array slices can sometimes require a copy of the underlying data structure (holding the pointers to the data in memory) they can be quite expensive. If you're really bottlenecked by this in your above example, you can perform mean operations by actually iterating over the i to i+10 elements and manually creating the mean. For some operations this won't give any performance improvement, but avoiding creating new data structures will generally speed up the process.
Another note, if you're not using native types inside numpy you will get a Very large performance penalty to manipulating a numpy array. Say you're array has dtype=float64 and your native machine float size is float32 -- this will cost a lot of extra computation power for numpy and performance overall will drop. Sometimes this is fine and you can just take the hit for maintaining a data type. Other times it's arbitrary what type the float or int is stored as internally. In these cases try dtype=float instead of dtype=float64. Numpy should default to your native type. I've had 3x+ speedups on numpy intensive algorithms by making this change.
Question 2:
__array_finalize__ "is called whenever the system internally allocates a new array from obj, where obj is a subclass (subtype) of the (big)ndarray" according to SciPy. Thus this is a result described in the first question. When you slice and make a new array, you have to finalize that array by either making structural copies or wrapping the original structure. This operation takes time. Avoiding slices will save on this operation, though for multidimensional data it may be impossible to completely avoid calls to __array_finalize__.

Compress sorted integers

I'm building a index which is just several sets of ordered 32 bit integers stored continuously in a binary file. The problem is that this file grows pretty large. I've been thinking of adding some compressions scheme but that's a bit out of my expertise. So I'm wondering, what compression algorithm would work best in this case? Also, decompression has to be fast since this index will be used to make make look ups.
If you are storing integers which are close together (eg: 1, 3 ,4, 5, 9, 10 etc... ) rather than some random 32 bit integers (982346..., 3487623412.., etc) you can do one thing:
Find the differences between the adjacent numbers which would be like 2,1,1,4,1... etc.(in our example) and then Huffman encode this numbers.
I don't think Huffman encoding will work if you directly apply them to the original list of numbers you have.
But if you have a sorted list of near-by numbers, the odds are good that you will get a very good compression ratio by doing Huffman encoding of the number differences, may be better ratio than using the LZW algorithm used in the Zip libraries.
Anyway thanks for posting this interesting question.
Are the integers grouped in a dense way or a sparse way?
By dense I'm referring to:
[1, 2, 3, 4, 42, 43, 78, 79, 80, 81]
By sparse I'm referring to:
[1, 4, 7, 9, 19, 42, 53, 55, 78, 80]
If the integers are grouped in a dense way you could compress the first vector to hold three ranges:
[(1, 4), (42, 43), (78, 81)]
Which is a 40% compression. Of course this algorithm does not work well on sparse data as the compressed data would take up 100% more space than the original data.
As you've discovered, a sorted sequence of N 32 bits integers doesn't have 32*N bits of data. This is no surprise. Assuming no duplicates, for every sorted sequence there are N! unsorted seqeuences containing the same integers.
Now, how do you take advantage of the limited information in the sorted sequence? Many compression algorithms base their compression on the use of shorter bitstrings for common input values (Huffman uses only this trick). Several posters have already suggested calculating the differences between numbers, and compressing those differences. They assume it will be a series of small numbers, many of which will be identical. In that case, the difference sequence will be compressed well by most algorithms.
However, take the Fibonacci sequence. That's definitely sorted integers. The difference between F(n) and F(n+1) is F(n-1). Hence, compressing the sequence of differences is equivalent to compressing the sequence itself - it doesn't help at all!
So, what we really need is a statistical model of your input data. Given the sequence N[0]...N[x], what is the probability distribution of N[x+1] ? We know that P(N[x+1] < N[x]) = 0, as the sequence is sorted. The differential/Huffman-based solutions presented work because they assume P(N[x+1] - N[x] = d) is quite high for small positive d and independent from x, so they use can use a few bits for the small differences. If you can give another model, you can optimize for that.
If you need fast random-access lookup, then a Huffman-encoding of the differences (as suggested by Niyaz) is only half the story. You will probably also need some sort of paging/indexing scheme so that it is easy to extract the nth number.
If you don't do this, then extracting the nth number is an O(n) operation, as you have to read and Huffman decode half the file before you can find the number you were after. You have to choose the page size carefully to balance the overhead of storing page offsets against the speed of lookup.
MSalters' answer is interesting but might distract you if you don't analyze properly. There are only 47 Fibonacci numbers that fit in 32-bits.
But he is spot on on how to properly solve the problem by analyzing the series of increments to find patterns there to compress.
Things that matter: a) Are there repeated values? If so, how often? (if important, make it part of the compression, if not make it an exception.) b) Does it look quasi-random? This also can be good as a suitable average increment can likely be found.
The conditions on the lists of integers is slightly different, but
the question Compression for a unique stream of data suggests several approaches which could help you.
I'd suggest prefiltering the data into a start and a series of offsets. If you know that the offsets will reliably small you could even encode them as 1- or 2-byte quantities instead of 4-bytes. If you don't know this, each offset could still be 4 bytes, but since they will be small diffs, you'll get many more repeats than you would storing the original integers.
After prefiltering, run your output through the compression scheme of your choice - something that works on a byte level, like gzip or zlib, would probably do a really nice job.
I would imagine Huffman coding would be quite appropiate for this purpose (and relatively quick compared to other algorithms with similar compression ratios).
EDIT: My answer was only a general pointer. Niyaz's suggestion of encoding the differences between consecutive numbers is a good one. (However if the list is not ordered or the spacing of numbers is very irregular, I think it would be no less effective to use plain Huffman encoding. In fact LZW or similar would likely be best in this case, though possibly still not very good.)
I'd use something bog standard off the shelf before investing in your own scheme.
In Java for example you can use GZIPOutputStream to apply gzip compression.
Maybe you could store the differences between consecutive 32-bit integers as 16-bit integers.
A reliable and effective solution is to apply Quantile Compression (https://github.com/mwlon/quantile-compression/). Quantile Compression automatically takes deltas if appropriate, then gets close to the Shannon Entropy of a smooth distribution of those deltas. Regardless of how many repeated numbers or widely spread numbers you have, it will get you close to optimum.