Best Scala collection type for vectorized numerical computing

Best Scala collection type for vectorized numerical computing - api

Looking for the proper data type (such as IndexedSeq[Double]) to use when designing a domain-specific numerical computing library. For this question, I'm limiting scope to working with 1-Dimensional arrays of Double. The library will define a number functions that are typically applied for each element in the 1D array.
Considerations:
Prefer immutable data types, such as Vector or IndexedSeq
Want to minimize data conversions
Reasonably efficient in space and time
Friendly for other people using the library
Elegant and clean API
Should I use something higher up the collections hierarchy, such as Seq?
Or is it better to just define the single-element functions and leave the mapping/iterating to the end user?
This seems less efficient (since some computations could be done once per set of calls), but at at the same time a more flexible API, since it would work with any type of collection.
Any recommendations?

If your computations are to do anything remotely computationally intensive, use Array, either raw or wrapped in your own classes. You can provide a collection-compatible wrapper, but make that an explicit wrapper for interoperability only. Everything other than Array is generic and thus boxed and thus comparatively slow and bulky.
If you do not use Array, people will be forced to abandon whatever things you have and just use Array instead when performance matters. Maybe that's okay; maybe you want the computations to be there for convenience not efficiency. In that case, I suggest using IndexedSeq for the interface, assuming that you want to let people know that indexing is not outrageously slow (e.g. is not List), and use Vector under the hood. You will use about 4x more memory than Array[Double], and be 3-10x slower for most low-effort operations (e.g. multiplication).
For example, this:
val u = v.map(1.0 / _) // v is Vector[Double]
is about three times slower than this:
val u = new Array[Double](v.length)
var j = 0
while (j<u.length) {
u(j) = 1.0/v(j) // v is Array[Double]
j += 1
}
If you use the map method on Array, it's just as slow as the Vector[Double] way; operations on Array are generic and hence boxed. (And that's where the majority of the penalty comes from.)

I am using Vectors all the time when I deal with numerical values, since it provides very efficient random access as well as append/prepend.
Also notice that, the current default collection for immutable indexed sequences is Vector, so that if you write some code like for (i <- 0 until n) yield {...}, it returns IndexedSeq[...] but the runtime type is Vector. So, it may be a good idea to always use Vectors, since some binary operators that take two sequences as input may benefit from the fact that the two arguments are of the same implementation type. (Not really the case now, but some one has pointed out that vector concatenation could be in log(N) time, as opposed to the current linear time due to the fact that the second parameter is simply treated as a general sequence.)
Nevertheless, I believe that Seq[Double] should already provide most of the function interfaces you need. And since mapping results from Range does not yield Vector directly, I usually put Seq[Double] as the argument type as my input, so that it has some generality. I would expect that efficiency is optimized in the underlying implementation.
Hope that helps.

Related

Why is Kotlin's Number-class missing operators?

In Kotlin, the Number type sounds quite useful: A type to use whenever I need something numeric.
When actually using it, however, I quickly noticed it is pretty useless: I cannot use any operators on these numbers. As soon as I need to do something with them, I need to explicitly convert them (even for comparing).
Why did the language designers choose to not include operators in the Number specification?
Thinking on this, I noticed it could be tricky to implement Number.plus(n: Number): Number, because n might be of a different type than this.
On the other hand, such implementations do exist in all Number subtypes I checked. And of course they are necessary if I want to type 1 + 1.2, which calls Int.plus(d: Double): Double
The result for me is that I have to call .toDouble() every time I use a number. This makes the code hard to read (compare a.toDouble() < b.toDouble() with a < b).
Is there any technical reason why operators where omitted from Number?

The problem is the implementation of the compareTo method. While it sounds reasonable and easy to add it in the first place, the devil lies in the details:
How would you compare instances of arbitrary Number classes to each other? Kotlin could implement the compare method using toDouble(); however this has problems with equality/precision: How do you compare a BigDecimal to a Double? Using toDouble() on the BigDecimal might lose precision, and two (actually different) BigDecimals might be considered equal using this method.
The mess gets even worse when you start to assume one or both types were supplied by libraries, where you cannot make assumptions on precision etc.
In Java, the Number type is not Comparable either.
Furthermore, some Number values like NaN might not be comparable at all.
If you need a Number to be comparable, you can easily implement your own compareTo-method as extension function. This has some additional limitations though, as most Number subtypes implement Comparable, and the extension function will lose against that implementation.
Credit for this answer goes to Roland, I only extended his comments (see on the question) into an answer.

When writing big O notation can unknown variables be used?

I do not know if the language I am using in the title is correct, but here is an example that illustrates what I am asking.
What would the time complexity for this non-optimal algorithm that removes character pairs from a string?
The function will loop through a string. When it finds two identical characters next to each other it will return a string without the found pair. It then recursively calls itself until no pair is found.
Example (each line is the return string from one recurisive function call):
iabccba
iabba
iaa
i
Would it be fair to describe the time complexity as O(|Characters| * |Pairs|)?
What about O(|Characters|^2) Can pairs be used to describe the time complexity even though the number of pairs is not knowable at the initial function call?
It was argued to me that this algorithm was O(n^2)because the number of pairs is not known.

You're right that this is strictly speaking O(|Characters| * |Pairs|)
However, in the worst case, number of pairs can be same as number of charachters (or same order of magnitude), for example in the string 'abcdeedcba'
So it also makes sense to describe it as O(n^2) worst-case.
I think this largely depends on the problem you mean to solve and and it's definition.
For graph algorithms for example, everyone is comfortable with a writing as complexity O(|V| + |E|), although in the worst case of a dense graph |E| = |V|^2. In other problems we just look at the worst possible case and write O(n^2), without breaking it into more specific variables.
I'd say that if there's no special convention, or no special data in the problem regarding number of pairs, O(...) implies worst-case performance and hence O(n^2) would be more appropriate.

Why does the Java API use int instead of short or byte?

Why does the Java API use int, when short or even byte would be sufficient?
Example: The DAY_OF_WEEK field in class Calendar uses int.
If the difference is too minimal, then why do those datatypes (short, int) exist at all?

Some of the reasons have already been pointed out. For example, the fact that "...(Almost) All operations on byte, short will promote these primitives to int". However, the obvious next question would be: WHY are these types promoted to int?
So to go one level deeper: The answer may simply be related to the Java Virtual Machine Instruction Set. As summarized in the Table in the Java Virtual Machine Specification, all integral arithmetic operations, like adding, dividing and others, are only available for the type int and the type long, and not for the smaller types.
(An aside: The smaller types (byte and short) are basically only intended for arrays. An array like new byte[1000] will take 1000 bytes, and an array like new int[1000] will take 4000 bytes)
Now, of course, one could say that "...the obvious next question would be: WHY are these instructions only offered for int (and long)?".
One reason is mentioned in the JVM Spec mentioned above:
If each typed instruction supported all of the Java Virtual Machine's run-time data types, there would be more instructions than could be represented in a byte
Additionally, the Java Virtual Machine can be considered as an abstraction of a real processor. And introducing dedicated Arithmetic Logic Unit for smaller types would not be worth the effort: It would need additional transistors, but it still could only execute one addition in one clock cycle. The dominant architecture when the JVM was designed was 32bits, just right for a 32bit int. (The operations that involve a 64bit long value are implemented as a special case).
(Note: The last paragraph is a bit oversimplified, considering possible vectorization etc., but should give the basic idea without diving too deep into processor design topics)
EDIT: A short addendum, focussing on the example from the question, but in an more general sense: One could also ask whether it would not be beneficial to store fields using the smaller types. For example, one might think that memory could be saved by storing Calendar.DAY_OF_WEEK as a byte. But here, the Java Class File Format comes into play: All the Fields in a Class File occupy at least one "slot", which has the size of one int (32 bits). (The "wide" fields, double and long, occupy two slots). So explicitly declaring a field as short or byte would not save any memory either.

(Almost) All operations on byte, short will promote them to int, for example, you cannot write:
short x = 1;
short y = 2;
short z = x + y; //error
Arithmetics are easier and straightforward when using int, no need to cast.
In terms of space, it makes a very little difference. byte and short would complicate things, I don't think this micro optimization worth it since we are talking about a fixed amount of variables.
byte is relevant and useful when you program for embedded devices or dealing with files/networks. Also these primitives are limited, what if the calculations might exceed their limits in the future? Try to think about an extension for Calendar class that might evolve bigger numbers.
Also note that in a 64-bit processors, locals will be saved in registers and won't use any resources, so using int, short and other primitives won't make any difference at all. Moreover, many Java implementations align variables* (and objects).
* byte and short occupy the same space as int if they are local variables, class variables or even instance variables. Why? Because in (most) computer systems, variables addresses are aligned, so for example if you use a single byte, you'll actually end up with two bytes - one for the variable itself and another for the padding.
On the other hand, in arrays, byte take 1 byte, short take 2 bytes and int take four bytes, because in arrays only the start and maybe the end of it has to be aligned. This will make a difference in case you want to use, for example, System.arraycopy(), then you'll really note a performance difference.

Because arithmetic operations are easier when using integers compared to shorts. Assume that the constants were indeed modeled by short values. Then you would have to use the API in this manner:
short month = Calendar.JUNE;
month = month + (short) 1; // is july
Notice the explicit casting. Short values are implicitly promoted to int values when they are used in arithmetic operations. (On the operand stack, shorts are even expressed as ints.) This would be quite cumbersome to use which is why int values are often preferred for constants.
Compared to that, the gain in storage efficiency is minimal because there only exists a fixed number of such constants. We are talking about 40 constants. Changing their storage from int to short would safe you 40 * 16 bit = 80 byte. See this answer for further reference.

The design complexity of a virtual machine is a function of how many kinds of operations it can perform. It's easier to having four implementations of an instruction like "multiply"--one each for 32-bit integer, 64-bit integer, 32-bit floating-point, and 64-bit floating-point--than to have, in addition to the above, versions for the smaller numerical types as well. A more interesting design question is why there should be four types, rather than fewer (performing all integer computations with 64-bit integers and/or doing all floating-point computations with 64-bit floating-point values). The reason for using 32-bit integers is that Java was expected to run on many platforms where 32-bit types could be acted upon just as quickly as 16-bit or 8-bit types, but operations on 64-bit types would be noticeably slower. Even on platforms where 16-bit types would be faster to work with, the extra cost of working with 32-bit quantities would be offset by the simplicity afforded by only having 32-bit types.
As for performing floating-point computations on 32-bit values, the advantages are a bit less clear. There are some platforms where a computation like float a=b+c+d; could be performed most quickly by converting all operands to a higher-precision type, adding them, and then converting the result back to a 32-bit floating-point number for storage. There are other platforms where it would be more efficient to perform all computations using 32-bit floating-point values. The creators of Java decided that all platforms should be required to do things the same way, and that they should favor the hardware platforms for which 32-bit floating-point computations are faster than longer ones, even though this severely degraded PC both the speed and precision of floating-point math on a typical PC, as well as on many machines without floating-point units. Note, btw, that depending upon the values of b, c, and d, using higher-precision intermediate computations when computing expressions like the aforementioned float a=b+c+d; will sometimes yield results which are significantly more accurate than would be achieved of all intermediate operands were computed at float precision, but will sometimes yield a value which is a tiny bit less accurate. In any case, Sun decided everything should be done the same way, and they opted for using minimal-precision float values.
Note that the primary advantages of smaller data types become apparent when large numbers of them are stored together in an array; even if there were no advantage to having individual variables of types smaller than 64-bits, it's worthwhile to have arrays which can store smaller values more compactly; having a local variable be a byte rather than an long saves seven bytes; having an array of 1,000,000 numbers hold each number as a byte rather than a long waves 7,000,000 bytes. Since each array type only needs to support a few operations (most notably read one item, store one item, copy a range of items within an array, or copy a range of items from one array to another), the added complexity of having more array types is not as severe as the complexity of having more types of directly-usable discrete numerical values.

If you used the philosophy where integral constants are stored in the smallest type that they fit in, then Java would have a serious problem: whenever programmers write code using integral constants, they have to pay careful attention to their code to check if the type of the constants matter, and if so look up the type in the documentation and/or do whatever type conversions are needed.
So now that we've outlined a serious problem, what benefits could you hope to achieve with that philosophy? I would be unsurprised if the only runtime-observable effect of that change would be what type you get when you look the constant up via reflection. (and, of course, whatever errors are introduced by lazy/unwitting programmers not correctly accounting for the types of the constants)
Weighing the pros and the cons is very easy: it's a bad philosophy.

Actually, there'd be a small advantage. If you have a
class MyTimeAndDayOfWeek {
byte dayOfWeek;
byte hour;
byte minute;
byte second;
}
then on a typical JVM it needs as much space as a class containing a single int. The memory consumption gets rounded to a next multiple of 8 or 16 bytes (IIRC, that's configurable), so the cases when there are real saving are rather rare.
This class would be slightly easier to use if the corresponding Calendar methods returned a byte. But there are no such Calendar methods, only get(int) which must returns an int because of other fields. Each operation on smaller types promotes to int, so you need a lot of casting.
Most probably, you'll either give up and switch to an int or write setters like
void setDayOfWeek(int dayOfWeek) {
this.dayOfWeek = checkedCastToByte(dayOfWeek);
}
Then the type of DAY_OF_WEEK doesn't matter, anyway.

Using variables smaller than the bus size of the CPU means more cycles are necessary. For example when updating a single byte in memory, a 64-bit CPU needs to read a whole 64-bit word, modify only the changed part, then write back the result.
Also, using a smaller data type requires overhead when the variable is stored in a register, since the behavior of the smaller data type to be accounted for explicitly. Since the whole register is used anyways, there is nothing to be gained by using a smaller data type for method parameters and local variables.
Nevertheless, these data types might be useful for representing data structures that require specific widths, such as network packets, or for saving space in large arrays, sacrificing speed.

Does OptaPlanner support optimizations and constraints on continuous variables?

I'm reading contradictory things in the documentation.
On one hand, this passage seems to indicate that continuous planning variables are possible:
A planning value range is the set of possible planning values for a
planning variable. This set can be a discrete (for example row 1, 2, 3
or 4) or continuous (for example any double between 0.0 and 1.0).
On the other hand, when defining a Planning Variable, you must specify a ValueRangeProvider annotation on a field to use for the value set:
The Solution implementation has method which returns a Collection. Any
value from that Collection is a possible planning value for this
planning variable.
Both of these snippets are in the same section of the documentation (http://docs.jboss.org/drools/release/latest/optaplanner-docs/html_single/#d0e2518)
So, which is it? Can I use a full double as my planning variable, or do I need to restrict its range to the values in a specific Collection?
Looking at the actual algorithms are provided, I don't see any that are actually suitable for optimizing continuous variables, so I doubt it's possible, but it'd be nice to have that clarified and made explicit.

We're working towards fully supporting continuous variables. But currently (in 6.0.0.CR2) it's not decently supported yet.
Value ranges can indeed be continuous ranges, but the plumbing to actually use them isn't there yet. We have made good progress recently, see https://issues.jboss.org/browse/PLANNER-160.
Here's how it will work:
You 'll be able to use a #ValueRangeProvider annotation on a method that returns a ValueRange (instead of a Collection) too.
A ValueRange will be an interface supports selecting a random value, getting a size, ...
Out-of-the-box we will support IntValueRange, DoubleValueRange, BigDecimalValueRange, ...
(Implementation detail: we'll retro-fit those Collection-returning methods into a CollectionValueRange.)
Then the ValueSelector implementations will use that directly.
As for the suitability to optimize continuous variables:
JIT random selection will be blazing fast and be very memory-efficient.
If you have an NP-complete/NP-hard problem, then OptaPlanner will be a great match. If you have only continuous variables (and not a single discrete variable), then it's unlikely that your problem is NP-complete (unless your constraints counterprove that) and in that case you're better off with a custom, handmade, polynomial algorithm anyway (because it's not NP-complete, so there's an "easy" solution).

Optimization of Function Calls in Haskell

Not sure what exactly to google for this question, so I'll post it directly to SO:
Variables in Haskell are immutable
Pure functions should result in same values for same arguments
From these two points it's possible to deduce that if you call somePureFunc somevar1 somevar2 in your code twice, it only makes sense to compute the value during the first call. The resulting value can be stored in some sort of a giant hash table (or something like that) and looked up during subsequent calls to the function. I have two questions:
Does GHC actually do this kind of optimization?
If it does, what is the behaviour in the case when it's actually cheaper to repeat the computation than to look up the results?
Thanks.

GHC doesn't do automatic memoization. See the GHC FAQ on Common Subexpression Elimination (not exactly the same thing, but my guess is that the reasoning is the same) and the answer to this question.
If you want to do memoization yourself, then have a look at Data.MemoCombinators.
Another way of looking at memoization is to use laziness to take advantage of memoization. For example, you can define a list in terms of itself. The definition below is an infinite list of all the Fibonacci numbers (taken from the Haskell Wiki)
fibs = 0 : 1 : zipWith (+) fibs (tail fibs)
Because the list is realized lazily it's similar to having precomputed (memoized) previous values. e.g. fibs !! 10 will create the first ten elements such that fibs 11 is much faster.

Saving every function call result (cf. hash consing) is valid but can be a giant space leak and in general also slows your program down a lot. It often costs more to check if you have something in the table than to actually compute it.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas