Functional data-structures, OO notions of dispatched equality and comparison, StructuralEquality, and referential transparency - oop

I have a very CPU intensive F# program that depends on persistent data-structures - about 40% of the total CPU time is spent in the Map module. So I thought I'd try out the PersistentHashMap in FSharpX collections. (BTW, this is already a big improvement over the previous version of F# in VS2013 where the same program spent 70% of its time in Map. I also notice that running programs with the debugger attached doesn't have the huge penalty it did before - good work guys...) There is also a hot-spot where I'm re-sorting all the time, where instead I should be adding to a Heap, so I thought I'd give that a go as well.
Two issue became immediately apparent:
(1) Swapping out one for the other from an interface perspective proved harder than it seems it should - I.e., making a shim that let me switch from a Map to a PersistentMap, preserving both the needed module-based let-bound functions and Types necessary to use the each map. I know that having full HM type-inference (and no type-classes) is orthogonal to LSP-style referential transparency for the most part - but maybe I was missing some way to do this better with a minimal amount of code.
(2) The biggest problem (which I'd like to focus on here) is the reliance of the F# functional data-structs on oo-style dispatched equality and comparison via the IComparison (when 't : comparison), etc., family of interfaces.
Even for OO programs ISTM that the idea of dispatching equality and comparison is a bad idea -- an object "knows" how to perform its own domain-specific tasks, but it doesn't "know" for the most part what notion of equality is going to be necessary at various points in the program for various purposes -- so equality/comparison should not be part of the object's interface, but when these concepts are needed, they should always be mentioned explicitly. For example, there should never be a .Sort(), only a .SortWith(...). One could argue that even something as basic as structural equality in F# could be explicit a.StructEq(b) or a ~= b - otherwise you always get object.Equals -- but even stipulating that doing things this way is the best for a multi-paradigm language that's a first-class .Net citizen, it seems like there should at least be the option of using passed-in comparison and equality functions, but this is not the case.
This means that: (a) type constraints are enforced even if you don't want them, causing ripples of broken inferred typing (and hundreds of wavy red lines with it being unclear where the actual "problem" is) and (b), that by implementing a notion of equality or comparison that makes one container type happy in one part of your program (and in my case I want to use the same container and item type with two different notions of ordering in two different places), it is likely to silently break (or cause inefficiency, if one subsumes the other) in other parts of the code that depended on the default/previous implementation.
The only way around this that I could think of is wrapping each item a adapter object using new...with object expression - but I really don't want to create so much garbage just to get the code to work.
So, ISTM that we could have a "pure" version of each persistent data struct that could be loaded if desired (even basics like List, etc.) that do not depend on dispatched equality/comparison/hashing and do not impose type constraints - all such needs should be via a passed in fn's at the time of the call. (Dispatched eq/cmp would be only for used for interop with BCL collections that don't accept delegates.) Then we could have a [EqCmpHashThrowNotImplemented] attribute, and I could be sure that there were no default operations happening at all, and I would feel better about the efficiency and predictability of my code. (And this also let's one change from a Record to a Class or visa-versa w/o worrying about any changes in behavior due to default implementations.) Again, this would be optional, but done by with a simple import. (Which does mean that each base core collection type would have to be broken out into its own module, which isn't really a bad idea anyway.)
If I've overlooked a better way to do things or there are some patterns people are using here, I'd be interested.


Is Polymorphism a waste to apply for the classes that we exactly know the type prior run-time?

Run-time Polymorphism can be used to let the run-time to dynamically load the exact concrete class of an abstract class/interface. (You can take Animal/Dog, Vehicle/Car examples)
But when we know the exact concrete class #coding-time (compile-time), does it really need to forcefully apply polymorphism?
When I write OO code, I tend to use most-general type I can on the left-hand side of the assignment. This immediately means that my answer to your question is - no.
Here's the example:
Animal x = new Dog();
The reason why I'm doing this is that I'm probably going to split beginning and end of the operation into two distinct operations. My methods are extremely short in practice.
Applied to the same example:
function moveDog() {
move(new Dog());
function move(Animal animal) {
As you can see, it would make no sense for the move function to know what kind of animal it is really moving.
Generally, it is compiler's duty to figure whether in a given code base any concrete call has been made with an overridden move() method. Some compilers can detect that no overridden method will be subjected to them and then they remove dynamic dispatch at compile time. With some luck, my code above would compile the same whether move function receives Animal or Dog.
Now, this is theory. In practice, there are two important things. First, compilers that are widely used have still not started using such aggressive optimization techniques as detecting static method calls, as opposed to calls that require dynamic dispatch. Second, the first thing doesn't matter too much with CPU power we have today.
I have been writing highly optimized code for fifteen years already and I have met the situation in which I had to factor polymorphic calls out. That is why I strongly recommend to apply polymorphism as much as possible. When the time comes to add some classes, to incorporate new features, polymorphic calls will likely be the tool to seamlessly add new classes to the existing design. If you used overly concrete types during development, it could easily happen that you cannot add new feature to the given code base.
But when we know the exact concrete class #coding-time (compile-time), does it really need to forcefully apply polymorphism?
Knowing the type at compile time is not necessarily a yes/no thing across all the code in an app and an object's entire lifetime, given techniques for type erasure. But, ignoring those classic uses of polymorphism, there are still other potential reasons such as...
(sorry - pretty obvious one this) to make it easier to change the implementation should another become available later
to make it easier to "mock" an implementation for testing (i.e. provide objects that pretend to provide some service or function, but have more scripted/controllable/observable behaviours to let tests put some dependent code through its paces)
hide aspects of the implementation that might otherwise have to be exposed (e.g. in C++, a class/struct definition must declare all the protected and private members)
this is sometimes for Intellectual Property protection; at other times, so more changes can be made to the implementation without having to make a change the "header" file that would typically trigger recompilation of a lot of dependent code
to aid in modelling and application design, using the "interfaces" to cleanly specify the intended APIs, which can then provide a more stable reference for comparison as the implementations are fleshed out

Scala immutable vs mutable. What is the way one should go?

I'm just learning to program in scala.
I have some experience in functional programming, as I have in object oriented programming.
My question is kind of simple, yet tricky:
Which structures should be used in Scala? Should we only stick to immutables, eg. modifing lists by iterating through it and stick a new one together, or go for mutables? What is your opinion on that, what are the performance aspects, memory related aspects, ...
I'm likely to program in a functional style, but it often expands to an insane amount of effort to do things which are easily done by using mutables. Is it situation dependent, what to use?
Prefer immutable to mutable state. Use mutable state only where it is absolutely necessary. Some notable reasons include:
Performance. The standard libraries make wide use of vars and while loops, even though this is not idiomatic Scala. This should not be emulated, however, except for cases where you have profiled to determine that modifying the code to be more imperative will bring a significant performance gain.
I/O. I/O, or interacting with the outside world is inherently state dependent, and thus must be dealt with in a mutable manner.
This is no different than the recommended coding style found in all major languages, imperative or functional. For example, in Java it is preferable to use data objects with only private final fields. Code written in an immutable (and functional) way is inherently easier to understand because when one sees a val, they know it will never change, reducing the possible number of states any particular object or function can be in.
In many cases, it also allows automatic parallel execution, for example, collection classes in Scala all have a par function, which will return a parallel collection that automatically run the calls to functions like map or reduce in parallel.
(I thought this must be a duplicate but couldn't easily find an earlier similar one, so I venture to answer...)
There is no general answer to this question. The rule of thumb suggested by the creators of Scala is to start with immutable vals and structures and stick to them as long as it makes sense. You can almost always create a workable solution to your problem this way. But if not, of course be pragmatic and use mutability.
Once you have a solution, you can tweak it, test it, measure its performance etc. If you find that e.g. it is too slow or overly complex, identify the critical part of it, understand what makes it problematic and - if needed - reimplement it using mutable variables, ideally keeping it isolated from the rest of the program. Note though that in many cases, a better solution can be found from within the immutable realm as well, so try looking there first. Especially for a beginner like myself, it still happens regularly that the best solution I could come up with looked contorted and complex with no apparent way to improve it - until seeing a simple and elegant solution to the same problem in a few lines of code, created by an experienced Scala developer who controls more of the power of the language and its libraries.
I usually obey the following rules:
Never use static mutable vars
Keep all user defined data types (typically case classes) immutable unless they are very expensive to copy. This will simplify a lot of the application logic.
If a data structure/collection is inherently mutable (i.e. it's designed to change over time), using a mutable data structure/collection might be appropriate. An example might be a large game world that is updated when players move. Remember to (almost) never share these data structures between threads though.
It's fine to use mutable local vars in methods
Use immutable collections for function results. These can be strictly or lazily evaluated depending on what gives best performance in the used context. Be careful if you use a lazily evaluated result which depends on a mutable collection though.

Operator overloading - is it really reasonable to forbid?

Java forbids operator overloading, but coming from C++ I do not see any reason for that. In languages where operator symbols are symbols as any other, same rules apply to "+" as to"plus" and there is no problem. So what is the point?
Edit: To be more concrete, show me which disadvantage overloaded "+" may have over overloaded "equals".
Just as many other things in Java, this is a restriction because it may be confusing if used improperly. (Similarly as pointer arithmetic is forbidden because it is error prone.) I'm a big fan of Java, but I'm generally of the opinion that it shouldn't be forbidden just because it could be misused.
For instance, BigInteger would benefit greatly from overloading the + operator.
OK, I'll try my hand at this under the assumption that Gabriel Ščerbák is doing this for better reasons than railing against a language.
The issue for me is one of manageable complexity: How much of the code in front of me do I have to decode vs. simply read?
In most conventional languages, upon seeing the expression a + b I know what is going to happen. The variables a and b will be added together. I'm pretty confident that behind the scenes the code will be very concise, very fast native machine code that adds the two numbers, whether the numbers are short integers or double-precision or some mixture of the two. (In some languages I may have to also assume that these could be strings being concatenated, but that's a rant for an entirely different question -- but one that flavours this rant if you peer at it from the right angle.)
When I make my own user-defined type -- say the omnipresent Complex type (and why Complex isn't a standard data type in modern languages is way the Hell beyond me, but that, again, is a rant for a different question) -- if I overload an operator (or, rather, if the operator is overloaded for me -- I'm using a library, say), short of peering very closely at the code I will not know that I'm now calling (possibly-virtual) methods on objects instead of having very tight, concise code generated for me behind the scenes. I will not know of the hidden conversions, the hidden temporary variables, the ... well, everything that goes along with writing many operators. To find out what's really going on in my code I have to pay very close attention to every line and keep track of declarations that may be three screens away from my current location in the code. To say that this impedes my understanding of the code flowing before my eyes is an understatement. Important details are being lost because the syntactic sugar is making things taste too tasty.
When I'm forced to use explicit methods on the objects (or even static methods or global methods where that applies) this is a signal to me, while I'm reading, that tells me of the potential cost overheads and bottlenecks and the like. I know, without even having to think for an instant, that I'm dealing with a method, that I've got dispatching overhead, that I may have temporary object creation and deletion overhead, etc. Everything's in front of me right before my eyes -- or at least enough indicators are in front of me that I know to be more careful.
I'm not intrinsically opposed to operator overloading. There are times when it makes code clearer, yes indeed, especially when you have complicated calculations over many baffling expressions. I can understand, however, exactly why someone might not want to put that into their language.
There is a further reason not to like operator overloading from the language designer's viewpoint. Operator overloading makes for very, very, very difficult grammars. C++ is already infamous for being nigh-unparseable and some of its constructs, like operator overloading, are the cause of it. Again from the viewpoint of someone writing the language I can fully understand why operator overloading was left off as a bad idea (or a good idea that's bad in implementation).
(This is all, of course, in addition to the other reasons you've already rejected. I'll submit my own overloading of operator-,() in my old C++ days in that stew just to be really annoying.)
There is no problem with operator overloading itself, but how it's actually has been used. As long as you overload the operators to make sense, the language still makes sense, but if you give other meanings to operators, it makes the language inconsistent.
(One example is how the shift left (<<) and shift right (>>) operators has been overloaded in C++ to mean "input" and "output"...)
So, the reasoning when leaving out operator overloading was probably that the risk of misuse was greater than the benefits of having operator overloading.
I think that Java would benefit greatly from extending its operators to cover built-in Number object types. Early (pre-1.0) versions of Java were said to have it (in that there were no primitives - everything was an object) but the VM technology of the time made it prohibitive from a performance view.
But in terms of in general allowing user defined operator overloading, it is not in the spirit of the Java language. The main problem is simply that it is hard to implement an operator that is consistent with what you expect from mathematics across object types and it will open the door to a lot of bad implementations which lead to a lot of hard to find (therefore expensive) bugs. You can just look at how many bad equals implementations (as in violate the contract) there are in general Java code, and the problem would only get worse from there.
Of course there are languages that prioritize power and syntactical beauty over such concerns, and more power to them. It is just not Java.
Edit: How is a custom + operator different than a custom == implementation (captured in Java in the equals(Object) method)? It isn't, really. It is just that by allowing operator overloading, things that are intuitive to a sixth grader become untrue. The real world experience of equals(Object) implementations shows how such complex contracts become hard to enforce in the real world.
Further Edit: Let me clarify the above, as I shortened it while editing and lost the point. A + operator in math has certain properties, one of which is that it doesn't matter which order the numbers on either side appear - it has the same result. So consider even the simplest case of a + performing an add to a Collection:
Collection a = ...
Collection b = ...
a + b;
The intuitive understanding of + would lead to an expectation that a + b or b + a would give the same result, but of course they would not. Start mixing two object types that take each other as paramaters in their plus method (say Collection and String) and things get harder to follow.
Now certainly it is possible to design operators on objects which are well understood and lead to better, more readable and more understandable code than without them. But the point is that more often than not in home-grown corporate APIs what you would end up seeing is obfuscated code.
There are a few problems:
Overloading logical operators has side effects because of lazy evaluation.
Even in mathematical types there are ambiguities, is (3dpoint*3dpoint) a cross or scaler product
You can't define new operators, so people reuse existing operators in novel ways eg. "string1%string2" to mean split string1 on string2.
But you can't always protect idiots from themselves even with an outright ban.
The point is that whenever you see, for example, a plus sign being used in the code, you know exactly what it does given that you know the types of its operands (which you always do in Java, as it is strongly typed).

How do you determine how coarse or fine-grained a 'responsibility' should be when using the single responsibility principle?

In the SRP, a 'responsibility' is usually described as 'a reason to change', so that each class (or object?) should have only one reason someone should have to go in there and change it.
But if you take this to the extreme fine-grain you could say that an object adding two numbers together is a responsibility and a possible reason to change. Therefore the object should contain no other logic, because it would produce another reason for change.
I'm curious if there is anyone out there that has any strategies for 'scoping', the single-responsibility principle that's slightly less objective?
it comes down to the context of what you are modeling. I've done some extensive writing and presenting on the SOLID principles and I specifically address your question in my discussions of Single Responsibility.
The following first appeared in the Jan/Feb 2010 issue of Code Magazine, and is available online at "S.O.L.I.D. Software Development, One Step at a Time"
The Single Responsibility Principle
says that a class should have one, and
only one, reason to change.
This may seem counter-intuitive at
first. Wouldn’t it be easier to say
that a class should only have one
reason to exist? Actually, no-one
reason to exist could very easily be
taken to an extreme that would cause
more harm than good. If you take it to
that extreme and build classes that
have one reason to exist, you may end
up with only one method per class.
This would cause a large sprawl of
classes for even the most simple of
processes, causing the system to be
difficult to understand and difficult
to change.
The reason that a class should have
one reason to change, instead of one
reason to exist, is the business
context in which you are building the
system. Even if two concepts are
logically different, the business
context in which they are needed may
necessitate them becoming one and the
same. The key point of deciding when a
class should change is not based on a
purely logical separation of concepts,
but rather the business’s perception
of the concept. When the business
perception and context has changed,
then you have a reason to change the
class. To understand what
responsibilities a single class should
have, you need to first understand
what concept should be encapsulated by
that class and where you expect the
implementation details of that concept
to change.
Consider an engine in a car, for
example. Do you care about the inner
working of the engine? Do you care
that you have a specific size of
piston, camshaft, fuel injector, etc?
Or, do you only care that the engine
operates as expected when you get in
the car? The answer, of course,
depends entirely on the context in
which you need to use the engine.
If you are a mechanic working in an
auto shop, you probably care about the
inner workings of the engine. You need
to know the specific model, the
various part sizes, and other
specifications of the engine. If you
don’t have this information available,
you likely cannot service the engine
appropriately. However, if you are an
average everyday person that only
needs transportation from point A to
point B, you will likely not need that
level of information. The notion of
the individual pistons, spark plugs,
pulleys, belts, etc., is almost
meaningless to you. You only care that
the car you are driving has an engine
and that it performs correctly.
The engine example drives straight to
the heart of the Single Responsibility
Principle. The contexts of driving the
car vs. servicing the engine provide
two different notions of what should
and should not be a single concept-a
reason for change. In the context of
servicing the engine, every individual
part needs to be separate. You need to
code them as single classes and ensure
they are all up to their individual
specifications. In the context of
driving a car, though, the engine is a
single concept that does not need to
be broken down any further. You would
likely have a single class called
Engine, in this case. In either case,
the context has determined what the
appropriate separation of
responsibilities is.
I tend to think in term of "velocity of change" of the business requirements rather than "reason to change" .
The question is indeed how likely stuffs will change together, not whether they could change or not.
The difference is subtle, but helps me. Let's consider the example on wikipedia about the reporting engine:
if the likelihood that the content and the template of the report change at the same time is high, it can be one component because they are apparently related. (It can also be two)
but if the likelihood that the content change without the template is important, then it must be two components, because they are not related. (Would be dangerous to have one)
But I know that's a personal interpretation of the SRP.
Also, a second technique that I like is: "Describe your class in one sentence". It usually helps me to identify if there is a clear responsibility or not.
I don't see performing a task like adding two numbers together as a responsibility. Responsibilities come in different shapes and sizes but they certainly should be seen as something larger than performing a single function.
To understand this better, it is probably helpful to clearly differentiate between what a class is responsible for and what a method does. A method should "do only one thing" (e.g. add two numbers, though for most purposes '+' is a method that does that already) while a class should present a single clear "responsibility" to it's consumers. It's responsibility is at a much higher level than a method.
A class like Repository has a clear and singular responsibility. It has multiple methods like Save and Load, but a clear responsibility to provide persistence support for Person entities. A class may also co-ordinate and/or abstract the responsibilities of dependent classes, again presenting this as a single responsibility to other consuming classes.
The bottom line is if the application of SRP is leading to single-method classes who's whole purpose seems to be just to wrap the functionality of that method in a class then SRP is not being applied correctly.
A simple rule of thumb I use is that: the level or grainularity of responsibility should match the level or grainularity of the "entity" in question. Obviously the purpose of a method will always be more precise than that of a class, or service, or component.
A good strategiy for evaluating the level of responsibility can be to use an appropriate metaphor. If you can relate what you are doing to something that exists in the real world it can help give you another view of the problem you're trying to solve - including being able to identify appropriate levels of abstraction and responsibility.
#Derick bailey: nice explanation
Some additions: It is totally acceptable that application of SRP is contextual base.
The question still remains: are there any objective ways to define if a given class violates SRP ?
Some design contexts are quite obvious ( like the car example by Derick ) but otherwise contexts in which a class's behaviour has to defined remains fuzzy many-a-times.
For such cases, it might well be helpful if the fuzzy class behaviour is analysed by splitting it's responsibilities into different classes and then measuring the impact of new behavioural and structural relations that has emanated because of the split.
As soon the split is done, the reasons to keep the splitted responsibilities or to back-merge them into single responsibility becomes obvious at once.
I have applied this approach and which has lead good results for me.
But my search to look for 'objective ways of defining a class responsibility' still continues.
I respectful don't agree when Chris Nicola's above says that "a class should presents a single clear "responsibility" to it's consumers
I think SRP is about having a good design inside the class, not class' customers.
To me it's not very clear what a responsability is, and the prove is the number of questions that this concept arises.
"single reason to change"
"if the description contains the word
"and" then it needs to be split"
leads to the question: where is the limit? At the end, any class with 2 public methods has 2 reasons to change, isn't it?
For me, the true SRP leads to the Facade pattern, where you have a class that simply delegades the calls to other classes
For example:
class Modem
Refactors to ==>
class ModemSender
class ModelReceiver
class Modem
send() -> ModemSender.send()
receive() -> ModemReceiver.receive()
Opinions are wellcome

Passing object references needlessly through a middleman

I often find myself needing reference to an object that is several objects away, or so it seems. The options I see are passing a reference through a middle-man or just making something available statically. I understand the danger of global scope, but passing a reference through an object that does nothing with it feels ridiculous. I'm okay with a little bit passing around, I suppose. I suspect there's a line to be drawn somewhere.
Does anyone have insight on where to draw this line?
Or a good way to deal with the problem of distributing references amongst dependent objects?
Use the Law of Demeter (with moderation and good taste, not dogmatically). If you're coding a.b.c.d.e, something IS wrong -- you've nailed forevermore the implementation of a to have a b which has a c which... EEP!-) One or at the most two dots is the maximum you should be using. But the alternative is NOT to plump things into globals (and ensure thread-unsafe, buggy, hard-to-maintain code!), it is to have each object "surface" those characteristics it is designed to maintain as part of its interface to clients going forward, instead of just letting poor clients go through such undending chains of nested refs!
This smells of an abstraction that may need some improvement. You seem to be violating the Law of Demeter.
In some cases a global isn't too bad.
Consider, you're probably programming against an operating system's API. That's full of globals, you can probably access a file or the registry, write to the console. Look up a window handle. You can do loads of stuff to access state that is global across the whole computer, or even across the internet... and you don't have to pass a single reference to your class to access it. All this stuff is global if you access the OS's API.
So, when you consider the number of global things that often exist, a global in your own program probably isn't as bad as many people try and make out and scream about.
However, if you want to have very nice OO code that is all unit testable, I suppose you should be writing wrapper classes around any access to globals whether they come from the OS, or are declared yourself to encapsulate them. This means you class that uses this global state can get references to the wrappers, and they could be replaced with fakes.
Hmm, anyway. I'm not quite sure what advice I'm trying to give here, other than say, structuring code is all a balance! And, how to do it for your particular problem depends on your preferences, preferences of people who will use the code, how you're feeling on the day on the academic to pragmatic scale, how big the code base is, how safety critical the system is and how far off the deadline for completion is.
I believe your question is revealing something about your classes. Maybe the responsibilities could be improved ? Maybe moving some code would solve problems ?
Tell, don't ask.
That's how it was explained to me. There is a natural tendency to call classes to obtain some data. Taken too far, asking too much, typically leads to heavy "getter sequences". But there is another way. I must admit it is not easy to find, but improves gradually in a specific code and in the coder's habits.
Class A wants to perform a calculation, and asks B's data. Sometimes, it is appropriate that A tells B to do the job, possibly passing some parameters. This could replace B's "getName()", used by A to check the validity of the name, by an "isValid()" method on B.
"Asking" has been replaced by "telling" (calling a method that executes the computation).
For me, this is the question I ask myself when I find too many getter calls. Gradually, the methods encounter their place in the correct object, and everything gets a bit simpler, I have less getters and less call to them. I have less code, and it provides more semantic, a better alignment with the functional requirement.
Move the data around
There are other cases where I move some data. For example, if a field moves two objects up, the length of the "getter chain" is reduced by two.
I believe nobody can find the correct model at first.
I first think about it (using hand-written diagrams is quick and a big help), then code it, then think again facing the real thing... Then I code the rest, and any smells I feel in the code, I think again...
Split and merge objects
If a method on A needs data from C, with B as a middle man, I can try if A and C would have some in common. Possibly, A or a part of A could become C (possible splitting of A, merging of A and C) ...
However, there are cases where I keep the getters of course.
But it's less likely a long chain will be created.
A long chain will probably get broken by one of the techniques above.
I have three patterns for this:
Pass the necessary reference to the object's constructor -- the reference can then be stored as a data member of the object, and doesn't need to be passed again; this implies that the object's factory has the necessary reference. For example, when I'm creating a DOM, I pass the element name to the DOM node when I construct the DOM node.
Let things remember their parent, and get references to properties via their parent; this implies that the parent or ancestor has the necessary property. For example, when I'm creating a DOM, there are various things which are stored as properties of the top-level DomDocument ancestor, and its child nodes can access those properties via the reference which each one has to its parent.
Put all the different things which are passed around as references into a single class, and then pass around just that one class instance as the only thing that's passed around. For example, there are many properties required to render a DOM (e.g. the GDI graphics handle, the viewport coordinates, callback events, etc.) ... I put all of these things into a single 'Context' instance which is passed as the only parameter to the methods of the DOM nodes to be rendered, and each method can get whichever properties it needs out of that context parameter.