Serializing Begets Deep Cloning? - oop

I was reading an article written by an ASF contributor, and he briefly mentioned that an "old Java trick" to deep clone an object is to serialize it and then deserialize it back into another object. When I read this I paused, and thought "hey, that's pretty smart." Unfortunately neither deep cloning, nor serialization, were the subject of the article, and so the author never gave an example of what he was talking about, and online searches haven't pulled back anything along these lines.
I have to assume, we're talking about something that looks like this:
public class Dog implements Serializable
{
// ...
public Dog deepClone()
{
Dog dogClone = null;
try
{
FileOutputStream fout = new FileOutputStream("mydog.dat");
ObjectOutputStream oos = new ObjectOutputStream(fout);
oos.writeObject(this);
oos.close();
FileInputStream fin = new FileInputStream("mydog.dat");
ObjectInputStream ois = new ObjectInputStream(fin);
dogClone = (Dog)ois.readObject();
ois.close();
return dogClone;
}
catch(Exception e)
{
// Blah
}
}
Provided that I might be off a little bit (plus or minus a few lines of code), is this a generally-accepted practice for deep cloning an object? Are there any pitfalls or caveats to this method?
Are there synching/concurrency/thread-safety issues not addressed?
Because if this is a best-practices way of deep cloning objects, I'm going to use it religiously.

This is one common practice for deep-clonging. The drawbacks are:
It is generally slow to do a serialization/deserialization. Custom cloning is faster.
It only clones serializable objects, obviously
It is difficult to know what you serialize. If your dog has an upwards pointer to some larger structure (pack of dogs), cloning a dog may clone a hundred other dogs if you don't pay attention. A manual clone of Dog would probably simply ignore the pack reference, creating a new individual dog object with the same properties, perhaps referencing the same pack of dogs, but not cloning the pack.
Thread safety is not different from doing a manual clone. The properties will most likely be read sequentially from the source object by the serializer, and unless you take care of thread safety you may clone a dog that is partially changed while cloning.
So I'd say it is probably not advisable to use this all the time. For a really simple object, making a simple manual clone/copy-constructor is simple and will perform much better. And for a complex object graph you may find that this runs the risk of cloning things you didn't intend to. So while it is useful, it should be used with caution.
By the way, in your example I'd use a memory stream rather than a file stream.

Related

Abstract factory: ways of realization

I'm learning design patterns now and reading different resources for every pattern. I have a question about pattern Abstract Factory. I read about two ways of realization this. I'll write using this factories, without realization. For example, I took making different doors.
First way. We have common class for Door Factory, which consists different methods for making different types of doors (which returns appropriate class of door):
$doorFactory = new DoorFactory();
$door1 = $doorFactory->createWoodDoor();
$doorFactory = new DoorFactory();
$door2 = $doorFactory->createSteelDoor();
Second way. We have parent class DoorFactory and extends classes for WoodDoorFactory and SteelDoorFactory. This classes realises same method createDoor (and return appropriate class of Door)
$woodDoorFactory = new WoodDoorFactory();
$door1 = $woodDoorFactory->createDoor();
$steelDoorFactory = new SteelDoorFactory();
$door2 = $steelDoorFactory->createDoor();
How you think, which way more optimal and canonical?
Please imagine the situation when your factory is passed around, and what client code need to ask the factory is just creating a door (not care about wood and steel), you will see why the 2nd way is better. Let's say we have a class Client with method foo which uses the factory (I'm using Java but it should be easy to understand):
class Client {
private DoorFactory factory;
public Client(DoorFactory factory) { this.factory = factory; }
public void foo() {
Door door = factory.createDoor();
}
}
Now you can pass a WoodDoorFactory or a SteelDoorFactory or WhateverDoorFactory to the constructor of Client.
Moreover, not to mention that your 1st way may be a violation of The Single Responsibility Principles since the DoorFactory class knows many things which are probably unrelated. Indeed, it knows how to create a wood door which requires Wood APIs (just example), it knows how to create a steel door which requires Steel APIs. This clearly reduces the opportunity of reusing the DoorFactory class in another environment which does not want to depend on Wood APIs or Steel APIs. This problem does not happen with your 2nd way.
As with the other answers, I also generally prefer the second method in practice. I've found it to be a more flexible and useful approach for Dependency Injection.
That said, I do think there are cases where the first approach works just as well if not better - they just aren't as common. An example that comes to mind is an XML Document Object Model. If you have ever used Microsoft's c++ XML DOM Document API then you will be familiar with this approach (see https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms760218(v%3dvs.85))
In this case, there are a limited number of well-defined elements that can go into an XML document. There isn't a need to be able to dynamically extend the types of elements that can go into an XML document either - this is all predetermined by some standards committee. Therefore, the first factory approach works here, because you can predetermine all the different types of things you need to be able to create from the get-go.
The other advantage, in this case, is that by making the XML Document class the factory for all of the elements contained there-in, the XML Document has complete control over the life-time of those internal objects. NOTE: They disallow using the same sub-elements in multiple XML Document instances. If you wanted to use a node from one XML Document and place it in another XML Document, you would be required to go through the second XML Document to produce a new node element as well as a copy of any and all sub-elements.
A notable difference in this case from the example in the OP is that rather than the Factory Method being used to provide multiple ways of creating the same type of thing, here the Factory knows how to create a bunch of highly related (and connected) object types.

How to design the objects' relationship with ORM

I have been confused about ORM since I see the following sample code:
public class Article
{
public List<Comment> Comments;
public void AddComment(Comment comment)
{
Comments.Add(comment);
}
// I'm surprised by this kind of operation
// how much performance hit it should be
public void Save()
{
//update the article and all its comments
}
}
According to what I think, the responsiblity of saving comment should be assigned to the comment itself:
public class Comment
{
public Article BelongArticle;
//I think this is better/direct than use object Article,
//But it's based on the consideration of database structure,
//I was told one should "forget" the database, but it's really hard
public int ArticleId;
public void Save()
{
//save the comment directly
}
}
You are reaching conclusions without any real basis because you are just looking at some sample code and not considering what actually might be happening.
The whole point of using an ORM is so you can allow it to handle the database transactions while you work in an object oriented rather than a relational fashion in your application. You really cannot say anything about how the ORM performs when you do an update on Article by just looking at Article.Save. Article.Save is an OO construct, and what the ORM actually executes on the database is a relational action. What about Article.Save makes you think it is inefficient? Looking at that does not give you any information. You would have to look at what the ORM of choice is doing on the database.
Suppose the Article is a new object. In this case you have to save the Article, set the foreign key in the Comment, and then save the comment. Your "preferred" code does not show the full operation but it still must occur. The difference is the ORM gives you an object oriented way to do this - just call save on the article. Under the hood the same operations must occur either way. Maybe the ORM takes more steps than you could do it in manually, but maybe not.
Suppose Article is not a new object. You add a new Comment. Depending on the platform, when you call Save this time potentially what happens with an ORM could be no different than what you might think as better depending on how your code is written. If there is nothing that needs to be updated in the Article, the ORM may simply save the Comment.
ORMs use various methods, but in general they maintain some kind of running account of objects that need to be updated.
You cannot just say that the first approach is somehow inefficient just on face because you called a method on Article instead of Comment - what is actually happening will depend on the specific ORM platform you use as well as the state of the objects.

Object persistence terminology: 'repository' vs. 'store' vs. 'context' vs. 'retriever' vs. (...)

I'm not sure how to name data store classes when designing a program's data access layer (DAL).
(By data store class, I mean a class that is responsible to read a persisted object into memory, or to persist an in-memory object.)
It seems reasonable to name a data store class according to two things:
what kinds of objects it handles;
whether it loads and/or persists such objects.
⇒ A class that loads Banana objects might be called e.g. BananaSource.
I don't know how to go about the second point (ie. the Source bit in the example). I've seen different nouns apparently used for just that purpose:
repository: this sounds very general. Does this denote something read-/write-accessible?
store: this sounds like something that potentially allows write access.
context: sounds very abstract. I've seen this with LINQ and object-relational mappers (ORMs).
P.S. (several months later): This is probably appropriate for containers that contain "active" or otherwise supervised objects (the Unit of Work pattern comes to mind).
retriever: sounds like something read-only.
source & sink: probably not appropriate for object persistence; a better fit with data streams?
reader / writer: quite clear in its intention, but sounds too technical to me.
Are these names arbitrary, or are there widely accepted meanings / semantic differences behind each? More specifically, I wonder:
What names would be appropriate for read-only data stores?
What names would be appropriate for write-only data stores?
What names would be appropriate for mostly read-only data stores that are occasionally updated?
What names would be appropriate for mostly write-only data stores that are occasionally read?
Does one name fit all scenarios equally well?
As noone has yet answered the question, I'll post on what I have decided in the meantime.
Just for the record, I have pretty much decided on calling most data store classes repositories. First, it appears to be the most neutral, non-technical term from the list I suggested, and it seems to be well in line with the Repository pattern.
Generally, "repository" seems to fit well where data retrieval/persistence interfaces are something similar to the following:
public interface IRepository<TResource, TId>
{
int Count { get; }
TResource GetById(TId id);
IEnumerable<TResource> GetManyBySomeCriteria(...);
TId Add(TResource resource);
void Remove(TId id);
void Remove(TResource resource);
...
}
Another term I have decided on using is provider, which I'll be preferring over "repository" whenever objects are generated on-the-fly instead of being retrieved from a persistence store, or when access to a persistence store happens in a purely read-only manner. (Factory would also be appropriate, but sounds more technical, and I have decided against technical terms for most uses.)
P.S.: Some time has gone by since writing this answer, and I've had several opportunities at work to review someone else's code. One term I've thus added to my vocabulary is Service, which I am reserving for SOA scenarios: I might publish a FooService that is backed by a private Foo repository or provider. The "service" is basically just a thin public-facing layer above these that takes care of things like authentication, authorization, or aggregating / batching DTOs for proper "chunkiness" of service responses.
Well so to add something to you conclusion:
A repository: is meant to only care about one entity and has certain patterns like you did.
A store: is allowed to do a bit more, also working with other entities.
A reader/writer: is separated to allow semantically show and inject only reading and wrting functionality into other classes. It's coming from the CQRS pattern.
A context: is more or less bound to a ORM mapper as you mentioned and is usually used under the hood of a repository or store, some use it directly instead of making a repository on top. But it's harder to abstract.

Is it good convention for a class to perform functions on itself?

I've always been taught that if you are doing something to an object, that should be an external thing, so one would Save(Class) rather than having the object save itself: Class.Save().
I've noticed that in the .Net libraries, it is common to have a class modify itself as with String.Format() or sort itself as with List.Sort().
My question is, in strict OOP is it appropriate to have a class which performs functions on itself when called to do so, or should such functions be external and called on an object of the class' type?
Great question. I have just recently reflected on a very similar issue and was eventually going to ask much the same thing here on SO.
In OOP textbooks, you sometimes see examples such as Dog.Bark(), or Person.SayHello(). I have come to the conclusion that those are bad examples. When you call those methods, you make a dog bark, or a person say hello. However, in the real world, you couldn't do this; a dog decides himself when it's going to bark. A person decides itself when it will say hello to someone. Therefore, these methods would more appropriately be modelled as events (where supported by the programming language).
You would e.g. have a function Attack(Dog), PlayWith(Dog), or Greet(Person) which would trigger the appropriate events.
Attack(dog) // triggers the Dog.Bark event
Greet(johnDoe) // triggers the Person.SaysHello event
As soon as you have more than one parameter, it won't be so easy deciding how to best write the code. Let's say I want to store a new item, say an integer, into a collection. There's many ways to formulate this; for example:
StoreInto(1, collection) // the "classic" procedural approach
1.StoreInto(collection) // possible in .NET with extension methods
Store(1).Into(collection) // possible by using state-keeping temporary objects
According to the thinking laid out above, the last variant would be the preferred one, because it doesn't force an object (the 1) to do something to itself. However, if you follow that programming style, it will soon become clear that this fluent interface-like code is quite verbose, and while it's easy to read, it can be tiring to write or even hard to remember the exact syntax.
P.S.: Concerning global functions: In the case of .NET (which you mentioned in your question), you don't have much choice, since the .NET languages do not provide for global functions. I think these would be technically possible with the CLI, but the languages disallow that feature. F# has global functions, but they can only be used from C# or VB.NET when they are packed into a module. I believe Java also doesn't have global functions.
I have come across scenarios where this lack is a pity (e.g. with fluent interface implementations). But generally, we're probably better off without global functions, as some developers might always fall back into old habits, and leave a procedural codebase for an OOP developer to maintain. Yikes.
Btw., in VB.NET, however, you can mimick global functions by using modules. Example:
Globals.vb:
Module Globals
Public Sub Save(ByVal obj As SomeClass)
...
End Sub
End Module
Demo.vb:
Imports Globals
...
Dim obj As SomeClass = ...
Save(obj)
I guess the answer is "It Depends"... for Persistence of an object I would side with having that behavior defined within a separate repository object. So with your Save() example I might have this:
repository.Save(class)
However with an Airplane object you may want the class to know how to fly with a method like so:
airplane.Fly()
This is one of the examples I've seen from Fowler about an aenemic data model. I don't think in this case you would want to have a separate service like this:
new airplaneService().Fly(airplane)
With static methods and extension methods it makes a ton of sense like in your List.Sort() example. So it depends on your usage pattens. You wouldn't want to have to new up an instance of a ListSorter class just to be able to sort a list like this:
new listSorter().Sort(list)
In strict OOP (Smalltalk or Ruby), all methods belong to an instance object or a class object. In "real" OOP (like C++ or C#), you will have static methods that essentially stand completely on their own.
Going back to strict OOP, I'm more familiar with Ruby, and Ruby has several "pairs" of methods that either return a modified copy or return the object in place -- a method ending with a ! indicates that the message modifies its receiver. For instance:
>> s = 'hello'
=> "hello"
>> s.reverse
=> "olleh"
>> s
=> "hello"
>> s.reverse!
=> "olleh"
>> s
=> "olleh"
The key is to find some middle ground between pure OOP and pure procedural that works for what you need to do. A Class should do only one thing (and do it well). Most of the time, that won't include saving itself to disk, but that doesn't mean Class shouldn't know how to serialize itself to a stream, for instance.
I'm not sure what distinction you seem to be drawing when you say "doing something to an object". In many if not most cases, the class itself is the best place to define its operations, as under "strict OOP" it is the only code that has access to internal state on which those operations depend (information hiding, encapsulation, ...).
That said, if you have an operation which applies to several otherwise unrelated types, then it might make sense for each type to expose an interface which lets the operation do most of the work in a more or less standard way. To tie it in to your example, several classes might implement an interface ISaveable which exposes a Save method on each. Individual Save methods take advantage of their access to internal class state, but given a collection of ISaveable instances, some external code could define an operation for saving them to a custom store of some kind without having to know the messy details.
It depends on what information is needed to do the work. If the work is unrelated to the class (mostly equivalently, can be made to work on virtually any class with a common interface), for example, std::sort, then make it a free function. If it must know the internals, make it a member function.
Edit: Another important consideration is performance. In-place sorting, for example, can be miles faster than returning a new, sorted, copy. This is why quicksort is faster than merge sort in the vast majority of cases, even though merge sort is theoretically faster, which is because quicksort can be performed in-place, whereas I've never heard of an in-place merge-sort. Just because it's technically possible to perform an operation within the class's public interface, doesn't mean that you actually should.

DDD: Repositories are in-memory collections of objects?

I've noticed Repository is usually implemented in either of the following ways:
Method 1
void Add(object obj);
void Remove(object obj);
object GetBy(int id);
Method 2
void Save(object obj); // Used both for Insert and Update scenarios
void Remove(object obj);
object GetBy(int id);
Method 1 has collection semantics (which is how repositories are defined). We can get an object from a repository and modify it. But we don't tell the collection to update it. Implementing a repository this way requires another mechanism for persisting the changes made to an in-memory object. As far as I know, this is done using Unit of Work. However, some argue that UoW is only required when you need transaction control in your system.
Method 2 eliminates the need to have UoW. You can call the Save() method and it determines if the object is new and should be Inserted or is modified and should be Updated. It then uses the data mappers to persist the changes to the database. Whilst this makes life much easier, a repository modeled doesn't have collection semantics. This model has DAO semantics.
I'm really confused about this. If repositories mimic in-memory collection of objects, then we should model them according to Method 1.
What are your thoughts on this?
Mosh
I personally have no issue with the Unit of Work pattern being a part of the solution. Obviously, you only need it for the CUD in CRUD. The fact that you are implementing a UoW pattern, though, does nothing more than dictate that you have a set of operations that need to go as a batch. That is slightly different than saying it needs to be a part of a transaction. If you abstract your repositories well enough, your UoW implementation can be agnostic to the backing mechanism that you are using - whether it is database, XML, etc.
As to the specific question, I think the difference between method one and method two are trivial, if for no other reason than most instances of method two contain a check to see if the identifier is set. If set, treat as update, otherwise, treat as insert. This logic is often built into the repository and is more for simplification of the exposed interface, in my opinion. The repository's purpose is to broker objects between a consumer and a data source and to remove having to have knowledge of the data source directly. I go with method two, because I trust the simple logic of detecting an identifier than having to rely on tracking object states all over the application.
The fact that the terminology for repository usage is so similar to both data access and object collections lend to the confusion. I just treat them as their own first class citizen and do what is best for the domain. ;-)
Maybe you want to have:
T Persist(T entityToPersist);
void Remove(T entityToRemove);
"Persist" being the same as "Save Or Update" or "Add Or Update" - ie. the Repo encapsulates creating new identities (the db may do this) but always returns the new instance with the identity reference.