Apache ACE XML repository - repository

Currently, XML file based repositories have been used in Apache ACE. Can we change them to make DBMS based? If yes, any guidelines are available?

ACE uses two layers of abstraction when it comes to storage:
Repository
I'll start at the bottom. Here, ACE introduces the notion of a Repository, which is nothing more than a versioned BLOB of data. Each repository starts versioning at 1, and every time you commit a new BLOB, that version gets bumped. There are multiple such repositories, which can be addressed by name.
Writing a different implementation of this Repository interface is fairly straightforward, and you can use any back-end that supports some form of BLOB, including a DBMS. Do note that at this level, there is no notion of what's inside these BLOBs, so depending on your reasons for using a DBMS here, that might or might not be what you want.
Object Graph
On top of this Repository, ACE uses an in-memory object graph of POJOs to represent its state. The POJOs hold metadata like for an artifact its URL, bundle symbolic name, version, etc. The POJOs are currently persisted and restored using XStream (that's where the XML comes from). At this level you could opt for storing the graph in a completely different way as well (maybe even completely bypassing the underlying Repository in favor of something else). Note though that ACE in general assumes that this whole graph of objects is versioned every time it is persisted (so we're not overwriting any old data).
Hopefully this explains a bit more about what's involved. If you want to discuss this some more, don't hesitate to subscribe to the ACE dev mailing list (see http://ace.apache.org/get-involved/mailing-lists.html for information on how to subscribe).

Related

Is it okay to have more than one repository for an aggregate in DDD?

I've read this question about something similar but it didn't quite solve my problem.
I have an application where I'm required to use data from an API. Problem is there are performance and technical limitations to doing this. The performance limitations are obvious. The technical limitations lie in the fact that the API does not support some of the more granular queries I need to make.
I decided to use MySQL as a queryable cache.
Since the data I needed to retrieve from the API did not change very often, I settled on refreshing the cache once a day, so I didn't need any complicated mapper that checked if we had the data in the cache and if not fell back to the API. That was my first design, but I realized that wasn't very practical when the API couldn't support most of the queries I needed to make anyway.
Now I have a set of two mappers for every aggregate. One for MySQL and one for the API.
My problem is now how I hide the complexities of persistence from the domain, and the fact that it seems that I need multiple repositories.
Ideally I would have an interface that both mappers adhered to, but as previously disclosed that's not possible.
Is it okay to have multiple repositories, one for each mapper?
Is it okay to have more than one repository for an aggregate in DDD?
Short answer: yes.
Longer answer: you won't find any suggestion of multiple repository in the original book by Evans. As he described things, the domain model would have one representation of the aggregate, and the repository abstraction provided consumers with the illusion that the aggregate was stored in an in-memory collection.
Largely, this makes sense -- you are trying to ensure that writes to data within the aggregate boundary are consistent, so you need a single authority for change.
But... there's no particular reason that reads need to travel through the same code path as writes. Welcome to the world of cqrs. What that gives you immediately is the idea that the in memory representation for reads might need to be optimized differently from the in memory representation used for writes.
In its more general form, you get the idea that the concept that you are modeling might have different representations for each use case.
For your case, where it is sometimes appropriate to read from the RDBMS, sometimes from the API, sometimes both, this isn't quite an exact match -- the repository interface hides the implementation details from the consumer, but you still have to bother with the implementation.
One thing you might look at is your requirements; how fresh does the data need to be in each use case? A constraint that is often relaxed in the CQRS pattern is the idea that the effects of writes are immediately available for reading. The important question to ask would be, if the data hasn't been cached yet, can you simply report "data not available" without hitting the API?
If so, then use cases that access the cached data need only a single repository implementation.
If you are using external API to read and modify data, you can cache them locally to be faster in reads, but I would avoid to have a domain repository.
From the domain perspective it seems that you need a service to query (or just a Query in CQRS implementation) for some data, that you can do with a service, that internally can call some remote API or read from a local cache (mysql, whatever).
When you read your local cache you can develop a repository to decouple your logic from the db implementation, but this is a different concept from a domain repository, it is just a detail of your technical implementation, that has nothing to do with your domain.
If the remote service start offering the query you need you will change the implementation of how your query is executed, calling the remote API instead of the db, but your domain model should not change.
A domain repository is used to load and persist your aggregates, meanwhile if you are working with external aggregates (in a different context, subdomain) you need to interact with them using services.

Schema versioning using Fluent NHibernate

I've tried reading some previous answers but it's not clear whether or not any of them apply to my situation, as far as I can see. Most of the questions seem to refer to web applications. I figure I'm better off stating my requirements and going from there instead of trying to reverse-engineer advice meant for a different situation. I'm essentially asking two questions:
What does (Fluent) NHibernate support that would, in principle, allow me to achieve the requirements? I'd prefer to use the Fluent API if possible;
What am I going to have to write myself to develop a working solution?
Broadly, the requirements are as follows:
What I'd like to do is use FNH to persist and rehydrate models for a desktop application that would have roughly the same usage model as MS Office, for example - that is, work is kept as self-contained files which are loaded into a local instance of the application.
The current version of the application must be able to import files from all previous versions and preserve all information except that which is declared to the user to be unsupported; by 'import' I mean 'transcribe the model information contained in file A into new file B such that file B is fully compatible with the current version, beside that which is declared to be unsupported.'
The current version of the application must be able to export a current model to be compliant with only the most recent issue of the previous major version of the application. It is not required to supply legacy compatibility with any older revisions of the previous major version.
The nature of the product is such that updates to the file format happen fairly frequently - aim to be able to release to the user every six months or so if necessary as a ballpark figure, and are changed in development much more frequently than that.
I have no objection to writing code to handle this, provided that:
The coding does not take an inordinate amount of time for arbitrarily complicated changes to the schema;
I am able to verify whether or not the translation between versions is complete by calling the FNH API through unit tests;
I can verify that any given model will round-trip correctly between versions and only lose data which is declared to the user to be unsupported between product versions;
So, to summarise:
What, if anything, does Fluent NHibernate supply to enable this kind of use-case?
Can the requirements be readily satisfied as they are, or will I have to make them more specific and constrained?
What should I investigate as to coding myself?
I would suggest using a document database, something like RavenDB, MongoDb etc, for what you are trying to do. I think these would be a better fit than trying to force a RDBMS (sql server, oracle etc) and consequently nHibernate to do something that its not all that good at. not to say that it can't, but you will end up jumping through all sort of hoops to accomplish what you are asking.
One thing to note is that Fluent Nhibernate only puts a Fluent API over the Class Mapping of nhibernate.

JBoss TreeCache vs PojoCache when using invaludation rather than replication

We are setting up a Jboss cluster and we are building an own distributed cache solution built upon Jboss cache (Cant use it as 2nd level cache to ORM layer in our case). We want to use invalidation and not replication as cache mode. As far as i can see after (very) little testing both solutions seem to work, objects are put into the cache and objects seem to be evicted when they are updated on any of the servers.
This leads me to believe that PojoCache with AOP instrumentation is only needed when using replication so that you can replicate only updated field values and not whole objects. Am I correct here or are there any other advantages with using PojoCache over TreeCache in our scenario? And if PojoCache have advantages, do we still need AOP instrumentation and to annotate our entities with #PojoCacheable (yes, we are using JBCache 1.4.1) since we are not using relication?
Regards
Jonas Heineson
PoJoCache has the ability through AOP to:
only replicate changed fields and not whole objects. Makes a difference if e.g. your person object containes a huge image of the person and you only change the password
detect changes and thus can automatically put them on the list to be replicated.
TreeCache (plain) does not need AOP, but can thus not replicate individual fields or detect what has changed so that you need to trigger replication yourself.
If you don't replicate, those points are probably irrelevant.
IIrc, you don't need the #PojocaCacheable annotation for Pojo cache - without it, you need to specify the classes to be enhanced in a different way.
I have the feeling that if you are not replicating, the plain TreeCache will be enough.

Relational database backend for mercurial or git

What I like about fossil is that it uses plain old sqlite to store changesets, files, etc. I can use its command line tool to query the repository, but if I want something not supported by it, I can fallback to writing an sql query.
Mercurial and git are more mature, they have more libraries, more momentum, but they use their own repository format. I wonder if it's possible to have sqlite as their repository backend. (I know there are tools to query a mercurial or git repo directly, but sql seems easier.)
As Jefromi writes, Mercurial is also using a custom format to achive high compression and fast access to any revision. This is the revlog format which is an append-only data structure that takes advantage of the immutability of changesets in Mercurial.
However, it is of course possible to replace this storage format with another if you like. Google did this when they put Mercurial on Bigtable for code.google.com. One funny consequence of them using their own backend format is that you don't see any revision numbers in their web interface. In normal Mercurial, the revision numbers (the local-only integer you can use instead of the full changeset hash) are the index of the changesets in the revlog. When changesets are not stored in revlogs, there is no natural index and therefore Google shows you no revision numbers.
With git, the repository format is a pretty fundamental part of the way everything works. You'd have to do a lot of work to change that.
I haven't read any of mercurial's source, but I imagine the situation isn't much different.
As I suggested in my comment, I'm not really sure why you'd want to do this. For git to still be able to have all of its advantages, you'd have to store git objects in your sqlite database. You'd still need all of the low-level git tools to access and manipulate them - you're not going to be just looking up blobs and trees by their SHA1s and doing all the rest of the work yourself. (And even if for some reason you wanted to, you could do that just as easily by looking in the git objects directory.)
My suggestion would be that, if you find that there are operations you want to perform in git that are unsupported, you familiarize yourself with some of the plumbing commands and figure out how to write them as scripts. Git really does expose pretty much the lowest level of operations you could want.
P.S. If you should find a specific unsupported operation you want to do, and are having trouble finding the plumbing you need to perform it, or with the scripting necessary to implement it, post a question here! No reason to get stuck just because you can't use sql.
It's possible with libgit2 backends :
https://github.com/libgit2/libgit2-backends/blob/master/sqlite/sqlite.c
I haven't done any measurement, but performance should suffer a bit. However, it's also more convenient (single file for the entire repo history, classic SQL query language..etc..)
Speaking for Git, you cannot use different backend with the official binaries. However, the libgit2 project allows you to use different backends to store the database. However, you'll have to build all the binaries you wish to use for committing, merging, pushing, pulling, rebasing, etc. Also, you won't be able to modify your repository with the official binaries. You'll have to push it to a standard repo first.

Object serialization practical uses?

How many software projects have you worked on used object serialization? I personally never came across a scenario where object serialization was used. One use case i can think of is, a server software storing objects to disk to save memory. Are there other types of software where object serialization is essential or preferred over a database?
I've used object serialization in a lot of my projects. Sometimes we use it to store computer-specific settings locally. I have also used XML serialization to simplify interaction and generation of XML documents. It is also very beneficial in communication protocols. Serialize on one end and re-inflate on the other end.
Well, converting objects to XML or JSON is a form of serialization that is quite common on the web. I've also worked on a project where objects were created and serialized to a binary file in one application and then imported into another custom application (though that's fragile since it uses C# and serialization has broken in the past between versions of the .NET framework). Also, application settings that have a complex structure may be useful to serialize. I also think remoting APIs use serialization to communicate. Basically, serialization in general is simply a way to store the states of your objects, and this has many different uses.
Here are few uses I can think of :
Send an object across network, the most common example is serializing objects across a cluster
Serialize object for (sort of) caching, ie save the state in a file and read it back later
Serialize passive/huge data to a file to minimize the memory consumption and read it back whenever required.
I'm using serialization to pass objects across a TCP socket. You put XmlSerializers on either side, and it parses your data into readily available objects. If you do a little ground work, you can get it so that you're basically passing objects back and forth, and it makes socket communication extremely easy, reducing it to nothing more than socket.Send(myObject);.
Interprocess communication is a biggie.
you can combine db & serialization. f.ex. when you have to store an object with a lot of attributes (often dynamic, i.e. one object attribute set will be different from another one) to the relational DB, and you don't want to create a new column per each attribute
We started out with a system that serialized all of the thousands of in-memory objects to disk every 15 minutes or so. When that started taking too long we switched over to a mixed mode of saving the objects into a relational db and pickle file (this was a python system btw). Eventually the majority of the data was stored in a relational database. Interestingly, the system was written in such a way that all of the application code couldn't care less what was going on down there. It was all done using XP and thousands of automated tests.
Document based applications such as word processors and vector graphics editors will often serialize the document model to disk when the user invokes the Save command. Serialization is often preferred over complex databases in these apps.
Using serialization saves you time each time you want to implement an import/export functionality.
Every time you need to export your system's data, create backups or store some kind of settings, you could use serialization instead and just save the state of the objects that represent the actual config, data or whatever else.
Only when you need a specific format of the exported/imported data, there is a sense in building a custom parser and exporter/importer.
Serialization is also change-proof. Whenever you change the format of the object that is involved in the exchange functionality, it is automatically exportable and you don't have to change the logic behind your export/import parts.
We used it for a backup & update functionality. It was basically serialized hibernate objects being backed up, then the DB schema is altered through the update and we delivered a helper class that "coverted" the old objects to the new DB schema. This way we had a pretty solid update mechanism that wouldnt break easily and does an automatic backup at the same time.
I've used XML serialization heavily on one project. The technique was used to persist to database data structures that had no common structure, so the data couldn't be stored directly. I also used serialization to separate application settings that could be changed at runtime.