Data replication or API Gateway Aggregation: which one to choose using microservices?

Data replication or API Gateway Aggregation: which one to choose using microservices? - asp.net-core

As an example, let's say that I'm building a simple social network. I currently have two services:
Identity, managing the users, their personal data (e-mail, password hashes, etc.) and their public profiles (username) and authentication
Social, managing the users' posts, their friends and their feed
The Identity service can give the public profile of an user using its API at /api/users/{id}:
// GET /api/users/1 HTTP/1.1
// Host: my-identity-service
{
"id": 1,
"username": "cat_sun_dog"
}
The Social service can give a post with its API at /api/posts/{id}:
// GET /api/posts/5 HTTP/1.1
// Host: my-social-service
{
"id": 5,
"content": "Cats are great, dogs are too. But, to be fair, the sun is much better.",
"authorId": 1
}
That's great, but my client, a web app, would like to show the post with the author's name, and it would preferably receive the following JSON data in one single REST request.
{
"id": 5,
"content": "Cats are great, dogs are too. But, to be fair, the sun is much better.",
"author": {
"id": 1,
"username": "cat_sun_dog"
}
}
I found two main ways to approach this.
Data replication
As described in Microsoft's guide for data and Microsoft's guide for communication between microservices, it's possible for a microservice to replicate the data it needs by setting up an event bus (such as RabbitMQ) and consuming events from other services:
And finally (and this is where most of the issues arise when building microservices), if your initial microservice needs data that's originally owned by other microservices, do not rely on making synchronous requests for that data. Instead, replicate or propagate that data (only the attributes you need) into the initial service's database by using eventual consistency (typically by using integration events, as explained in upcoming sections).
Therefore, the Social service can consume events produced by the Identity service such as UserCreatedEvent and UserUpdatedEvent. Then, the Social service can have in its very own database a copy of all the users, but only the required data (their Id and Username, nothing more).
With this eventual consistent approach, the Social service now has all the required data for the UI, all in one request!
// GET /api/posts/5 HTTP/1.1
// Host: my-social-service
{
"id": 5,
"content": "Cats are great, dogs are too. But, to be fair, the sun is much better.",
"author": {
"id": 1,
"username": "cat_sun_dog"
}
}
Benefits:
Makes the Social service totally independent from the Identity service; it can work totally fine without it
Retrieving the data requires less network roundtrips
Provides data for cross-service validation (e.g. check if the given user exists)
Drawbacks and questions:
Takes some time for a change to propagate
The system is absolutely RUINED for some users if some messages fail to get through due to a disaster that fried all your replicated queues!
What if, one day, I need more data from the user, like their ProfilePicture?
What to do if I want to add a new service with the same replicated data?
API Gateway aggregation
As described in Microsoft's guide for data, it's possible to create an API gateway that aggregates data from two requests: one to the Social service, and another to the Identity service.
Therefore, we can have an API gateway action (/api/posts/{id}) implemented as such, in pseudo-code for ASP.NET Core:
[HttpGet("/api/posts/{id}")]
public async Task<IActionResult> GetPost(int id)
{
var post = await _postService.GetPostById(id);
if (post is null)
{
return NotFound();
}
var author = await _userService.GetUserById(post.AuthorId);
return Ok(new
{
Id = post.Id,
Content = post.Content,
Author = new
{
Id = author.Id,
Username = author.Username
}
});
}
Then, a client just uses the API gateway and gets all the data in one query, without any client-side overhead:
// GET /api/posts/5 HTTP/1.1
// Host: my-api-gateway
{
"id": 5,
"content": "Cats are great, dogs are too. But, to be fair, the sun is much better.",
"author": {
"id": 1,
"username": "cat_sun_dog"
}
}
Benefits:
Very easy to implement
Always gives the up-to-date data
Gives a centralized place to cache API queries
Drawbacks and questions:
Increased latency: in this case, it's due to two sequential network roundtrips
The action breaks if the Identity service is down, although this can be mitigated using the circuit breaker pattern, the client won't see the author's name anyway
Unused data might get still queried and waste resources (but that's marginal most of the time)
Having those two options: aggregation on the API gateway and data replication on individual microservices using events, which one to use for which situation, and how to implement them correctly?

In general, I strongly favor state replication via events in durable log-structured storage over services making synchronous (in the logical sense, even if executed in a non-blocking fashion) queries.
Note that all systems are, at a sufficiently high level, eventually consistent: because we don't stop the world to allow an update to a service to happen, there's always a delay from update to visibility elsewhere (including in a user's mind).
In general, if you lose your datastores, things get ruined. However, logs of immutable events give you active-passive replication for nearly free (you have a consumer of that log which replicates events to another datacenter): in a disaster you can make the passive side active.
If you need more events than you are already publishing, you just add a log. You can seed the log with a backfilled dump of synthesized events from the state before the log existed (e.g. dump out all the current ProfilePictures).
When you think of your event bus as a replicated log (e.g. by implementing it using Kafka), consumption of an event doesn't prevent arbitrarily many other consumers from coming along later (it's just incrementing your read-position in the log). So that allows for other consumers to come along and consume the log for doing their own remix. One of those consumers could be simply replicating the log to another datacenter (enabling that active-passive).
Note that once you allow services to maintain their own views of the important bits of data from other services, you are in practice doing Command Query Responsibility Segregation (CQRS); it's thus a good idea to familiarize yourself with CQRS patterns.

Related

Gemfire spring example

The example at https://spring.io/guides/gs/caching-gemfire/ shows that if there is a cache miss, we have to fetch the data from a server and store in the cache.
Is this an example of Gemfire running as the Gemfire server or is it a Gemfire client? I thought a client would automatically fetch the data from a Server if there is a cache miss. If that is the case, would there ever be a cache miss for the client?
Regards,
Yash

First, I think you are missing the point of the core Spring Framework's Cache Abstraction. I encourage you to read more about the Cache Abstraction's intended purpose here.
In a nutshell, if one of your application objects makes a call to some "external", "expensive" service to access a resource, then caching maybe applicable, especially if the inputs passed result in the exact same output every single time.
So, for a moment, lets imagine your application makes a call to the Geocoding API in the Google Maps API to translate a addresses and (the inverse,) latitude/longitude coordinates.
You might have a application Spring #Service component like so...
#Service("AddressService")
class MyApplicationAddressService {
#Autowired
private GoogleGeocodingApiDao googleGeocodingApiDao;
#Cacheable("Address")
public Address getAddressFor(Point location) {
return googleGeocodingApiDao.convert(location);
}
}
#Region("Address")
class Address {
private Point location;
private State state;
private String street;
private String city;
private String zipCode;
...
}
Clearly, given a latitude/longitude (input), it should produce the same Address (result) everytime. Also, since making a (network) call to an external API like Google's Geocoding service can be very expensive, to both access the resource and perform the conversion, then this type of service call is a perfect candidate for use to cache in our application.
Among many other caching providers (e.g. EhCache, Hazelcaset, Redis, etc), you can, of course, use Pivotal GemFire, or the open source alternative, Apache Geode to back Spring's Caching Abstraction.
In your Pivotal GemFire/Apache Geode setup, you can of course use either the peer-to-peer (P2P) or client/server topology, it doesn't really matter, and GemFire/Geode will do the right thing, once "called upon".
But, the Spring Cache Abstraction documentation states, when you make a call to one of your application components methods (e.g. getAddressFor(:Point)) that support caching (with #Cacheable) the interceptor will first "consult" the cache before making the method call. If the value is present in the cache, then that value is returned and the "expensive" method call (e.g. getAddressFor(:Point)) will not be invoked.
However, if there is a cache miss, then Spring will proceed in invoking the method, and upon successful return from the method invocation, cache the result of the call in the backing cache provider (such as GemFire/Geode) so that the next time the method call is invoked with the same input, the cached value will be returned.
Now, if your application is using the client/sever topology, then of course, the client cache will forward the request onto the server if...
The corresponding client Region is a PROXY, or...
The corresponding client Region is a CACHING_PROXY, and the client's local client-side Region does not contain the requested Point for the Address.
I encourage you to read more about different client Region data management policies here.
To see another working example of Spring's Caching Abstraction backed by Pivotal GemFire in Action, have a look at...
caching-example
I used this example in my SpringOne-2015 talk to explain caching with GemFire/Geode as the caching provider. This particular example makes a external request to a REST API to get the "Quote of the Day".
Hope this helps!
Cheers,
John

Why providing pagination information in API response?

We are now designing our RESTful API and I have a question for how to expose the pagination information.
It seems some famous services like Github or Firefox Market Place having something like below in their API:
{
"meta": {
"limit": 3,
"next": "/api/v1/apps/category/?limit=3&offset=6",
"offset": 3,
"previous": "/api/v1/apps/category/?limit=3&offset=0",
"total_count": 16
}
}
My question is:
Why should the server give the complete next/previous url in the response?
It seem to me that the client is making the first request. So it knows what parameters it used to call (offset/limit/api version). It is easy for the client to figure out what is the next/previous url to call. Why bother to calculate the redundant urls and give it to the client?

It's RESTful! That is specifically part of HATEOAS, or Hypermedia as the Engine of Application State.
Except for simple fixed entry points to the application, a client does not assume that any particular action is available for any particular resources beyond those described in representations previously received from the server.
and:
[HATEOAS] is a constraint of the REST application architecture that distinguishes it from most other network application architectures. The principle is that a client interacts with a network application entirely through hypermedia provided dynamically by application servers. A REST client needs no prior knowledge about how to interact with any particular application or server beyond a generic understanding of hypermedia.
...
A REST client enters a REST application through a simple fixed URL. All future actions the client may take are discovered within resource representations returned from the server.
(Emphasis added)
It seem to me that the client is making the first request. So it knows what parameters it used to call (offset/limit/api version).
Yes, the client makes the first request, but that doesn't mean it knows anything about URL discovery, pagination, limit/offset parameters, etc.

Need some advice for a web service API?

My company has a product that will I feel can benefit from a web service API. We are using MSMQ to route messages back and forth through the backend system. Currently we are building an ASP.Net application that communicates with a web service (WCF) that, in turn, talks to MSMQ for us. Later on down the road, we may have other client applications (not necessarily written in .Net). The message going into MSMQ is an object that has a property made up of an array of strings. There is also a property that contains the command (a string) that will be routed through the system. Personally, I am not a huge fan of this, but I was told it is for scalability and every system can use strings.
My thought, regarding the web services was to model some objects based on our data that can be passed into and out of the web services so they are easily consumed by the client. Initially, I was passing the message object, mentioned above, with the array of strings in it. I was finding that I was creating objects on the client to represent that data, making the client responsible for creating those objects. I feel the web service layer should really be handling this. That is how I have always worked with services. I did this so it was easier for me to move data around the client.
It was recommended to our group we should maintain the “single entry point” into the system by offering an object that contains commands and have one web service to take care of everything. So, the web service would have one method in it, Let’s call it MakeRequest and it would return an object (either serialized XML or JSON). The suggestion was to have a base object that may contain some sort of list of commands that other objects can inherit from. Any other object may have its own command structure, but still inherit base commands. What is passed back from the service is not clear right now, but it could be that “message object” with an object attached to it representing the data. I don’t know.
My recommendation was to model our objects after our actual data and create services for the types of data we are working with. We would create a base service interface that would house any common methods used for all services. So for example, GetById, GetByName, GetAll, Save, etc. Anything specific to a given service would be implemented for that specific implementation. So a User service may have a method GetUserByUsernameAndPassword, but since it implements the base interface it would also contain the “base” methods. We would have several methods in a service that would return the type of object expected, based on the service being called. We could house everything in one service, but I still would like to get something back that is more usable. I feel this approach leaves the client out of making decisions about what commands to be passed. When I connect to a User service and call the method GetById(int id) I would expect to get back a User object.
I had the luxury of working with MS when I started developing WCF services. So, I have a good foundation and understanding of the technology, but I am not the one designing it this time.
So, I am not opposed to the “single entry point” idea, but any thoughts about why either approach is more scalable than the other would be appreciated. I have never worked with such a systematic approach to a service layer before. Maybe I need to get over that?

I think there are merits to both approaches.
Typically, if you are writing an API that is going to be consumed by a completely separate group of developers (perhaps in another company), then you want the API to be as self-explanative and discoverable as possible. Having specific web service methods that return specific objects is much easier to work with from the consumer's perspective.
However, many companies use web services as one of many layers to their applications. In this case, it may reduce maintenance to have a generic API. I've seen some clever mechanisms that require no changes whatsoever to the service in order to add another column to a table that is returned from the database.
My personal preference is for the specific API. I think that the specific methods are much easier to work with - and are largely self-documenting. The specific operation needs to be executed at some point, so why not expose it for what it is? You'd get laughed at if you wrote:
public void MyApiMethod(string operationToPerform, params object[] args)
{
switch(operationToPerform)
{
case "InsertCustomer":
InsertCustomer(args);
break;
case "UpdateCustomer":
UpdateCustomer(args);
break;
...
case "Juggle5BallsAtOnce":
Juggle5BallsAtOnce(args);
break;
}
}
So why do that with a Web Service? It'd be much better to have:
public void InsertCustomer(Customer customer)
{
...
}
public void UpdateCustomer(Customer customer)
{
...
}
...
public void Juggle5BallsAtOnce(bool useApplesAndEatThemConcurrently)
{
...
}

How to handle multiple storage backends transparently

I'm working with an application right now that uses a third-party API for handling some batch email-related tasks, and in order for that to work, we need to store some information in this service. Unfortunately, this information (first/last name, email address) is also something we want to use from our application. My normal inclination is to pick one canonical data source and stick with it, but round-tripping to a web service every time I want to look up these fields isn't really a viable option (we use some of them quite a bit), and the service's API requires the records to be stored there, so the duplication is sadly necessary.
But I have no interest in peppering every method throughout our business classes with code to synchronize data to the web service any time they might be updated, and I also don't think my entity should be aware of the service to update itself in a property setter (or whatever else is updating the "truth").
We use NHibernate for all of our DAL needs, and to my mind, this data replication is really a persistence issue - so I've whipped up a PoC implementation using an EventListener (both PostInsert and PostUpdate) that checks, if the entity is of type X, and any of fields [Y..Z] have been changed, update the web service with the new state.
I feel like this is striking a good balance between ensuring that our data is the canonical source and making sure that it gets replicated transparently and minimizing the chances for changes to fall through the cracks and get us into a mismatch situation (not the end of the world if eg. the service is unreachable, we just do a manual batch update later, but for everybody's sanity in the general case, the goal is that we never have to think about it), but my colleagues and I still have a degree of uncomfortableness with this way forward.
Is this a horrid idea that will invite raptors into my database at inopportune times? Is it a totally reasonable thing to do with an EventListener? Is it a serviceable solution to a less-than-ideal situation that we can just make do with and move on forever tainted? If we soldier on down this road, are there any gotchas I should be wary of in the Events pipeline?

In case of unreliable data stores (web service in your case), I would introduce a concept of transactions (operations) and store them in local database, then periodically pull them from DB and execute against the Web Service (other data store).
Something like this:
public class OperationContainer
{
public Operation Operation; //what ever operations you need CRUD, or some specific
public object Data; //your entity, business object or whatever
}
public class MyMailService
{
public SendMail (MailBusinessObject data)
{
DataAcceessLair<MailBusinessObject>.Persist(data);
OperationContainer operation = new OperationContainer(){Operation=insert, Data=data};
DataAcceessLair<OperationContainer>.Persist(operation);
}
}
public class Updater
{
Timer EverySec;
public void OnEverySec()
{
var data = DataAcceessLair<OperationContainer>.GetFirstIn(); //FIFO
var webServiceData = WebServiceData.Converr(data); // do the logic to prepare data for WebService
try
{
new WebService().DoSomething(data);
DataAcceessLair<OperationContainer>.Remove(data);
}
}
}
This is actually pretty close to the concept of smart client - technically not logicaly. Take a look at book: .NET Domain-Driven Design with C#: Problem-Design-Solution, chapter 10. Or take a look at source code from the book, it's pretty close to your situation: http://dddpds.codeplex.com/

Best way to share data between .NET application instance?

I have create WCF Service (host on Windows Service) on load balance server. Each of this service instance maintain list of current user. E.g. Instance A has user A001, A002, A005, instance B has user A003, A004, A008 and so on.
On each service has interface that use to get user list, I expect this method to return all user in all service instance. E.g. get user list from instance A or instance B will return A001, A002, A003, A004, A005 and A008.
Currently I think that I will store the list of current users on database but this list seem to update so often.
I want to know, is it has another way to share data between WCF service that suit my situation?

Personally, the database option sounds like overkill to me just based on the notion of storing current users. If you are actually storing more than that, then using a database may make sense. But assuming you simply want a list of current users from both instances of your WCF service, I would use an in-memory solution, something like a static generic dictionary. As long as the services can be uniquely identified, I'd use the unique service ID as the key into the dictionary and just pair each key with a generic list of user names (or some appropriate user data structure) for that service. Something like:
private static Dictionary<Guid, List<string>> _currentUsers;
Since this dictionary would be shared between two WCF services, you'll need to synchronize access to it. Here's an example.
public class MyWCFService : IMyWCFService
{
private static Dictionary<Guid, List<string>> _currentUsers =
new Dictionary<Guid, List<string>>();
private void AddUser(Guid serviceID, string userName)
{
// Synchronize access to the collection via the SyncRoot property.
lock (((ICollection)_currentUsers).SyncRoot)
{
// Check if the service's ID has already been added.
if (!_currentUsers.ContainsKey(serviceID))
{
_currentUsers[serviceID] = new List<string>();
}
// Make sure to only store the user name once for each service.
if (!_currentUsers[serviceID].Contains(userName))
{
_currentUsers[serviceID].Add(userName);
}
}
}
private void RemoveUser(Guid serviceID, string userName)
{
// Synchronize access to the collection via the SyncRoot property.
lock (((ICollection)_currentUsers).SyncRoot)
{
// Check if the service's ID has already been added.
if (_currentUsers.ContainsKey(serviceID))
{
// See if the user name exists.
if (_currentUsers[serviceID].Contains(userName))
{
_currentUsers[serviceID].Remove(userName);
}
}
}
}
}
Given that you don't want users listed twice for a specific service, it would probably make sense to replace the List<string> with HashSet<string>.

A database would seem to offer a persistent store which may be useful or important for your application. In addition it supports transactions etc which may be useful to you. Lots of updates could be a performance problem, but it depends on the exact numbers, what the query patterns are, database engine used, locality etc.
An alternative to this option might be some sort of in-memory caching server like memcached. Whilst this can be shared and accessed in a similar (sort of) way to a database server there are some caveats. Firstly, these platforms are generally not backed by some sort of permanent storage. What happens when the memcached server dies? Second they may not be ACID-compliant enough for your use. What happens under load in terms of additions and updates?

I like the in memory way. Actually I am designing a same mechanism for one my projects I'm working now. This is good for scenarios where you don't have opportunities to access database or some people are really reluctant to create a table to store simple info like a list of users against a machine name.
Only update I'd do there is a node will only return the list of its available users to its peer and peer will combine that with its existing list. Then return its existing list to the peer who called. Thats how all the peers would be in sync with same list.

The DB option sounds good. If there are no performance issues it is a simple design that should work. If you can afford to be semi realtime and non persistent one way would be to maintain the list in memory in each service and then each service updates the other when a new user joins. This can be done as some kind of broadcast via a centralised service or using msmq etc.

If you reconsider and host using IIS you will find that with a single line in a config file you can make the ASP Global, Application and Session objects available. This trick is also very handy because it means you can share session state between an ASP application and a WCF service.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas