Data Model/Schema decoupling in Data Processing Pipeline suing Event Driven Architecture - schema

I was wondering how Microservices in the Streaming Pipeline based on Event Driven Architecture can be truly decoupled from the data model perspective. We have implemented a data processing pipeline using Event-Driven Architecture where the data model is very critical. Although all the Microservices are decoupled from the business perspective, they are not truly decoupled as the data model is shared across all the services.
In the ingestion pipeline, we have collected data from multiple sources where they have a different data model. Hence, a normalizer microservice is required to normalize those data models to a common data model that can be used by downstream consumers. The challenge is Data Model can change for any reason and we should be able to easily manage the change here. However, that level of change can break the consumer applications and can easily introduce a cascade of modification to all the Microservices.
Is there any solution or technology that can truly decouple microservices in this scenario?

This problem is solved by carefully designing the data model to ensure backward and forward compatibility. Such design is important for independent evolution of services, rolling upgrades etc. A data model is said to be backward compatible if a new client (using new model) can read / write the data written by another client (using old model). Similarly, forward compatibility means a client (using old data model) can read / write the data written by another client (using new data model).
Let's say in a Person object is shared across services in a JSON encoded format. Now a one of the services introduces a new field alternateContact. A service consuming this data and using the old data model can simply ignore this new field and continue its operation. If you're using Jackson library, you'd use #JsonIgnoreProperties(ignoreUnknown = true). Thus the consuming service is designed for forward compatibility.
Problem arises when the service (using old data model) deserializes a Person data written with the new model, updates one or more field values and writes the data back. Since the unknown properties are ignored, the write will result in data loss.
Fortunately, binary encoding format such as Protocol Buffer 3.5 and later versions preserve unknown fields during deserialization using old model. Thus when you serialize the data back, the new fields remain as is.
There maybe other data model evolutions hou need to deal with like field removal, field rename etc. The nasic idea is you need to be aware of and plan for these possibilities early on in the design phase. The common data encoding formats are JSON, Apache Thrift, Protocol Buffer, Avro etc.

Related

How to implement a dual write system for SQL database and Elastic Search

I made my own research and found out that there is several ways to do that, but the most accurate is Change Data Capture. However, I don't see the benefits of it related to the asynchronous method for example :
Synchronous double-write: Elasticsearch is updated synchronously when
the DB is updated. This technical solution is the simplest, but it
faces the largest number of problems, including data conflicts, data
overwriting, and data loss. Make your choice carefully.
Asynchronous double-write: When the DB is updated, an MQ is recorded and used to
notify the consumer. This allows the consumer to backward query DB
data so that the data is ultimately updated to Elasticsearch. This
technical solution is highly coupled with business systems. Therefore,
you need to compile programs specific to the requirements of each
business. As a result, rapid response is not possible.
Change Data Capture (CDC): Change data is captured from the DB, pushed to an
intermediate program, and synchronously pushed to Elasticsearch by
using the logic of the intermediate program. Based on the CDC
mechanism, accurate data is returned at an extremely fast speed in
response to queries. This solution is less coupled to application
programs. Therefore, it can be abstracted and separated from business
systems, making it suitable for large-scale use. This is illustrated
in the following figure.
Alibabacloud.com
In another article it said that asynchronous is also risky if one datasource is down and we cannot easily rollback.
https://thorben-janssen.com/dual-writes/
So my question is : Should I use CDC to perform persistance operations for multiple datasources ? Why CDC is better than asynchronous given that is based on the same principle ?

Operations performed on the communications between the server and clients

Part of federated learning research is based on operations performed on the communications between the server and clients such as dropping part of the updates (drop some gradients describing a model) exchanged between clients and server or discarding an update from a specific client in a certain communication round. I want to know if such capabilities are supported by Tensorflow-federated (TFF) framework and how they are supported because, from a first look, it seems to me the level of abstraction of TFF API does not allow such operations. Thank you.
TFF's language design intentionally avoids a notion of client identity; there is desire to avoid making a "Client X" addressable and discarding its update or sending it different data.
However, there may be a way to run simulations of the type of computations mentioned. TFF does support expressing the following:
Computations that condition on properties of tensors, for example ignore an update that has nan values. One way this could be accomplished would be by writing a tff.tf_computation that conditionally zeros out the weight of updates before tff.federated_mean. This technique is used in tff.learning.build_federated_averaing_process()
Simulations that run a different computations on different sets of clients (where a set maybe a single client). Since the reference executor parameterizes clients by the data they posses, a writer of TFF could write two tff.federated_computations, apply them to different simulation data, and combine the results.

Data updates with featuretools / DFS

In the ML 2.0 and AI PM papers it implies update data - which could be either existing data or new data - happens dynamically (in real-time). For example, in the AI PM paper it says, "Rather, we have demonstrated a complete system that works in the real world, on continually updating live data."
Do you mean update data is automatically pre-processed into appropriate feature vectors and included in the next model re-training cycle? Or, is the model being updated dynamically?
In this case, the data update means new data is automatically appended to existing data and then transformed into new feature vectors. These feature vectors can be used to retrain the model or score using an existing model.
The automation is that the feature engineering on the new data may depend on historical data to compute, so the APIs in Featuretools aim to abstract that away as much as possible from the developer. This is achieved using the Entityset.concat(..) method.

Data model design guide lines with GEODE

We are soon going to start something with GEODE regarding reference data. I would like to get some guide lines for the same.
As you know in financial reference data world there exists complex relationships between various reference data entities like Instrument, Account, Client etc. which might be available in database as 3NF.
If my queries are mostly read intensive which requires joins across
tables (2-5 tables), what's the best way to deal with the same with in
memory grid?
Case 1:
Separate regions for all tables in your database and then do a similar join using OQL as you do in database?
Even if you do so, you will have to design it with solid care that related entities are always co-located within same partition.
Modeling 1-to-many and many-many relationship using object graph?
Case 2:
If you know how your join queries look like, create a view model per join query having equi join characteristics.
Confusion:
(1) I have 1 join query requiring Employee,Department using emp.deptId = dept.deptId [OK fantastic 1 region with such view model exists]
(2) I have another join query requiring, Employee, Department, Salary, Address joins to address different requirement
So again I have to create a view model to address (2) which will contain similar Employee and Department data as (1). This may soon reach to memory threshold.
Changes in database can still be managed by event listeners, but what's the recommendations for that?
Thanks,
Dharam
I think your general question is pretty broad and there isn't just one recommended approach to cover all UCs (primarily all your analytical views/models of your data as required by your application(s)).
Such questions involve many factors, such as the size of individual data elements, the volume of data, the frequency of access or access patterns originating from the application or applications, the timely delivery of information, how accurate the data needs to be, the size of your cluster, the physical resources of each (virtual) machine, and so on. Thus, any given approach will undoubtedly require application tuning, tuning GemFire accordingly and JVM tuning regardless of your data model. Still, a carefully crafted data model can determine the extent of such tuning.
In GemFire specifically, such tuning will involve different configuration such as, but not limited to: data management policies, eviction (Overflow) and expiration (LRU, or perhaps custom) settings along with different eviction/expiration thresholds, maybe storing data in Off-Heap memory, employing different partition strategies (PartitionResolver), and so on and so forth.
For example, if your Address information is relatively static, unchanging (i.e. actual "reference" data) then you might consider storing Address data in a REPLICATE Region. Data that is written to frequently (typically "transactional" data) is better off in a PARTITION Region.
Of course, as you know, any PARTITION data (managed in separate Regions) you "join" in a query (using OQL) must be collocated. GemFire/Geode does not currently support distributed joins.
Additionally, certain nodes could host certain Regions, thus dividing your cluster into "transactional" vs. "analytical" nodes, where the analytical-based nodes are updated from CacheListeners on Regions in transactional nodes (be careful of this), or perhaps better yet, asynchronously using an AEQ with AsyncEventListeners. AEQs can be separately made highly available and durable as well. This transactional vs analytical approach is the basis for CQRS.
The size of your data is also impacted by the form in which it is stored, i.e. serialized vs. not serialized, and GemFire's proprietary serialization format (PDX) is quite optimal compared with Java Serialization. It all depends on how "portable" your data needs to be and whether you can keep your data in serialized form.
Also, you might consider how expensive it is to join the data on-the-fly. Meaning, if your are able to aggregate, transform and enrich data at runtime relatively cheaply (compute vs. memory/storage), then you might consider using GemFire's Function Execution service, bringing your logic to the data rather than the data to your logic (the fundamental basis of MapReduce).
You should know, and I am sure you are aware, GemFire is a Key-Value store, therefore mapping a complex object graph into separate Regions is not a trivial problem. Dividing objects up by references (especially many-to-many) and knowing exactly when to eagerly vs. lazily load them is an overloaded problem, especially in a distributed, replicated data store such as GemFire where consistency and availability tradeoffs exist.
There are different APIs and frameworks to simplify persistence and querying with GemFire. One of the more notable approaches is Spring Data GemFire's extension of Spring Data Commons Repository abstraction.
It also might be a matter of using the right data model for the job. If you have very complex data relationships, then perhaps creating analytical models using a graph database (such as Neo4j) would be a simpler option. Spring also provides great support for Neo4j, led by the Neo4j team.
No doubt any design choice you make will undoubtedly involve a hybrid approach. Often times the path is not clear since it really "depends" (i.e. depends on the application and data access patterns, load, all that).
But one thing is for certain, make sure you have a good cursory knowledge and understanding of the underlying data store and it' data management capabilities, particularly as it pertains to consistency and availability, beginning with this.
Note, there is also a GemFire slack channel as well as a Apache DEV mailing list you can use to reach out to the GemFire experts and community of (advanced) GemFire/Geode users if you have more specific problems as you proceed down this architectural design path.

Graph database implemented using key value store

I have a requirement for a graph database that needs to be backed-up and potentially accessed at a lower level of abstraction. It is also must be distributed for the sake of load balancing, (single master replication will do).
I know that it is possible to implement a graph database using a self-referencing key-value store. The Git object database is an example of this pattern. One of the frustrating things I find about most graph databases is that they do not "disclose" their underlying persistence layer in public api.
Do any replicated graph databases exist that allow for an underlying key-value stores to be "plugged-in" or accessed directly?
I addition to Gremlin/Tinkerpop, mentionned by #amirouche above, I'm aware of two solutions:
Redis, completed by its Graph Module, matches your description.
Cayley could also be a solution as it provides graph features over various SQL and NoSQL backends, some of them supporting distributed mode (Postgresql, MySQL, MongoDB, CockroachDB)