BigQuery distributed transactions - google-bigquery

I'm trying to architect a microservice based system utilizing BigQuery as one of services. We need to preserve eventual consistency between BigQuery and other microservices, so that changes to BigQuery (data uploads, table creates, etc) were eventually propagated to other services.
I'm wondering if BigQuery has mechanisms, supporting this kind of consistency? As I checked, BigQuery does not support publishing its events to pub/sub, which would definitely solve a problem.
I'm thinking of utilizing labels for this. I hope updates of data and labels should be atomic in respect to one API call.
Something like keeping two labels with current version and committed version, and maybe uncommitted operation type. Mutation operation increases current version and queues task, publishing update to pub/sub, which on success updates committed version to match the current one. I though see a number of problems with this solution.
Basically, there is a broader question, of how APIs need to be designed to support eventual consistency with other systems, and if it is possible to use API not specially designed for this, in an eventually consistent distributed system.

Related

Latency while updating BigQuery schema

I am facing some issues regarding latency in updating BigQuery schema.
I have a table that receives streaming inserts and the schema is updated automatically whenever needed. The issue is that the schema update doesn't seem to take effect for sometime and inserts made in that duration drop the values of the new columns.
I found this answer from 2016 that says that there could be delays of up till 5 minutes before changes take effect.
Is this still the case and how do you work around this? If a timeout is the answer, then how long should you wait before writing to the new columns?
In order to get more meaningful and sense-full information on the subject, I would encourage you to check out this good written article, discovering Bigquery streaming inserts life-cycle, leveraging tabledata.insertAll Bigquery REST API method.
Actually, as documentation says, data Availability and Consistency are the most important requirements for ingesting data in real-time analyzing tasks:
Because BigQuery's streaming API is designed for high insertion
rates, modifications to the underlying table metadata exhibit are
eventually consistent when interacting with the streaming system. In
most cases metadata changes are propagated within minutes, but during
this period API responses may reflect the inconsistent state of the
table.
Admitting the fact that in some cases where metadata changes are required inline with streaming ingests, the documentation confirms the delay accomplishing this. Even caching mechanism that aims to gather metadata from tables in some circumstances does not guarantee the data changes, i.e. referencing streaming injections to the not existing table or entire columns in the shortest moment. Due to the complexity of GCP Bigquery server-less platform, that originally built on top of Dremel model, it is hardly to estimate the latency time for high throughputs of the particular streaming task, hence this not documented in GCP knowledge base.
Meanwhile, reading this Stack thread, #Sean Chen recommended to afford Bigquery metadata changes beforehand launching streaming ingests.

Realtime queries in deepstream "cache" layer?

I see, that by using RethinkDB connector one can achieve real time querying capabilites by subscribing into specifically named lists. I assume, that this is not actually the fastest solution, as the query probably updates only after changes to records are written to the database. Is there any recommended approach to achieve realtime querying capabilites deepstream-side?
There are some favourable properties like:
Number of unique queries is small compared to number of records or even number of connected clients
All manipulation of records that are subject to querying is done via RPC.
I can imagine multiple ways how to do that:
Imitate the rethinkdb connector approach. But for that I am missing a list.listen() method. With that I would be able to create a backend process creating a list on-demand and on each RPC CRUD operation on records update all currently active lists=queries.
Reimplement basic list functionality in records and use the above approach with now existing .listen()
Use .listen() in events?
Or do we have list.listen() and I just missed it? Or there is more elegant way how to do it?
Great question - generally lists are a client-side concept, implemented on top of records. Listen notifies you about clients subscribing to records, not necessarily changing them - change notifications arrive via mylist.subscribe(data => {}) or myRecord.subscribe(data => {}).
The tricky bit is the very limited querying capability of caches. Redis has a basic concept of secondary indices that can be searched for ranges and intersection, memcached and co are to my knowledge pure key-value stores, searchable only by ID - as a result the actual querying would make most sense on the database layer where your data will usually arrive in significantly less than 200ms.
The RethinkDB search provider offers support for RethinkDB's built in realtime querying capabilites. Alternatively you could use MongoDB and trail its operations log or use PostGres and deepstream's built in subscribe feature for change notifications.

Azure SQL Replication

I have an application that, for performance reasons, will have completely independent standalone instances in several Azure data centers. The stack of Azure IaaS and PaaS components at each data center will be exactly the same. Primarily, there will be a front end application and a database.
So let's say I have the application hosted in 4 data centers. I would like to have the data coming into each Azure SQL database replicate it's data asynchronously to all of the other 3 databases, in an eventually consistent manner. Each of these databases needs to be updatable.
Does anyone know if Active Geo-Replication can handle this scenario? I know I can do this using a VM and IaaS, but would prefer to use SQL Azure.
Thanks...
Peer-to-peer tranasaction replication supports what you're asking for, to some extent - I'm assuming that's what you're referring to when you mention setting it up in IaaS, but it seems like it would be self defeating if you're looking to it for a boost in write performance (and against their recommendations):
From https://msdn.microsoft.com/en-us/library/ms151196.aspx
Although peer-to-peer replication enables scaling out of read operations, write performance for the topology is like that for a single node. This is because ultimately all inserts, updates, and deletes are propagated to all nodes. Replication recognizes when a change has been applied to a given node and prevents changes from cycling through the nodes more than one time. We strongly recommend that write operations for each row be performed at only node, for the following reasons:
If a row is modified at more than one node, it can cause a conflict or even a lost update when the row is propagated to other nodes.
There is always some latency involved when changes are replicated. For applications that require the latest change to be seen immediately, dynamically load balancing the application across multiple nodes can be problematic.
This makes me think that you'd be better off using Active Geo Replication - you get the benefit of PaaS and not having to manage your own VMs, not having to manage TR, which gets messy, and if the application is built to deal with "eventual consistency" in the UI, you might be able to get away with slight delays in the secondaries being up to date.

Need Design & Implementation inputs on Cassandra based use case

I am planning to store high-volume order transaction records from a commerce website to a repository (Have to use cassandra here, that is our DB). Let us call this component commerceOrderRecorderService.
Second part of the problem is - I want to process these orders and push to other downstream systems. This component can be called batchCommerceOrderProcessor.
commerceOrderRecorderService & batchCommerceOrderProcessor both will run on a java platform.
I need suggestion on design of these components. Especially the below:
commerceOrderRecorderService
What is he best way to design the columns, considering performance and scalability? Should I store the entire order (complex entity) as a single JSON object. There is no search requirement on the order attributes. We can at least wait until they are processed by the batch processor. Consider - that a single order can contain many sub-items - at the time of processing each of which can be fulfilled differently. Designing columns for such data structure may be an overkill
What should be the key, given that data volumes would be high. 10 transactions per second let's say during peak. Any libraries or best practices for creating such transactional data in cassandra? Can TTL also be used effectively?
batchCommerceOrderProcessor
How should the rows be retrieved for processing?
How to ensure that a multi-threded implementation of the batch processor ( and potentially would be running on multiple nodes as well ) will have row level isolation. That is no two instance would read and process the same row at the same time. No duplicate processing.
How to purge the data after a certain period of time, while being friendly to cassandra processes like compaction.
Appreciate design inputs, code samples and pointers to libraries. Thanks.
Depending on the overall requirements of your system, it could be feasible to employ the architecture composed of:
Cassandra to store the orders, analytics and what have you.
Message queue - your commerce order recorder service would simple enqueue new order to the transactional and persistent queue and return. Scalability and performance should not be an issue here as you can easily achieve thousands of transactions per second with a single queue server. You may have a look at RabbitMQ as one of available choices.
Stream processing framework - you could read a stream of messages from the queue in a scalable fashion using streaming frameworks such as Twitter Storm. You could implement in Java than 3 simple pipelined processes in Storm:
a) Spout process that dequeues next order from the queue and pass it to
the second process
b) Second process called Bolt that inserts each next order to Cassandra and pass it to the third bolt
c) Third Bolt process that pushes the order to other downstream systems.
Such an architecture offers high-performance, scalability, and near real-time, low latency data processing. It takes into account that Cassandra is very strong in high-speed data writes, but not so strong in reading sequential list of records. We use Storm+Cassandra combination in our InnoQuant MOCA platform and handle 25.000 tx/second and more depending on hardware.
Finally, you should consider if such an architecture is not an overkill for your scenario. Nowadays, you can easily achieve 10 tx/second with nearly any single-box database.
This example may help a little. It loads a lot of transactions using the jmxbulkloader and then batches the results into files of a certain size to be transported else where. It multi-threaded but within the same process.
https://github.com/PatrickCallaghan/datastax-bulkloader-writer-example
Hope it helps. BTW it uses the latest cassandra 2.0.5.

Application Level Replication Technologies

I am building out a solution that will be deployed in multiple data centers in multiple regions around the world, with each data center having a replicated copy of data actively updated in each region. I will have a combination of multiple databases and file systems in each data center, the state of which must be kept consistent (within a data center). These multiple repositories will be fronted by a SOA service tier.
I can tolerate some latency in the replication, and need to allow for regions to be off-line, and then catch up later.
Given the multiple back end repositories of data, I can't easily rely on independent replication solutions for each one to maintain a consistent state. I am thus lead to implementing replication at the application layer -- by replicating the SOA requests in some manner. I'll need to make sure that replication loops don't occur, and that last writer conditions are sorted out correctly.
In your experience, what is the best pattern for solving this problem, and are there good products (free or otherwise) that should be investigated?
Lotus/ Domino is your answer. I've been working with it for ten years and its exactly what you need. It may not be trendy (a perception that I would challenge) but its powerful, adaptable and very secure, The latest version R8 is the best yet.
You should definitely consider IBM Lotus Domino. A Lotus Notes database can replicate between sites on a predefined schedule. The replicate in Notes/Domino is definitely a very powerful feature and enables for full replication of data between sites. Even if a server is unavailable the next time it connects it will simply replicate and get back in sync.
As far as SOA Service tier you could then use Domino Designer to write a webservice. Since Notes/Domino 7.5.x (I believe) Domino has been able to provision and consume webservices.
AS what other advised, I will recommend also Lotus Notes/Domino. 8.5 is really very powerful application development platfrom
You dont give enough specifics to be certain of your needs but I think you should check out SQL Server Merge replication. It allows for asynchronous replication of multiple databases with full conflict resolution. You will need to designate a Global master and all the other databases will replicate to that one, but all the database instances are fully functional (read/write) and so you can schedule replication at whatever intervals suit you. If any region goes offline they can catch up later with no issues - if the master goes offline everyone will work independantly until replication can resume.
I would be interested to know of other solutions this flexible (apart from Lotus Notes/Domino of course which is not very trendy these days).
I think that your answer is going to have to be based on a pub/sub architecture. I am assuming that you have reliable messaging between your data centers so that you can rely on published updates being received eventually. If all of your access to the data repositories is via service you can add an event notification to the orchestration of each of your update services that notifies all interested data centers of the event. Ideally the master database is the only one that sends out these updates. If the master database is the only one sending the updates you can exclude routing the notifications to the node that generated them in the first place thus avoiding update loops.