Data Consolidation for ETL pipeline - sql

I am currently planning to move some data sources to one place for posterior analysis.
Currently I have any data sources (databases) such as:
MSSQL
Mysql
mongodb
Postgres
Cassandra will be use for analytics in a big data pipeline. What is the best way to migrate any source to a Cassandra cluster?

I will highly recommend using NiFi for this use case. Some of benefits that I can outline right away.
Inbuilt "Processors" available for reading the data from all listed data sources and writing to Cassandra.
Very high throughput with low latency.
Rapid data acquisition pipeline development without writing a lot of code.
Ability to do "Change Data Capture" very easily later in your project, if needed.
Provides a highly concurrent model without a developer having to worry about the typical complexities of concurrency.
Is inherently asynchronous which allows for very high throughput and natural buffering even as processing and flow rates fluctuate
The resource-constrained connections make critical functions such as back-pressure and pressure release very natural and intuitive.
The points at which data enters and exits the system as well as how it flows through are well understood and easily tracked
And biggest of all, OPEN SOURCE.
You can refer Apache NiFi homepage for more information.
Hope that helps!

Related

How to implement a dual write system for SQL database and Elastic Search

I made my own research and found out that there is several ways to do that, but the most accurate is Change Data Capture. However, I don't see the benefits of it related to the asynchronous method for example :
Synchronous double-write: Elasticsearch is updated synchronously when
the DB is updated. This technical solution is the simplest, but it
faces the largest number of problems, including data conflicts, data
overwriting, and data loss. Make your choice carefully.
Asynchronous double-write: When the DB is updated, an MQ is recorded and used to
notify the consumer. This allows the consumer to backward query DB
data so that the data is ultimately updated to Elasticsearch. This
technical solution is highly coupled with business systems. Therefore,
you need to compile programs specific to the requirements of each
business. As a result, rapid response is not possible.
Change Data Capture (CDC): Change data is captured from the DB, pushed to an
intermediate program, and synchronously pushed to Elasticsearch by
using the logic of the intermediate program. Based on the CDC
mechanism, accurate data is returned at an extremely fast speed in
response to queries. This solution is less coupled to application
programs. Therefore, it can be abstracted and separated from business
systems, making it suitable for large-scale use. This is illustrated
in the following figure.
Alibabacloud.com
In another article it said that asynchronous is also risky if one datasource is down and we cannot easily rollback.
https://thorben-janssen.com/dual-writes/
So my question is : Should I use CDC to perform persistance operations for multiple datasources ? Why CDC is better than asynchronous given that is based on the same principle ?

How to test multi-region write in Cosmos DB

I am going to test multi-region write functionality by writing some test code using the cosmos c# v3 SDK.
I plan to have a multi-region write enabled cosmos DB (SQL core API) with three regions. I want to write to one specific region and then read from other regions. While doing it, I want to measure performance as well.
Is there any way of implementing these type of tests? Is there any good of measuring performance such as performance metrics? I also want to vary consistency level and see latency.
Depending on what type of tests you are looking to do the benchmarks in this Cosmos DB Global Distribution Demos GitHub Repo may be of some help. There's a bit of a learning curve as the benchmarks are data driven from app.config files. But once you get the URIs and keys in the app.config you should be mostly good to go.
One thing worth pointing out that changing consistency level when testing multiple writers and readers in different regions when configured for multi-region writes is meaningless because you will always have eventual consistency under those circumstances. For more information see, Guarantees associated with consistency levels.
The other thing to call out is you cannot configure multi-region writes with strong consistency. For more information see, Strong consistency and multiple write regions

Object storage for a web application

I am currently working on a website where, roughly 40 million documents and images should be served to it's users. I need suggestions on which method is the most suitable for storing content with subject to these requirements.
System should be highly available, scale-able and durable.
Files have to be stored permanently and users should be able to modify them.
Due to client restrictions, 3rd party object storage providers such as Amazon S3 and CDNs are not suitable.
File size of content can vary from 1 MB to 30 MB. (However about 90% of the files would be less than 2 MB)
Content retrieval latency is not much of a problem. Therefore indexing or caching is not very important.
I did some research and found out about the following solutions;
Storing content as BLOBs in databases.
Using GridFS to chunk and store content.
Storing content in a file server in directories using a hash and storing the metadata in a database.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
The website is developed using PHP and Couchbase Community Edition is used as the database.
I would really appreciate any input.
Thank you.
I have been working on a similar system for last two years, the work is still in progress. However, requirements are slightly different from yours: modifications are not possible (I will try to explain why later), file sizes fall in range from several bytes to several megabytes, and, the most important one, the deduplication, which should be implemented both on the document and block levels. If two different users upload the same file to the storage, the only copy of the file should be kept. Also if two different files partially intersect with each other, it's necessary to store the only copy of the common part of these files.
But let's focus on your requirements, so deduplication is not the case. First of all, high availability implies replication. You'll have to store your file in several replicas (typically 2 or 3, but there are techniques to decrease data parity) on independent machines in order to stay alive in case if one of the storage servers in your backend dies. Also, taking into account the estimation of the data amount, it's clear that all your data just won't fit into a single server, so vertical scaling is not possible and you have to consider partitioning. Finally, you need to take into account concurrency control to avoid race conditions when two different clients are trying to write or update the same data simultaneously. This topic is close to the concept of transactions (I don't mean ACID literally, but something close). So, to summarize, these facts mean that you're are actually looking for distributed database designed to store BLOBs.
On of the biggest problems in distributed systems is difficulties with global state of the system. In brief, there are two approaches:
Choose leader that will communicate with other peers and maintain global state of the distributed system. This approach provides strong consistency and linearizability guarantees. The main disadvantage is that in this case leader becomes the single point of failure. If leader dies, either some observer must assign leader role to one of the replicas (common case for master-slave replication in RDBMS world), or remaining peers need to elect new one (algorithms like Paxos and Raft are designed to target this issue). Anyway, almost whole incoming system traffic goes through the leader. This leads to the "hot spots" in backend: the situation when CPU and IO costs are unevenly distributed across the system. By the way, Raft-based systems have very low write throughput (check etcd and consul limitations if you are interested).
Avoid global state at all. Weaken the guarantees to eventual consistency. Disable the update of files. If someone wants to edit the file, you need to save it as new file. Use the system which is organized as a peer-to-peer network. There is no peer in the cluster that keeps the full track of the system, so there is no single point of failure. This results in high write throughput and nice horizontal scalability.
So now let's discuss the options you've found:
Storing content as BLOBs in databases.
I don't think it's a good option to store files in traditional RDBMS because they provide optimizations for structured data and strong consistency, and you don't need neither of this. Also you'll have difficulties with backups and scaling. People usually don't use RDBMS in this way.
Using GridFS to chunk and store content.
I'm not sure, but it looks like GridFS is built on the top of MongoDB. Again, this is document-oriented database designed to store JSONs, not BLOBs. Also MongoDB had problems with a cluster for many years. MongoDB passed Jepsen tests only in 2017. This may mean that MongoDB cluster is not mature yet. Make performance and stress tests, if you go this way.
Storing content in a file server in directories using a hash and storing the metadata in a database.
This option means that you need to develop object storage on your own. Consider all the problems I've mentioned above.
Using a distributed file system such as GlusterFS or HDFS and storing the file metadata in a database.
I used neither of these solutions, but HDFS looks like overkill, because you get dependent on Hadoop stack. Have no idea about GlusterFS performance. Always consider the design of distributed file systems. If they have some kind of dedicated "metadata" serves, treat it as a single point of failure.
Finally, my thoughts on the solutions that may fit your needs:
Elliptics. This object storage is not well-known outside of the russian part of the Internet, but it's mature and stable, and performance is perfect. It was developed at Yandex (russian search engine) and a lot of Yandex services (like Disk, Mail, Music, Picture hosting and so on) are built on the top of it. I used it in previous project, this may take some time for your ops to get into it, but it's worth it, if you're OK with GPL license.
Ceph. This is real object storage. It's also open source, but it seems that only Red Hat people know how to deploy and maintain it. So get ready to a vendor lock. Also I heard that it have too complicated settings. Never used in production, so don't know about performance.
Minio. This is S3-compatible object storage, under active development at the moment. Never used it in production, but it seems to be well-designed.
You may also check wiki page with the full list of available solutions.
And the last point: I strongly recommend not to use OpenStack Swift (there are lot of reasons why, but first of all, Python is just not good for these purposes).
One probably-relevant question, whose answer I do not readily see in your post, is this:
How often do users actually "modify" the content?
and:
When and if they do, how painful is it if a particular user is served "stale" content?
Personally (and, "categorically speaking"), I prefer to tackle such problems in two stages: (1) identifying the objects to be stored – e.g. using a database as an index; and (2) actually storing them, this being a task that I wish to delegate to "a true file-system, which after all specializes in such things."
A database (it "offhand" seems to me ...) would be a very good way to handle the logical ("as seen by the user") taxonomy of the things which you wish to store, while a distributed filesystem could handle the physical realities of storing the data and actually getting it to where it needs to go, and your application would be in the perfect position to gloss-over all of those messy filesystem details . . .

Google Cloud Platform architecture

A simple question:
Is the data that is processed via Google Big Query stored on Google Cloud Storage, and is just segmented for GBQ purposes? or does Google Big Query hold it's own Storage mechanism.
I'm trying to learn the architecture, and I see arrows pointing back and forth to each other, but it doesn't say where GBQ's architecture sits?
Thanks.
From Bigquery under the hood:
Colossus - Distributed Storage
BigQuery relies on Colossus, Google’s latest generation distributed
file system. Each Google datacenter has its own Colossus cluster, and
each Colossus cluster has enough disks to give every BigQuery user
thousands of dedicated disks at a time. Colossus also handles
replication, recovery (when disks crash) and distributed management
(so there is no single point of failure). Colossus is fast enough to
allow BigQuery to provide similar performance to many in-memory
databases, but leveraging much cheaper yet highly parallelized,
scalable, durable and performant infrastructure.
BigQuery leverages the ColumnIO columnar storage format and
compression algorithm to store data in Colossus in the most optimal
way for reading large amounts of structured data.Colossus allows
BigQuery users to scale to dozens of Petabytes in storage seamlessly,
without paying the penalty of attaching much more expensive compute
resources — typical with most traditional databases.
The part about ColumnIO is outdated--BigQuery uses the Capacitor format now--but the rest is still relevant.

Why would you use industry standard ETL?

I've just started work at a new company who have a datawarehouse that uses some bizzare proprietary ETL built in PHP.
I'm looking for arguments as to why its worth the investment to move to a standard system such as SSIS or infomatica or something. The primary reasons I have at the moment are:
A wider and more diverse community of developers available for contract work, replacements etc.
A large online knowledge base/support networks
Ongoing updates and support will be better
What are other good high level arguments to bring a little standardisation in :)
The only real disadvantage is that a lot of the data sources are web apis returning individual row-by-row records which are more easily looped through with PHP as opposed to standard ETL.
Here are some more:
Simplifies development and deployment process.
Easy to debug and incorporate changes. Would reduce maintenance and enhancement costs.
Industry standard ETL tools perform better on large volume of data as they use various techniques like, grid computing, parallel processing, partitioning etc.
Can support many types for data as source or target. Less impact if source or target systems are migrated to a different data store.
Codes are re-usable. Same component of code can be used in multiple processes.