We are in the process for buying SaaS solution for busy sales operations. We want to ensure that we have ability to access our data and ingest it into our analytics data lake (some real-time). I am looking for advice for what requirements should we have/prefer for vendors and their solutions?
APIs - most vendors mention that they provide APIs for data access, however, what features APIs need to have to be suitable for data ingestion into Analytics data lake?. For example Salesforce has Bulk API, does this mean that if vendor only offers "lean APIs", they won't work for DL use case?
Direct SQL Access - shall we prefer SaaS solutions that offer single tenant DBs so that we could obtain direct SQL access?
DB replica - shall we expect that vendor provides a DB replica (if it's single tenant) and we use it as a data store for reporting. Obviously, that extra costs for us.
Direct SQL Access via ODBC - I also read that if SaaS app has multi-tenants, ODBC/JDBC drivers could be built to access DB data via SQL but with proper authorization to ensure data security? Would this be a valid request/approach?
Staged tables - shall we request the vendor to stage their DB tables (as files) and load to our (or theirs) data lake environment. This then would be a raw data source analytics and data archive. My concern is incremental updates.
Any other options we should consider/ look for in vendor solutions or request?
Thank you!
You need to provide your requirements (data lake architecture, data latency, etc.) to the vendors and get them to provide the solution that will work with their product.
Related
I apologize that I am likely going to be showing my ignorance of this space. I am just staring into Delta Lake and I have a feeling my initial concepts were incorrect.
Today I have millions of documents in Azure Cosmos Db. I have a service that combines data from various container tables and merges them into combined json documents that are then indexed into Elasticsearch.
Our current initiative is to use Synapse to do enriching of the data before indexing to Elasticsearch. The initial idea was that we would stream the CosmosDb updates into ADLS via the ChangeFeed. We would then combine (i.e., replace what the combiner service is doing) and enrich in Synapse.
The logic in combiner service is very complex, and difficult to rewrite from scratch (it is currently an Azure Service Fabric Stateless .Net application). I had thought that I could just have my combiner write the final copy (i.e, the json we are currently indexing as an end product) to ADLS, then we would only need to do our enrichments as additive data. I believe this is a misunderstanding of what Delta Lake is. I have been thinking of it as similar to Cosmos Db where I can push a json document via a REST call. I don't think this is a valid scenario, but I don't find any information that states this (perhaps because the assumption is so far off base that it never comes up).
In this scenario, would my only option be to have my service write the consolidated document back to Cosmos Db, and then sync that doc into ADLS?
Thanks!
John
I am currently in the process of building a SQL database in Microsoft Azure for handling pictures, documents, etc. What is the most efficient/best way of storing data? Uploading the files directly to the DB, or by sourcing the files from something like Azure BLOB? I have read numerous posts about people uploading it directly to the DB, but I am concerned about its efficiency.
Thank you in advance for any replies.
You can store in something like Azure SQL DB for example but I would not recommend it, you should definitely store in Azure Storage (BLOB) and then for reference store in a DB. Azure has multiple relational and NoSQL data stores which are offered as platform services.
I would do two things, use a NoSQL platform data store like Cosmos DB using SQL Core API to store the metadata for the images, here you can use the filename as the partition ID to do a point read (this is very fast read and it would be a very cheap option with blazing fast performance) and secondly I would use Azure CDN to make sure images are accessed via CDN so that they are faster.
Azure CDN has three options; Akamai, Verizon and Microsoft. You can test which CDN is faster from where you are from here: https://cloudharmony.com/speedtest-for-azure
Using the above URL you can also use to test which Azure region is closer to you so to use that region, or test for your end-users and choose the region closer tot them.
I would say storing in Azure BLOBs is a better idea. Imagine you have 100 GB files stored in DB.
It will slow down your query if your table is not designed properly.
Backup & Restore DB will be very slow.
Azure DB is more expensive than Azure BLOB for the same size.
If your total file size is small enough, it doesn't make much difference.
I have a SQL DB which contains PHI, hosted on AWS. I want to access this data to perform analytics, however, I must de-identify the data first to comply with HIPAA.
How should I approach this? I have thought of a few approaches:
Simply de-identify the DB with SQL commands.
From now on, every time the DB is added to, add a de-identified version of that data to another DB. Then access this DB for analytics.
From now on, every time the DB is added to, add a de-identified version of that data to another table in that DB. Then access this table with SQL commands for analytics.
Which is the best approach to use to maintain compliance with HIPAA? Or, is there a better way?
Thanks!
Budget allowing, consider doing your analytics on a different system and during the ETL, de-identify the data. Changing the source system to accommodate this requirement will increase complexity to maintain and likely affect other integrations - might end up with a monolith.
There's various ways to do this: You could do a AWS DMS (with ongoing replication) with the DB as your source and S3 as target (parquet format). From there you could use Athena for analytics as jarmod highlighted, which also supports parquet format and you can even use SQL-like queries in Athena to analyze your data. There's also Redshift, send to another Relational DB, other analytics platforms etc.
We have shifted from IBM DB2 databases to having PostGRE SQL databases on the AWS Cloud. Is anyone aware of or has worked with AWS to test databases?
a) If so, what tools do you use?
b) What do you test when checking the databases in a Business Intelligence (BI) type of environment?
Anything other than just load or performance testing on it. I wish to check on Functional Testing, where I validation/verify that the data on the Cloud Servers and Databases is equivalent to the Data in the physical Servers with DB2 as the database.
So, mainly a kind of data reconciliation, but with ETL also involved.
Our product Ajilius (http://ajilius.com) does 90% of what you're after. We specialise in cloud data warehouse automation. PostgreSQL is our primary DBMS for on-premise and SMP data warehouses; Redshift is one of our cloud platforms (as well as Snowflake and Azure SQL Data Warehouse); and DB2 is a supported data source.
I say "90%" because our data warehouse migration feature reconciles data that is migrated between warehouses, but only when both warehouses were created by Ajilius. I'd like to understand more about your need, if you email me through our web site we can talk it over in detail.
Two competitors - Matillion and Treasure Data - also work in this space. Matillion is a full ETL tool, Treasure Data is more "EL" without the T. Definitely look at them, they're both good products with different approaches.
I've done a fair bit of reading and it seems like there are a couple of off-the-shelf products that replicate/sync data from on-premise database to Azure SQL Data Warehouse but I've found nothing that syncs using an Azure database as the source. The Azure Data Factory holds some promise however it looks more suited to one off loads.
Anyone know of a way? (SSIS package not really an option as I want the transfer to occur wholly inside the cloud)
Azure Data Factory can run continuous loads from SQL Database to SQL Data Warehouse. You'll want to look into the frequency and interval parameters for the pipeline
The documentation is here https://azure.microsoft.com/en-us/documentation/articles/data-factory-create-datasets/.