Treat Delta Lake table as a transactional store for external API? - azure-synapse

I apologize that I am likely going to be showing my ignorance of this space. I am just staring into Delta Lake and I have a feeling my initial concepts were incorrect.
Today I have millions of documents in Azure Cosmos Db. I have a service that combines data from various container tables and merges them into combined json documents that are then indexed into Elasticsearch.
Our current initiative is to use Synapse to do enriching of the data before indexing to Elasticsearch. The initial idea was that we would stream the CosmosDb updates into ADLS via the ChangeFeed. We would then combine (i.e., replace what the combiner service is doing) and enrich in Synapse.
The logic in combiner service is very complex, and difficult to rewrite from scratch (it is currently an Azure Service Fabric Stateless .Net application). I had thought that I could just have my combiner write the final copy (i.e, the json we are currently indexing as an end product) to ADLS, then we would only need to do our enrichments as additive data. I believe this is a misunderstanding of what Delta Lake is. I have been thinking of it as similar to Cosmos Db where I can push a json document via a REST call. I don't think this is a valid scenario, but I don't find any information that states this (perhaps because the assumption is so far off base that it never comes up).
In this scenario, would my only option be to have my service write the consolidated document back to Cosmos Db, and then sync that doc into ADLS?
Thanks!
John

Related

SaaS App data ingestion to DL/DWH - what include into NFR?

We are in the process for buying SaaS solution for busy sales operations. We want to ensure that we have ability to access our data and ingest it into our analytics data lake (some real-time). I am looking for advice for what requirements should we have/prefer for vendors and their solutions?
APIs - most vendors mention that they provide APIs for data access, however, what features APIs need to have to be suitable for data ingestion into Analytics data lake?. For example Salesforce has Bulk API, does this mean that if vendor only offers "lean APIs", they won't work for DL use case?
Direct SQL Access - shall we prefer SaaS solutions that offer single tenant DBs so that we could obtain direct SQL access?
DB replica - shall we expect that vendor provides a DB replica (if it's single tenant) and we use it as a data store for reporting. Obviously, that extra costs for us.
Direct SQL Access via ODBC - I also read that if SaaS app has multi-tenants, ODBC/JDBC drivers could be built to access DB data via SQL but with proper authorization to ensure data security? Would this be a valid request/approach?
Staged tables - shall we request the vendor to stage their DB tables (as files) and load to our (or theirs) data lake environment. This then would be a raw data source analytics and data archive. My concern is incremental updates.
Any other options we should consider/ look for in vendor solutions or request?
Thank you!
You need to provide your requirements (data lake architecture, data latency, etc.) to the vendors and get them to provide the solution that will work with their product.

Filtered one-way synchronization of Azure SQL database

We have a multi-tenant, single db application where some customers have expressed the desire to get direct access to their own data.
I have been suggested looking into Azure Data Sync to achieve a setup where each of the customers get their own Azure SQL instance to which we setup a one-way synchronization of their data from the master database.
I managed to find some documentation on this, but one I got around to try it out in a lab setup, it looks like the ability to filter rows in the sync job has been removed in a later iteration of the Azure Data Sync service.
Am I wrong or is that feature really gone? If so, what would be your suggestions to achieve something similar on Azure?
You cannot filter rows using Azure SQL Data Sync. However, you can build a custom solution based on Sync Framework as explained here.

Create single Azure Analysis Services table from many blobs in Data Lake Store

I'm new to analysis services and data lake, working on a POC. I've used data factory to pull in some TSV data from blob storage, which is logically organized as small "partition" blobs (thousands of blobs). I have a root folder that can be thought of as containing the whole table, containing subfolders that logically represent partitioning by, say, customer - these contain subfolders that loggically represent partitioning the customer's data by, say, date. I want to model this whole folder/blob structure as one table in Analysis Services, but can't seem to figure out how. I have seen the blog posts and examples that create a single AAS table from a single ADLS file, but information on other data file layouts seems sparse. Is my approach to this wrong, or am I just missing something obvious?
This blog post provides instructions on appending multiple blobs into a single table.
Then the part 3 blog post describes creating some Analysis Services partitions to improve processing performance.
Finally this blog post describes connecting to Azure Data Lake Store (as opposed to Azure Blob Storage in the prior posts).
I would use those approaches to create say 20-200 partitions (not thousands) in Azure Analysis Services. Partitions should generally be at least 8 million rows to get optimal compression and performance. I assume that will require appending several blobs together in order to achieve that size.

deal with multiple reader/writer in azure data lake

I am new to azure data lake and am currently using data factory v2 to move data from my transactional database to azure data lake storage.
Consider a scenario
Company has multiple datasources
Team A is responsible for Source A
Team B is responsible for Source B
Team C is responsible for Source C
Multiple Writers
Each Team is responsible for moving the data into the data lake.
Team A moves data under
/TeamA/entity01.csv
/TeamA/entity02.csv
..
Team B moves data under
/TeamB/entity03.csv
..
Multiple Readers
Team Analytics can read the data and perform calculations in a databricks environment
Team Power BI can fetch the data transform it and copy it into single tenant folders
Tenant1/entity01.csv
Tenant2/entity02.csv
Question
How can the readers read without conflicts with the writers. So that while a reader is reading data, the file is NOT being written into by a Team X update data factory activity ?
What I was thinking / What have I tried :
I was thinking of having a shared source of meta data (maybe in as table storage accessible by all the readers).
"teamA/entity1" : [
"TeamA/Entity1/01-02-2018/0000/data.csv",
"TeamA/Entity1/01-01-2018/0000/data.csv",
]
"teamA/entity2" : [
"TeamA/Entity2/01-01-2018/1200/data.csv"
"TeamA/Entity2/01-01-2018/0600/data.csv"
"TeamA/Entity2/01-01-2018/0000/data.csv"
]
"teamB/entity3" : [
"TeamA/Entity3/01-01-2018/0600/data.csv"
"TeamA/Entity3/01-01-2018/0000/data.csv"
]
the writers will have added responsible for maintaining a set of versions to avoid deleting/overriding data.
the reader will have added responsibility of performing a lookup here and then reading the data.
Data Lake writes to temporary files in the background, before subsequently writing to the actual file. Which will likely mitigate this problem, however I'm unsure whether this will 100% avoid clashes.
If you are willing to have the pipelines in one factory you could use the inbuilt chaining of activities to allow data factory to manage the dependencies.
We typically write to "serving storage" such as SQL server rather than letting powerbi have direct access to data lake store, which may help separate things (also benefits from DirectQuery etc).
However I haven't seen data bricks support yet, I'd bet it is coming similar to how HDInsight can be used.
Notably, as you are finding Data Lake Store not being a OLTP data source this sort of thing isn't what data lake store is meant for, this stackoverflow post discusses this in more detail: Concurrent read/write to ADLA

Real-time data synchronization from Azure database to Azure SQL Data Warehouse?

I've done a fair bit of reading and it seems like there are a couple of off-the-shelf products that replicate/sync data from on-premise database to Azure SQL Data Warehouse but I've found nothing that syncs using an Azure database as the source. The Azure Data Factory holds some promise however it looks more suited to one off loads.
Anyone know of a way? (SSIS package not really an option as I want the transfer to occur wholly inside the cloud)
Azure Data Factory can run continuous loads from SQL Database to SQL Data Warehouse. You'll want to look into the frequency and interval parameters for the pipeline
The documentation is here https://azure.microsoft.com/en-us/documentation/articles/data-factory-create-datasets/.