deal with multiple reader/writer in azure data lake - azure-data-lake

I am new to azure data lake and am currently using data factory v2 to move data from my transactional database to azure data lake storage.
Consider a scenario
Company has multiple datasources
Team A is responsible for Source A
Team B is responsible for Source B
Team C is responsible for Source C
Multiple Writers
Each Team is responsible for moving the data into the data lake.
Team A moves data under
/TeamA/entity01.csv
/TeamA/entity02.csv
..
Team B moves data under
/TeamB/entity03.csv
..
Multiple Readers
Team Analytics can read the data and perform calculations in a databricks environment
Team Power BI can fetch the data transform it and copy it into single tenant folders
Tenant1/entity01.csv
Tenant2/entity02.csv
Question
How can the readers read without conflicts with the writers. So that while a reader is reading data, the file is NOT being written into by a Team X update data factory activity ?
What I was thinking / What have I tried :
I was thinking of having a shared source of meta data (maybe in as table storage accessible by all the readers).
"teamA/entity1" : [
"TeamA/Entity1/01-02-2018/0000/data.csv",
"TeamA/Entity1/01-01-2018/0000/data.csv",
]
"teamA/entity2" : [
"TeamA/Entity2/01-01-2018/1200/data.csv"
"TeamA/Entity2/01-01-2018/0600/data.csv"
"TeamA/Entity2/01-01-2018/0000/data.csv"
]
"teamB/entity3" : [
"TeamA/Entity3/01-01-2018/0600/data.csv"
"TeamA/Entity3/01-01-2018/0000/data.csv"
]
the writers will have added responsible for maintaining a set of versions to avoid deleting/overriding data.
the reader will have added responsibility of performing a lookup here and then reading the data.

Data Lake writes to temporary files in the background, before subsequently writing to the actual file. Which will likely mitigate this problem, however I'm unsure whether this will 100% avoid clashes.
If you are willing to have the pipelines in one factory you could use the inbuilt chaining of activities to allow data factory to manage the dependencies.
We typically write to "serving storage" such as SQL server rather than letting powerbi have direct access to data lake store, which may help separate things (also benefits from DirectQuery etc).
However I haven't seen data bricks support yet, I'd bet it is coming similar to how HDInsight can be used.
Notably, as you are finding Data Lake Store not being a OLTP data source this sort of thing isn't what data lake store is meant for, this stackoverflow post discusses this in more detail: Concurrent read/write to ADLA

Related

Treat Delta Lake table as a transactional store for external API?

I apologize that I am likely going to be showing my ignorance of this space. I am just staring into Delta Lake and I have a feeling my initial concepts were incorrect.
Today I have millions of documents in Azure Cosmos Db. I have a service that combines data from various container tables and merges them into combined json documents that are then indexed into Elasticsearch.
Our current initiative is to use Synapse to do enriching of the data before indexing to Elasticsearch. The initial idea was that we would stream the CosmosDb updates into ADLS via the ChangeFeed. We would then combine (i.e., replace what the combiner service is doing) and enrich in Synapse.
The logic in combiner service is very complex, and difficult to rewrite from scratch (it is currently an Azure Service Fabric Stateless .Net application). I had thought that I could just have my combiner write the final copy (i.e, the json we are currently indexing as an end product) to ADLS, then we would only need to do our enrichments as additive data. I believe this is a misunderstanding of what Delta Lake is. I have been thinking of it as similar to Cosmos Db where I can push a json document via a REST call. I don't think this is a valid scenario, but I don't find any information that states this (perhaps because the assumption is so far off base that it never comes up).
In this scenario, would my only option be to have my service write the consolidated document back to Cosmos Db, and then sync that doc into ADLS?
Thanks!
John

Sql Azure - Cross database queries

I have N databases, for example 10 databases.
Every database has the same schema, but different data.
Now i would like to take every data of each database from the table "Table1" and insert them in a common table in a new database "DWHDatabase" in a table named Table1Common.
so it's an insert like n to 1.
How i can do that? i'm trying to solve my issues with the elastic queries but seems it's a 1 to 1 stuff
Use Azure Data Factory with Linked Services to each database. Use the Copy activity to load the data.
You can also paramaterize the solution.
Parameterize linked services
Parameters in Azure Data Factory by Catherine Wilhemsen
Elastic query is best suited for reporting scenarios in which the majority of the processing (filtering, aggregation) may be done on the external source side. It is unsuitable for ETL procedures involving significant amounts of data transfer from a distant database (s). Consider Azure Synapse Analytics for large reporting workloads or data warehousing applications with more sophisticated queries.
You may use the Copy activity to copy data across on-premises and
cloud-based data storage. After you've copied the data, you may use
other actions to alter and analyse it. The Copy activity may also be
used to publish transformation and analysis findings for use in
business intelligence (BI) and application consumption.
MSFT Copy Activity Overview: Here.

Creating a Datawarehouse

Currently our team is having a major database management/data management issue where hundreds of databases are being built and used for minor/one off applications where the app should really be pulling from an already existing database.
Since our security is so tight, the owners of these Systems of authority will not allow others to pull data from them at a consistent (App Necessary) rate, rather they allow a single app to do a weekly pull and that data is then given to the org.
I am being asked to compile all of those publicly available (weekly snapshots) into a single data warehouse for end users to go to. We realistically are talking 30-40 databases each with hundreds of thousands of records.
What is the best way to turn this into a data warehouse? Create a SQL server and treat each one as its own DB on the server? As far as the individual app connections I am less worried, I really want to know what is the best practice to house all of the data for consumption.
What you're describing is more of a simple data lake. If all you're being asked for is a single place for the existing data to live as-is, then sure, directly pulling all 30-40 databases to a new server will get that done. One thing to note is that if they're creating Database Snapshots, those wouldn't be helpful here. With actual database backups, it would be easy to build a process that would copy and restore those to your new server. This is assuming all of the sources are on SQL Server.
"Data warehouse" implies a certain level of organization beyond that, to facilitate reporting on an aggregate of the data across the multiple sources. Generally you'd identify any concepts that are shared between the databases and create a unified table for each concept, then create an ETL (extract, transform, load) process to standardize the data from each source and move it into those unified tables. This would be a large lift for one person to build. There's plenty of resources that you could read to get you started--Ralph Kimball's The Data Warehouse Toolkit is a comprehensive guide.
In either case, a tool you might want to look into is SSIS. It's good for copying data across servers and has drivers for multiple different RDBMS platforms. You can schedule SSIS packages from SQL Agent. It has other features that could help for data warehousing as well.

General question about ETL solutions for Azure for a small operation

The way we use data is either retrieving survey data from other organizations, or creating survey instruments ourselves and soliciting organizations under our organization for data.
We have a database where our largest table is perhaps 10 million records. We extract and upload most of our data on an annual basis, with occasionally needing to ETL over large numbers of tables from organizations such as the Census, American Community Survey, etc. Our database is all on Azure and currently the way that I get databases from Census flat files/.csv files is by re-saving them as Excel and using the Excel import wizard.
All of the 'T' in ETL is happening within programmed procedures within my staging database before moving those tables (using Visual Studio) to our reporting database.
Is there a more sophisticated technology I should be using, and if so, what is it? All of my education in this matter comes from perusing Google and watching YouTube, so my grasp on all of the different terminology is lacking and searching on the internet for ETL is making it difficult to get to what I believe should be a simple answer.
For a while I thought we wanted to eventually graduate to using SSIS, but I learned that SSIS was something that was used primarily if you had a database on prem. I've tried looking at dynamic SQL using BULK INSERT to find that BULK INSERT doesn't work with Azure DBs. Etc.
Recently I've been learning about Azure Data Factory and something called Bulk Copy Program using Windows Power Shell.
Does anybody have any suggestions as to what technology I should look at for a small-scale BI reporting solution?
I suggest you using the Data Factory, it has good performance for the large data transfer.
Refence here: Copy performance and scalability achievable using ADF
Copy Active supports you using table data, query or stored procedure to filter data in Source:
Sink support you select the destination table, stored procedure or auto create table(bulk insert) to receive the data:
Data Factory Mapping Data Flow provides more features for the data convert.
Ref: Copy and transform data in Azure SQL Database by using Azure Data Factory.
Hope this helps.

Access Azure Data Lake Analytics Tables from SQL Server Polybase

I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.