data lake job fails when job preparation time exceeds 25 mins - azure-data-lake

I am processing an input file set of approximately 4000 csv files in Data Lake, the job fails with script compile error when job preparation time exceeds 25 mins
we have a requirement to bulk process beyond 4000 csv files, I have heard Microsoft has a solution in preview to process input file set size as large as 30,000 files

This is currently an opt-in preview feature. Please use the "contact us" section at this link to contact the ADLA support team.
Input File Set scales orders of magnitudes better (requires opt-in)
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_03_09/USQL_Release_Notes_2017_03_09.md
As an alternative, you might consider Azure SQL Data Warehouse and Polybase for importing and storing flat files which would be super fast. ADLA can then connect to Azure SQL Data Warehouse using federated tables. This gives you the ability to "query data where it lives" and leans towards the idea of a logical data lake, which uses the two products, Azure SQL Data Warehouse and Azure Data Lake Analytics (ADLA). The trade-off is a more complex architecture / setup but Polybase is optimised for fast flat-file import. Just an idea.
NB I do not work for Microsoft, I'm just copying and pasting the links : )

Related

Data migration - SQL tables to Parquet framework

I'm looking for a high level plan to perform ETL on calls that have been transcribed. The transcriptions are stored in a local software, so its on premise sql tables.  We are looking to move the speech transcriptions info into the data lake which is a parquet format table.
The data needs to be feed into the data lake on an hourly basis. 
My question is what are the data processes are required and the workflow in order to achieve the above reliably. How would the "Cron" Job work?.
Apologies that I could not provide more information.

Storing Data for Google Data Studio

I have an application that makes reports available by http in either CSV or JSON format. I want this data to be accessible to Google Data Studio. I was considering building a connector to access the data, but the number of rows that can be accessed at any given time is quite small and there is a daily data limit. So I want to build a system to download the reports daily and store them to be accessed by Data Studio. I created a script to load the reports into a Google Cloud SQL but this is quite expensive because of the base cost of running a Google Cloud SQL machine. Any ideas how else to deal with a situation like this?
You can use firebase Realtime Database.
I used it before for storing 1G Data and 20k rows.
I have code samples for that.

General question about ETL solutions for Azure for a small operation

The way we use data is either retrieving survey data from other organizations, or creating survey instruments ourselves and soliciting organizations under our organization for data.
We have a database where our largest table is perhaps 10 million records. We extract and upload most of our data on an annual basis, with occasionally needing to ETL over large numbers of tables from organizations such as the Census, American Community Survey, etc. Our database is all on Azure and currently the way that I get databases from Census flat files/.csv files is by re-saving them as Excel and using the Excel import wizard.
All of the 'T' in ETL is happening within programmed procedures within my staging database before moving those tables (using Visual Studio) to our reporting database.
Is there a more sophisticated technology I should be using, and if so, what is it? All of my education in this matter comes from perusing Google and watching YouTube, so my grasp on all of the different terminology is lacking and searching on the internet for ETL is making it difficult to get to what I believe should be a simple answer.
For a while I thought we wanted to eventually graduate to using SSIS, but I learned that SSIS was something that was used primarily if you had a database on prem. I've tried looking at dynamic SQL using BULK INSERT to find that BULK INSERT doesn't work with Azure DBs. Etc.
Recently I've been learning about Azure Data Factory and something called Bulk Copy Program using Windows Power Shell.
Does anybody have any suggestions as to what technology I should look at for a small-scale BI reporting solution?
I suggest you using the Data Factory, it has good performance for the large data transfer.
Refence here: Copy performance and scalability achievable using ADF
Copy Active supports you using table data, query or stored procedure to filter data in Source:
Sink support you select the destination table, stored procedure or auto create table(bulk insert) to receive the data:
Data Factory Mapping Data Flow provides more features for the data convert.
Ref: Copy and transform data in Azure SQL Database by using Azure Data Factory.
Hope this helps.

Access Azure Data Lake Analytics Tables from SQL Server Polybase

I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.

SQL/NewSQL suited read only, OLAP Database

I am looking for a sql/newsql database Engine which is suited for fast Response times on a read only Star Schema for ad-hoc olap purposes. Complete dataset is about 4 Billion rows and MySql ist performing very slow on that Task.
I would be very grateful for some recommendations
If it is a read only star schema benchmark, try MariaDB Columnstore. It is very fast with data load and star schema-like queries.
Hive
"The Apache Hive ™ data warehouse software facilitates reading,
writing, and managing large datasets residing in distributed storage
using SQL. Structure can be projected onto data already in storage. A
command line tool and JDBC driver are provided to connect users to
Hive."