I'm looking for a high level plan to perform ETL on calls that have been transcribed. The transcriptions are stored in a local software, so its on premise sql tables. We are looking to move the speech transcriptions info into the data lake which is a parquet format table.
The data needs to be feed into the data lake on an hourly basis.
My question is what are the data processes are required and the workflow in order to achieve the above reliably. How would the "Cron" Job work?.
Apologies that I could not provide more information.
Related
I have questions about ways to automate data transformation process.
What I normally do is that I transform data using python or postgresql and then export the processed data as csv. After that, I connect the csv file to Tableau.
I have done some research and found that ETL can help. However, I've watched some ETL tools' demo videos, and I'm not sure whether these tools' transform features would meet my need or not. For example, I have written 100+ sql lines for one of my data transforming task; it's better if I can use postgresql to run the query instead of using ETL tools.
The problem is that I don't know what's the proper way to automate the data transforming process and then push the data to Tableau. The csv files will be updated on a daily basis, so I'll need to refresh the data.
Data transformation can be done in various ways. It depends on your nature of data to figure out what can be the right fit.
If you have huge volume of data and you are comfortable in python/java and you can automate your transformation logic using spark and write it to a hive table and then connect tableau to read data from hive.
Most of the next gen ETL tools like pentaho and talend can be used but that erodes the flexibility and portability what a framework like spark or beam can give.
If you want to know , how can you achieve this using cloud provider services like GCP or AWS , please let me know
Prep is the Tableau tool for preparing data. It can be used for joining, appending, cleaning, pivoting, filtering and other data cleansing activities.
Tableau Prep is available:
for free if you have a Tableau Creator license
in desktop and Online/ Tableau server versions
Scheduling Prep flows is available in Tableau Online/ Server. To schedule flows you will need to acquire the Tableau Prep Conductor add-on.
The way we use data is either retrieving survey data from other organizations, or creating survey instruments ourselves and soliciting organizations under our organization for data.
We have a database where our largest table is perhaps 10 million records. We extract and upload most of our data on an annual basis, with occasionally needing to ETL over large numbers of tables from organizations such as the Census, American Community Survey, etc. Our database is all on Azure and currently the way that I get databases from Census flat files/.csv files is by re-saving them as Excel and using the Excel import wizard.
All of the 'T' in ETL is happening within programmed procedures within my staging database before moving those tables (using Visual Studio) to our reporting database.
Is there a more sophisticated technology I should be using, and if so, what is it? All of my education in this matter comes from perusing Google and watching YouTube, so my grasp on all of the different terminology is lacking and searching on the internet for ETL is making it difficult to get to what I believe should be a simple answer.
For a while I thought we wanted to eventually graduate to using SSIS, but I learned that SSIS was something that was used primarily if you had a database on prem. I've tried looking at dynamic SQL using BULK INSERT to find that BULK INSERT doesn't work with Azure DBs. Etc.
Recently I've been learning about Azure Data Factory and something called Bulk Copy Program using Windows Power Shell.
Does anybody have any suggestions as to what technology I should look at for a small-scale BI reporting solution?
I suggest you using the Data Factory, it has good performance for the large data transfer.
Refence here: Copy performance and scalability achievable using ADF
Copy Active supports you using table data, query or stored procedure to filter data in Source:
Sink support you select the destination table, stored procedure or auto create table(bulk insert) to receive the data:
Data Factory Mapping Data Flow provides more features for the data convert.
Ref: Copy and transform data in Azure SQL Database by using Azure Data Factory.
Hope this helps.
We have multiple source systems sending data. Ideally we should capture the raw data coming from sources and keep it in data lake. Then we have to process the raw data into a structured format. Now users can update this data via a front end application.
I am thinking of putting a rdbms on top of processed data and then pull the audit trails from rdbms to data lake and merge processed data and audit trails to create the final view for reporting. Or the rdbms can also be used for analytics as well.
Or we can bring in all the data originally in rdbms and run the changes in rdbms and pull data from rdbms into data lake. But this doesn't make much sense to bring in data lake.
Kindly suggest.
Thanks,
ADLA is NOT consumer oriented, meaning you would not connect a front-end system to it.
If the question is "what should we do", I'm not sure anyone can answer that for you, but it sounds like you are on the right track.
What I can do is tell you what we do:
Raw data (CSV or TXT files) come in to Blob Storage
U-SQL scripts extract that data and store it in Data Lake Analytics
tables. [Blobs can be deleted at that point].
We output processed data as required to "consumable" sources like RDBMS. There
are several ways to do this, but currently we output to pipe delimited text files in blob storage and use Polybase to import to SQL Server. YMMV.
Pulling the data into Data Lake first and RDBMS second makes sense to me.
I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.
I am processing an input file set of approximately 4000 csv files in Data Lake, the job fails with script compile error when job preparation time exceeds 25 mins
we have a requirement to bulk process beyond 4000 csv files, I have heard Microsoft has a solution in preview to process input file set size as large as 30,000 files
This is currently an opt-in preview feature. Please use the "contact us" section at this link to contact the ADLA support team.
Input File Set scales orders of magnitudes better (requires opt-in)
https://github.com/Azure/AzureDataLake/blob/master/docs/Release_Notes/2017/2017_03_09/USQL_Release_Notes_2017_03_09.md
As an alternative, you might consider Azure SQL Data Warehouse and Polybase for importing and storing flat files which would be super fast. ADLA can then connect to Azure SQL Data Warehouse using federated tables. This gives you the ability to "query data where it lives" and leans towards the idea of a logical data lake, which uses the two products, Azure SQL Data Warehouse and Azure Data Lake Analytics (ADLA). The trade-off is a more complex architecture / setup but Polybase is optimised for fast flat-file import. Just an idea.
NB I do not work for Microsoft, I'm just copying and pasting the links : )