We have multiple source systems sending data. Ideally we should capture the raw data coming from sources and keep it in data lake. Then we have to process the raw data into a structured format. Now users can update this data via a front end application.
I am thinking of putting a rdbms on top of processed data and then pull the audit trails from rdbms to data lake and merge processed data and audit trails to create the final view for reporting. Or the rdbms can also be used for analytics as well.
Or we can bring in all the data originally in rdbms and run the changes in rdbms and pull data from rdbms into data lake. But this doesn't make much sense to bring in data lake.
Kindly suggest.
Thanks,
ADLA is NOT consumer oriented, meaning you would not connect a front-end system to it.
If the question is "what should we do", I'm not sure anyone can answer that for you, but it sounds like you are on the right track.
What I can do is tell you what we do:
Raw data (CSV or TXT files) come in to Blob Storage
U-SQL scripts extract that data and store it in Data Lake Analytics
tables. [Blobs can be deleted at that point].
We output processed data as required to "consumable" sources like RDBMS. There
are several ways to do this, but currently we output to pipe delimited text files in blob storage and use Polybase to import to SQL Server. YMMV.
Pulling the data into Data Lake first and RDBMS second makes sense to me.
Related
I asked one of our company partner to give us read/write ODBC access so that we can pull raw data, crate views from their Case management system. they mentioned that they can provide us with data dump of the tables with in their website where we would be able to pull data from.
i looked into what i can do by data dump of tables. and found that this is detailed record of tables in the database. it is implemented to take backup of a database or multiple databases available in the server so that their data contents can be renovated in the event of any data loss.
I am looking how can i use this to write my own sql query and get what i need and create views. where can i read more about how else i can use data dump of a table
thanks
I'm looking for a high level plan to perform ETL on calls that have been transcribed. The transcriptions are stored in a local software, so its on premise sql tables. We are looking to move the speech transcriptions info into the data lake which is a parquet format table.
The data needs to be feed into the data lake on an hourly basis.
My question is what are the data processes are required and the workflow in order to achieve the above reliably. How would the "Cron" Job work?.
Apologies that I could not provide more information.
I need to export a multi terabyte dataset processed via Azure Data Lake Analytics(ADLA) onto a SQL Server database.
Based on my research so far, I know that I can write the result of (ADLA) output to a Data Lake store or WASB using built-in outputters, and then read the output data from SQL server using Polybase.
However, creating the result of ADLA processing as an ADLA table seems pretty enticing to us. It is a clean solution (no files to manage), multiple readers, built-in partitioning, distribution keys and the potential for allowing other processes to access the tables.
If we use ADLA tables, can I access ADLA tables via SQL Polybase? If not, is there any way to access the files underlying the ADLA tables directly from Polybase?
I know that I can probably do this using ADF, but at this point I want to avoid ADF to the extent possible - to minimize costs, and to keep the process simple.
Unfortunately, Polybase support for ADLA Tables is still on the roadmap and not yet available. Please file a feature request through the SQL Data Warehouse User voice page.
The suggested work-around is to produce the information as Csv in ADLA and then create the partitioned and distributed table in SQL DW and use Polybase to read the data and fill the SQL DW managed table.
I have a Web app with database where the consumption data are stored in SQL database. I want to consolidate the data data older than 3 months to SQL database and save the unconsolidated data to storage. The data will not be often accessed because the consolidated info will be available in SQL it's only for purpose that somethink will go wrong. It is better to use table or blob storage? Thanks for your advices.
The data will be accesed separately or based on from which building they are comming. For example I have building A and someone comes and want to know the detailed consumption for a week or day half year ago. I will go to storage and get the data. The data in SQL are stored every 5 minutes.
You can use either blob storage or table storage for this purpose but I am more inclined towards using table storage to store this data.
The reason being you would want some kind of querying capability which is only offered by table storage. With blob storage, you would need to download all the relevant data on the client side, parse that data to create some kind of collection and then query that collection. With table storage, you can execute server-side queries.
If you're going with table storage, my recommendation would be to use date/time value (with date precision) as the PartitionKey. This will make searching for data by date/time much faster.
If you're going with blob storage, my recommendation would be to use a Cool Storage account for saving this data. Since you would rarely need this data, storing it in a Cool Storage account would be cheaper than a regular storage account.
I am using BigQueryIO to publish data into BigQuery from a Google Dataflow job.
AFAIK, BigQuery can be used to query data from Google Cloud Storage, Google Drive and Google Sheets.
But when we store data using BigQueryIO, where the data will stored? Is it in Google Cloud Storage?
Short answer - BigQueryIO Write/Read to/from BigQuery Table
To go a little deeper:
BigQuery stores data in the Capacitor columnar data format, and offers the standard database concepts of tables, partitions, columns, and rows.
It manages the technical aspects of storing your structured data, including compression, encryption, replication, performance tuning, and scaling.
You can read more about BigQuery different components in BigQuery Overview
Cloud Storage is a separate service from Big Query. Internally, Big Query manages its own storage.
So, if you save your data to Cloud Storage, and then use the bq command to load a Big Query table from a file in Cloud Storage, there are now 2 copies of the data.
Consequences include:
If you delete the Cloud Storage copy, the data will still be in Big Query.
Fees include a price for each copy. I think in April 2017 long term storage in BQ is around $0.01/GB, and in cloud storage around $0.01-$0.026/GB depending on storage class.
If the same data is in both GCS and BQ, you are paying twice. Whether it is worthwhile to have a backup copy of data is up to you.
BigQuery is a managed data warehouse, simply say it's a database.
So your data will be stored in BigQuery, and you can acccess it by using SQL queries.