Create single Azure Analysis Services table from many blobs in Data Lake Store - azure-data-lake

I'm new to analysis services and data lake, working on a POC. I've used data factory to pull in some TSV data from blob storage, which is logically organized as small "partition" blobs (thousands of blobs). I have a root folder that can be thought of as containing the whole table, containing subfolders that logically represent partitioning by, say, customer - these contain subfolders that loggically represent partitioning the customer's data by, say, date. I want to model this whole folder/blob structure as one table in Analysis Services, but can't seem to figure out how. I have seen the blog posts and examples that create a single AAS table from a single ADLS file, but information on other data file layouts seems sparse. Is my approach to this wrong, or am I just missing something obvious?

This blog post provides instructions on appending multiple blobs into a single table.
Then the part 3 blog post describes creating some Analysis Services partitions to improve processing performance.
Finally this blog post describes connecting to Azure Data Lake Store (as opposed to Azure Blob Storage in the prior posts).
I would use those approaches to create say 20-200 partitions (not thousands) in Azure Analysis Services. Partitions should generally be at least 8 million rows to get optimal compression and performance. I assume that will require appending several blobs together in order to achieve that size.

Related

Creating a Datawarehouse

Currently our team is having a major database management/data management issue where hundreds of databases are being built and used for minor/one off applications where the app should really be pulling from an already existing database.
Since our security is so tight, the owners of these Systems of authority will not allow others to pull data from them at a consistent (App Necessary) rate, rather they allow a single app to do a weekly pull and that data is then given to the org.
I am being asked to compile all of those publicly available (weekly snapshots) into a single data warehouse for end users to go to. We realistically are talking 30-40 databases each with hundreds of thousands of records.
What is the best way to turn this into a data warehouse? Create a SQL server and treat each one as its own DB on the server? As far as the individual app connections I am less worried, I really want to know what is the best practice to house all of the data for consumption.
What you're describing is more of a simple data lake. If all you're being asked for is a single place for the existing data to live as-is, then sure, directly pulling all 30-40 databases to a new server will get that done. One thing to note is that if they're creating Database Snapshots, those wouldn't be helpful here. With actual database backups, it would be easy to build a process that would copy and restore those to your new server. This is assuming all of the sources are on SQL Server.
"Data warehouse" implies a certain level of organization beyond that, to facilitate reporting on an aggregate of the data across the multiple sources. Generally you'd identify any concepts that are shared between the databases and create a unified table for each concept, then create an ETL (extract, transform, load) process to standardize the data from each source and move it into those unified tables. This would be a large lift for one person to build. There's plenty of resources that you could read to get you started--Ralph Kimball's The Data Warehouse Toolkit is a comprehensive guide.
In either case, a tool you might want to look into is SSIS. It's good for copying data across servers and has drivers for multiple different RDBMS platforms. You can schedule SSIS packages from SQL Agent. It has other features that could help for data warehousing as well.

Load daily MySQL DB snapshots from S3 to snowflake

I have daily MySQL DB snapshots stored on S3. This daily DB snapshot is a backup of 1000 tables in our DB, using mysqldump, size is about 300M daily (stored 1 year of snapshots, which is about 110G).
Now we want to load these snapshots daily to snowflake for reporting purpose. How do we create tables in snowflake? Shall we create 1000 tables? Will snowflake be able to handle this scenario?
All comments are welcome. Thanks!
One comment before I look at possible solutions: your statement "Our purpose is to avoid creating dimension or fact tables (typical data warehouse approach) to save cost at the beginning" is the sort of thinking that can get companies into real trouble. Once you build something and start using it, in 99% of cases you will be stuck with it - so not designing a proper, supportable, reporting solution (whether it is a Kimball model or something else) from the start is always a false economy. If you take a "quick and dirty" approach now you will regret it in a year's time.
With that out of the way, there seem to be 2 issues you need to address:
How to store your data
How to process your data (to produce you metrics and whatever else you want to do with it)
Data Storage
(Probably stating the obvious) Any tables that you create to hold metrics or which will be accessed by BI tools (including direct SQL) I would hold in Snowflake - otherwise you wont get the performance that Snowflake can deliver and there is little point using Snowflake - you might as well be using Athena directly against your S3 buckets.
For your source tables (currently in S3), in an ideal world I would also copy them into Snowflake and treat S3 as your staging area - so once the data has been copied from S3 to Snowflake you can drop the data from S3 (or archive it or do whatever you want to it).
However, if you need the S3 versions of the data for other purposes (and so can't delete it once it has been copied to Snowflake) then rather than keep duplicate copies of the data you could create External Tables in Snowflake that point to your S3 buckets and don't require you to move the data into Snowflake. Query performance against External Tables will be worse than if the tables were within Snowflake, but performance may be good enough for your purposes - especially if they are "just" being used as data sources rather than for analytical queries.
Computation
There are a number of options for the technologies you use to calculate your metrics - which one you choose is probably down to your existing skillset, cost, supportability, etc.
Snowflake functionality - Stored Procedures, External Functions (still in Preview rather than GA, I believe), etc.
External coding tools: anything that can connect to Snowflake and read/write data (e.g. Python, Spark, etc.)
ETL/ELT tool - probably overkill for your specific use case but if you are building a proper reporting platform that requires an ETL tool then obviously you could use this to create your metrics as well as move your data around
Hope this helps?

Building OLTP DB from Datalake?

I'm confused and having trouble finding examples and reference architecture where someone wants to extract data from an existing data lake (S3/Lakeformation in my case) and build a OLTP datastore that serves as an applications backend. Everything I come across is an OLAP data warehousing pattern (i.e. ETL -> S3 -> Redshift -> BI Tools) where data is always coming IN to the datalake and warehouse rather than being pulled OUT. I don't necessarily have a need for 'business analytics' but I do have a need for displaying graphs in web dashboards with large amounts of time series data points underneath for my websites users.
What if I want to automate pulling extracts of a large dataset in the datalake and build a relational database with some useful data extracts from the various datasets that need to be queried by the hand full instead of performing large analytical queries against a DW?
What if I just want an extract of say, stock prices over 10 years, and just get the list of unique ticker symbols for populating a drop down on a web app? I don't want to query an OLAP data warehouse every time to get this, so I want to have my own OLTP store for more performant queries on smaller datasets that will have much higher TPS?
What if I want to build dashboards for my web app's customers that display graphs of large amounts of time series data currently sitting in the datalake/warehouse. Does my web app connect directly to the DW to display this data? Or do I pull that data out of the datalake or warehouse and into my application DB on some schedule?
My views on your 3 questions:
Why not just use the same ETL solution that is being used to load the datalake?
Presumably your DW has a Ticker dimension that has unique records for each Ticker symbol? What's the issue with querying this as it would be very fast to get the unique Ticker symbols from it?
It depends entirely on your environment/infrastructure and what you are doing with the data - so there is no generic answer anyone could provide you with. If your webapp is showing aggregations of a large volume of data then your DW is probably better at doing those aggregations and passing the aggregated data to your webapp; if the webapp is showing unaggregated data (and only a small subset of what is held in your DW, such as just the last week's data) then loading it into your application DB might make more sense
The pros/cons of any solution would also be heavily influenced by your infrastructure e.g. what's the network performance like between your DW and application?

AWS Glue sync data from RDS (need to sync 4 table from all schema) to S3 (apache parque format)

We are using a Postgres RDS instance (db.t3.2xlarge with around 2TB data). We have a multi-tenancy application so for all organizations who sign up in our product, we are creating a separate schema which replicates our data model. Now a couple of our schemas (around 5 to 10 schemas) contain a couple of big tables (around 5 to 7 big tables where each contains 10 to 200 million rows). For UI we need to show some statics as well as graphs and to calculate that statics as well as graph data we need to perform joins on big tables and it slows down our whole database server. Sometimes we need to do this type of query in night time so that users don't face any performance issues. So ss a solution we are planning to create a data lake in S3 so that all analytical load we can shift out of RDBMS and to an OLAP solution.
As a first step we need to transfer our data from RDS to S3 and also keep syncing both data sources. Can you please suggest which tool is a better choice for us considering the below requirements:
We need to update the last 3 days data on an hourly basis. We want to keep updating recent data because over the 3 day time window, it may change. After 3 days we can consider the data “at rest” and it can rest in the data lake without any future modification.
We are using a multi tenancy system currently and we are having ~350 schemas, But it will be increasing as more organizations sign up in our product.
We are planning to do ETL so in transform we are planning to join all tables and create one denormalized table and store the data in apache parque format in S3. So that we can perform analytical queries on that table using Redshift Spectrum, EMR, or some other tool.
I just found out about AWS Data Lake recently, and also based on my research (which will hopefully, assist you in the best solution possible)..
AWS Athena can partition data, and you may want to partition your data based on tenant id (customer id).
AWS Glue has crawlers:
Crawlers can run periodically to detect the availability of new data
as well as changes to existing data, including table definition
changes.

Creating a blog service or a persistent chat with Table Storage

I'm trying azure storage and can't come up with real life scenarios when I would use it. As far as I understand the only index Table Storage has is Partition Key and Row Key. I can't sort or query on other columns without doing a full partition scan, right?
If I would migrate my blog service from a traditional sql server or a richer nosql database like Mongo i would probably be alright, considering users don't blog that much in one year (I would partition all blog posts per user per year for example). Even if someone would hit around a thousand blog posts a year i would be OK to load them all metadata in memory. I could do smarter partitioning if this won't work well.
If I would migrate my persistent chat service to table storage how would I do that? Users post thousands of messages a day and query history pretty often from desktop clients, mobile devices, web site etc. I don't want to lose on this and only return 1 day history with paging (which can be slow as well).
Any ideas or patterns or what am I missing here?
btw I can always use different database, however considering Table Storage is so cheap I don't want to.
PartitionKey and RowKey values are the only two indexed properties. To work around the lack of secondary indexes, you can store multiple copies of each entity with each copy using a different RowKey value. For instance, one entity will have PartitionKey=DepartmentName and RowKey=EmployeeID, while the other entity will have PartitionKey=DepartmentName and RowKey=EmailAddress. That will allow you to look up either by EmployeeID or emailAddress. Azure Storage Table Design Guide ( http://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/) has more detailed example and has all the information that you need to design a scalable and performant Tables.
We will need more information to answer your second question about how you would migrate contents of your chat service to table storage. We need to understand the format and structure of the data that you currently store in your chat service.