Multiple Datamarts Architecture / Modeling on Snowflake cloud datawarehouse - schema

Context :
Let's suppose we have multiple datamarts (Ex : HR, Accounting, Marketing ...) and all of them use the Star Schema as dimensional modeling (Kimball approach ) .
Question :
Since Snowflake cloud data warehouse architecture eliminate the need to spin off separate physical data marts / databases in order to maintain performance. So, what's the best approach to build the multiple datamarts on Snowflake ?
Create database for each datamart ? create one database (EDW )with multiple schema and each schema refer to a datamart ?
Thank you !

Ron is correct - the answer depends on a few things:
If there are conformed dimensions, then one database and schema might be the way to go
If they are completely non-integrated data marts I would go with separate schemas or even separate databases. They are all logical containers in Snowflake (rather than physical) with full role based access control available to segregate users.
So really - how do you do it today? Does that work for you or are there things you need or want to do that you cannot do today with your current physical setup. How is security set up with your BI tools? Do they reference a database name or just a schema name? If you can, minimize changes to your data pipeline and reporting so you have fewer things that might need refactoring (at least for your first POC or migration).
One thing to note is that with Snowflake you have the ability to easily do cross database joins (i.e., database.schema.table) - all you need is SELECT access, so even if you separate the marts by database oyu can still do cross mart reporting if needed.
Hope that helps.

There is no specific need to separate star schemas at all.
If you're using shared / conformed dimensions across your marts, separation would actually be an anti-pattern.
If your issue is simplifying the segregation of users, schema per mart works well.
All of the approaches you've suggested (DB/mart, DW/schema,...) will work, I'm just not clear on the need.

The goal of having separate data marts is more related to governance, to keep data organized and where it is expected to be found (i.e. sales transactions in the "sales data mart"), and less related to performance issues.
The advantage of having a single database acting as a data warehouse is that all your data for analytics will be stored in one place, making it more accessible and easier to find. In this case, you can use schemas to implement (logically) separate data marts. You can also use schemas within a database to keep development data separate from production data, for each data mart.
Snowflake is different from traditional relational databases; given its technical architecture, it has no issues with joining large tables between different databases/schemas so you can certainly build different data marts in separate databases and join their facts or dimensions with some other Snowflake database/data mart.
In you specific case, if you have a large number of data marts (e.g. 10 or more) and you're not using Snowflake for much more than data warehouseing, I think the best path would be to implement each data mart in its own database and use schemas to manage prod/dev data within each schema. This will help keep data organized, as opposed to quickly reaching a point where you'll have hundreds of tables (every data mart, and its dev/prod versions) in one database, which won't be a great development or maintenance experience.
But, from a performance perspective, there's really no noticeable difference.

Related

How Does NoSQL Scale Out Exactly?

I use SQL Server 2012. I have a database sharded across physical tiers by User ID. In my app User is an aggregate root (i.e., nothing about Users comes from or goes into my repository without the entire User coming or going). Everything about one particular User exists on one particular machine.
Is my system any less scalable than one that employs NoSQL? Or, can someone explain how NoSQL systems scale out across servers exactly? Wouldn't they have to shard in a similar manner to what I'm doing? We've all read that NoSQL enables scalability but at the moment I don't see how, say, MongoDB would benefit my architecture.
MongoDB allows you to scale in two ways: sharding and replication. I think you can do both in MS SQL Server.
What usually is different is the data model:
In a relational database, you typically have multiple tables that reference each other. Theoretically, you can do something similar with MongoDB by using multiple collections, however this is not the way it's typically done. Instead, in MongoDB, you tend to store all the data that belongs together in the same collection. So typically you have less collections than tables in a database. This will in many times result in more redundancy (data is copied). You can try to do that in a relational database, but it's not quite so easy (there will be less tables, each having more columns).
MongoDB collections are more flexible than tables in that you don't need to define the data model up front (the exact list of columns / properties, the data types). This allows you to change the data model without having to alter the tables - the disadvantage is that you need to take this into account in the application (you can't rely on all rows / documents having the same structure). I'm not sure if you can do that in MS SQL Server.
In MongoDB, each document is a Json object, so it's a tree and not a flat table. This allows more flexibility in the data model. For example, in an application I'm developing (Apache Jackrabbit Oak / MongoMK), for each property (column) we can store multiple values; one value for each revision. Doing that in a relational database is possible, but quite tricky.

SQL NOSQL mix possible or not?

I have an application on a relational database that needs to change in order to keep more data. My problem is that just 2 of the tables will store more data(up to billions of entries) and one the tables is "linked" by fk to other tables. I could give up the relational model for these tables.
I'd like to keep the rest of the db intact and changes only these 2 tables. I'm also doing a lot of queries - from simple selects to group by and subqueries - on these tables, so more problems there.
My experience with NoSQL is limited, so I'm asking which one (if any) of its siblings suits my needs:
- huge data
- complex queries
- integration with a SQL database. This is not as important as the first two and I could migrate my entire db to an equivalent if it's worth it.
Thanks
Both relational databases and NoSQL approaches can handle data having billions of data points. With the supplied information, it is hard to make a meaningful and specific recommendation. It would be helpful to know more about what you are trying to do with the data, what your options are regarding your hardware and network topology, etc.
I assume since you are currently using a relational database, you have probably already looked at partitioning or otherwise structuring your larger tables so that your query performance is satisfactory. This activity by itself can be non-trivial, but IMHO, a good database design with optimized sql can take you a very long way before there is a clear need to explore alternatives.
However, if your data usage looks like write-once, read often, the join dependencies are manageable, and you need to perform some aggregations over the data set, then you might start to look into alternative approaches like Hadoop or MongoDB - however these choices come with trade-offs in terms of their performance, capabilities, platform requirements, latency, and so forth. Your particular question about integration between a NoSQL repository and a SQL database at the query level might not be realizable without some duplication of data between the two. For example, MongoDB does not like joins (http://stackoverflow.com/questions/4067197/mongodb-and-joins), so you must design your persistence model with that in mind, and this may involve duplication of data.
The point I am trying to make is - identifying the "right" approach will depend on your specific goal and constraints.

When to use separate SQL databases on the same server?

I've worked in several SQL environments. In one environment, the different tables holding business data were split across several different SQL databases, all on the same server.
In another environment, almost all the tables are kept on one single SQL database.
I'm creating a new project that is closely related to another project, and I've been wondering if I should put the new tables in the same SQL database or a new SQL database.
This all runs on MS SQL Server.
What factors do I need to consider as I make this decision?
It's tough from your question to tell what your actual requirements are, or what data you would consider to store in different databases. But in addition to Gordon's points I can address a couple of additional reasons why you might want to use separate databases for data belonging to different customers / users (and this answer assumes that one possible separation of data, whether by database or schema, would be by customer):
As I mentioned in a comment, some customers will demand that their data be stored separately, and you may need to agree to that in writing before you see a penny or are able to secure their business. So you may as well be prepared for that inevitability.
Keeping each customer in their own database makes it very easy to move them if they outgrow your current server. At my previous job we designed the system in this way, and it saved our bacon later - we were able to move customers completely to a different server with what essentially amounted to a metadata operation. During a maintenance window, backed up their database, set the original to offline, restored the backup to a new server, and updated a config table that told all the apps where to find that database. This is much more flexible than trying to extract all of their data from a database shared by others...
Separate databases also allow you to handle maintenance differently. One customer needs point-in-time restore, and another doesn't? Perfect, you can just use a different recovery model on separate databases. Much easier than separating by filegroups and trying to implement some filegroup-level backup solution, and much more efficient than just treating one big database in full recovery.
This isn't free, of course, it's about trade-offs. Multiple databases scares some people away but having managed such a system for 13 years I can tell you that managing 100 or 500 databases that are largely identical is not that much more complicated than managing 500 schemas in one massive database (in fact I would say it is less so in a lot of respects).
A database is the unit of backup and recovery, so that should be the first consideration when designing database structures. If the data has different back up and recovery requirements, then they are very good candidates for separate databases.
That is only half the problem, though. In most environments, backup/recovery is pretty much the same for all databases. It becomes a question of application design. In other words, the situation becomes quite subjective.
In the environment that I'm working in right now, here are some criteria for splitting data into different databases:
(1) Publishing tables to a wide audience. We "publish" data in tables and put these into a database, separate from other tables used for building them or for special purposes. Admittedly, SQL Server claims that "schema" are the unit of security. However, databases seem to do a good job in the real world.
(2) Strict security requiremeents. Some data is so sensitive that lawyers have to approve who can see it. This goes into its own database, with its own access.
(3) Separation of data tables (which users can see) and tables that describe the production system.
(4) Separation of tables used for general querying by a skilled group of analysts (the published tables) versus tables used for specific reports/applications.
Finally, I would add this. If some of the data is being updated continuously throughout the day and other data is used for reporting, I would tend to put them in different databases. This helps separate them in the case of problems.

Advice for hand-written olap-like extractions from relational database

We've implemented over the course of the years a series of web based reports summarizing historical business data (product sales, traffic, etc). The thing relies heavily on complex SQL queries, and the boss expects the results to be real time, but they need up to a minute to execute. The reports are customizable on a several dimensions.
I've done some basic research, and it looks like what we need is some kind of OLAP (?), ETL(?), whatever.
Is that true? Are we supposed to convert to a whole package and trash our beloved developments, or is there a possibility to keep it relational, SQL-based, and get close to a dedicated solution by simply pre-calculating some optimized views with a batch process running at night? Have you got pointers to good documentation on the subject?
Thank you.
You can do ETL (Extract, transform, and load) at night, loading the (probably summarized) data into tables that can usually be queried pretty quickly. Appropriate indexes are still important.
It often makes sense to put those summary tables in a different schema, a different database, or on a different server, but you don't absolutely have to do that.
The structure of the tables is important, and it's not like designing tables for an OLTP system. The IBM Redbooks have a couple of titles that can help you design the tables.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
Most dbms today support SQL analytic functions. See, for example, Analytic Functions by Example for Oracle, or Window Functions for PostgreSQL.
In the long term, it sounds as though a move to a data warehouse would definitely benefit you (as suggested in Catcall's answer). You can use the existing reports as a starting point for your data warehouse's requirements.
In the short term, you could build summarised tables optimised for your existing reporting requirements. This should probably be regarded as a stopgap, unless you are never going to change these reports again.
You might also benefit from looking into partitioning tables in your database by date/time, since you will probably still want to report the current day's data for realtime reporting purposes.

Using views in a datawarehouse

I recently inherited a warehouse which uses views to summarise data, my question is this:
Are views good practise, or the best approach?
I was intending to use cubes to aggregate multi dimensional queries.
Sorry if this is asking a basic question, I'm not experienced with warehouse and analyis services
Thanks
Analysis Services and Views have the fundamental difference that they will be used by different reporting or analytic tools.
If you have SQL-based reports (e.g. through Reporting Services or Crystal Reports) the views may be useful for these. Views can also be materialised (these are called indexed views on SQL Server). In this case they are persisted to the disk, and can be used to reduce I/O needed to do a query against the view. A query against a non-materialized view will still hit the underlying tables.
Often, views are used for security or simplicity purposes (i.e. to encapsulate business logic or computations in something that is simple to query). For security, they can restrict access to sensitive data by filtering (restricting the rows available) or masking off sensitive fields from the underlying table.
Analysis Services uses different query and reporting tools, and does pre-compute and store aggregate data. The interface to the server is different to SQL Server, so reporting or query tools for a cube (e.g. ProClarity) are different to the tools for reporting off a database (although some systems do have the ability to query from either).
Cubes are a much better approach to summarize data and perform multidimensional analysis on it.
The problem with views is twofold: bad performance (all those joins and group bys), and inability to dice and slice the data by the user.
In my projects I use "dumb" views as a another layer between the datawarehouse and the cubes (ie, my dimensions and measure groups are based on views), because It allows me a greater degree of flexibility.
Views are useful for security purposes such as to restrict/control/standardise access to data.
They can also be used to implement custom table partitioning implementations and federated database deployments.
If the function of the views in your database is to facilitate the calculation of metrics or statistics then you will certainly benefit from a more appropriate implementation, such as that available through a data warehouse solution.
I was in the same boat a few years ago. In my case I had access to another SQL server. On the second server I created a link server to the warehouse and then created my views and materialized views on the second server. In a sense I had a data warehouse and a reporting warehouse. For the project this approach worked out best as we were required to give access to the data to other departments and some vendors. Splitting the servers into two separate instances, one for warehousing and one for reporting also alleviated some of the risks involved in regards to secure access.