I recently inherited a warehouse which uses views to summarise data, my question is this:
Are views good practise, or the best approach?
I was intending to use cubes to aggregate multi dimensional queries.
Sorry if this is asking a basic question, I'm not experienced with warehouse and analyis services
Thanks
Analysis Services and Views have the fundamental difference that they will be used by different reporting or analytic tools.
If you have SQL-based reports (e.g. through Reporting Services or Crystal Reports) the views may be useful for these. Views can also be materialised (these are called indexed views on SQL Server). In this case they are persisted to the disk, and can be used to reduce I/O needed to do a query against the view. A query against a non-materialized view will still hit the underlying tables.
Often, views are used for security or simplicity purposes (i.e. to encapsulate business logic or computations in something that is simple to query). For security, they can restrict access to sensitive data by filtering (restricting the rows available) or masking off sensitive fields from the underlying table.
Analysis Services uses different query and reporting tools, and does pre-compute and store aggregate data. The interface to the server is different to SQL Server, so reporting or query tools for a cube (e.g. ProClarity) are different to the tools for reporting off a database (although some systems do have the ability to query from either).
Cubes are a much better approach to summarize data and perform multidimensional analysis on it.
The problem with views is twofold: bad performance (all those joins and group bys), and inability to dice and slice the data by the user.
In my projects I use "dumb" views as a another layer between the datawarehouse and the cubes (ie, my dimensions and measure groups are based on views), because It allows me a greater degree of flexibility.
Views are useful for security purposes such as to restrict/control/standardise access to data.
They can also be used to implement custom table partitioning implementations and federated database deployments.
If the function of the views in your database is to facilitate the calculation of metrics or statistics then you will certainly benefit from a more appropriate implementation, such as that available through a data warehouse solution.
I was in the same boat a few years ago. In my case I had access to another SQL server. On the second server I created a link server to the warehouse and then created my views and materialized views on the second server. In a sense I had a data warehouse and a reporting warehouse. For the project this approach worked out best as we were required to give access to the data to other departments and some vendors. Splitting the servers into two separate instances, one for warehousing and one for reporting also alleviated some of the risks involved in regards to secure access.
Related
Context :
Let's suppose we have multiple datamarts (Ex : HR, Accounting, Marketing ...) and all of them use the Star Schema as dimensional modeling (Kimball approach ) .
Question :
Since Snowflake cloud data warehouse architecture eliminate the need to spin off separate physical data marts / databases in order to maintain performance. So, what's the best approach to build the multiple datamarts on Snowflake ?
Create database for each datamart ? create one database (EDW )with multiple schema and each schema refer to a datamart ?
Thank you !
Ron is correct - the answer depends on a few things:
If there are conformed dimensions, then one database and schema might be the way to go
If they are completely non-integrated data marts I would go with separate schemas or even separate databases. They are all logical containers in Snowflake (rather than physical) with full role based access control available to segregate users.
So really - how do you do it today? Does that work for you or are there things you need or want to do that you cannot do today with your current physical setup. How is security set up with your BI tools? Do they reference a database name or just a schema name? If you can, minimize changes to your data pipeline and reporting so you have fewer things that might need refactoring (at least for your first POC or migration).
One thing to note is that with Snowflake you have the ability to easily do cross database joins (i.e., database.schema.table) - all you need is SELECT access, so even if you separate the marts by database oyu can still do cross mart reporting if needed.
Hope that helps.
There is no specific need to separate star schemas at all.
If you're using shared / conformed dimensions across your marts, separation would actually be an anti-pattern.
If your issue is simplifying the segregation of users, schema per mart works well.
All of the approaches you've suggested (DB/mart, DW/schema,...) will work, I'm just not clear on the need.
The goal of having separate data marts is more related to governance, to keep data organized and where it is expected to be found (i.e. sales transactions in the "sales data mart"), and less related to performance issues.
The advantage of having a single database acting as a data warehouse is that all your data for analytics will be stored in one place, making it more accessible and easier to find. In this case, you can use schemas to implement (logically) separate data marts. You can also use schemas within a database to keep development data separate from production data, for each data mart.
Snowflake is different from traditional relational databases; given its technical architecture, it has no issues with joining large tables between different databases/schemas so you can certainly build different data marts in separate databases and join their facts or dimensions with some other Snowflake database/data mart.
In you specific case, if you have a large number of data marts (e.g. 10 or more) and you're not using Snowflake for much more than data warehouseing, I think the best path would be to implement each data mart in its own database and use schemas to manage prod/dev data within each schema. This will help keep data organized, as opposed to quickly reaching a point where you'll have hundreds of tables (every data mart, and its dev/prod versions) in one database, which won't be a great development or maintenance experience.
But, from a performance perspective, there's really no noticeable difference.
I have a sql server 2012 database which is the backend to an asp.net MVC application, storing customer and order information. This database is accessed under high load and high usage.
I know have a requirement to be able to generate ad hoc reports from the database accessing the same data as the MVC application works with. I am concerned what impact this would have on the database server and the database itself, around locking etc. As such their is a distinction between the data, for the app its operational, but for the reports its more data warehouse oriented.
Therefore I am looking at my options as to the best approach to avoid such.
I am considering creating another database on a different server and archive the data to it using a sql job at regular intervals during the day. Only concern around this is that it would require maintenance and also a dependency to ensure any necessary changes are made to the target database when the source database changes.
What other options opened to me in such a situation and what advice could be given regarding such? What is the best approach to such?
You don't have to think of your own solution to keep the databases in sync. SQL Server has build in ways to achieve this.
Database Mirroring
Replication
Always On Availability Groups
If you're using Enterprise Edition of SQL Server 2012 then I would look into Always On Availability Groups if not then (Transactional) Replication. Both of these solutions can keep a second read-only and near real-time copy of the database.
As Steve McConell suggests you should make no assumptions about performance. You should just measure it before making any decisions. It is not a wise choice to make design choices without knowing the actual performance overhead. So I would suggest to measure, or simulate the performance overhead before even consider using a complex architecture, because you would not know if it's worth the trouble.
Anyway, I think that your approach is right. I would create a windows service which periodically retrieves the data I need from my database and stores them in my warehouse (the new database). I don't think you would ever find a tool keeping consistency between the two schemas, unless you want one schema to be an exact copy of the other.
I don't know your exact needs and perhaps my suggestion is an overkill but I would encourage you to consider using an OLAP approach in the data warehouse where your reporting data will come from. I have to warn you that these systems are oriented in really big data and advanced reporting needs but perhaps you can take some ideas from them. Since you are familiar with the Microsoft ecosystem, I would suggest using Business Intelligence Studio. You could there build an OLAP cube using your normal database as data source and integrate advanced reporting.
Hope I helped.
I have been trying to learn hana these past few days and have been getting some problems. So As i see SAP HANA is used for de-normalization of data(as per some tutorials that i have seen). So i make the analytic views and I have my data denormalized after making the analytical views. What next?. How do I harness/use these views to create reports for business analysis. I need to generate several reports based on this de-normalized data(which i intend to ultimately use for a website based product). Do i need to create different Anaytical views for different reports?
HANA is not for denormalization of your data. You don't have to create aggregate and denormalized tables to speed up your analytics. In a normal analytics scenario you might build these but this will result in duplicate data, double maintenance to keep these up date etc..
Instead of this you can use your normal normalized database tables as a master/transactional data foundation and then build analytic views on these. How many views have to be created for different reports depends on your actual business needs, because views contain data in many aspects so they can be reused. In case of more complex reports you can of course create calculation views to get the exact data you need.
HTH
We've implemented over the course of the years a series of web based reports summarizing historical business data (product sales, traffic, etc). The thing relies heavily on complex SQL queries, and the boss expects the results to be real time, but they need up to a minute to execute. The reports are customizable on a several dimensions.
I've done some basic research, and it looks like what we need is some kind of OLAP (?), ETL(?), whatever.
Is that true? Are we supposed to convert to a whole package and trash our beloved developments, or is there a possibility to keep it relational, SQL-based, and get close to a dedicated solution by simply pre-calculating some optimized views with a batch process running at night? Have you got pointers to good documentation on the subject?
Thank you.
You can do ETL (Extract, transform, and load) at night, loading the (probably summarized) data into tables that can usually be queried pretty quickly. Appropriate indexes are still important.
It often makes sense to put those summary tables in a different schema, a different database, or on a different server, but you don't absolutely have to do that.
The structure of the tables is important, and it's not like designing tables for an OLTP system. The IBM Redbooks have a couple of titles that can help you design the tables.
Data Modeling Techniques for Data
Warehousing
Dimensional Modeling: In a Business
Intelligence Environment
Most dbms today support SQL analytic functions. See, for example, Analytic Functions by Example for Oracle, or Window Functions for PostgreSQL.
In the long term, it sounds as though a move to a data warehouse would definitely benefit you (as suggested in Catcall's answer). You can use the existing reports as a starting point for your data warehouse's requirements.
In the short term, you could build summarised tables optimised for your existing reporting requirements. This should probably be regarded as a stopgap, unless you are never going to change these reports again.
You might also benefit from looking into partitioning tables in your database by date/time, since you will probably still want to report the current day's data for realtime reporting purposes.
I just learned about SQL views, which seem nice, but if I abstract table joining to a data access class, wouldn't this accomplish the same thing? What are your thoughts on this? I've never used views before so this is all pretty new to me.
Remember that not all applications that hit your your database may be using the same data access classes. Nor are they used in exports or reporting. The views are a better place to abstract some complex things (such as how to get certain kinds of financial information) if you want consistentcy. However, don't go overboard with abstracting things to views either. Views should not call views (at least in Sql Server but I suspect in other dbs as well) because you have to materialize all the underlying views to get the data in the top layer. This means to get to the three records you want, you might end up materializing millions of records first. With large tables this can create a performance problem of nightmare proportions. Further views that call views that call views become a truly difficult maintenance problem when something needs to be fixed.
The main purpose of a view is to abstract the complexity of creating a specific result set. In large relational databases, you often need to join many tables together to get useful data. By creating views, any client can access it without needing access to your database access layer.
Additionally, almost all RDMSs will optimize a view by caching the parsed execution plan. If you query is complicated enough, you may hit a substantial query planning hit when executing the query. However, with a view, the query plan is created and saved when the view is created or when it is used for the first time.
Views can also be great for maintaining backward compatibility. Say you have a table that needs to change, but it would be difficult to update all the clients at once to use the new table schema. You could create a view with the old table name that provides the backwards compatibility. You can then create a new table with the new schema.
I'd say one of the main purposes of views are to simplify the interface betwen a complex database (whether it be star schema/OLTP) and another layer (user / OLAP cube / reporting interface).
I guess broadly I'd say that if there can be multiple ways that you can access your database (MS Excel/.net app) then you would want to use SQL views as then they are available to all, otherwise if you create a data access class in c# (e.g.), then it wont be usable by the MS Excel people.
Views simply put, reduce the complex look of all the joins put together in a sql query.
So instead of executing a join on 30 tables, a view does the 30 table join but then can be reused in another view / sproc to simply say:
SELECT * FROM myView
Rather then:
SELECT...
FROM
...
INNER JOIN
...
INNER JOIN
...
INNER JOIN
...
It basically hides all these details. This article should be a great reference: http://www.craigsmullins.com/cnr_0299b.htm
The point is views are not physical structures, they are simply a relational model or "view" of one or many tables in a database system.
Abstracting joins to a data access class might give you the same data, but it might not give you the same performance.
Also, for most businesses the database is a shared resource. It's sensible to assume that there are applications already hitting the database that cannot or will not go through your data access class. It's also sensible to assume that some future applications cannot or will not go through your data access class. As a trivial example, the command-line interface and the graphical interface to any dbms you use won't be using your data access class.
Views are also the way SQL databases implement logical data independence. Think of them as part of the public interface.
Views can be shared by interactive SQL users, report writers, OLAP tools, and applications written in different languages or by multiple programming teams that don't share classes with eahc other.
As such, it's a good way for database designers to share standardized queries across the whole community of users of the data.