How Does NoSQL Scale Out Exactly? - sql

I use SQL Server 2012. I have a database sharded across physical tiers by User ID. In my app User is an aggregate root (i.e., nothing about Users comes from or goes into my repository without the entire User coming or going). Everything about one particular User exists on one particular machine.
Is my system any less scalable than one that employs NoSQL? Or, can someone explain how NoSQL systems scale out across servers exactly? Wouldn't they have to shard in a similar manner to what I'm doing? We've all read that NoSQL enables scalability but at the moment I don't see how, say, MongoDB would benefit my architecture.

MongoDB allows you to scale in two ways: sharding and replication. I think you can do both in MS SQL Server.
What usually is different is the data model:
In a relational database, you typically have multiple tables that reference each other. Theoretically, you can do something similar with MongoDB by using multiple collections, however this is not the way it's typically done. Instead, in MongoDB, you tend to store all the data that belongs together in the same collection. So typically you have less collections than tables in a database. This will in many times result in more redundancy (data is copied). You can try to do that in a relational database, but it's not quite so easy (there will be less tables, each having more columns).
MongoDB collections are more flexible than tables in that you don't need to define the data model up front (the exact list of columns / properties, the data types). This allows you to change the data model without having to alter the tables - the disadvantage is that you need to take this into account in the application (you can't rely on all rows / documents having the same structure). I'm not sure if you can do that in MS SQL Server.
In MongoDB, each document is a Json object, so it's a tree and not a flat table. This allows more flexibility in the data model. For example, in an application I'm developing (Apache Jackrabbit Oak / MongoMK), for each property (column) we can store multiple values; one value for each revision. Doing that in a relational database is possible, but quite tricky.

Related

How to convert SQL many to many data model to firebase data model

I am having troubles trying to convert my data model of an attendance system for a football trainer (I have done it as if it were a SQL normalized relational model) to a firebase model. Here is a picture of my relational model:
I was thinking about making 4 Collections also:
Players
Attendance
Match
MatchType (it can be friendly-match, tournament, practice, among others)
I think this depends of how do you want to use the data. When I look at this it seems that one collection "Attandence" is enough, which in your schema is connecting all tables.
The idea of relational databases is that data should not be redundant, so every information is stored only once and connected by relation using keys like ex. PlayerID.
While in noSQL databases you does not care about data redundancy. So you are storing the same information (like player name) in many documents. The idea is to have everything in one document and do not create sophisticated queries to get information - just get document and you have everything.
So all depend how do you use information, which we do not know. You can put everything in one collection and just get all information from one document.
On the other hand you can create 4 collection with exactly the same fields as in SQL database and use in relational way just to have cheap, fast and serverless database engine.
What more you can change your solution any time as you do not define any schema.
So in Firestore you are free to choose any solution, you should think first how do you will use the information.

Multiple Datamarts Architecture / Modeling on Snowflake cloud datawarehouse

Context :
Let's suppose we have multiple datamarts (Ex : HR, Accounting, Marketing ...) and all of them use the Star Schema as dimensional modeling (Kimball approach ) .
Question :
Since Snowflake cloud data warehouse architecture eliminate the need to spin off separate physical data marts / databases in order to maintain performance. So, what's the best approach to build the multiple datamarts on Snowflake ?
Create database for each datamart ? create one database (EDW )with multiple schema and each schema refer to a datamart ?
Thank you !
Ron is correct - the answer depends on a few things:
If there are conformed dimensions, then one database and schema might be the way to go
If they are completely non-integrated data marts I would go with separate schemas or even separate databases. They are all logical containers in Snowflake (rather than physical) with full role based access control available to segregate users.
So really - how do you do it today? Does that work for you or are there things you need or want to do that you cannot do today with your current physical setup. How is security set up with your BI tools? Do they reference a database name or just a schema name? If you can, minimize changes to your data pipeline and reporting so you have fewer things that might need refactoring (at least for your first POC or migration).
One thing to note is that with Snowflake you have the ability to easily do cross database joins (i.e., database.schema.table) - all you need is SELECT access, so even if you separate the marts by database oyu can still do cross mart reporting if needed.
Hope that helps.
There is no specific need to separate star schemas at all.
If you're using shared / conformed dimensions across your marts, separation would actually be an anti-pattern.
If your issue is simplifying the segregation of users, schema per mart works well.
All of the approaches you've suggested (DB/mart, DW/schema,...) will work, I'm just not clear on the need.
The goal of having separate data marts is more related to governance, to keep data organized and where it is expected to be found (i.e. sales transactions in the "sales data mart"), and less related to performance issues.
The advantage of having a single database acting as a data warehouse is that all your data for analytics will be stored in one place, making it more accessible and easier to find. In this case, you can use schemas to implement (logically) separate data marts. You can also use schemas within a database to keep development data separate from production data, for each data mart.
Snowflake is different from traditional relational databases; given its technical architecture, it has no issues with joining large tables between different databases/schemas so you can certainly build different data marts in separate databases and join their facts or dimensions with some other Snowflake database/data mart.
In you specific case, if you have a large number of data marts (e.g. 10 or more) and you're not using Snowflake for much more than data warehouseing, I think the best path would be to implement each data mart in its own database and use schemas to manage prod/dev data within each schema. This will help keep data organized, as opposed to quickly reaching a point where you'll have hundreds of tables (every data mart, and its dev/prod versions) in one database, which won't be a great development or maintenance experience.
But, from a performance perspective, there's really no noticeable difference.

Why magento store EAV data across attribute type?

In magento system, there are 5 tables to store EAV data across attribute type. Is it an effective performance choice to do it? When I make a sql query, I still need to use UNION clause to get the whole data set. If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?
Is it an effective performance choice to do it?
The Magento developers chose to use an EAV structure because it performs well under high volumes of data. A flat table structure would be suitable for a small setup, but as it scales it becomes less and less efficient.
When I make a sql query, I still need to use UNION clause to get the whole data set.
You should try in every possible case to avoid direct SQL queries on a Magento database. They have Setup models that you can use for installing new data, of which you either utilise the Magento models that already exist and the methods that the Setup models have to create/modify the Magento core config data or variables, or use the underlying Zend framework's ORM if you need to to create new tables, etc.
In terms of the EAV part of the database specifically, the way it is setup is complicated if you attack it from the SQL point-of-view, which is why Magento models exist so that it can be all wrapped up in PHP ORM. Again, avoid SQL queries if you can.
If you have to make direct queries, you wouldn't be creating UNION queries but joins onto those tables, and you'd use the eav_attribute table as a pivot table to provide you with both the attribute_id (primary key) and the source table which the value will exist in.
By using direct SQL queries you also lose the fallback system that Magento implements where store or website level values can exist, and the Magento models will select them if you ask for them at a store level. If you want to do this manually with SQL then the queries become more complicated as you need to look for those values and if they aren't found, revert to the default (global scope) value.
If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?
As mentioned before, it depends on the expected scale of your database. You will notice that there are plenty of flat tables in a standard Magento database, and that EAV structures only apply to the parts that Magento developers have decided may increase drastically in volume (customers, catalog etc).
If you want to implement a custom module and you think that it also has the potential to grow quickly over time then you can implement your own EAV tables for it. The Magento model scaffolds support this, and there is plenty of resource online about how to set them up.
If your tables are likely to remain (relatively) small, then by all means go for a flat table approach. If it's a custom module and you notice rapid growth, you can always convert it later before it becomes a bottleneck.

How to design a database that is to change often?

I inherit a project of a program that configures devices via ethernet. Settings are stored in the database. The set of settings is constantly changing as devices are developing so there's a need for a simple schema change (user must be able to perform this operation).
Now, this simplicity is achieved by the XSD-scheme (easy readable and editable), and the data is stored as XML. This approach also satisfies the requirement of the use of various database engines (MS SQL and Oracle are currently supported).
I want to move database structure to the relational model. Are there any solutions which are as easy-to-change as described one while using a relational database?
I want to move database structure to the relational model.
Why?
Do you want to be able to index/query parts of the configuration, or be able to change just one part of the configuration without touching the rest?
If no, then just treating the XML as opaque BLOB should be sufficient.
If yes, then you'll have to tell us more about the actual structure of configuration.1
1 BTW, some DBMSes can "see inside" the XML, index the elemnts and so on, but that would no longer be DBMS-agnostic.
There are several solutions to your design problem.
I suggest the following;
Use a different database. Relational databases are not the best choice for this kind of data. There are databases with good support for dynamic data. One example of such a database is mongoDB, which uses JSON-style documents.
or
2. Create one (or a small set) of Key/Value table(s). You can support a hierarcical structure by adding a parent column that points to the parent key-value pair.
I wouldn't recommend changing a relational db schema on the fly as the result of a user operation. It goes against fundamental design rules for relational database design.

When to use separate SQL databases on the same server?

I've worked in several SQL environments. In one environment, the different tables holding business data were split across several different SQL databases, all on the same server.
In another environment, almost all the tables are kept on one single SQL database.
I'm creating a new project that is closely related to another project, and I've been wondering if I should put the new tables in the same SQL database or a new SQL database.
This all runs on MS SQL Server.
What factors do I need to consider as I make this decision?
It's tough from your question to tell what your actual requirements are, or what data you would consider to store in different databases. But in addition to Gordon's points I can address a couple of additional reasons why you might want to use separate databases for data belonging to different customers / users (and this answer assumes that one possible separation of data, whether by database or schema, would be by customer):
As I mentioned in a comment, some customers will demand that their data be stored separately, and you may need to agree to that in writing before you see a penny or are able to secure their business. So you may as well be prepared for that inevitability.
Keeping each customer in their own database makes it very easy to move them if they outgrow your current server. At my previous job we designed the system in this way, and it saved our bacon later - we were able to move customers completely to a different server with what essentially amounted to a metadata operation. During a maintenance window, backed up their database, set the original to offline, restored the backup to a new server, and updated a config table that told all the apps where to find that database. This is much more flexible than trying to extract all of their data from a database shared by others...
Separate databases also allow you to handle maintenance differently. One customer needs point-in-time restore, and another doesn't? Perfect, you can just use a different recovery model on separate databases. Much easier than separating by filegroups and trying to implement some filegroup-level backup solution, and much more efficient than just treating one big database in full recovery.
This isn't free, of course, it's about trade-offs. Multiple databases scares some people away but having managed such a system for 13 years I can tell you that managing 100 or 500 databases that are largely identical is not that much more complicated than managing 500 schemas in one massive database (in fact I would say it is less so in a lot of respects).
A database is the unit of backup and recovery, so that should be the first consideration when designing database structures. If the data has different back up and recovery requirements, then they are very good candidates for separate databases.
That is only half the problem, though. In most environments, backup/recovery is pretty much the same for all databases. It becomes a question of application design. In other words, the situation becomes quite subjective.
In the environment that I'm working in right now, here are some criteria for splitting data into different databases:
(1) Publishing tables to a wide audience. We "publish" data in tables and put these into a database, separate from other tables used for building them or for special purposes. Admittedly, SQL Server claims that "schema" are the unit of security. However, databases seem to do a good job in the real world.
(2) Strict security requiremeents. Some data is so sensitive that lawyers have to approve who can see it. This goes into its own database, with its own access.
(3) Separation of data tables (which users can see) and tables that describe the production system.
(4) Separation of tables used for general querying by a skilled group of analysts (the published tables) versus tables used for specific reports/applications.
Finally, I would add this. If some of the data is being updated continuously throughout the day and other data is used for reporting, I would tend to put them in different databases. This helps separate them in the case of problems.