And what are the (dis)advantages of each?
1) In SQL, you have to create a table and define the data types; In mongoDB, you can't create a collection, it creates itself automatically when inserting data.
2) In SQL, you must insert values as per data types; In mongoDB, you can insert values of any types.
3) In SQL, you can't create a column at insert or update time; In mongoDB, it is possible.
4) In SQL, almost nothing is case sensitive; In mongoDB, everything is case sensitive .
EX. In SQL, the use of "use [demo]" and "use [DEMO]" will select the same database; In mongoDB, the use of "use demo" and "use Demo" will select two different databases.
Table(SQL) - RDBMS
Maintains relations between the data
Fixed or predefined schema Data is stored in rows and columns
Foreign Key relations are supported by DB.
Data will not be stored if we violate any of the column data type or foreign key or primary key.
Joins can be used effectively to query the data.
Vertically Scalable (would be limited on the hardware, say you cannot
keep on adding RAM into a server machine, The machine has its own
limit of how much RAM can be increased) Storing and Retrieving is
comparatively slower when data is huge.
MongoDB Collection - NoSQL DB
No relation is maintained between the data - Dynamic Schema
Data is stored as Document
Dynamic schema allows to save the document of any data type or any number of parameters.
Horizontally Scalable which is simply can be done by adding more servers - Storing and Retrieving is faster
No explicit foreign Key support is available whereas we can design the schema by having foreign key(but remember we need to maintain the relationship).
$lookup performs similar operation like LEFT OUTER JOIN in SQL.
Hope it Helps!!
Before you step into the examination of the differences, you should first assert for yourself what kind of data you need to store, and to what degree your data has structure. Table-base databases are perfect for well-structured information. Non-SQL databases (like MongoDB) are best for heterogeneous data (and hence they talk about documents).
So, the answer to your question is another question: How does your data look like?
I know this may not be the answer you are expecting, but it may point for you to the right path of thinking.
Related
Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.
This looks like a standard task to sync records from SQL server (Primary data source) to NOSQL (elasticsearch), so to support advanced fast search functionality supported by NOSQL databases.
And there is already standard solution to support this using Logstash for elasticsearch. But the main challenge here is to convert normalized data stored in SQL to denormalized data in elasticsearch.
Ex: SQL might have below normalized tables for Person entity.
Person
PersonAddress
PersonJob
PersonContact
PersonSalary
But when we store this in elasticsearch, we tend to create single denormalized entity called Person. Now I see below challenges
SQL query to convert multiple rows from normalized table to single denormalized single entity. We can still write some complex query to join all tables, but I am looking for any standard approach used to solve this problem. Any built in support in SQL?
Update time for entity. Each row in SQL tables has their own update time, but I expect any change in any table for this entity should be seen as update time for entity. Now even query for finding changes in all normalized table is complex.
Any references/ideas to solve the above problems will be appreciated. Thanks in advance
In Relational Model, why we not keep all our data in a single table ? Why we need to create multiple tables ?
It depends on what your purpose is. For many analytic purposes, a single table is the simplest method.
However, relational databases are really design to keep data integrity. And one aspect of data integrity is that any given item of data is stored in only one place. For instance, a customer name is stored on the customer table, so it does not need to be repeated on the orders table.
This ensures that the customer name is always correct, because it is stored in one place.
In addition, repeating data through a single table often requires duplicating data -- and that would make the table way larger than needed and hence slow everything down.
I am not an expert, but i think that this would consume very much time and many resources, if there are a lot of data in the table. If we seperate them we can make operations a lot easier
Taken from https://www.bbc.co.uk/bitesize/guides/zvq634j/revision/1
A single flat-file table is useful for recording a limited amount of
data. But a large flat-file database can be inefficient as it takes up
more space and memory than a relational database. It also requires new
data to be added every time you enter a new record, whereas a
relational database does not. Finally, data redundancy – where data is
partially duplicated across records – can occur in flat-file tables,
and can more easily be avoided in relational databases.
Therefore, if you have a large set of data about many different
entities, it is more efficient to create separate tables and connect
them with relationships.
I'm curious if there are any trade offs between creating a child table to hold a set of data compared to just placing all the data in the main table in the first place?
My scenario is that I have data that handles various metrics. Such as LastUpdated, and AmtOfXXX. I'm curious if it would be better to place all this data in a Table (specifically for metrics) and reference it by foreign key, or place all these fields directly in the main table and forego any foreign keys? Are there trade-offs? Performance considerations?
I'm referring to Relational Database Management Systems such as SQL Server and specifically I'll be using Entity Framework Core with MS SQL Server.
Your question appears to be more about the considerations between the two approach rather than asking which is specifically better. The latter is more an opinion. This addresses the former.
The major advantage to having a separate table that is 1-1 is to isolate the metrics from other information about the entities. There is a name for this type of data model, vertical partitioning (or at least that's what it was called when I first learned about it).
This has certain benefits:
The width of the data rows is smaller. So queries that only need the "real" data (or only the metrics) are faster.
The metrics are isolated. So adding new metrics does not require rewriting the "real" data.
A query such as select * on the "real" data only returns the real data.
Queries that modify only the metrics do not lock the "real" data.
There might also be an edge case if you have lots of columns and they fit into two tables but not into one.
Of course, there is overhead:
You need a JOIN to connect the two tables. (Although with the same primary key, the join will be quite fast).
Queries that modify both the "real" data and the metrics are more complicating, having to lock both tables.
I have been programming relational database for many years, but now have come across an unusual and tricky problem:
I am building an application that needs to have very quick and easily defined entities (by the user). Instances of these entities could then be created, updated, deleted etc.
There are two options I can think of.
Option 1 - Dynamically created tables
The first option is to write an engine to dynamically generate the tables, and insert the data into these. However, this would become very tricky, as every query would also need to be dynamic, or at least dynamically created stored procedures etc.
Option 2 - Entity - Key - Value Pattern
This is the only realistic option I can think of, where I have 5 table structure:
EntityTypes
EntityTypeID int
EntityTypeName nvarchar(50)
Entities
EntityID int
EntityTypeID int
FieldTypes
FieldTypeID int
FieldTypeName nvarchar(50)
SQLtype int
FieldValues
EntityID int
FIeldID int
Value nvarchar(MAX)
Fields
FieldID int
FieldName nvarchar(50)
FieldTypeID int
The "FieldValues" table would work a little like a datawarehouse fact table, and all my inserts/updates would work by filling a "Key/Value" table valued parameter and passing this to a SPROC (to avoid multiple inserts/updates).
All the tables would be heavily indexed, and I would end up doing many self joins to obtain the data.
I have read a lot about how bad Key/Value databases are, but for this problem it still seems to be the best.
Now my questions!
Can anyone suggest another approach or pattern other than these two options?
Would option two be feasible for medium sized datasets (1 million rows max)?
Are there further optimizations for option 2 I could use?
Any direction and advice much appreciated!
Personally I would just use a "noSQL" (key/value) database like MongoDB.
But if you need to use a relational database option 2 is the way to go. A good example of that kind of model is the Alfresco Data Dictionary (Alfresco is an enterprise content management system). It's design is similar to what you describe, although they have multiple columns for field values (for every simple type available in the database). If you add a good cache system to that (for example Ehcache) it should work fine.
As others have suggested NoSQL, I'm going to say that, in my opinion, schemaless databases really is best suited for use-cases with no schema.
From the description, and the schema you came up with, it looks like your case is not in fact "no schema", but rather it seems to be "user-defined schema".
In fact, the schema you came up with looks very similar to the internal meta-schema of a relational database. (You're sort of building a relational database on top of a relational database, which in my experience is not a good idea, as this "meta-database" will have at least twice the overhead and complexity for any basic operation - tables will get very large, which doesn't scale well, and the data will be difficult to query and update, problems will be difficult to debug, and so on.)
For use-cases like that, you probably want DDL: Data Definition Language.
You didn't say which SQL database you're using, but most SQL databases (such as MySQL, PostgreSQL and MS-SQL) support some dialect of DDL extensions to SQL syntax, which let you manipulate the actual schema.
I've done this successfully for use-cases like yours in the past. It works well for cases where the schema rarely changes, and the data volumes are relatively low for each user. (For high volumes or frequent schema updates, you might want schemaless or some other type of NoSQL database.)
You might need some tables on the side for additional field information that doesn't fit in SQL schema - you may want to duplicate some schema information there as well, as this can be difficult or inefficient to read back from actual schema.
Ensuring atomic updates to your field information tables and the schema probably requires transactions, which may not be supported by your database engine - PostgreSQL at least does support transactional schema updates.
You have to be vigilant when it comes to security - you don't want to open yourself up to users creating, storing or deleting things they're not supposed to.
If it suits your use-case, consider using not only separate tables, but separate databases, which can also by created and destroyed on demand using DDL. This could be applicable if each customer has ownership of data collections that can't, shouldn't, or don't need to be queried across customers. (Arguably, these are rare - typically, you want at least analytics or something across customers, but there are cases where each customer "owns" an isolated, hosted wiki, shop or CMS/DMS of some sort.)
(I saw in your comment that you already decided on NoSQL, so just posting this option here for completeness.)
It sounds like this might be a solution in search of a problem. Is there any chance your domain can be refactored? If not - theres still hope.
Your scalability for option 2 will depend a lot on the width of the custom objects. How many fields can be created dynamically? 1 million entities when each entity has 100 fields could be a drag... Efficient indexing could make performance bearable.
For another option - you could have one data table that has a few string fields, a few double fields, and a few integer fields. For example, a table with String1, String2, String3, Int1, Int2, Int3. A second table with have rows that define a user object and map your "CustomObjectName" => String1, and such. A stored procedure reading INFORMATION_SCHEMA and some dynamic sql would be able to read the schema table and return a strongly typed recordset...
Yet another option (for recent versions of SQL Server) would be to store a row with an id, a type name, and an XML field that contains a XML document that contains the object data. In MS Sql Server this can be queried against directly, and maybe even validated against a schema.
PErsonally I would take the time to define as many attritbutes as you can ratheer than use EAV for everything. Surely you know some of the attributes. Then you only need EAv for the things that are truly client specific.
But if all must be EAV, then a nosql databse is the way to go. Or you can use a relationsla datbase for some stuff and a nosql database for the rest.