Emulating join behavior with Rails and Mongoid - ruby-on-rails-3

Just wanted to ask some advice when building a database with mongodb, I have been reading a lot that if you have a database with a lot of joins it's better to go with say postgresql.
So if I wanted flexibility and needed my data to join multiple times, should I go with Postgresql? I know mongodb has fast reads / writes but needs to query multiple times to emulate joins. So when would this become a performance hit? Does mongodb limit your ability to create new complex relationships on your data that did not previously exist?
I guess the attractiveness of mongodb is its javascript syntax and similarity to json :)

I will start from the end:
I guess the attractiveness of mongodb is its javascript syntax and
similarity to json :)
Not only this, and json style is not main advantage. Main advantages of mongodb is ability to embedd documents, high performance and full scalability, full index support, map/reduce, etc.
So if I wanted flexibility and needed my data to join multiple times,
should I go with Postgresql?
It depends on concrete task, for example if you designing report system i prefer to use some relational database. But sometimes instead of joins and separate collections you can embedd documents + mongodb good fit for the data denormalization ( and in many situations you can denormalize in background to avoid joins )
I know mongodb has fast reads / writes but needs to query multiple
times to emulate joins. So when would this become a performance hit?
If you will use mongodb as regular relational database (without embedding and denormaliztion) you never achieve best performance.
Does mongodb limit your ability to create new complex relationships on
your data that did not previously exist?
No mongodb not limit you, because of it does not contains any constraints between collections like foreign key in any sql database + it allow embedd and easy denormalize data to fit your business needs and achieve best performance.

Another alternative would be to denormalize your data.
You store copies of data in multiple tables/collections. In doing so, you avoid the need for JOINs and lookups needed to stitch together related pieces of data.
You avoid joins and you’re storing more data - but your overall application can be faster.
In mongoid there are two great gems to make this easier:
Mongoid_alize &
Mongoid_denomalize
http://blog.joshdzielak.com/blog/2012/05/03/releasing-mongoid-alize-comprehensive-field-denormalization-for-mongoid/

You can always use:
http://www.mongodb.org/display/DOCS/MapReduce
Or
http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group

Related

Apply join like SQL to get data from multiple collections - Firestore

I have a project database setup on Firestore. For some case I just want to pull out data from multiple collections and combine them like SQL. Also we have provision of pagination with filters applied on data. So its like applying where condition and joining multiple collections to get relevant data. Can anybody help me to get this result?
Firestore read operations get documents from a single collection, or a group of collections of the same name. It has no support for server-side join operations.
The two common workaround for this are:
Load the additional data from your application code, also referred to as a client-side join. This affect performance a bit, but not nearly as much as you may expect - so definitely try and measure performance before ruling it out.
Duplicate the data that you need from the secondary collection. This way you store more data, and your write operations become more complex, but reading the data is fast and simple.
The second solution is also the only way to have conditions on the data from both collections, as there's also no way to query across collections.
While this may all be very unexpected if you come from a background in relational databases, it is actually very common amongst NoSQL solutions and is one of the reasons they can scale so well to massive data sizes.
To learn more, I highly recommend reading NoSQL data modeling and watching Getting to know Cloud Firestore.

Why magento store EAV data across attribute type?

In magento system, there are 5 tables to store EAV data across attribute type. Is it an effective performance choice to do it? When I make a sql query, I still need to use UNION clause to get the whole data set. If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?
Is it an effective performance choice to do it?
The Magento developers chose to use an EAV structure because it performs well under high volumes of data. A flat table structure would be suitable for a small setup, but as it scales it becomes less and less efficient.
When I make a sql query, I still need to use UNION clause to get the whole data set.
You should try in every possible case to avoid direct SQL queries on a Magento database. They have Setup models that you can use for installing new data, of which you either utilise the Magento models that already exist and the methods that the Setup models have to create/modify the Magento core config data or variables, or use the underlying Zend framework's ORM if you need to to create new tables, etc.
In terms of the EAV part of the database specifically, the way it is setup is complicated if you attack it from the SQL point-of-view, which is why Magento models exist so that it can be all wrapped up in PHP ORM. Again, avoid SQL queries if you can.
If you have to make direct queries, you wouldn't be creating UNION queries but joins onto those tables, and you'd use the eav_attribute table as a pivot table to provide you with both the attribute_id (primary key) and the source table which the value will exist in.
By using direct SQL queries you also lose the fallback system that Magento implements where store or website level values can exist, and the Magento models will select them if you ask for them at a store level. If you want to do this manually with SQL then the queries become more complicated as you need to look for those values and if they aren't found, revert to the default (global scope) value.
If I use one mixed table to store EAV data according by the only one data type(varchar, or sql_variant in sql server2008), what will I encounter performance issue in future?
As mentioned before, it depends on the expected scale of your database. You will notice that there are plenty of flat tables in a standard Magento database, and that EAV structures only apply to the parts that Magento developers have decided may increase drastically in volume (customers, catalog etc).
If you want to implement a custom module and you think that it also has the potential to grow quickly over time then you can implement your own EAV tables for it. The Magento model scaffolds support this, and there is plenty of resource online about how to set them up.
If your tables are likely to remain (relatively) small, then by all means go for a flat table approach. If it's a custom module and you notice rapid growth, you can always convert it later before it becomes a bottleneck.

SQL versus noSQL (speed)

When people are comparing SQL and noSQL, and concluding the upsides and downsides of each one, what I never hear anyone talking about is the speed.
Isn't performing SQL queries generally faster than performing noSQL queries?
I mean, for me this would be a really obvious conclusion, because you should always be able to find something faster if you know the structure of your database than if you don't.
But people never seem to mention this, so I want to know if my conclusion is right or wrong.
People who tend to use noSQL use it specifically because it fits their use cases. Being divorced from normal RDBMS table relationships and constraints, as well as ACID-ity of data, it's very easy to make it run a lot faster.
Consider Twitter, which uses NoSQL because a user only does very limited things on site, or one exactly - tweet. And concurrency can be considered non-existent since (1) nobody else can modify your tweet and (2) you won't normally be simultaneously tweeting from multiple devices.
The definition of noSQL systems is a very broad one -- a database that doesn't use SQL / is not a RDBMS.
Therefore, the answer to your question is, in short: "it depends".
Some noSQL systems are basically just persistent key/value storages (like Project Voldemort). If your queries are of the type "look up the value for a given key", such a system will (or at least should be) faster that an RDBMS, because it only needs to have a much smaller feature set.
Another popular type of noSQL system is the document database (like CouchDB).
These databases have no predefined data structure.
Their speed advantage relies heavily on denormalization and creating a data layout that is tailored to the queries that you will run on it. For example, for a blog, you could save a blog post in a document together with its comments. This reduces the need for joins and lookups, making your queries faster, but it also could reduce your flexibility regarding queries.
As Einstein would say, speed is relative.
If you need to store a master/detail simple application (like a shopping cart), you would need to do several Insert statements in your SQL application, also you will get a Data set of information when you do a query to get the purchase, if you're using NoSQL, and you're using it well, then you would have all the data for a single order in one simple "record" (document if you use the terms of NoSQL databases like djondb).
So, I really think that the performance of an application can be measured by the number of things it need to do to achieve a single requirement, if you need to do several Inserts to store an order and you only need one simple Insert in a database like djondb then the performance will be 10x faster in the NoSQL world, just because you're using 10 times less calls to the database layer, that's it.
To illustrate my point let me link an example I wrote sometime ago about the differences between NoSQL and SQL data models approach: https://web.archive.org/web/20160510045647/http://djondb.com/blog/nosql-masterdetail-sample/, I know it's a self reference, but basically I wrote it to address this question which I found it's the most challenging question a RDBMS guy could have and it's always a good way to explain why NoSQL is so different from SQL world, and why it will achieve better performance anytime, not because we use "nasa" technology, it's because NoSQL will let the developer do less... and get more, and less code = greater performance.
The answer is: it depends. Generally speaking, the objective of NoSQL DATABASES (no "queries") is scalability. RDBMS usually have some hard limits at some point (I'm talking about millons and millons of rows) where you could not scale any more by traditional means (Replication, clustering, partitioning), and you need something more because your needs keep growing. Or even if you manage to scale, the overall setup is quite complicated. Or you can scale reads, but not writes.
And the queries depends on the particular implementation of your server, the type of query you are doing, the columns in the table, etc... remember that queries are just one part of the RDBMS.
query time of relational database like SQL for 1000 person data is 2000 ms and graph database like neo4j
is 2ms .if you crate more node 1000000 speed stable 2 ms

SQL NOSQL mix possible or not?

I have an application on a relational database that needs to change in order to keep more data. My problem is that just 2 of the tables will store more data(up to billions of entries) and one the tables is "linked" by fk to other tables. I could give up the relational model for these tables.
I'd like to keep the rest of the db intact and changes only these 2 tables. I'm also doing a lot of queries - from simple selects to group by and subqueries - on these tables, so more problems there.
My experience with NoSQL is limited, so I'm asking which one (if any) of its siblings suits my needs:
- huge data
- complex queries
- integration with a SQL database. This is not as important as the first two and I could migrate my entire db to an equivalent if it's worth it.
Thanks
Both relational databases and NoSQL approaches can handle data having billions of data points. With the supplied information, it is hard to make a meaningful and specific recommendation. It would be helpful to know more about what you are trying to do with the data, what your options are regarding your hardware and network topology, etc.
I assume since you are currently using a relational database, you have probably already looked at partitioning or otherwise structuring your larger tables so that your query performance is satisfactory. This activity by itself can be non-trivial, but IMHO, a good database design with optimized sql can take you a very long way before there is a clear need to explore alternatives.
However, if your data usage looks like write-once, read often, the join dependencies are manageable, and you need to perform some aggregations over the data set, then you might start to look into alternative approaches like Hadoop or MongoDB - however these choices come with trade-offs in terms of their performance, capabilities, platform requirements, latency, and so forth. Your particular question about integration between a NoSQL repository and a SQL database at the query level might not be realizable without some duplication of data between the two. For example, MongoDB does not like joins (http://stackoverflow.com/questions/4067197/mongodb-and-joins), so you must design your persistence model with that in mind, and this may involve duplication of data.
The point I am trying to make is - identifying the "right" approach will depend on your specific goal and constraints.

Should I be concerned that ORMs, by default, return all columns?

In my limited experience in working with ORMs (so far LLBL Gen Pro and Entity Framework 4), I've noticed that inherently, queries return data for all columns. I know NHibernate is another popular ORM, and I'm not sure that this applies with it or not, but I would assume it does.
Of course, I know there are workarounds:
Create a SQL view and create models and mappings on the view
Use a stored procedure and create models and mappings on the result set returned
I know that adhering to certain practices can help mitigate this:
Ensuring your row counts are reasonably limited when selecting data
Ensuring your tables aren't excessively wide (large number of columns and/or large data types)
So here are my questions:
Are the above practices sufficient, or should I still consider finding ways to limit the number of columns returned?
Are there other ways to limit returned columns other than the ones I listed above?
How do you typically approach this in your projects?
Thanks in advance.
UPDATE: This sort of stems from the notion that SELECT * is thought of as a bad practice. See this discussion.
One of the reasons to use an ORM of nearly any kind is to delay a lot of those lower-level concerns and focus on the business logic. As long as you keep your joins reasonable and your table widths sane, ORMs are designed to make it easy to get data in and out, and that requires having the entire row available.
Personally, I consider issues like this premature optimization until encountering a specific case that bogs down because of table width.
First of : great question, and about time someone asked this! :-)
Yes, the fact an ORM typically returns all columns for a database table is something you need to take into consideration when designing your systems. But as you've mentioned - there are ways around this.
The main fact for me is to be aware that this is what happens - either a SELECT * FROM dbo.YourTable, or (better) a SELECT (list of all columns) FROM dbo.YourTable.
This is not a problem when you really want the whole object and all its properties, and as long as you load a few rows, that's fine, too - the convenience beats the raw performance.
You might need to think about changing your database structures a little bit - things like:
maybe put large columns like BLOBs into separate tables with a 1:1 link to your base table - that way, a select on the parent tables doesn't grab all those large blobs of data
maybe put groups of columns that are optional, that might only show up in certain situations, into separate tables and link them - again, just to keep the base tables lean'n'mean
Also: avoid trying to "arm-wrestle" your ORM into doing bulk operations - that's just not their strong point.
And: keep an eye on performance, and try to pick an ORM that allows you to change certain operations into e.g. stored procedures - Entity Framework 4 allows this. So if the deletes are killing you - maybe you just write a Delete stored proc for that table and handle that operation differently.
The question here covers your options fairly well. Basically you're limited to hand-crafting the HQL/SQL. It's something you want to do if you run into scalability problems, but if you do in my experience it can have a very large positive impact. In particular, it saves a lot of disk and network IO, so your scalability can take a big jump. Not something to do right away though: analyse then optimise.
Are there other ways to limit returned columns other than the ones I listed above?
NHibernate lets you add projections to your queries so you wouldn't need to use views or procs just to limit your columns.
For me this has only been an issue if the tables has LOTS of columns > 30 or if the column had alot of data for example a over 5000 character in a field.
The approach I have used is to just map another object to the existing table but with only the fields I need. So for a search that populates a table with 100 rows I would have a
MyObjectLite, but when I click to view the Details of that Row I would call a GetById and return a MyObject that has all the columns.
Another approach is to use custom SQL, Stroed procs but I only think you should go down this path if you REALLY need the performance gain and have users complaining. SO unless there is a performance problem do not waste your time trying to fix a problem that does not exist.
You can limit number of returned columns by using Projection and Transformers.AliasToBean and DTO here how it looks in Criteria API:
.SetProjection(Projections.ProjectionList()
.Add(Projections.Property("Id"), "Id")
.Add(Projections.Property("PackageName"), "Caption"))
.SetResultTransformer(Transformers.AliasToBean(typeof(PackageNameDTO)));
In LLBLGen Pro, you can return Typed Lists which not only allow you to define which fields are returned but also allow you to join data so you can pull a custom list of fields from multiple tables.
Overall, I agree that for most situations, this is premature optimization.
One of the big advantages of using LLBLGen and other ORMs as well (I just feel confident speaking about LLBLGen because I have used it since its inception) is that the performance of the data access has been optimized by folks who understand the issues better than your average bear.
Whenever they figure out a way to further speed up their code, you get those changes "for free" just by re-generating your data layer or by installing a new dll.
Unless you consider yourself an expert at writing data access code, ORMs probably improve most developers efficacy and accuracy.