I am using MongoDB and Ruby on rails to Build a webservice. I have around 10GB of data. Collections(similar to Tables in RDBMS)in the data are divided by states in a country and the fields in the collection differ slightly from collection to collection. I have 60 collections. I wont have problems if I combine 2-3 collection with different fields as I am using a nosql database.
My problem
If have dont combine my collections then I would have 60 models in my rails application. If I combine them all then I would have a very large collection and performance would reduce. What would be the optimum choice as my server resources are limited.I will query my database based on 3-4 different parameters. For example I may only search for a particular area or may be for a particular license type a person owns or both some times.
As Sergio said, one large collection with indexes on the 3-4 fields you query on will probably work best.
However, you don't have to have 60 models, just use dynamic fields. This is one of the main benefits of using MongoDB. You can read about it in mongoid here. Basically, just define the fields that are common to all of the documents in the collection and then set and get the dynamic fields as needed.
The one gotcha here is that the method (".", dot) attribute accesor doesn't work until that attribute is set. So you can't say model.attribute until you have set one via model[:attribute] = "blah" or model.attribute = "blah".
Related
Suppose i have a User table, and other tables (e.g. UserSettings, UserStatistics) which have one-to-one relationship with a user.
Since sql databases don't save complex structs in table fields (some allow JSON fields with undefined format), is it ok to just add said tables, allowing to store individual (complex) data for each user? Will it complicate performance by 'joining' more queries?
And in distirbuted databases cases, will it save those (connected) tables randomly in different nodes, making more redundant requests with each other and decreasing efficiency?
1:1 joins can definitely add overhead, especially in a distributed database. Using a JSON or other schema-less column is one way to avoid that, but there are others.
The simplest approach is a "wide table": instead of creating a new table UserSettings with columns a,b,c, add columns setting_a, setting_b, setting_c to your User table. You can still treat them as separate objects when using an ORM, it'll just need a little extra code.
Some databases (like CockroachDB which you've tagged in your question) let you subdivide a wide table into "column families". This tends to let you get the best of both worlds: the database knows to store rows for the same user on the same node, but also to let them be updated independently.
The main downside of using JSON columns is they're harder to query efficiently--if you want all users with a certain setting, or want to know just one setting for a user, you're going to get at least a minor performance hit if the database has to parse a JSON column to figure that out, or you have to fetch the entire blob and do it in your app. If they're more convenient for other reasons though, you can work around this by adding inverted indexes on your JSON columns, or expression indexes on the specific values you're interested in. Indexes can have a similar cost to 1:1 joins, but you can mitigate that in CockroachDB using by using the STORING keyword to tell the DB to write a copy of all the user columns to the index.
I have a Postgress question/challange
There is a table 'products' with a lot of rows (millions).
Each product is of a certain Class.
Each product has a number of features of different type:
A Feature could be 'color' and the value is a picklist of all colors.
A Feature could be Voltages with a numerical value of (low) 220 to (high) 240.
There can be up to 100 features for each product.
What is done is to put all features of a product in a Many-table (with the Product table as the One).
So, this table is even bigger (much bigger).
Standard query (no Feature-filters)
A query comes along for all products of that Class. This can result is a lot of products, so Pagination is implemented on the SQL Query.
I solved this by query the products table first, then a separate query on the feature-table , gather all features for the products in the first batch and add them to the result (in the NodeJS Api application)
Problem with using a Feature-filter
But now a new request comes along to request for product of a certain Class, and matching the value for a certain feature.
It is not possible to use the same method as before and just filter out all products not matching the value for the specific feature mentioned in the request.
Because post-processing the database result and taking out products (not matching the Feature-value) will mess up the pagination (which comes from the database).
Possible Solutions
The following solutions I have already thought of:
Go the MongoDB way
Just put everything of a product in one record, and use Array's in Postgres for the features.
Downside is that array's can become quite large and I don't know how Postgres performance will be on very large records.
(Maybe I should go with MongoDB, which is filled by Postgres, just to handle requests)
Any tips here?
Forget pagination from the database
Just do not do the pagination in the database abnd handle it in NodeJS. Then I can do the postprocessing in javascript.
But I need to use WHERE clause for filtering (not LIMIT/OFFSET) which makes it quite complex and costs a lot of memory on the NodeJS Application.
This is not the best solution.
Use another technique?
I'm not familiar with Data Warehousing techniques, but is there a solution lurking in that area?
Current stack is Python, Postgres, NodeJS for the API. Any other tools which can help me?
I have multiple products which each of them may have different arrtibutes then the other products for example laptop vs t-shirt.
One of the solutions that may come to mind is to have text "specs" column in "products" table and store the products specs in it as text key/value pairs like
for example "label:laptop, RAM:8gb".
What is wrong with this approach? Why I can not find any web article that recommend it ? I mean it is not that hard to come to one's mind.
What I see on the internet are two ways to solve this problem :
1- use EAV model
2- use json
Why not just text key/value pairs as I mentioned
In SQL, a string in a primitive type and it should be used to store only a single value. That is how SQL works best -- single values in columns, rows devoted to a single entity or relationship between two tables.
Here are some reasons why you do not want to do this:
Databases have poor string processing functionality.
Queries using spec cannot (in general) be optimized using indexes or partitioning.
The strings have a lot of redundancy, because names are repeated over and over (admittedly, JSON and XML also have this "feature").
You cannot validate the data for each spec using built-in SQL functionality.
You cannot enforce the presence of particular values.
The one time when this is totally acceptable is when you don't care what is in the string -- it is there only to be returned for the application or user.
Why are you reluctant to use the solutions you mention in your question?
Text pairs (and even JSON blobs) are fine for storage and display, so long as you don't need to search on the product specifications. Searching against unstructured data in most SQL databases is slow and unreliable. That's why for variant data the EAV model typically gets used.
You can learn more about the structure by studying Normal Forms.
SQL assumes attributes are stored individually in columns, and the value in that column is to be treated as a whole value. Support for searching for rows based on some substring of a value in a column is awkward in SQL.
So you can use a string to combine all your product specs, but don't expect SQL expressions to be efficient or elegant if you want to search for one specific product spec by name or value inside that string.
If you store specs in that manner, then just use the column as a "black box." Fetch the whole product spec string from the database, after selecting the product using normal columns. Then you can explode the product spec string in your application code. You may use JSON, XML, or your own custom format. Whatever you can manipulate in application code is valid, and it's up to you.
You may also like a couple of related answers of mine:
How to design a product table for many kinds of product where each product has many parameters
Is storing a delimited list in a database column really that bad? (many of the disadvantages of using a comma-separated list are just as applicable to JSON or XML, or any other format of semi-structured data in a single column.)
After seeing some of the crazy ways developers use JSON columns in questions on Stack Overflow, I'm beginning to change my opinion that JSON or any other document-in-a-column formats are not a good idea for any relational database. They may be tolerable if you follow the "black box" principle I mention above, but too many developers then extend that and expect to query individual sub-fields within the JSON as if they are normal SQL columns. Don't do it!
I have a few dropdowns in my webpage. These are linked and have a similar class structure with bi-directional linking.
In other words: class Alpha has a list of class Beta which in turn has a list of class Charlie. Each class Beta also has its own list of Alpha (the ones it belongs to) and each class Charlie has its own list of Beta.
I am using NHibernate 3 with fluent nhibernate and automappings.
Now. If I simply would run a
session.CreateCriteria<Alpha>().SetMaxResults(1000).List<Alpha>();
I get the N+1 problem when I loop over the collections.
The way I see it the following SQL's should be all that's queried to the database
select top 1000 * from Alpha
select top 1000 * from Beta
select top 1000 * from Charlie
select * from Alpha2Beta
select * from Beta2Charlie
But how do I write the query for this to work??
There's a nice trick Ayende showed in his blog. I haven't tried it personally as I decided to change my BL to avoid this problem, so take this with a grain of salt.
You should be able to load collections separately and let NHibernate connect entities, using NHibernate Futures. Since it's not a light subject it's better that you read his blog post.
If you're using Criteria you'll need to include Dyanmic Fetching method calls.
As far as I know, there's no way you can help this on a query by query level, like you can with join fetching. However, if you change the mappings and set the default fetch mode for the associations to be "subquery", you might be pleasantly surprised:
From the Hibernate Documentation (works equally well with NHibernate):
With fetch="subselect" on a collection you can tell Hibernate to not only load this
collection in the second SELECT (either lazy or non-lazy), but also all other collections
for all "owning" entities you loaded in the first SELECT. This is especially useful for
fetching multiple collections in parallel"
What this means is that when the first association is required, NHibernate will, instead of loading one association, recall the query you used to get the root entity, then load the association data for all instances of the root entity type that were returned by the query.
That said, if you're loading 1K entities and you expect the associations to have more than a couple of records each, you're probably just going to go from a (SELECT N+1)^2 to a "holy crap I just loaded the entire database into memory". ;-)
(Note that if you do this and have a scenario where you load the Alpha list, but only need the associated Betas for a single Alpha, you're still going to load all of them and there's nothing you can do about that. In practice though, I've found this to be a very rare scenario, so usually subselect fetch suits me very well.)
A "static" query is one that remains the same at all times. For example, the "Tags" button on Stackoverflow, or the "7 days" button on Digg. In short, they always map to a specific database query, so you can create them at design time.
But I am trying to figure out how to do "dynamic" queries where the user basically dictates how the database query will be created at runtime. For example, on Stackoverflow, you can combine tags and filter the posts in ways you choose. That's a dynamic query albeit a very simple one since what you can combine is within the world of tags. A more complicated example is if you could combine tags and users.
First of all, when you have a dynamic query, it sounds like you can no longer use the substitution api to avoid sql injection since the query elements will depend on what the user decided to include in the query. I can't see how else to build this query other than using string append.
Secondly, the query could potentially span multiple tables. For example, if SO allows users to filter based on Users and Tags, and these probably live in two different tables, building the query gets a bit more complicated than just appending columns and WHERE clauses.
How do I go about implementing something like this?
The first rule is that users are allowed to specify values in SQL expressions, but not SQL syntax. All query syntax should be literally specified by your code, not user input. The values that the user specifies can be provided to the SQL as query parameters. This is the most effective way to limit the risk of SQL injection.
Many applications need to "build" SQL queries through code, because as you point out, some expressions, table joins, order by criteria, and so on depend on the user's choices. When you build a SQL query piece by piece, it's sometimes difficult to ensure that the result is valid SQL syntax.
I worked on a PHP class called Zend_Db_Select that provides an API to help with this. If you like PHP, you could look at that code for ideas. It doesn't handle any query imaginable, but it does a lot.
Some other PHP database frameworks have similar solutions.
Though not a general solution, here are some steps that you can take to mitigate the dynamic yet safe query issue.
Criteria in which a column value belongs in a set of values whose cardinality is arbitrary does not need to be dynamic. Consider using either the instr function or the use of a special filtering table in which you join against. This approach can be easily extended to multiple columns as long as the number of columns is known. Filtering on users and tags could easily be handled with this approach.
When the number of columns in the filtering criteria is arbitrary yet small, consider using different static queries for each possibility.
Only when the number of columns in the filtering criteria is arbitrary and potentially large should you consider using dynamic queries. In which case...
To be safe from SQL injection, either build or obtain a library that defends against that attack. Though more difficult, this is not an impossible task. This is mostly about escaping SQL string delimiters in the values to filter for.
To be safe from expensive queries, consider using views that are specially crafted for this purpose and some up front logic to limit how those views will get invoked. This is the most challenging in terms of developer time and effort.
If you were using python to access your database, I would suggest you use the Django model system. There are many similar apis both for python and for other languages (notably in ruby on rails). I am saving so much time by avoiding the need to talk directly to the database with SQL.
From the example link:
#Model definition
class Blog(models.Model):
name = models.CharField(max_length=100)
tagline = models.TextField()
def __unicode__(self):
return self.name
Model usage (this is effectively an insert statement)
from mysite.blog.models import Blog
b = Blog(name='Beatles Blog', tagline='All the latest Beatles news.')
b.save()
The queries get much more complex - you pass around a query object and you can add filters / sort elements to it. When you finally are ready to use the query, Django creates an SQL statment that reflects all the ways you adjusted the query object. I think that it is very cute.
Other advantages of this abstraction
Your models can be created as database tables with foreign keys and constraints by Django
Many databases are supported (Postgresql, Mysql, sql lite, etc)
DJango analyses your templates and creates an automatic admin site out of them.
Well the options have to map to something.
A SQL query string CONCAT isn't a problem if you still use parameters for the options.