Is MPTT an overkill for maintaining a tree like database even if the depth is around 3 - 4? - sql

I am planning to store some tree like data in MySQL.
Topics can have sub topics and they in turn can have more sub topics.
Is Modified Preorder Tree Traversal(MPTT) an over kill even if the maximum depth is around 3 - 4?

In any way you have to write model methods like get_children(), get_root(), is_root() and others. In some cases django-mptt will reduce queries to database. It is not overkill, it will save you a lot of time. django-mptt code is more reliable than yours will be, so you code will have less potential bugs. Just spend a few hours to read full docs=)

Related

Ruby eager query causing load issues

To explain the problem I am facing, let me take an example from the Ruby Sequel documentation.
In the case of
Album.eager(:artists, :genre).all
This will be fine if the data is comparatively less.
If the data is huge this query will fire artist_id in thousands. If data is in millions that would be a million artist_ids being collected. Is there way to fetch the same output but with different optimized query?
Just don't eager load large datasets, maybe? :)
(TL;DR: this is not an answer but rather a wordy way to say "there is no single simple answer" :))
Simple solutions for simple problems. Default (naive) eager loading strategy efficiently solves the N+1 problem for reasonably small relations, but if you try to eager load huge relations you might (and will) shoot own leg.
For example, if you fetch just 1000 albums with ids, say, (1...1000), then your ORM will fire additional eager loading queries for artists and genres that might look like select * from <whatever> where album_id in (1,2,3,...,1000) with 1000 ids in the in list. And this is already a problem on its own - performance of where ... in queries can be suboptimal even in modern dbs with their query planners as smart as Einstein. At certain data scale this will become awfully slow even with such a small batch of data. And if you try to eager load just everything (as in your example) - it's not feasible for almost any real-world data usage (except the smallest use cases).
So,
In general, it is better to avoid loading "all" and load data in batches instead;
EXPLAIN is your best friend - always analyze queries that your ORM fires. Even production grade battle-tested ORM can (and will) produce sub-optimal queries time to time;
The latter is especially true for large datasets - at certain scale you will have no other choices but to move from nice ORM API to lower level custom-tailored SQL queries (at least for bottlenecks);
At certain scale even custom SQL will not help any more - the problems of that scale need to be addressed on another level (data remodeling, caching, sharding, CQRS etc etc etc ...)

Database design for: Very hierarchical data; off-server subset caching for processing; small to moderate size; (complete beginner)

I found myself with a project (very relaxed, little to none consequences on failure) that I think a database of some sort is required to solve. The problem is, that while I'm still quite inexperienced in general, I've never touched any database beyond the tutorials I could dig up with Google and setting up your average home-cloud. I got myself stuck on not knowing what I do not know.
That's about the situation:
Several hundred different automated test-systems will write little amounts of data over a slow network into a database frequently. Few users, will then get large subsets of that data from the database over a slow network infrequently. The data will then be processed, which will require a large amount of reads, very high performance at this point is desired.
This will be the data (in order of magnitudes):
1000 products containing
10 variants containing
100 batches containing
100 objects containing
10 test-systems containing
100 test-steps containing
10 entries
It is basically a labeled B-tree with the test-steps as leave-nodes (since their format has been standardized).
A batch will always belong to one variant, a object will always belong to the same variant (but possibly multiple batches), and a variant will always belong to one product. There are hundreds of thousands of different test-steps.
Possible queries will try to get (e.g.):
Everything from a batch (optional: and the value of an entry within a range)
Everything from a variant
All test-steps of the type X and Y from a test-system with the name Z
As far as I can tell rows, hundreds of thousands columns wide (containing everything described above), do not seem like a good idea and neither do about a trillion rows (and the middle ground between the two still seems quite extreme).
I'd really like to leverage the hierarchical nature of the data, but all I found on e.g. something like nested databases is, that they're simply not a thing.
It'd be nice if you could help me with:
What to search for
What'd be a good approach to structure and store this data
Some place I can learn about avoiding the SQL horror stories even I've found plenty of
If there is a great way / best practice I should know of of transmitting the queried data and caching it locally for processing
Thank you and have a lovely day
Andreas
Search for "database normalization".
A normalized relational database is a fine structure.
If you want to avoid the horrors of SQL, you could also try a No-SQL Document-oriented Database, like MongoDB. I actually prefer this kind of database in a great many scenarios.
The database will cache your query results, and of course, whichever tool you use to query the database will cache the data in the tool's memory (or it will cache at least a subset of the query results if the number of results is very large). You can also write your results to a file. There are many ways to "cache", and they are all useful in different situations.

Efficient Database Structure for Deep Tree Data

For a very big data database (more than a billion rows) where there is a very deep data tree, what is the most efficient structure? The read loading is the highest usage, but there are also changes to the tree on a regular basis.
There are several standard algorithms to represent a data tree. I have found this reference as part of the Mongodb manual to be an excellent summary: http://docs.mongodb.org/manual/tutorial/model-tree-structures/
My system has properties that do not map well to any of these cases. The issue is that the depth of the tree is so great that keeping "ancestors" or a "path" is very large. The tree also changes frequently enough that the "Nested Sets" approach is not efficient. I am considering a hybrid of the "Materialized Paths" and "Parent References" approach, where instead of the path I store a hash that is not guaranteed to be unique, but 90% of the time it is. Then the 10% of the time there is a collision, the parent reference resolves it. The idea is that the 90% of the time there is a fast query for the path hash. This idea is kind of like a bloom filter technique. But this is all for background: the question is in the first line of this post.
What I've done in the past with arbitrarily deep trees is just to store a Parent Key with each, as well as a sequence number which governs the order of children under a parent. I used RDBM's and this worked very efficiently. To arrange the tree structure after reading required code to arrange things properly - put each node in a Child collection in the nodes parent - but this in fact ran pretty fast.
It's a pretty naieve approach, in that there's nothing clever about it, but it does work for me.
The tree had about 300 or 400 members total and was I think 7 or 8 levels deep. This part of the system had no performance problems at all: it was very fast. The UI was a different matter, but that's another story.

Best (NoSQL?) DB for small docs/records, unchanging data, lots of writes, quick reads?

I found a few questions in the same vein as this, but they did not include much detail on the nature of the data being stored, how it is queried, etc... so I thought this would be worthwhile to post.
My data is very simple, three fields:
- a "datetimestamp" value (date/time)
- two strings, "A" and "B", both < 20 chars
My application is very write-heavy (hundreds per second). All writes are new records; once inserted, the data is never modified.
Regular reads happen every few seconds, and are used to populate some near-real-time dashboards. I query against the date/time value and one of the string values. e.g. get all records where the datetimestamp is within a certain range and field "B" equals a specific search value. These queries typically return a few thousand records each.
Lastly, my database does not need to grow without limit; I would be looking at purging records that are 10+ days old either by manually deleting them or using a cache-expiry technique if the DB supported one.
I initially implemented this in MongoDB, without being aware of the way it handles locking (writes block reads). As I scale, my queries are taking longer and longer (30+ seconds now, even with proper indexing). Now with what I've learned, I believe that the large number of writes are starving out my reads.
I've read the kkovacs.eu post comparing various NoSQL options, and while I learned a lot I don't know if there is a clear winner for my use case. I would greatly appreciate a recommendation from someone familiar with the options.
Thanks in advance!
I have faced a problem like this before in a system recording process control measurements. This was done with 5 MHz IBM PCs, so it is definitely possible. The use cases were more varied—summarization by minute, hour, eight-hour-shift, day, week, month, or year—so the system recorded all the raw data, but is also aggregated on the fly for the most common queries (which were five minute averages). In the case of your dashboard, it seems like five minute aggregation is also a major goal.
Maybe this could be solved by writing a pair of text files for each input stream: One with all the raw data; another with the multi-minute aggregation. The dashboard would ignore the raw data. A database could be used, of course, to do the same thing. But simplifying the application could mean no RDB is needed. Simpler to engineer and maintain, easier to fit on a microcontroller, embedded system, etc., or a more friendly neighbor on a shared host.
Deciding a right NoSQL product is not an easy task. I would suggest you to learn more about NoSQL before making your choice, if you really want to make sure that you don't end up trusting someone else's suggestion or favorites.
There is a good book which gives really good background about NoSQL and anyone who is starting up with NoSQL should read this.
http://www.amazon.com/Professional-NoSQL-Wrox-Programmer/dp/047094224X
I hope reading some of the chapters in the book will really help you. There are comparisons and explanations about what is good for what job and lot more.
Good luck.

When to choose Cassandra over a SQL/Semantic Store solution?

I have 30-40 GB of data and 3 developer machines (Core Duo i4, 3GB). The data is a set of graph like structures and I have queries that traverse the graphs. Is there a guideline that could help me to decide to use Cassandra or a classic solution, e.g., SQL or Semantic Store? My current plan is to set up Cassandra and see how does it work but I would like to learn more before starting the installation.
I would not use Cassandra for any kind of graph level structure. It has been about 6 months since I looked into doing something similar so maybe Cassandra has moved on since then but I found it was fundamentally limited by the fact that it only has row level indexes.
For a Graph based structure (assuming a simplistic one arc per row layout) you really need column indexes as well since if you want to traverse the graph you want to be able to start from a particular node A and find all the arcs that go from that node (assuming a directed Graph) then you'd have to do a row scan of the entire dataset as there is no built in functionality for saying give me the rows that have A in a particular column.
To achieve this you have to effectively design a data layout for Cassandra that gives you an inverted index. This is somewhat tricky and requires you to know ahead of time the type of queries that you want to answer - answering new types of queries at a later data may be very difficult or impossible if you don't design well. These slides demonstrate the idea but I hope it makes it clear that you effectively have to construct your own indexes.
For Graph structures that can be decomposed to triples consider an RDF store - for more complex structures then consider a full blown Graph Database. If you really want to do NoSQL you can probably build something on top of a document database as they tend to have much better indexing but again you'll have to think carefully about how you store your data.