Is there any open-source hierarchical database or emulation atop of existing RDBMS?
I am searching a DMBS (or plugin to existing RDBMS) which can store hierarchical data and permits to perform queries on hierarchical data (something like "SELECT LEVEL ... CONNECT BY ...", "SELECT PARENT ..." for example). I know there is some support in Oracle, but is there a more complex solution?
There isn't a standardized plugin for doing this. I've looked more than once. However, there are a number of options. See from my earlier question on the same topic:
What are the options for storing hierarchical data in a relational database?
In short, if you're using a table with ID and ParentID (a.k.a. adjacency list) you use Common Table Expressions with most databases (Oracle's CONNECT BY being one of the most notable exceptions). OTO, something like materialized path or nested sets may be a better fit for your situation - for instance ability to easily find "lineage" where with adjacency list this is an expensive operation.
Usually what ends up happening with a system that needs to work extensively with hierarchical data, for instance a CMS, is that it implements more than one of these solutions. The assumption is reads heavily outweigh writes.
Have you tried the Nest Set model http://mikehillyer.com/articles/managing-hierarchical-data-in-mysql/
Relational data doesn't directly support hierarchies in the way that an inherently hierarchical structure like XML does. You have to use a data model such as nested sets or a straight self-join to model the hierarchy.
Depending on the type of system you have, Common Table Expressions will let you run hierarchical queries on data. CTEs are supported by SQL Server versions since 2005, Recent versions of DB/2 and PostgreSQL - and probably some other systems. CTEs are a bit more fiddly than CONNECT BY, but they do run on a fair variety of platorms.
Related
I have a question regarding the usage of a DocumentDB or SQL-Database.
E.g. I have categories which can have multiple child categories and so on. Every category can have multiple attributes and every attribute can have one or many values. Would it be better to use a schemaless solution like a DocumentDB because I could add new sub categories etc. with no effort or is it better to stick with a schema and use a SQL-Database.
Many thanks in advance.
As #DavidMakogon said, there is not a standard & absolute right answer, it just up to you and up to application scenario. For this current needs to store a tree structure of categories with attributes, it's simple to design database schema & develop application for both without any addition condition like data volume and concurrency, etc, and both are good.
Consideration for others, there are two documents may help analyzing the features which you may need to use in your application or more suitable for your scenario, to make your choice.
MongoDB vs MySQL: Comparison Between RDBMS and Document Oriented Database, it's very similar for comparision between DocumentDB and SQL Database.
10 things never to do with a relational database, I think the advantage of RDBMS is as well known and be suitable for which scenario, but NoSQL's not.
Hope it helps.
I have been checking out GridGain for a while and came across some features regarding GridGain's SQL capabilities, which led me to some questions (that I couldn't find a firm answer in the docs)
From the examples, there is always an explicit data model. I am using Java, so that means there's always a class definition of the model to be queried for. The examples in the API docs: http://atlassian.gridgain.com/wiki/display/GG60/SQL,+Scan,+And+Full+Text+Queries begin by showing how properties much be annotated, which suggests to me an explicit model is always required. Properties of the model can be annotated for SQL querying such as "#GridCacheQuerySqlField". Is an explicit data model always required? Ideally, I would like a way to not have to explicitly state the model, as my use case does change often and has complex relations.
What subset of SQL queries can be performed through GridGain's SQL API? My use cases often require very complex queries. For example, in the docs (same link as above) it states that "Continuous Queries cannot be used with SQL. Only predicate-based queries are supported." where can I find what subset of SQL is supported (and under what conditions, as the example provided does not perform continuous sql queries unless the condition that queries are predicate-based is met)
Thanks in advance for the insight
GridGain has support for non-fixed data model in Enterprise version, namely portable objects. Portable objects allow you to render data model as a map-like nesting structure which allows dynamic structure changes, indexing and portability across different languages (Java, C#, .NET). You can take a look at portable objects in GridGain Enterprise edition examples and read documentation here: http://entdoc.gridgain.org/latest/Portable+Cross+Platform+Objects In open-source version explicit class definition is always required.
The SQL limitations are described in GridCacheQuery javadoc: http://gridgain.com/sdk/6.5.0/javadoc/org/gridgain/grid/cache/query/GridCacheQuery.html
Group by and sort by statements are applied separately on each node, so result set will likely be incorrectly grouped or sorted after results from multiple remote nodes are grouped together.
Aggregation functions like sum, max, avg, etc. are also applied on each node. Therefore you will get several results containing aggregated values, one for each node.
Joins will work correctly only if joined objects are stored in collocated mode or at least one side of the join is stored in REPLICATED cache.
In my project:
Data is not going to be modified (only query).
It is going to be more than 1.000.000 instances of data.
Query performance is critical.
In case of using SQL, it is going to be a single table with 7 columns. (no joints)
There are also different classification approaches used in NoSQL. Which are given below with some examples:
Column: Accumulo, Cassandra, HBase
Document: Clusterpoint, Couchdb, Couchbase, MarkLogic, MongoDB
Key-value: Dynamo, FoundationDB, MemcacheDB, Redis, Riak, FairCom c-treeACE
Graph: Allegro, Neo4J, OrientDB, Virtuoso, Stardog
Source: http://en.wikipedia.org/wiki/NoSQL#cite_note-7
First of all, does the database system really makes an observable amount of performance difference for this case?
If it makes then,can you please explain which one is more suitable for my project SQL or NoSQL, if NoSQL then which classification approach?
Thank you in advance
I am currently enrolled in a project to set up a "standard" Database with a huge amount of data. We start by implementing in SQL to see the performance of the queries. Once this is done we address the problem of performance.
There is multiple reasons for this, but to name a few:
Standard SQL is easily implemented and standard across multiple instances (as of present day)
If you know SQL, make a fast implementation. To save time and get the project going.
There are loads of information available about SQL implementations.
I cannot answer about NoSQL but hopefully someone can fill me in.
The important question you need to ask is what kind of queries you will be performing. For example ClusterPoint offers real-time aggregation, so if you need result grouping and extracting summaries, it gives you great performance.
For a regular key/value they should all perform pretty well, so pick the one you are most comfortable with.
I figure there has to be a specific design reason why you can't write a query like the following one:
select
(select column_name
from information_schema
where column_name not like '%rate%'
and table_name = 'Fixed_Income')
from Fixed_Income
and instead have to resort to dynamic SQL.
Anyone knows what that reason is? I tried Googling it, but all the hits were cries for help in solving the problem -- meaning it's a pretty widespread need and not well understood.
The reason is that the query optimizer needs to know the exact schema objects you are referring to at compile time. It needs them to optimize the query. You wouldn't believe how slow the RDBMS would be without having this information available to the query optimizer.
It's a little like the performance difference of static vs. dynamic typing in practice: There is usually a non-trivial difference (I'm thinking just about mainstream languages here). The compiler can exploit the static information to generate great code.
Even if this feature was present, it would be implemented by first computing the table and column names and then doing a standard "static" query planning.
You ask a very interesting question.
The "relational" in "relational algebra" refers to name-value pairs, not to relationships between tables. In relational algebra, there is no requirement that all records in a set (table) have the same columns.
My best guess is that the limitation is related to the idea of entity-relationship diagrams comes into play. A database is designed around tables, and these tables have relationships to each other. The choice of a relational database for data storage and access was specifically when the data could be stored this way. Knowing the entities and their attributes suggests a static form of the data and hence static references in queries.
In addition, SQL as a language is a declarative language rather than a procedural language. This suggests -- but does not impose -- a compilation step separate from the running of the query. In general, the SQL engine does the following (at a very high level):
Compiles the query, generally into some sort of data flow process.
Optimizes the data flow process. (Typically part of the compilation process.)
Runs the query.
The first two result in what is called "the query plan". You really cannot do optimization, though, unless you know about the objects you are operating on. So, dynamically choosing tables and columns means that optimization would be part of running the query rather than compiling it.
Finally, some databases like SQL Server support dynamic SQL. This allows you to build strings that get compiled and run at the same time. This is very useful for complex decision support queries. It is not recommended when you need fast transaction throughput, because the overhead for compilation is too high relative to the query.
I recently came across http://www.fossil-scm.org/index.html/doc/tip/www/theory1.wiki by D. Richard Hipp, the developer responsible for SQLite.
it go me thinking, is Fossil the only NoSQL database that uses SQL?
Do others uses SQL as a 'High Level Scripting Language'?
From the article, it sounds like Fossil isn't a database any more than git is a database. Yes, it's a thing that contains data, and yes, it's backed by a database, but it seems pretty far from a database itself. So the first part of of your question basically relies on a faulty assumption. There is a database called Friendly which uses MySQL to store schema-less models, but it seems like an awkward bandaid sort of solution at best.
I'm certainly not familiar with all of the NoSQL options out there, but, to my knowledge, none of the well-though-of ones use SQL for anything. MongoDB and CouchDB, the two I'm most familiar with, both use Javascript as part of their query interface, though in very different ways. MongoDB has queries more like what you'd expect from a relational database: you can write an arbitrary query for all documents that match a certain set of attributes. However, unlike a relational database, there's no such thing as a join (you'll only ever get a list of distinct documents back, not compound documents) and you can write arbitrary Javascript code to select documents. CouchDB, on the other hand, does not allow arbitrary queries. Instead, you create views (which are essentially simpler key-value stores) using map/reduce functions written in Javascript and then query those views from a start key to and end key.
In both cases, the type of information being transmitted to the server to perform the query isn't well-suited for the type of problem that SQL is good at solving. The trade-off to SQL being so high-level (to use the logic of the author of the paper) is that it's only suitable for a very narrow set of problems.
The creator of Fossil / SQLite is working and pushing UnQL as the NoSQL standard:
UnQL means Unstructured Query Language.
It's an open query language for JSON, semi-structured and document
databases.
It looks like a stripped down version of SQL.