What is a columnar database? - sql

I have been working with warehousing for a while now.
I am intrigued by Columnar Databases and the speed that they have to offer for data retrievals.
I have multi-part question:
How do Columnar Databases work?
How do they differ from relational databases?

How do columnar databases work?
The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from CJ Date's supplier and parts database:
SNO STATUS CITY SNAME
--- ------ ---- -----
S1 20 London Smith
S2 10 Paris Jones
S3 30 Paris Blake
S4 20 London Clark
S5 30 Athens Adams
would be stored on disk or in memory something like:
S1S2S3S4S5;2010302030;LondonParisParisLondonAthens;SmithJonesBlakeClarkAdams
This is in contrast to a traditional rowstore which would store the data more like this:
S120LondonSmith;S210ParisJones;S330ParisBlake;S420LondonClark;S530AthensAdams
From this simple concept flows all of the fundamental differences in performance, for better or worse, between a column-store and a row-store. For example, a column store will excel at doing aggregations like totals and averages, but inserting a single row can be expensive, while the inverse holds true for row-stores. This should be apparent from the above diagram.
How do they differ from relational databases?
A relation database is a logical concept. A columnar database, or column-store, is a physical concept. Thus the two terms are not comparable in any meaningful way. Column- oriented DMBSs may be relational or not, just as row-oriented DBMS's may adhere more or less to relational principles.

How do Columnar Databases work?
Columnar database is a concept rather a particular architecture/implementation. In other words, there isn't one particular description on how these databases work; indeed, several are build upon traditional, row-oriented, DBMS, simply storing the info in tables with one (or rather often two) columns (and adding the necessary layer to access the columnar data in an easy fashion).
How do they differ from relational databases?
They generally differ from traditional (row-oriented) databases with regards to ...
performance...
storage requirements ...
ease of modification of the schema ...
...in specific use cases of DBMSes.
In particular they offer advantages in the areas mentioned when the typical use is to compute aggregate values on a limited number of columns, as opposed to try and retrieve all/most columns for a given entity.
Is there a trial version of a columnar database I can install to play around? (I am on Windows 7)
Yes, there are commercial, free and also open-source implementation of columnar databases. See the list at the end of the Wikipedia article for starter.
Beware that several of these implementations were introduced to address a particular need (say very small footprint, highly compressible distribution of data, or spare matrix emulation etc.) rather than provide a general purpose column-oriented DBMS per-se.
Note: The remark about the "single purpose orientation" of several columnar DBMSes is not a critique of these implementations, but rather an additional indication that such an approach for DBMSes strays from the more "natural" (and certainly more broadly used) approach to storing record entities. As a result, this approach is used when the row-oriented approach isn't satisfactory, and therefore and tends to
a) be targeted for a particular purpose
b) receive less resources/interest than work on "General Purpose", "Tried and Tested", tabular approach.
Tentatively, the Entity-Attribute-Value (EAV) data model, may be an alternative storage strategy which you may want to consider. Although distinct from the "pure" Columnar DB model, EAV shares several of the characteristics of Columnar DBs.

I would say the best candidate to understand about column oriented databases is to check HBase (Apache Hbase) . You an checkout the code and explore further to find out about the implementation .

Also, Columnar DBs have a built in affinity for data compression, and the loading process is unique. Here's an article I wrote in 2008 that explains a bit more.
You may also be interested in a new report from IDC's Carl Olofson on 3rd generation DBMS technology. It discusses columnar, et al. If you're not an IDC client you can get it free on our site. He's doing a webinar on June 16th, too (also on our site).
(BTW, one comment above lists asterdata but I don't think they are columnar.)

To understand what is column oriented database, it is better to contrast it with row oriented database.
Row oriented databases (e.g. MS SQL Server and SQLite) are designed to efficiently return data for an entire row. It does it by storing all the columns values of a row together. Row-oriented databases are well-suited for OLTP systems (e.g., retail sales, and financial transaction systems).
Column oriented databases are designed to efficiently return data for a limited number of columns. It does it by storing all of the values of a column together. Two widely used Column oriented databases are Apache Hbase and Google BigTable (used by Google for its Search, Analytics, Maps and Gmail). They are suitable for the big data projects. A column oriented database will excel at read operations on a limited number of columns, however write operation will be expensive compared to row oriented databases.
For more: https://en.wikipedia.org/wiki/Column-oriented_DBMS

Product information. This may help. These were to featured products on a Google search.
http://www.vertica.com/
http://www.paraccel.com/
http://www.asterdata.com/index.php

kx is another columnar database, for example used in the financial sector. The licence is somewhat $50K last time I checked, though. No optimisation needed, no index needed, because kx has powerful operators (matlab equivalents: .*, kron, bsxfun, ...).

Related

Relational table - is JSON recommended?

Trying to implement a relational table that links a user to it's favorite books.
So I have a table with book_id and user_id
Sample Table:
user 1 favourite 1
user 1 favourite 2
user 1 favourite 3
Can't I have something like a JSON array?
user 1 [favourite 1, favourite 2, favourite 3] ?
Performance-wise is it better to do things like in the first example, or the second?
The first solution is a junction/association table and it is the recommended solution for SQL-based relational databases. Basically, you have two entities, books and users. The junction table is a third table that connects them.
SQL provides the functionality for this purpose. Relational databases provide the mechanisms for optimizing performance -- through indexes, column stores, horizontal partitioning, and fancy algorithms -- that make this work effectively, even for very large databases.
Does this mean that JSON structures are never used? Absolutely not. They have their place -- some databases even provide indexing support for them.
However, from the database perspective, JSON structures add additional overhead for extracting values. They also impede optimization. So, such an array within a row is not the first choice for the data representation.
For straight performance out of a SQL database, the join table is better as per Gordon Linoff's answer.
If you're serialising/deserialising complex objects however it is often more performant to store the object as JSON in a field in a table.
I had a project where I had a fully normalised structure to support an advertising schedule. It worked well until one client created a schedule with 40,000 spots in it. The time to save and load the large advertising schedule versus the small schedules was minutes versus seconds.
I changed the structure to store the object as JSON. The time to serialise then save and deserialise then load the large advertising schedule went from minutes to seconds.

How to predict resource requirements vs performance of many to many queries

In a traditional RDBMS,
Many-to-Many Joins are much more resource consuming than Many-to-One joins.
I am observing Many-to-Many queries getting slow past about 10 to 15 millions of lines in tables, using mainstream computers with 3-4 gigabytes of RAM.
Querying Many-to-One relationships, I observed however no slow-down even with 50 Millions of lines.
How do one predict memory and CPU requirements against expected performance? (Is any benchmark available?)
Past which thresholds is it worth using other solutions instead? (MPP or NoSQL)
In a traditional RDBMS, why are Many-to-Many Joins so much more resource consuming than Many-to-One joins?
An SQL FOREIGN KEY (FK) constraint holds when subrow values for a list of columns also appear elsewhere as a UNIQUE NOT NULL (superkey). So for every row in the referencing table there can be only one matching row in the referenced table. So the result of a JOIN ON equality of a FK & its superkey can output at most one row per row in the FK table. Whereas in general since a JOIN returns every possible combination of rows that can be made from input rows that satisfies an ON condition, in general there can be many more rows output.
After all, aren't Many-to-Many relationships just like two Many-to-One relationships?
It's not clear what you mean by "just like" or how you think it suggests or justifies anything. And a join is not a relationship. (A table represents a relationship.)
How does one predict memory and CPU requirements against expected performance? (Is any benchmark available?)
Many SQL DBMSs have a query planner/optimizer EXPLAIN command & others for inquiring about what a query will or did do or cost.
Read about (logical & physical) relational query optimization/implementation, in general & in any particular DBMS. Wikipedia happens to have a decent article. Many textbooks are online.
Past which thresholds is it worth using other solutions instead? (MPP or NoSQL)
Re NoSQL search my answers, of which the most recent are:
How to convert an existing relational database to a key-value store?
How does noSQL perform horizontal scaling and how it is more efficient than RDBMS scaling
Reasonable Export of Relational to Non-Relational Data
RDBMSs offer generic straightforward querying with certain computational complexitiy & optimization opportunities. Relatively speaking, other systems specialize, with certain aspects improved at the expense of others.

PostgreSQL -- put everything into one table?

So I'm trying to learn some basic database design principles and decided to download a copy of the sr27 database provided by the USDA. The database is storing nutritional information on food, and statistical information on how these nutritional values were derived.
When I first started this project, my thoughts were: well, I want to be able to search for food names, and I will probably want to do some basic statistical modeling on your most common nutritional values like calories, proteins, fats, etc. So, the thought was simple, just make 3 tables that look like this:
One table for food names
One table for common nutritional values (1-1 relationship with names)
One table for other nutritional values (1-1 relationship with names)
However, it's not clear that this is even necessary. Do you gain anything from partitioning the columns (or values) based on the idea: I like to do searches on names, so let's keep that as one table for less overhead, and I like to data calculations on common nutritional values so let's keep that as another table. (Question 1) Or does proper indexing make this moot?
My next question is then: Why in the world did the USDA decide to use 12 tables? Is this considered good database design practice, or would they have been better off merging a lot of these tables? (this excerpt is taken from the PDF provided in the USDA link above, pg 29)
Do you gain anything from partitioning the columns (or values) based
on the idea: I like to do searches on names, so let's keep that as one
table for less overhead, and I like to data calculations on common
nutritional values so let's keep that as another table. (Question 1)
Or does proper indexing make this moot?
if you just had a list of items, and you want to summarize on just some of them, then indexing is the way to address performance, not splitting some into another table arbitrarily.
Also, do read up on Normalization.
My next question is then: Why in the world did the USDA decide to use
12 tables? Is this considered good database design practice, or would
they have been better off merging a lot of these tables? (this excerpt
is taken from the PDF provided in the USDA link above, pg 29)
Probably because the types of questions they want to ask are not exactly the same ones you are trying to ask.
They clearly have more info about each food - like groups, nutrients, weights, and they are also apparently tracking where the source data is coming from...
There are important rules related to design relational databases - Normal forms - that reduces some artefacts and reduce IO operations. This design is usual for OLTP databases - and I have a possibility to see terrible slow databases because the developers has zero knowledge about it. Analytical databases OLAP are little bit different - there are wide tables used and some modern OLAP databases with column store support it.
PostgreSQL is classic row store database - so all in one table is not common and it is not good strategy. You can use a view to create some typical and often used views on data - so the complex schema can be invisible (transparent) for you.

Column oriented database vs row oriented database

I have used row oriented database design for long time and except for datawarehouse projects and Big data samples, I have not used column oriented database design for OLTP app.
My row oriented table looks like
ID, Make, Model, Month, Miles, Cost
1 BMW Z3 12 12000 100
Some people in our team advocating column oriented database design.
They suggest that all the column names should be property names in a Property table.
Then another table Quote will have two columns PropertyName and PropertyValue.
In the .net code, we read each key and compare and convert to strongly typed object. The code is really getting messy.
if (qwi.DomainCode == typeof(CoreBO.Base.iQQConstants.MBPCollateralInfo).Name)
{
if (qwi.RefCode == iQQConstants.MBPCollateralInfo.ENGINETYPE)
{
Aspiration = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.FUELTYPE)
{
FuelType = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.MAKE)
{
Make = qwi.Value;
}
else if (qwi.RefCode == iQQConstants.MBPCollateralInfo.MILEAGE)
{
int reading = 0;
bool success = int.TryParse(qwi.Value, out reading);
if (success)
{
OdometerReading = reading;
}
}
}
The arguement for this column oriented design is that we won't have to change table schema and the stored proc(we are still using stored proc instead of Entity Framework).
Seems like we are heading into real problem. Is Column oriented design well accepted in the industry.
I am having trouble with your terminology. You are describing an EAV structure (standing for Entity-Attribute-Value).
Aside: A "column-oriented" database usually refers to a database that stores each column separately from others (when I learned about databases, this was called "vertical partitioning", but I don't think that caught on). Examples include Paracel and Vertica.
An entity-attribute-value database is storing each attribute for an entity as a separate row.
The first problem that you have with your particular structure is typing. Some of the attributes are strings and some are numbers. This becomes a management nightmare in an EAV world. Either you store everything as strings (losing the ability to type check values and to guarantee that arithmetic words) or you include multiple columns for different types with a type column (making queries much more complicated).
Similarly, constraints and foreign key references are much harder to implement. Also, because you are repeating the entity id and attribute id on each row, the data often takes up more space. NULL values are typically quite space efficient.
On the OLTP side, you have another problem. When you want to insert an entity, you typically want to insert a bunch of attributes as well. One insert has now turned into many inserts, and you'll want to start wrapping these in transactions, affecting performance.
Given all these shortcomings, you might think never use EAV models. There is a place for them. They are particularly useful when attributes are changing over time. Say, if you have an application where users can put in their own information with tags. In such cases, a hybrid approach is the best solution. Use a regular relational table with many columns for the common information. Use an EAV table for optional information for each entity.
Source: WIKI
Column-oriented organizations are more efficient when an aggregate needs to be computed over many rows but only for a notably smaller subset of all columns of data, because reading that smaller subset of data can be faster than reading all data.
Column-oriented organizations are more efficient when new values of a column are supplied for all rows at once, because that column data can be written efficiently and replace old column data without touching any other columns for the rows.
Row-oriented organizations are more efficient when many columns of a single row are required at the same time, and when row-size is relatively small, as the entire row can be retrieved with a single disk seek.
Row-oriented organizations are more efficient when writing a new row if all of the column data is supplied at the same time, as the entire row can be written with a single disk seek.
In practice, row-oriented storage layouts are well-suited for OLTP-like workloads which are more heavily loaded with interactive transactions. Column-oriented storage layouts are well-suited for OLAP-like workloads (e.g., data warehouses) which typically involve a smaller number of highly complex queries over all data (possibly terabytes).
In addition to the problems Gordon Linoff mentions, EAV data models are also fiendishly hard to query - find all cars where the make is BMW and the months between 12 and 24 and the cost < 10000 becomes a huge jumble of nasty SQL, especially if you're doing string comparison on numbers...
Generally row-oriented and column-oriented is the storage mechanism at the low level(disk). The goodness of each storage depends on your requirement. In some scenario column-oriented storage will result better and in some scenarios row-oriented will.
In Hbas database they are using the concept of column-family which is group of columns.
The difference between row-oriented is that logical table which consist of rows is stored one row per row-block whereas column-oriented stores one column per column block.
Row-oriented result in poor performance when we are firing query which are analytical (like sum of salaries, avg of salary) but works fine when we need to access invidual detail of a row or to insert a new record. Whereas column-oriented works fine on analytical queries but result in poor performance for insertion of individual record or accessing all the detail of a row.
You can visit this link which have describe different scenarios their pros and cons with example and their summary difference.
click here : http://geekrandomstuff.blogspot.tw/2014/04/row-oriented-database-vs-column.html
From my experience EAV is great for storing application settings ie. relatively static data without any further need for joining and transforming data, nothing more then that.

Is there an effcient hierarchy method for storing a large tree in a SQL table?

I have a table with well over 5 millions rows, that contains hierarchical data (~20 levels). The table is growing exponetially every year and the recursive method for CRUD operations from the table is becoming slow. The table recieves a high volume of updates, reads and deletes. Does any one know of any data models that would be suitable to replace the current Adjacency List Model, or what steps if any to speed up the table?
Have you looked at the HierachyID data type which is available in SQL Server 2008 onwards.
http://technet.microsoft.com/en-us/library/bb677290.aspx
There's a good section on it's use in this free e-book from MS Press
http://blogs.msdn.com/b/microsoft_press/archive/2009/11/16/free-e-book-introducing-microsoft-sql-server-2008.aspx
Five million rows is nothing.
There is a difference between a well-designed Adjacency List model and a badly-designed one. If you post your DDL maybe we could improve it, rather than you throwing out the whole concept, because th eimplementation is poor.
In any case, I would not implement a tree structure or an hierarchy in a Relational database using such a model. I have use the following (ignore the History), hundreds of times, and it is very fast. If you provide the DDL for the table and all indices, I can provide a model specifically for it.
Data Model
▶Tree Structure Data Model◀
Readers who are unfamiliar with the Relational Modelling Standard may find ▶IDEF1X Notation◀ useful.
Maybe a hierarchical or graphical database would be better choices. SQL isn't always the answer - that's why NoSQL is a viable niche.