Why does BigTable have column families? - bigtable

Why is BigTable structured as a two-level hierarchy of "family:qualifier"? Specifically, why is this enforced rather than just having columns and, say, recommending that users name their qualifiers "vertical:column"?
I am interested in whether or not enforcing this enables some engineering optimizations or if this is strictly a design thing.

There are a couple of advantages to family groups:
queries become easier by getting a group of column qualifiers in a single column family
Bigtable has a notion of "locality groups". Locality groups allow a family to be written to a separate file, which helps in situations where some column families are accessed less frequently than others. You can see some information about locality groups in this analysis of HBase vs. Bigtable.

Related

Bigtable design and querying with respect to number of column families

From Cloud Bigtable schema design documentation:
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
In my use case, I can group all the data into one single column family (currently the access pattern is retrieve all fields), or group them to say 3 logical column families and specify these column families all the time while querying. Is there any performance difference between these two designs? Which design is recommended?
In your case, there isn't a performance advantage either way. I would use the 3 logical column families so that you have cleaner code.

PostgreSQL -- put everything into one table?

So I'm trying to learn some basic database design principles and decided to download a copy of the sr27 database provided by the USDA. The database is storing nutritional information on food, and statistical information on how these nutritional values were derived.
When I first started this project, my thoughts were: well, I want to be able to search for food names, and I will probably want to do some basic statistical modeling on your most common nutritional values like calories, proteins, fats, etc. So, the thought was simple, just make 3 tables that look like this:
One table for food names
One table for common nutritional values (1-1 relationship with names)
One table for other nutritional values (1-1 relationship with names)
However, it's not clear that this is even necessary. Do you gain anything from partitioning the columns (or values) based on the idea: I like to do searches on names, so let's keep that as one table for less overhead, and I like to data calculations on common nutritional values so let's keep that as another table. (Question 1) Or does proper indexing make this moot?
My next question is then: Why in the world did the USDA decide to use 12 tables? Is this considered good database design practice, or would they have been better off merging a lot of these tables? (this excerpt is taken from the PDF provided in the USDA link above, pg 29)
Do you gain anything from partitioning the columns (or values) based
on the idea: I like to do searches on names, so let's keep that as one
table for less overhead, and I like to data calculations on common
nutritional values so let's keep that as another table. (Question 1)
Or does proper indexing make this moot?
if you just had a list of items, and you want to summarize on just some of them, then indexing is the way to address performance, not splitting some into another table arbitrarily.
Also, do read up on Normalization.
My next question is then: Why in the world did the USDA decide to use
12 tables? Is this considered good database design practice, or would
they have been better off merging a lot of these tables? (this excerpt
is taken from the PDF provided in the USDA link above, pg 29)
Probably because the types of questions they want to ask are not exactly the same ones you are trying to ask.
They clearly have more info about each food - like groups, nutrients, weights, and they are also apparently tracking where the source data is coming from...
There are important rules related to design relational databases - Normal forms - that reduces some artefacts and reduce IO operations. This design is usual for OLTP databases - and I have a possibility to see terrible slow databases because the developers has zero knowledge about it. Analytical databases OLAP are little bit different - there are wide tables used and some modern OLAP databases with column store support it.
PostgreSQL is classic row store database - so all in one table is not common and it is not good strategy. You can use a view to create some typical and often used views on data - so the complex schema can be invisible (transparent) for you.

How can Datomic users cope without composite indexes?

In Datomic, how do you efficiently perform queries such as 'find all people living in Washington older than 50' (city and age may vary)? In relational databases and most of NoSQL databases you use composite indexes for this purpose; Datomic, as far as I'm aware of, does not support anything like this.
I built several, say, medium-sized web-apps and not a single one would perform quick enough, if not for composite indexes. How are Datomic users dealing with this? Or are they just playing with datasets small enough not to suffer from this? Am I missing something?
This problem and its solution are not identical in Datomic due to the structure of data (datoms) in Datomic. There are two performance characteristics/strategies that may add some shading to this:
(1) When you fetch data in Datomic, you fetch an entire leaf segment from the index tree (not an individual item) - with segments being composed of potentially many thousands of datoms. This is then cached automatically so that you don't have to reach out over the network to get more datoms.
If you're querying a single person - i.e., a single entity, for their age and where they live, it's very likely the query's navigation of the EAVT or AEVT indexes may have cached everything you need. You've effectively cached the datom, how to navigate to it to it, and related datoms (by locality in the index).
(2) Partitions can provide a manual means to specify locality of reference. Partitions impact the entity ID's value (it's encoded in the high bits) and ensure that related entities are sorted near each other. So for an alternative implementation of the above problem, if you needed information from the city and person entities both, you could include them in the same partition.
I've written a library to handle this: https://github.com/arohner/datomic-compound-index
Update 2019-06-28: Since 0.9.5927 (Datomic On-Prem) / 480-8770 (Datomic Cloud), Datomic supports Tuples as a new Attribute Type, which allows you to have compound indexes.

What is atomicity in dbms

I read something like below in 1NF form of DBMS.
There was a sentence as follows:
"Every column should be atomic."
Can anyone please explain it to me thoroughly with an example?
Re "atomic"
In Codd's original 1969 and 1970 papers he defined relations as having a value for every attribute in a row. The value could be anything, including a relation. This used no notion of "atomic". He explained that "atomic" meant not relation-valued (ie not table-valued):
So far, we have discussed examples of relations which are defined on
simple domains--domains whose elements are atomic (nondecomposable)
values. Nonatomic values can be discussed within the relational
framework. Thus, some domains may have relations as elements.
He used "simple", "atomic" and "nondecomposable" as informal expository notions. He understood that a relation has rows of which each column has an associated name and value; attributes are by definition "single-valued"; the value is of any type. The only structural property that matters relationally is being a relation. It is also just a value, but you can query it relationally. Then he used "nonsimple" etc meaning relation-valued.
By the time of Codd's 1990 book The Relational Model for Database Management: Version 2:
From a database perspective, data can be classified into two types:
atomic and compound. Atomic data cannot be decomposed into smaller
pieces by the DBMS (excluding certain special functions). Compound
data, consisting of structured combinations of atomic data, can be
decomposed by the DBMS.
In the relational model there is only one type of compound data: the
relation. The values in the domains on which each relation is defined
are required to be atomic with respect to the DBMS. A relational
database is a collection of relations of assorted degrees. All of the
query and manipulative operators are upon relations, and all of them
generate relations as results. Why focus on just one type of compound
data? The main reason is that any additional types of compound data
add complexity without adding power.
"In the relational model there is only one type of compound data: the relation."
Sadly, "atomic = non-relation" is not what you're going to hear. (Unfortunately Codd was not the clearest writer and his expository remarks get confused with his bottom line.) Virtually all presentations of the relational model get no further than what was for Codd merely a stepping stone. They promote an unhelpful confused fuzzy notion canonicalized/canonized as "atomic" determining "normalized". Sometimes they wrongly use it to define realtion. Whereas Codd used everyday "nonatomic" to introduce defining relational "nonatomic" as relation-valued and defined "normalized" as free of relation-valued domains.
(Neither is "not a repeating group" helpful as "atomic", defining it as not something that is not even a relational notion. And sure enough in 1970 Codd says "terms attribute and repeating group in present database terminology are roughly analogous to simple domain and nonsimple domain, respectively".)
Eg: This misinterpretation was promoted for a long time from early on by Chris Date, honourable early relational explicator and proselytizer, primarily in his seminal still-current book An Introduction to Database Systems. Which now (2004 8th edition) thankfully presents the helpful relationally-oriented extended notion of distinguishing relation, row and "scalar" (non-relation non-row) domains:
This definition merely states that all [relation variables] are in 1NF
Eg: Maiers' classic The Theory of Relational Databases (1983):
The definition of atomic is hazy; a value that is atomic in one application could be non-atomic in another. For a general guideline, a value is non-atomic if the application deals with only a part of the value.
Eg: The current Wikipedia article on First NF (Normal Form) section Atomicity actually quotes from the introductory parts above. And then ignores the precise meaning. (Then it says something unintelligible about when the nonatomic turtles should stop.):
Codd states that the "values in the domains on which each
relation is defined are required to be atomic with respect to the
DBMS." Codd defines an atomic value as one that "cannot be decomposed
into smaller pieces by the DBMS (excluding certain special functions)"
meaning a field should not be divided into parts with more than one
kind of data in it such that what one part means to the DBMS depends
on another part of the same field.
Re "normalized" and "1NF"
When Codd used "normalize" in 1970, he meant eliminate relation-valued ("non-simple") domains from a relational database:
For this reason (and others to be cited below) the possibility of
eliminating nonsimple domains appears worth investigating. There is,
in fact, a very simple elimination procedure, which we shall call
normalization.
Later the notion of "higher NFs" (involving FDs (functional dependencies) & then JDs (join dependencies)) arose and "normalize" took on a different meaning. Since Codd's original normalization paper, normalization theory has always given results relevant to all relations, not just those in Codd's 1NF. So one can "normalize" in the original sense of going from just relations to a "normalized" "1NF" without relation-valued columns. And one can "normalize" in the normalization-theory sense of going from a just-relations "1NF" to higher NFs while ignoring whether domains are relations. And "normalization" is commonly also used for the "hazy" notion of eliminating values with "parts". And "normalization" is also wrongly used for designing a relational version of a non-relational database (whether just relations and/or some other sense of "1NF").
Relational spirit is to eschew multiple columns with the same meaning or domains with interesting parts in favour of another base table. But we must always come to an informal ergonomic decision about when to stop representing parts and just treat a column as "atomic" (non-relation-valued) vs "nonatomic" (relation-valued).
Normalization in database management system
Atomicity and 1NF... that is not about atomic transactions, but about definition and column content.
"Atomic" means "cannot be divided or split in smaller parts". Applied to 1NF this means that a column should not contain more than one value. It should not compose or combine values that have a meaning of their own.
This tipically regards 2 very common mistakes made by database designers:
1. multiple values in one column (list columns)
columns that contain a list of values, tipically space or comma separated, like this blog post table:
id title date_posted content tags
1 new idea 2014-05-23 ... tag1,tag2,tag3
2 why this? 2014-05-24 ... tag2,tag5
3 towel day 2014-05-26 ... tag42
or this contacts table:
id room phones
4 432 111-111-111 222-222-222
5 456 999-999-999
6 512 888-888-8888 333-3333-3333
This type of denormalization is rare, as most database designers see this cannot be a good thing. But you do find tables like this. They usually come from modifications to the database, whereas it may seem simpler to widen a column and use it to stuff multiple values instead of adding a normalized related table (which often breaks existing applications).
2. complex multi-part columns
In this case one column contains different bits of information and could maybe be designed as a set of separate columns.
Typical example are fullname and address columns:
id fullname address
1 Mark Tomers 56 Tomato Road
2 Fred Askalong 3277 Hadley Drive
3 May Anne Brice 225 Century Avenue - apartment 43/a
These types of denormalizations are very common, as it is quite difficult to draw the line and what is atomic and what is not. Depending on the application, a multi-part column could very well be the best solution in some cases. It is less structured, but simpler.
Structuring an address in many atomic columns may mean having more complex code to handle results for output. Another complexity comes from the structure not being adeguate to fit all types of addresses. Using one single VARCHAR column does not pose this problem, but may pose others... typically about searching and sorting.
An extreme case of multi-part columns are dates and times. Most RDBMS provide date and time data types and provide functions to handle date and time algebra and the extraction of the various bits (month, hour, etc...). Few people would consider convenient to have separate year, mont, day columns in a relational database. But I've seen it... and with good reasons: the use case was birthdates for a justice department database. They had to handle many immigrants with few or no documents. Sometimes you just knew a person was born in a certain year, but you would not know the day or month or birth. You can't handle that type of info with a single date column.
"Every column should be atomic."
Chris Date says, "Please note very carefully that it is not just simple things like the integer 3 that are legitimate values. On the contrary, values can be arbitrarily complex; for example, a value might be a geometric point, or a polygon, or an X ray, or an XML document, or a fingerprint, or an array, or a stack, or a list, or a relation (and so on)."[1]
He also says, "A relvar is in 1NF if and only if, in every legal value of that relvar, every tuple contains exactly one value for each attribute."[2]
He generally discourages the use of the word atomic, because it has confusing connotations. Single value is probably a better term to use.
For example, a date like '2014-01-01' is a single value. It's not indivisible; on the contrary, it quite clearly is divisible. But the dbms does one of two things with single values that have parts. The dbms either returns those values as a whole, or the dbms provides functions to manipulate the parts. (Clients don't have to write code to manipulate the parts.)[3]
In the case of dates, SQL can
return dates as a whole (SELECT CURRENT_DATE),
return one or more parts of a date (EXTRACT(YEAR FROM CURRENT_DATE)),
add and subtract intervals (CURRENT_DATE + INTERVAL '1' DAY),
subtract one date from another (CURRENT_DATE - DATE '2014-01-01'),
and so on. In this (narrow) respect, SQL is quite relational.
An Introduction to Database Systems, 8th ed, p 113. Emphasis in the original.
Ibid, p 358.
In the case of a "user-defined" type, the "user" is presumed to be a database programmer, not a client of the database.
it means column should not contain multiple values(like comma seperated values).
plz see below link.
http://www.studytonight.com/dbms/database-normalization.php

What is a columnar database?

I have been working with warehousing for a while now.
I am intrigued by Columnar Databases and the speed that they have to offer for data retrievals.
I have multi-part question:
How do Columnar Databases work?
How do they differ from relational databases?
How do columnar databases work?
The defining concept of a column-store is that the values of a table are stored contiguously by column. Thus the classic supplier table from CJ Date's supplier and parts database:
SNO STATUS CITY SNAME
--- ------ ---- -----
S1 20 London Smith
S2 10 Paris Jones
S3 30 Paris Blake
S4 20 London Clark
S5 30 Athens Adams
would be stored on disk or in memory something like:
S1S2S3S4S5;2010302030;LondonParisParisLondonAthens;SmithJonesBlakeClarkAdams
This is in contrast to a traditional rowstore which would store the data more like this:
S120LondonSmith;S210ParisJones;S330ParisBlake;S420LondonClark;S530AthensAdams
From this simple concept flows all of the fundamental differences in performance, for better or worse, between a column-store and a row-store. For example, a column store will excel at doing aggregations like totals and averages, but inserting a single row can be expensive, while the inverse holds true for row-stores. This should be apparent from the above diagram.
How do they differ from relational databases?
A relation database is a logical concept. A columnar database, or column-store, is a physical concept. Thus the two terms are not comparable in any meaningful way. Column- oriented DMBSs may be relational or not, just as row-oriented DBMS's may adhere more or less to relational principles.
How do Columnar Databases work?
Columnar database is a concept rather a particular architecture/implementation. In other words, there isn't one particular description on how these databases work; indeed, several are build upon traditional, row-oriented, DBMS, simply storing the info in tables with one (or rather often two) columns (and adding the necessary layer to access the columnar data in an easy fashion).
How do they differ from relational databases?
They generally differ from traditional (row-oriented) databases with regards to ...
performance...
storage requirements ...
ease of modification of the schema ...
...in specific use cases of DBMSes.
In particular they offer advantages in the areas mentioned when the typical use is to compute aggregate values on a limited number of columns, as opposed to try and retrieve all/most columns for a given entity.
Is there a trial version of a columnar database I can install to play around? (I am on Windows 7)
Yes, there are commercial, free and also open-source implementation of columnar databases. See the list at the end of the Wikipedia article for starter.
Beware that several of these implementations were introduced to address a particular need (say very small footprint, highly compressible distribution of data, or spare matrix emulation etc.) rather than provide a general purpose column-oriented DBMS per-se.
Note: The remark about the "single purpose orientation" of several columnar DBMSes is not a critique of these implementations, but rather an additional indication that such an approach for DBMSes strays from the more "natural" (and certainly more broadly used) approach to storing record entities. As a result, this approach is used when the row-oriented approach isn't satisfactory, and therefore and tends to
a) be targeted for a particular purpose
b) receive less resources/interest than work on "General Purpose", "Tried and Tested", tabular approach.
Tentatively, the Entity-Attribute-Value (EAV) data model, may be an alternative storage strategy which you may want to consider. Although distinct from the "pure" Columnar DB model, EAV shares several of the characteristics of Columnar DBs.
I would say the best candidate to understand about column oriented databases is to check HBase (Apache Hbase) . You an checkout the code and explore further to find out about the implementation .
Also, Columnar DBs have a built in affinity for data compression, and the loading process is unique. Here's an article I wrote in 2008 that explains a bit more.
You may also be interested in a new report from IDC's Carl Olofson on 3rd generation DBMS technology. It discusses columnar, et al. If you're not an IDC client you can get it free on our site. He's doing a webinar on June 16th, too (also on our site).
(BTW, one comment above lists asterdata but I don't think they are columnar.)
To understand what is column oriented database, it is better to contrast it with row oriented database.
Row oriented databases (e.g. MS SQL Server and SQLite) are designed to efficiently return data for an entire row. It does it by storing all the columns values of a row together. Row-oriented databases are well-suited forĀ OLTP systems (e.g., retail sales, and financial transaction systems).
Column oriented databases are designed to efficiently return data for a limited number of columns. It does it by storing all of the values of a column together. Two widely used Column oriented databases are Apache Hbase and Google BigTable (used by Google for its Search, Analytics, Maps and Gmail). They are suitable for the big data projects. A column oriented database will excel at read operations on a limited number of columns, however write operation will be expensive compared to row oriented databases.
For more: https://en.wikipedia.org/wiki/Column-oriented_DBMS
Product information. This may help. These were to featured products on a Google search.
http://www.vertica.com/
http://www.paraccel.com/
http://www.asterdata.com/index.php
kx is another columnar database, for example used in the financial sector. The licence is somewhat $50K last time I checked, though. No optimisation needed, no index needed, because kx has powerful operators (matlab equivalents: .*, kron, bsxfun, ...).