Bigtable design and querying with respect to number of column families - bigtable

From Cloud Bigtable schema design documentation:
Grouping data into column families allows you to retrieve data from a single family, or multiple families, rather than retrieving all of the data in each row. Group data as closely as you can to get just the information that you need, but no more, in your most frequent API calls.
In my use case, I can group all the data into one single column family (currently the access pattern is retrieve all fields), or group them to say 3 logical column families and specify these column families all the time while querying. Is there any performance difference between these two designs? Which design is recommended?

In your case, there isn't a performance advantage either way. I would use the 3 logical column families so that you have cleaner code.

Related

Multiple column selection in columnar database

I am just understanding the difference between row based and column based databases. I know their benefits but I have few questions.
Let's say I have a table with 3 columns - col1, col2 and col3. I want to fetch all col2, col3 pairs where col3 matches particular value. Let's say column values are stored in disk like below.
Block1 = col1
Block2,Block3 = col2
Block4 = col3
My understanding is that column value along with row id information will be stored in a block. Eg: (Block4 -> apple:row_2, banana:row_1). Am I correct?
Are values in the block sorted by column value? Eg: (Block4 -> apple:row_2, banana:row_1 instead of Block4 -> banana:row_1, apple:row_2). If not, how does filtering or search work without compromising the performance?
Assuming values in the block are sorted based on column value, how does corresponding col2 values will be filtered based on row ids fetched from Block4 ? Does it require linear search then?
The purpose of a columnar database is to improve performance for read queries by limiting the IO only to those columns used in the query. It does this by separating the columns into separate storage spaces.
A naive form of a columnar database would store one or a set of columns with a primary key and then use JOIN to bring together all the columns for a table. Columns that are not referenced would not be included.
However, databases that provide native support for columnar databases have much more sophisticated functionality than the naive example. Each columnar database stores data in its own way. So, your answer depends on the particular database which you haven't specified.
They might store "blocks" of values for a column and these blocks represent (in some way) a range of rows. So, if you are choosing 1 row from a billion row table, only the blocks with those rows need to be read.
Storing columns separately allows for enhanced functionality at the column level:
Compression. Values with the same data type can be much more easily compressed than rows which contain different values.
Block statistics. Blocks can be summarized statistically -- such as min and max values -- which can facilitate filtering.
Secondary data structures. Indexes for instance can be used within blocks (and these might be akin to "sorting" the values, actually).
The cost of all this is that inserts are no longer simple, so ACID properties are trickier with a column orientation. Because such databases are often used for decision support queries, this may not be an important limitation.
The "rows" are determined -- essentially -- by row ids. However, the row ids may actually consist of multiple parts, such as a block id and a row-within-a-block. This allows the store to use, say, 4 bytes for each component but not be limited to 4 billion rows.
Reconstructing rows between different columns is obviously a critical piece of functionality for any such database. In the "naive" example, this is handled via JOIN algorithms. However, specialized data structures would clearly have more specific approaches. Storing the data essentially in "row order" would be a typical solution.

BigQuery Clustered Tables: How to Create Multiple Clusters

My BigQuery table is commonly queried with different combinations of "where" conditions across 1 or more common columns, say across columns A, B, C (not in order). Hence, I would like to add individual clusters for columns A, B, and C respectively.
How can I create multiple clusters for BigQuery tables? (Similar to how multiple indexes can be created on a traditional rdbms table)
Multiple clustering is allowed (but it is hierarchical,you cluster by specific field and then it is subclustered on the following, etc).
At the same time, clustering is only allowed for partitioned tables.
You can find the corresponding documentation here
Upon viewing some comments and pages it appears that there are no ways to have multiple independent clusters (vs how multiple indexes can be created on a traditional rdbms) on a single bigquery table.
This is because clusters pretty much just sort the data blocks of that table as per docs:
When data is written to a clustered table by a query job or a load job, BigQuery sorts the data using the values in the clustering columns. These values are used to organize the data into multiple blocks in BigQuery storage. When you submit a query that contains a clause that filters data based on the clustering columns, BigQuery uses the sorted blocks to eliminate scans of unnecessary data.
Hence, it appears that there is no way of applying multiple sorting logic for each independent cluster on the same set of data, and so what I require appears to be impossible as of now.

Why does BigTable have column families?

Why is BigTable structured as a two-level hierarchy of "family:qualifier"? Specifically, why is this enforced rather than just having columns and, say, recommending that users name their qualifiers "vertical:column"?
I am interested in whether or not enforcing this enables some engineering optimizations or if this is strictly a design thing.
There are a couple of advantages to family groups:
queries become easier by getting a group of column qualifiers in a single column family
Bigtable has a notion of "locality groups". Locality groups allow a family to be written to a separate file, which helps in situations where some column families are accessed less frequently than others. You can see some information about locality groups in this analysis of HBase vs. Bigtable.

Cost for scanning nested\repeated data in bigquery

I know that the pricing for scanning data in the bigquery queries is determined by which columns are accessed.
How does this play out with nested and repeated data?
Is all the data for each record scanned? Or, does it depend on the nodes which are scanned? Are the columns in the nested data considered as each a separate column?
Thanks for any clarifications on this.
This is a great question - short answer is yes, BigQuery only processes the nested nodes you specify. A useful characteristic of BigQuery's columnar data storage is that nested columns can be represented as a separate column. If you don't explicitly SELECT them, the additional children field in your nested record are not processed.

Database - (rows or records, columns or fields)?

In database terminology:
What is the difference between a row and a record?
Likewise, aren't columns and fields the same thing?
On the blog Joe Celko The SQL Apprentice , I noticed that the banner mentions that they are different things.
Row and record can arguably be considered as the same thing.
Fields and columns are different, a field is the intersection of a row and a column.
i.e. if your table has 10 rows and 10 columns, it has 100 fields.
When you create a table using DDL statements, you define columns (metadata).
When you add rows using DML statements, you define rows and their fields.
In a broader sense, rows and columns refers to a matrix structure. When a database, not limited to a relational database, has a matrix structured data, it can be borrowed this terminology, but there might be a more specifical one.
In relational databases, for example, a table is always a matrix, so at each column in a table corresponds a field in a record and at each row corresponds a record: different concepts pointing to the same object.
A field can be present even in NoSQL databases, where often there's a free schema (no columns) and each row can have a different number of fields.
Similarly, a record can be a complex value in non-relational databases: it can contain fields with multiple distinct values (not 1NF). A row (a tuple in relational algebra) otherwise contains a single value for each field.
As stated in a previous answer to this question, row and record can arguably be used interchangeably.
Column and field can also arguably be used interchangeably. See the following article:Column (database)
Here's a quote (as of this writing), from the article mentioned above, which makes that point:
"The term field is often used interchangeably with column, although many consider it more correct to use field (or field value) to refer specifically to the single item that exists at the intersection between one row and one column."
Here's some additional background info which may be helpful:
During my IT career as an analyst and programmer, I've typically used the terms field and record, not column and row, in both programming and relational database contexts. I think that comes from the instruction that I received during my university studies, and the fact that I learned the basic data hierarchy of bit, byte, field, record, file, before learning about relational databases.
In researching this question, I found that it is common practice, and arguably correct, to use row and record interchangeably and to use column and field interchangeably. I was actually quite surprised, though, when my research indicated that row and column are preferred terms over record and field, in database terminology.
The terms Record and Field, predate relational databases, a time when computerized file systems ruled persistence storage, mainframes ruled the computing market and DBAs/Data Analysts were called DPs (Data Processing specialists).
A file with data organized in a 2-d matrix form, where a piece of information is called a field (column) and a collection of related fields a record (row). This data file is similar to a table (without standardized relationships governing the contents), therefore, the terms used during the file processing times were inherited. Technically, a row <> record and column <> field.
--For more information: Database Systems: Design, Implementation & Management - Coroner (Chapter 1, Section 5)
Records and fields make up a database table. Rows and Columns are found in spreadsheets.