Designing a Cloud BigTable: Millions of Rows X Millions of Columns? - bigtable

I'm wondering if the following table design for BigTable is legit. From what I read, having millions of sparse columns should work, but would it work well?
The idea is to keep time-based "samples" in columns (each is a few Kb). I expect to have millions of rows, where each would have a limited number of entries (~10-50) as values in the table. Each column in the table represents a timespan of (say, ) 10 seconds. and since there are roughly 2.6 seconds in a month, a year would take about 3M columns. I intend to use row-scans to fetch rows by prefix - usually just a handful of rows per fetch.
so, to sum:
the table will contain (million rows X 50 samples per row, each a few kb): 50M items
but the table's dimensions are (million rows X million columns): a trillion cells.
Now, I know that empty cells don't take space and the whole "table" metaphor isn't really apt to BT, but I'm still wondering: does the above represent a valid use-case for BigTable?

Based on the Google docs, Cloud Bigtable is a sparsely populated table that can scale to billions of rows and thousands of columns. About the limitation of Cloud Bigtable rows and columns, Cloud Bigtable Rows can be big but are not infinite, the rows can contain ~100 column families and millions of columns but the recommendation is 100MB for row size then 10MB for column value.
Therefore, In BigTable, the limit of the data within table is based on data size instead of the number of columns or rows (except for "Column families per table"). I believe your use-case is valid and could have a million of rows and columns as long as the values is within the hard limit. As a best practice, design your schema to keep the size of your data.

Related

Hive join query optimisation

Table A
---------
col1, col2,Adate,qty
Table B
-------
col2,cost,Bdate
The table sizes are as follows:
A: 1 million
B: 700k
Consider this query:
SELECT
A.col1,
A.col2,
B.Bdate bdate,
SUM(qty)*COLLECT_LIST(cost)[0] price
FROM A
JOIN B
ON (A.col2 = B.col2 AND A.Adate <= B.Bdate)
GROUP BY
A.col1,
A.col2,
B.bdate;
The above hive query takes more than 3 hrs on a cluster of 4 slaves(8GB memory,100 GB disk) and 1 master(16 GB memory, 100 GB disk)
Can this query be optimized? If yes, where can the optimization be possible?
Use Tez and mapjoin.
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --adjust for your smaller table to fit in memory
set hive.execution.engine=tez;
Also this computation is not memory-efficient:
SUM(qty)*COLLECT_LIST(cost)[0] price
COLLECT_LIST will collect all cost values in the group into non unique(contains values from ALL rows in the group) and unordered (yes, unordered, because you have no any distribute + sort before collect_list) array. This array can be big enough (the number of elements = the number of rows in the group), depending on your data, then you are taking [0] element, it means that you are picking just any random cost from the group. Does it make any sense to collect array to get just any random element? Use min() or max instead. If it does not matter which cost should be taken, then min(cost) or max(cost) or some other scalar function will consume less memory. You can use first_value analytic function (may require sub-query, but it will be memory-efficient also)
I will try to give you some advices to improve query performance in Hive.
Check the execution engine you are using
set hive.execution.engine;
If you execution engine is mr, rather than MapReduce, you may be able to use Apache Spark or Apache Tez, both of which are faster than MapReduce.
set hive.execution.engine=tez;
Join queries are computationally expensive and can be slow, especially when you’re joining three or more tables, or if you’re working with very large data.
One strategy that can be used to remedy this problem is to join the data in advance and store the pre-joined result in a separate table, which you can then query.
this is one way of denormalizing a normalized database, to make it easier to run analytic queries.
This approach of pre-joining tables has some costs, but it can make analytic queries easier to write and faster to run.
There are some other techniques for improving Hive query performance
Join table ordering (Largest table last)
As with any type of tuning, it is important to understand the internal working of a system. When Hive executes a join,
it needs to select which table is streamed and which table is cached.
Hive takes the last table in the JOIN statement for streaming, so we need to ensure that this streaming table is largest among the two.
A: 1 million B: 700k
Hence, when these two tables are joined it is important that the larger table comes last in the query.
Bucketing stores data in separate files, not separate subdirectories like partitioning.
It divides the data in an effectively random way, not in a predictable way like partitioning.
When records are inserted into a bucketed table, Hive computes hash codes of the values in the specified bucketing column and uses these hash codes to divide the records into buckets.
For this reason, bucketing is sometimes called hash partitioning.
The goal of bucketing is to distribute records evenly across a predefined number of buckets.
Bucketing can improve the performance of joins if all the joined tables are bucketed on the join key column.
For more on bucketing, see the page of the Hive Language Manual describing bucketed tables,
BucketedTables
bucketing-in-hive
Partitioning
Partitioning is a way of dividing a table into related parts based on the values of particular columns like date, city, and department.
Each table in the hive can have one or more partition keys to identify a particular partition.
Using partition it is easy to do queries on slices of the data.
apache-hive-partitions

Return First Row For Given Value in a Column - BigQuery

I have a very large table that has a column which holds custom ID of string type for each row. For each ID, there are 50 properties in that table. This is guaranteed to be unique in the table.
My main task is to get those 50 properties in the row for a given ID.
When I run a normal query like the one below, it takes 5 sec to scan only 1 million rows.
SELECT * FROM `mytable` WHERE id='123'
As per my understanding, BigQuery does a parallel search for a match after partitioning the rows into different clusters. And I believe for a given ID value it will check all the rows in all different clusters. So that even if a match is found in one partition, the others clusters will continue to get other matches.
But as the values in the ID column are unique here, can we somehow "break" the jobs running on other clusters as soon a match is found in a cluster and return the row.
I hope this will speed up the query run time.
Also, in the future, this table will grow to really large so if this can be done it will really be helpful for my purpose.
Any suggestions are welcome.
You can use recently introduced Clustered Tables
This will allow you to bring down cost and improve performance
Please note: currently clustering is supported for partitioned tables only - but support for clustering non-partitioned tables is under development
If you table is partitioned you can just cluster it by id - and you are done
If not - you can introduce 'fake' date field and partition by it so clustering will be available for that table
Meantime, if you just interested in one row for given id - try below
SELECT * FROM mytable WHERE id='123' LIMIT 1

which one is faster: to query with criteria in one shot or to subset a large table into a smaller table and then apply criteria

I have a large table (TB size, ~10 billion rows, ~100 million IDs).
I want to run a query to get some specific IDs' counts (say, 100k IDs). List of needed IDs are in another table.
I know that I can run a join query to get the results, but it is extremely time consuming (~ 5 days processing).
I am wondering if I break the script into 2 phase (1- subset the whole table based on just the IDs, 2-apply the selection criteria on the subset table), will it make any process improvement?

The limitation of partition table updates in BigQuery

The Quotas & Limits document says that "partition table updates" has the two below limitations.
Daily limit: 2,000 partition updates per table, per day
Rate limit: 50 partition updates every 10 seconds
My question is whether these limitations are applied to a partitoin table or ones in the dataset.
For example, is it possible to have thousands day-partitioned tables and perform streaming insert to each table everyday?

2 billion distinct values per column limitation

According to this link, columns in SSAS tabular models are limited to 2 billion DISTINCT values. Does this apply across partitions?
For example, say I have a fact table with 4 billion records and a PK column containing values from 1 to 4,000,000,000. Based on the link above, I'm assuming processing would fail once it hit the limit. So could I partition the table and have the 2billion distinct limit apply at the partition level?
Also, does this limit apply to DirectQuery partitions?
Yes, once you go beyond 2B distinct values, processing would fail. A way to work around this issue would be to create two separate tables and then use DAX to merge them together.