Bigtable hotspotting - least significant row key change - bigtable

I have a table where I store product item information. The format of the row key is Business Unit UUID + Product ID + product serial #. Each of the row key components is of fixed byte length.
Writes to the table will occur in bursts (possibly 100Ks of records) with constant BU UUID, but with either the Product ID, serial # or both more or less changing at random.
Reads from the table will be one row at a time (no scans) with random key components.
My question is, will the BU ID being fixed during a write burst result in hotspotting a particular node and or tablet? My understanding is that I should be OK since my overall row key value is not monotonically increasing, but I want to be sure.

As noted by Solomon it is possible that you would observe hotspotting even with a changing key. It would depend on the total number of nodes you have, write volume, and size of the rows.
Bigtable will attempt to dynamically rebalance so that the key space is evenly distributed among its servers, but you might see better results if you apply the salting technique described in the Time series schema design documentation:
https://cloud.google.com/bigtable/docs/schema-design-time-series#ensure_that_your_row_key_avoids_hotspotting
In general we would recommend trying this out and experimenting if possible. You can generate load and then use the Cloud Key Visualizer (https://cloud.google.com/bigtable/docs/keyvis-overview) to inspect whether you are encountering hotspots as long as you have enough data available to perform the analysis (https://cloud.google.com/bigtable/docs/keyvis-getting-started#viewing-scan).
You may also find this talk presented at Google Cloud Next 2018 useful:
https://www.youtube.com/watch?v=3QHGhnHx5HQ
It describes an approach for doing iterative schema design with the help of the Cloud Key Visualizer.

Related

Are there any downsides to using nanoid for primary key?

I know that UUIDs and incrementing integers are often used for primary keys.
I'm thinking of nanoids instead because those are URL friendly without being guessable / brute-force scrapeable (like incrementing integers).
Would there be any reason not to use nanoids as primary keys in a database like Postgres? (For example: Maybe they drastically increase query time since they aren't ... aligned or something?)
https://github.com/ai/nanoid
Most databases use incrementing id's because it's more efficient to insert a new value onto the end of a B-tree based index.
If you insert a new value into a random place in the middle of a B-tree, it may have to split the B-tree nonterminal node, and that could cause the node at the next higher level to split, and so on up to the top of the B-tree.
This also has a greater risk of causing fragmentation, which means the index takes more space for the same number of values.
Read https://www.percona.com/blog/2015/04/03/illustrating-primary-key-models-in-innodb-and-their-impact-on-disk-usage/ for a great visualization about the tradeoff between using an auto-increment versus UUID in a primary key.
That blog is about MySQL, but the same issue applies to any B-tree based data structure.
I'm not sure if there is a disadvantage to using nanoids, but they are often unnecessary. While UUIDs are long, they can be translated to a shorter format without losing entropy.
See the NPM package (https://www.npmjs.com/package/short-uuid).
UUIDs are standardized by the Open Software Foundation (OSF) and described by the RFC 4122. That means that there will be far more chances for other tools to give you some perks around it.
Some examples:
MongoDB has a special type to optimize the storage of UUIDs. Not only a NanoID string will take more space, but even the binary takes more bits (126 in Nano ID and 122 in UUID)
Once saw a logging tool extracting the timestamp from the uids, can't remember which, but is is available
Also the long, non reduced version of UUIDs are very easy to identify visually. When the end user is a developer, it might help to understand the nature/source of the ID (like clearly not a database auto-increment key)

Bigtable: Avoiding hotspotting when using timestamps on row keys

Cloud Bigtable docs on schema design for time series say:
In the vast majority of cases, time-series queries are accessing a given dataset for a given time period. Therefore, make sure that all of the data for a given time period is stored in contiguous rows, unless doing so would cause hotspotting.
Additionally, here's what they recommend to avoid hotspotting:
If you're storing a cell phone's battery status, and your row key consists of the word "BATTERY" plus a timestamp, the row key will always increase in sequence. Because Cloud Bigtable stores adjacent row keys on the same server node, all writes will focus only on one node until that node is full, at which point writes will move to the next node in the cluster.
Field promotion is suggested:
Move fields from the column data into the row key to make writes non-contiguous.
For example:
BATTERY#20150301124501001 --> BATTERY#Corrie#20150301124501001
Questions:
Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
Field promotion may solve hotspotting. Still, wouldn't that make querying by time range a little bit difficult?
That depends what your query looks like. For example, if you want to query Corrie's battery status from T1 to T2, you can construct a row range easily: [BATTERY#Corrie#T1, BATTERY#Corrie#T2]. However, if you want to query the battery status of all the users, then all the rows with prefix BATTERY will be scanned.
So, the most important queries you have should dictate which fields you promote to the row key. Also, fields with high cardinality help more when promoted to row key, as they distribute load to a larger number of tablets.
On the other side, is hotspotting avoidable if you want to query a range ONLY by TIMESTAMP? Don't think so, right?
I am not entirely sure what you mean by "query a range only the timestamp", can you provide an example?
A lot will depend on what "TIMESTAMP" means. If you always want to query for last 10 minutes, then all of your queries will go to a single server at any given time and you will experience hotspotting.
Another thing to keep in mind is that if you don't design the row key properly, writes will encounter hotspotting and you will not get good write throughput. Its recommended to design row-keys to avoid hotspotting.

Bigtable rowkey design for real-time sensor data?

Your company is streaming real-time sensor data from their factory floor into Bigtable and they have noticed extremely poor performance. How should the row key be redesigned to improve Bigtable performance on queries that populate real-time dashboards?
a) Use a row key of the form <timestamp>
b) Use a row key of the form <sensorid>
c) Use a row key of the form <timestamp>#<sensorid>
d) Use a row key of the form >#<sensorid>#<timestamp>
Based on the documentation, what would be the ideal row key on this case? I think it should be a row key of sensorid and timestamp, but i have seen some online article mentioning just the 'timestamp' for the above homework question. Please help.
I have conflicting theories on the above specific usecase as below:
- Since rows are sorted lexicographical, it is not just wise to just use the timestamp as row-key. (From Doc - Using the timestamp by itself as the row key is not recommended, as most writes would be pushed onto a single node.)
- On this usecase, since the requirement is a real-time dashboard, it could also mean that the all sensorid data can be stored just for one timestamp, so real-time querying can be done based on just the timestamp.
Please help with the ideal row-key on this usecase.
The problem is, it does not specify what query the real-time dashboard shows also not much insight on the performance. Please refer to the schema design for time series data documentation which has some example scenarios. If you have only timestamp as key, you may suffer from hotspotting. The ideal key will be ## (Option D) but it always depends on the use case which is not very clear in the question.
As per the Bigtable schema design documentation:
"Using the timestamp by itself as the row key is not recommended, as most writes would be pushed onto a single node". So this excludes option A
"For the same reason, avoid placing a timestamp at the start of the row key.". There goes option C
Also, the page says "Your row key for this data could combine an identifier for the machine with a timestamp for the data (for example, machine_4223421#1425330757685).". This leads us to choosing option D as the best one.
In theory, option B would also be valid, but option D seems better.

AWS DynamoDB v2: Do I need secondary index for alternative queries?

I need to create a table that would contain a slice of data produced by a continuously running process. This process generates messages that contain two mandatory components, among other things: a globally unique message UUID, and a message timestamp.
Those messages would be later retrieved by the UUID.
In addition, on a regular basis I would need to delete all messages from that table that are too old, i.e. whose timestamps are more than X away from the current time.
I've been reading the DynamoDB v2 documentation (e.g. Local Secondary Indexes) trying to figure out how to organize my table and whether or not I need a secondary index to perform searches for messages to delete. There might be a simple answer to my question, but I am somehow confused...
So should I just create a table with the UUID as the hash and messageTimestamp as the range key (together with a "message" attribute that would contain the actual message), and then not create any secondary indices? In the examples that I've seen, the hash was something that was not unique (e.g. ForumName under the above link). In my case, the hash would be unique. I am not sure whether than makes any difference.
And if I create the table with hash and range as described, and without a secondary index, then how would I query for all messages that are in a certain timerange regardless of their UUIDs?
DynamoDB introduced Global Secondary Index which would solve this problem.
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GSI.html
We've wrestled with this as well. The best solution we've come up with is to create second table for storing the time series data. To do this:
1) Use the date plus "bucket" id for a hash key
You could just use the date, but then I'm guessing today's date would become a "hot" key - one that is written with excessive frequency. This can create a serious bottleneck, as the total throughput for a particular DynamoDB partition is equal to the total provisioned throughput divided by the number of partitions - that means if all your writes are to a single key (today's key) and you have a throughput of 20 writes per second, then with 20 partitions, your total throughput would be 1 write per second. Any requests beyond this would be throttled. Not a good situation.
The bucket can be a random number from 1 to n, where n should be greater than the number of partitions used by the underlying DB. Determining n is a bit tricky of course because Dynamo does not reveal how many partitions it uses. But we are currently working with the upper limit of 200 based on the example found here. The writeup at this link was the basis for our thinking in coming up with this approach.
2) Use the UUID for the range key
3) Query records by issuing queries for each day and bucket.
This may seem tedious, but it is more efficient than a full scan. Another possibility is to use Elastic Map Reduce jobs, but I have not tried that myself yet so cannot say how easy/effective it is to work with.
We are still figuring this out ourselves, so I'm interested to hear others' comments. I also found this presentation very helpful in thinking through how best to use Dynamo:
Falling In and Out Of Love with Dynamo
-John
In short you can not. All DynamoDB queries MUST contain the primary hash index in the query. Optionally, you can also use the range key and/or a local secondary index. With the current DynamoDB functionality you won't be able to use an LSI as an alternative to the primary index. You also are not able to issue a query with only the range key (you can test this out easily in the AWS Console).
A (costly) workaround that I can think of is to issue a scan of the table, adding filters based on the timestamp value in order to find out which fields to delete. Note that filtering will not reduce the used capacity of the query, as it will parse the whole table.

Representing Sparse Data in PostgreSQL

What's the best way to represent a sparse data matrix in PostgreSQL? The two obvious methods I see are:
Store data in a single a table with a separate column for every conceivable feature (potentially millions), but with a default value of NULL for unused features. This is conceptually very simple, but I know that with most RDMS implementations, that this is typically very inefficient, since the NULL values ususually takes up some space. However, I read an article (can't find its link unfortunately) that claimed PG doesn't take up data for NULL values, making it better suited for storing sparse data.
Create separate "row" and "column" tables, as well as an intermediate table to link them and store the value for the column at that row. I believe this is the more traditional RDMS solution, but there's more complexity and overhead associated with it.
I also found PostgreDynamic, which claims to better support sparse data, but I don't want to switch my entire database server to a PG fork just for this feature.
Are there any other solutions? Which one should I use?
I'm assuming you're thinking of sparse matrices from mathematical context:
http://en.wikipedia.org/wiki/Sparse_matrix (The storing techniques described there are for memory storage (fast arithmetic operation), not persistent storage (low disk usage).)
Since one usually do operate on this matrices on client side rather than on server side a SQL-ARRAY[] is the best choice!
The question is how to take advantage of the sparsity of the matrix? Here the results from some investigations.
Setup:
Postgres 8.4
Matrices w/ 400*400 elements in double precision (8 Bytes) --> 1.28MiB raw size per matrix
33% non-zero elements --> 427kiB effective size per matrix
averaged using ~1000 different random populated matrices
Competing methods:
Rely on the automatic server side compression of columns with SET STORAGE MAIN or EXTENDED.
Only store the non-zero elements plus a bitmap (bit varying(xx)) describing where to locate the non-zero elements in the matrix. (One double precision is 64 times bigger than one bit. In theory (ignoring overheads) this method should be an improvement if <=98% are non-zero ;-).) Server side compression is activated.
Replace the zeros in the matrix with NULL. (The RDBMSs are very effective in storing NULLs.) Server side compression is activated.
(Indexing of non-zero elements using a 2nd index-ARRAY[] is not very promising and therefor not tested.)
Results:
Automatic compression
no extra implementation efforts
no reduced network traffic
minimal compression overhead
persistent storage = 39% of the raw size
Bitmap
acceptable implementation effort
network traffic slightly decreased; dependent on sparsity
persistent storage = 33.9% of the raw size
Replace zeros with NULLs
some implementation effort (API needs to know where and how to set the NULLs in the ARRAY[] while constructing the INSERT query)
no change in network traffic
persistent storage = 35% of the raw size
Conclusion:
Start with the EXTENDED/MAIN storage parameter. If you have some free time investigate your data and use my test setup with your sparsity level. But the effect may be lower than you expect.
I suggest always to use the matrix serialization (e.g. Row-major order) plus two integer columns for the matrix dimensions NxM. Since most APIs use textual SQL you are saving a lot of network traffic and client memory for nested "ARRAY[ARRAY[..], ARRAY[..], ARRAY[..], ARRAY[..], ..]" !!!
Tebas
CREATE TABLE _testschema.matrix_dense
(
matdata double precision[]
);
ALTER TABLE _testschema.matrix_dense ALTER COLUMN matdata SET STORAGE EXTERN;
CREATE TABLE _testschema.matrix_sparse_autocompressed
(
matdata double precision[]
);
CREATE TABLE _testschema.matrix_sparse_bitmap
(
matdata double precision[]
bitmap bit varying(8000000)
);
Insert the same matrices into all tables. The concrete data depends on the certain table.
Do not change the data on server side due to unused but allocated pages. Or do a VACUUM.
SELECT
pg_total_relation_size('_testschema.matrix_dense') AS dense,
pg_total_relation_size('_testschema.matrix_sparse_autocompressed') AS autocompressed,
pg_total_relation_size('_testschema.matrix_sparse_bitmap') AS bitmap;
A few solutions spring to mind,
1) Separate your features into groups that are usually set together, create a table for each group with a one-to-one foreign key relationship to the main data, only join on tables you need when querying
2) Use the EAV anti-pattern, create a 'feature' table with a foreign key field from your primary table as well as a fieldname and a value column, and store the features as rows in that table instead of as attributes in your primary table
3) Similarly to how PostgreDynamic does it, create a table for each 'column' in your primary table (they use a separate namespace for those tables), and create functions to simplify (as well as efficiently index) accessing and updating the data in those tables
4) create a column in your primary data using XML, or VARCHAR, and store some structured text format within it representing your data, create indexes over the data with functional indexes, write functions to update the data (or use the XML functions if you are using that format)
5) use the contrib/hstore module to create a column of type hstore that can hold key-value pairs, and can be indexed and updated
6) live with lots of empty fields
A NULL value will take up no space when it's NULL. It'll take up one bit in a bitmap in the tuple header, but that will be there regardless.
However, the system can't deal with millions of columns, period. There is a theoretical max of a bit over a thousand, IIRC, but you really don't want to go that far.
If you really need that many, in a single table, you need to go the EAV method, which is basically what you're saying in (2).
If each entry has only a relatively few keys, I suggest you look at the "hstore" contrib modules which lets you store this type of data very efficiently, as a third option. It's been enhanced further in the upcoming 9.0 version, so if you are a bit away from production deployment, you might want to look directly at that one. However, it's well worth it in 8.4 as well. And it does support some pretty efficient index based lookups. Definitely worth looking into.
I know this is an old thread, but MadLib provides a sparse vector type for Postgres, along with several machine learning and statistical methods.