Best practices for storing multiple logical records in one aerospike record? - aerospike

We have a dataset in which we have more logical records than our current cluster capacity allows for. The records are really small, so it should be easy enough to group multiple records together into one aerospike record. I can obviously come up with a custom solution to do this, but I’m wondering if there are any best practices we should be following in doing this. I would think this is a fairly common problem.

Its called a "Modeling Tiny Records" problem. You can store each small record in a Map type bin as a key-value pair. The key of this mega-record will be some number of bits of the RIPEMD160 Hash of the mapkey. Downside - you can't use XDR in EE for individual record shipping, you lose record level time-to-live option. This technique was discussed at the Aerospike User Summit 2019. The slide snapshot is here:

Related

How handle Hash-key, partition and index on Azure?

I've studied Azure Synapse and distribution types.
Hash-distributed table needs a column to distribute the data between different nodes,
For me it's the same idea of partition, I saw some examples that uses a hash-key, partition and index. It's not clear in my mind their differences and how to choose one of them. How Hash-key, partition and index could work together?
Just an analogy which might explain the difference between Hash and Partition
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
For more details:
What is a difference between table distribution and table partition in sql?
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
And indexing is physical arrangement within those nodes

What is a difference between table distribution and table partition in sql?

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.
Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.
Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Bigtable/BigQuery pricing when inserts depend on lookups

I have a simple proof-of-concept application written in traditional SQL. I need to scale it to much larger size (potentially trillions of rows, multiple terabytes or possibly petabytes in size). I'm trying to come up with the pricing model of how this could be done using Google's Bigtable/BigQuery/Dataflow.
From what I gather from Google's pricing documents, Bigtable is priced in terms of nodes needed to handle the necessary QPS and in terms of storage required, whereas the BigQuery is priced in terms of each query's size.
But what happens when your inserts into the table actually require the lookup of that same table? Does that mean that you have to consider an additional cost factor into each insert? If my total column size is 1TB and I have to do a SELECT on that column before each additional insert, will I be charged $5 for each insert operation as a consequence? Do I have to adjust my logic to accommodate this pricing structure? Like breaking the table into a set of smaller tables, etc?
Any clarification much appreciated, as well as links to more detailed and granular pricing examples for Bigtable/BigQuery/Dataflow than what's available on Google's website.
I am the product manager for Google Cloud Bigtable.
It's hard to give a detailed answer without a deeper understanding of the use case. For example, when you need to do a lookup before doing an insert, what's the complexity of the query? Is it an arbitrary SQL query, or can you get by with a lookup by primary key? How big is the data set?
If you only need to do lookups by key, then you may be able to use Bigtable (which, like HBase, only has a single key: the row key), and each lookup by row key is fast and does not require scanning the entire column.
If you need complex lookups, you may be able to use:
Google BigQuery, but note that each lookup on a column is a full scan as per this answer, though as suggested in another answer, you can partition data to scan less data, if that's helpful
Google Cloud Datastore, which is a document database (like MongoDB), allows you to set up indexes on some of the fields, so you can do a search based on those properties
Google Cloud SQL, which is a managed service for MySQL, but while it can scale to TB, it does not scale to PB, so it depends how big your dataset is that you need to query prior to inserting
Finally, if your use case is going into the PB-range, I strongly encourage you to get in touch with Google Cloud Platform folks and speak with our architects and engineers to identify the right overall solution for your specific use cases, as there may be other optimizations that we can make if we can discuss your project in more detail.
Regarding BigQuery, you are able to partition your data based on day. So if you need to query only last days the charge will be for that and not for full table.
On the other hand, you need to rethink your data management. Choosing an append-only and event based data flow could help you to avoid lookups on the same table.
will I be charged $5 for each insert operation as a consequence?
Yes, any time you scan that column - you will be charged for full column's size unless your result is cachable (see query caching) which most likely is not your case
Do I have to adjust my logic ... ?
Yes.
"breaking the table into a set of smaller tables" (Sharding with Table wildcard functions) or Partitioning is the way to go for you

design the feature in LinkedIn where it computes how many hops there are between you and another person?

This was questioned in quora, and I'm really interested in the most efficient answer.
I think one way is a sql db or a document based db
get all your connections, then all your connection's connections, and
then theirs, and check if the person you are looking at is somewhere
in that list.
Taking into account an average of 500 connections per person, and an indexed by user_id db, this will be 3 queries max to the db.
I am interested in a graph db solution which I know very little about to see this feature can be greatly improved.
In a graph database such a task is a typical path finding task. It is solved using algorithms built into the system. For example in Neo4j: http://neo4j.com/docs/stable/rest-api-graph-algos.html
When you find the shortest path between persons you can easily calculate number of edges (hops in terms of your question) between them.
Graph databases has a great advantage in such kind of tasks over relational or key-value databases as they can use effective graph algorithms such as Dijkstra's algorithm.

Sharding when you don't have a good partition function

Edit: I see that the partition functionality of some RDBMS (Postgres: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html) provides much of what I'm looking for. I'd still be interested in specific algorithms and best practices for managing the partitioning.
Horizontally scaling relational databases is often achieved by sharding data onto n servers based on some function that splits the data into n buckets of roughly equal expected size. This maintains efficient and useful queries as long as all queries contain the shard key and the data is partitioned into mutually irrelevant sets, respectively.
What is the best approach to horizontally scaling a relational database when you don't have any function that fits the above properties?
For example, in a multi-tenant situation, some tenants may produce barely any data and some may produce a full server's worth (or more), and there's no way to know which, and almost all of the queries you want to do are on a tenant's entire dataset.
I couldn't find much literature on this. The best solution I can think of is:
Initially partition based on some naive equal-splitting function into n groups.
When any server gets filled up, increment n (or increase by some other amount/factor), then re-partition the data.
When a tenant takes up more than some percent of the space on a server, move it to its own server, and add a special case to the partitioning function.
This is pretty complicated and would require a lot of complex logic in your application sharding layer (not to mention copying large sets of data between servers), but it seems like it wouldn't be too hard to semi-automate and if you were careful you could change the sharding function over time in a way that minimized the amount of data relocation from one server to another.
Is this completely barking up the wrong tree?