How handle Hash-key, partition and index on Azure? - azure-synapse

I've studied Azure Synapse and distribution types.
Hash-distributed table needs a column to distribute the data between different nodes,
For me it's the same idea of partition, I saw some examples that uses a hash-key, partition and index. It's not clear in my mind their differences and how to choose one of them. How Hash-key, partition and index could work together?

Just an analogy which might explain the difference between Hash and Partition
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
For more details:
What is a difference between table distribution and table partition in sql?
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
And indexing is physical arrangement within those nodes

Related

What is a difference between table distribution and table partition in sql?

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.
Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.
Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Bigtable/BigQuery pricing when inserts depend on lookups

I have a simple proof-of-concept application written in traditional SQL. I need to scale it to much larger size (potentially trillions of rows, multiple terabytes or possibly petabytes in size). I'm trying to come up with the pricing model of how this could be done using Google's Bigtable/BigQuery/Dataflow.
From what I gather from Google's pricing documents, Bigtable is priced in terms of nodes needed to handle the necessary QPS and in terms of storage required, whereas the BigQuery is priced in terms of each query's size.
But what happens when your inserts into the table actually require the lookup of that same table? Does that mean that you have to consider an additional cost factor into each insert? If my total column size is 1TB and I have to do a SELECT on that column before each additional insert, will I be charged $5 for each insert operation as a consequence? Do I have to adjust my logic to accommodate this pricing structure? Like breaking the table into a set of smaller tables, etc?
Any clarification much appreciated, as well as links to more detailed and granular pricing examples for Bigtable/BigQuery/Dataflow than what's available on Google's website.
I am the product manager for Google Cloud Bigtable.
It's hard to give a detailed answer without a deeper understanding of the use case. For example, when you need to do a lookup before doing an insert, what's the complexity of the query? Is it an arbitrary SQL query, or can you get by with a lookup by primary key? How big is the data set?
If you only need to do lookups by key, then you may be able to use Bigtable (which, like HBase, only has a single key: the row key), and each lookup by row key is fast and does not require scanning the entire column.
If you need complex lookups, you may be able to use:
Google BigQuery, but note that each lookup on a column is a full scan as per this answer, though as suggested in another answer, you can partition data to scan less data, if that's helpful
Google Cloud Datastore, which is a document database (like MongoDB), allows you to set up indexes on some of the fields, so you can do a search based on those properties
Google Cloud SQL, which is a managed service for MySQL, but while it can scale to TB, it does not scale to PB, so it depends how big your dataset is that you need to query prior to inserting
Finally, if your use case is going into the PB-range, I strongly encourage you to get in touch with Google Cloud Platform folks and speak with our architects and engineers to identify the right overall solution for your specific use cases, as there may be other optimizations that we can make if we can discuss your project in more detail.
Regarding BigQuery, you are able to partition your data based on day. So if you need to query only last days the charge will be for that and not for full table.
On the other hand, you need to rethink your data management. Choosing an append-only and event based data flow could help you to avoid lookups on the same table.
will I be charged $5 for each insert operation as a consequence?
Yes, any time you scan that column - you will be charged for full column's size unless your result is cachable (see query caching) which most likely is not your case
Do I have to adjust my logic ... ?
Yes.
"breaking the table into a set of smaller tables" (Sharding with Table wildcard functions) or Partitioning is the way to go for you

Why Spark SQL considers the support of indexes unimportant?

Quoting the Spark DataFrames, Datasets and SQL manual:
A handful of Hive optimizations are not yet included in Spark. Some of
these (such as indexes) are less important due to Spark SQL’s
in-memory computational model. Others are slotted for future releases
of Spark SQL.
Being new to Spark, I'm a bit baffled by this for two reasons:
Spark SQL is designed to process Big Data, and at least in my use
case the data size far exceeds the size of available memory.
Assuming this is not uncommon, what is meant by "Spark SQL’s
in-memory computational model"? Is Spark SQL recommended only for
cases where the data fits in memory?
Even assuming the data fits in memory, a full scan over a very large
dataset can take a long time. I read this argument against
indexing in in-memory database, but I was not convinced. The example
there discusses a scan of a 10,000,000 records table, but that's not
really big data. Scanning a table with billions of records can cause
simple queries of the "SELECT x WHERE y=z" type take forever instead
of returning immediately.
I understand that Indexes have disadvantages like slower INSERT/UPDATE, space requirements, etc. But in my use case, I first process and load a large batch of data into Spark SQL, and then explore this data as a whole, without further modifications. Spark SQL is useful for the initial distributed processing and loading of the data, but the lack of indexing makes interactive exploration slower and more cumbersome than I expected it to be.
I'm wondering then why the Spark SQL team considers indexes unimportant to a degree that it's off their road map. Is there a different usage pattern that can provide the benefits of indexing without resorting to implementing something equivalent independently?
Indexing input data
The fundamental reason why indexing over external data sources is not in the Spark scope is that Spark is not a data management system but a batch data processing engine. Since it doesn't own the data it is using it cannot reliably monitor changes and as a consequence cannot maintain indices.
If data source supports indexing it can be indirectly utilized by Spark through mechanisms like predicate pushdown.
Indexing Distributed Data Structures:
standard indexing techniques require persistent and well defined data distribution but data in Spark is typically ephemeral and its exact distribution is nondeterministic.
high level data layout achieved by proper partitioning combined with columnar storage and compression can provide very efficient distributed access without an overhead of creating, storing and maintaining indices.This is a common pattern used by different in-memory columnar systems.
That being said some forms of indexed structures do exist in Spark ecosystem. Most notably Databricks provides Data Skipping Index on its platform.
Other projects, like Succinct (mostly inactive today) take different approach and use advanced compression techniques with with random access support.
Of course this raises a question - if you require an efficient random access why not use a system which is design as a database from the beginning. There many choices out there, including at least a few maintained by the Apache Foundation. At the same time Spark as a project evolves, and the quote you used might not fully reflect future Spark directions.
In general, the utility of indexes is questionable at best. Instead, data partitioning is more important. They are very different things, and just because your database of choice supports indexes doesn't mean they make sense given what Spark is trying to do. And it has nothing to do with "in memory".
So what is an index, anyway?
Back in the days when permanent storage was crazy expensive (instead of essentially free) relational database systems were all about minimizing usage of permanent storage. The relational model, by necessity, split a record into multiple parts -- normalized the data -- and stored them in different locations. To read a customer record, maybe you read a customer table, a customerType table, take a couple of entries out of an address table, etc. If you had a solution that required you to read the entire table to find what you want, this is very costly, because you have to scan so many tables.
But this is not the only way to do things. If you didn't need to have fixed-width columns, you can store the entire set of data in one place. Instead of doing a full-table scan on a bunch of tables, you only need to do it on a single table. And that's not as bad as you think it is, especially if you can partition your data.
40 years later, the laws of physics have changed. Hard drive random read/write speeds and linear read/write speeds have drastically diverged. You can basically do 350 head movements a second per disk. (A little more or less, but that's a good average number.) On the other hand, a single disk drive can read about 100 MB per second. What does that mean?
Do the math and think about it -- it means if you are reading less than 300KB per disk head move, you are throttling the throughput of your drive.
Seriouusly. Think about that a second.
The goal of an index is to allow you to move your disk head to the precise location on disk you want and just read that record -- say just the address record joined as part of your customer record. And I say, that's useless.
If I were designing an index based on modern physics, it would only need to get me within 100KB or so of the target piece of data (assuming my data had been laid out in large chunks -- but we're talking theory here anyway). Based on the numbers above, any more precision than that is just a waste.
Now go back to your normalized table design. Say a customer record is really split across 6 rows held in 5 tables. 6 total disk head movements (I'll assume the index is cached in memory, so no disk movement). That means I can read 1.8 MB of linear / de-normalized customer records and be just as efficient.
And what about customer history? Suppose I wanted to not just see what the customer looks like today -- imagine I want the complete history, or a subset of the history? Multiply everything above by 10 or 20 and you get the picture.
What would be better than an index would be data partitioning -- making sure all of the customer records end up in one partition. That way with a single disk head move, I can read the entire customer history. One disk head move.
Tell me again why you want indexes.
Indexes vs ___ ?
Don't get me wrong -- there is value in "pre-cooking" your searches. But the laws of physics suggest a better way to do it than traditional indexes. Instead of storing the customer record in exactly one location, and creating a pointer to it -- an index -- why not store the record in multiple locations?
Remember, disk space is essentially free. Instead of trying to minimize the amount of storage we use -- an outdated artifact of the relational model -- just use your disk as your search cache.
If you think someone wants to see customers listed both by geography and by sales rep, then make multiple copies of your customer records stored in a way that optimized those searches. Like I said, use the disk like your in memory cache. Instead of building your in-memory cache by drawing together disparate pieces of persistent data, build your persistent data to mirror your in-memory cache so all you have to do is read it. In fact don't even bother trying to store it in memory -- just read it straight from disk every time you need it.
If you think that sounds crazy, consider this -- if you cache it in memory you're probably going to cache it twice. It's likely your OS / drive controller uses main memory as cache. Don't bother caching the data because someone else is already!
But I digress...
Long story short, Spark absolutely does support the right kind of indexing -- the ability to create complicated derived data from raw data to make future uses more efficient. It just doesn't do it the way you want it to.

1 or many sql tables for persisting "families" of properties about one object?

Our application (using a SQL Server 2008 R2 back-end) stores data about remote hardware devices reporting back to our servers over the Internet. There are a few "families" of information we have about each device, each stored by a different server application into a shared database:
static configuration information stored by users using our web app. e.g. Physical Location, Friendly Name, etc.
logged information about device behavior, e.g. last reporting time, date the device first came online, whether device is healthy, etc.
expensive information re-computed by scheduled jobs, e.g. average signal strength, average length of transmission, historical failure rates, etc.
These properties are all scalar values reflecting the most current data we have about a device. We have a separate way to store historical information.
The largest number of device instances we have to worry about will be around 100,000, so this is not a "big data" problem. In most cases a database will have 10,000 devices or less to worry about.
Writes to the data about an individual device happens infrequently-- typically every few hours. It's theoretically possible for a scheduled task, user-inputted configuration changes, and dynamic data to all make updates for the same device at the same time, but this seems very rare. Reads are more frequent: probably 10x per minute reads against at least one device in a database, and several times per hour for a full scan of some properties of all devices described in a database.
Deletes are relatively rare, in fact many cases we only "soft delete" devices so we can use them for historical reporting. New device inserts are more common, perhaps a few every day.
There are (at least) two obvious ways to store this data in our SQL database:
The current design of our application stores each of these families of information in separate tables, each with a clustered index on a Device ID primary key. One server application writes to one table each.
An alternate implementation that's been proposed is to use one large table, and create covering indexes as needed to accelerate queries for groups of properties (e.g. all static info, all reliability info, etc.) that are frequently queried together.
My question: is there a clearly superior option? If the answer is "it depends" then what are the circumstances which would make "one large table" or "multiple tables" better?
Answers should consider: performance, maintainability of DB itself, maintainability of code that reads/writes rows, and reliability in the face of unexpected behavior. Maintanability and reliability are probably a higher priority for us than performance, if we have to trade off.
Don't know about a clearly superior option, and I don't know about sql-server architecture. But I would go for the first option with separate tables for different families of data. Some advantages could be:
granting access to specific sets of data (may be desirable for future applications)
archiving different famalies of data at different rates
partial functionality of the application in the case of maintenance on a part (some tables available while another is restored)
indexing and partitioning/sharding can be performed on different attributes (static information could be partitioned on device id, logging information on date)
different families can be assigned to different cache areas (so the static data can remain in a more "static" cache, and more rapidly changing logging type data can be in another "rolling" cache area)
smaller rows pack more rows into a block which means fewer block pulls to scan a table for a specific attribute
less chance of row chaining if altering a table to add a row, easier to perform maintenance if you do
easier to understand the data when seprated into logical units (families)
I wouldn't consider table joining as a disadvantage when properly indexed. But more tables will mean more moving parts and the need for greater awareness/documentation on what is going on.
The first option is the recognized "standard" way to store such data in a relational database.
Although a good design would probably result in more tables. Relational databases software such as SQLServer were designed to store and retrieve data in multiple tables quickly and efficiently.
In addition such designs allow for great flexibility, both in terms of changing the database to store extra data, and, in allowing unexpected/unusual queries against the data stored.
The single table option sounds beguilingly simple to practitioners unfamiliar with Relational databases. In practice they perform very badly, are difficult to manage, and lead to a high number of deadlocks and timeouts.
They also lead to development paralysis. You cannot add a requested feature because it cannot be done without a total redesign of the "simple" database schema.

What are the best uses of document stores?

I have been hearing a lot about document oriented data stores like CouchDB. I understand the uses of BigTable like stores such as Cassandra. After reading this question, I was wondering what the conditions would be to merit using a document store?
Column-family stores such as Bigtable and Cassandra have very limited querying capabilities. The application is responsible for maintaining indexes in order to query a more complex data model.
Document databases allow you to query the content, not just the key. It will also manage the indexes for you, reducing the complexity of your application.
Domain-driven design evangelizes the use of aggregates and value objects. As Ayende points out, (complex) aggregates are very natural candidates to be stored as a single document, instead of normalizing them over multiple tables or column families. This will reduce the complexity of your persistence layer. There's also less chance that related data is scattered across multiple nodes, as all the data is contained in a single document.
If your application needs to store polymorphic objects, document databases are also a good candidate. Of course, this could also be stored in Cassandra, but you won't have as much querying capabilities. At least not out of the box.
Think of a document database as a luxurious sports car. It doesn't need a professional driver (read: complex application) to get you from A to B, it has features such as air conditioning and comfortable seats and it will lap the high-scalability track in an acceptable time. However, if you want to set a lap record on the high-scalability track, you will need a professional driver and a highly optimized car (e.g. Cassandra), which lacks features such as air conditioning.
Another feature of CouchDB is that you can create those aggregations, not as documents stored manually, but as views (which are derived from the stored data, and updated automatically.)
This is like power windows, heated seats, or the kicking stereo.