What is a difference between table distribution and table partition in sql? - sql

I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.

Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.

Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.

Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.

Related

How handle Hash-key, partition and index on Azure?

I've studied Azure Synapse and distribution types.
Hash-distributed table needs a column to distribute the data between different nodes,
For me it's the same idea of partition, I saw some examples that uses a hash-key, partition and index. It's not clear in my mind their differences and how to choose one of them. How Hash-key, partition and index could work together?
Just an analogy which might explain the difference between Hash and Partition
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries. When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
For more details:
What is a difference between table distribution and table partition in sql?
https://www.linkedin.com/pulse/partitioning-distribution-azure-synapse-analytics-swapnil-mule
And indexing is physical arrangement within those nodes

Sharding when you don't have a good partition function

Edit: I see that the partition functionality of some RDBMS (Postgres: http://www.postgresql.org/docs/9.1/static/ddl-partitioning.html) provides much of what I'm looking for. I'd still be interested in specific algorithms and best practices for managing the partitioning.
Horizontally scaling relational databases is often achieved by sharding data onto n servers based on some function that splits the data into n buckets of roughly equal expected size. This maintains efficient and useful queries as long as all queries contain the shard key and the data is partitioned into mutually irrelevant sets, respectively.
What is the best approach to horizontally scaling a relational database when you don't have any function that fits the above properties?
For example, in a multi-tenant situation, some tenants may produce barely any data and some may produce a full server's worth (or more), and there's no way to know which, and almost all of the queries you want to do are on a tenant's entire dataset.
I couldn't find much literature on this. The best solution I can think of is:
Initially partition based on some naive equal-splitting function into n groups.
When any server gets filled up, increment n (or increase by some other amount/factor), then re-partition the data.
When a tenant takes up more than some percent of the space on a server, move it to its own server, and add a special case to the partitioning function.
This is pretty complicated and would require a lot of complex logic in your application sharding layer (not to mention copying large sets of data between servers), but it seems like it wouldn't be too hard to semi-automate and if you were careful you could change the sharding function over time in a way that minimized the amount of data relocation from one server to another.
Is this completely barking up the wrong tree?

1 or many sql tables for persisting "families" of properties about one object?

Our application (using a SQL Server 2008 R2 back-end) stores data about remote hardware devices reporting back to our servers over the Internet. There are a few "families" of information we have about each device, each stored by a different server application into a shared database:
static configuration information stored by users using our web app. e.g. Physical Location, Friendly Name, etc.
logged information about device behavior, e.g. last reporting time, date the device first came online, whether device is healthy, etc.
expensive information re-computed by scheduled jobs, e.g. average signal strength, average length of transmission, historical failure rates, etc.
These properties are all scalar values reflecting the most current data we have about a device. We have a separate way to store historical information.
The largest number of device instances we have to worry about will be around 100,000, so this is not a "big data" problem. In most cases a database will have 10,000 devices or less to worry about.
Writes to the data about an individual device happens infrequently-- typically every few hours. It's theoretically possible for a scheduled task, user-inputted configuration changes, and dynamic data to all make updates for the same device at the same time, but this seems very rare. Reads are more frequent: probably 10x per minute reads against at least one device in a database, and several times per hour for a full scan of some properties of all devices described in a database.
Deletes are relatively rare, in fact many cases we only "soft delete" devices so we can use them for historical reporting. New device inserts are more common, perhaps a few every day.
There are (at least) two obvious ways to store this data in our SQL database:
The current design of our application stores each of these families of information in separate tables, each with a clustered index on a Device ID primary key. One server application writes to one table each.
An alternate implementation that's been proposed is to use one large table, and create covering indexes as needed to accelerate queries for groups of properties (e.g. all static info, all reliability info, etc.) that are frequently queried together.
My question: is there a clearly superior option? If the answer is "it depends" then what are the circumstances which would make "one large table" or "multiple tables" better?
Answers should consider: performance, maintainability of DB itself, maintainability of code that reads/writes rows, and reliability in the face of unexpected behavior. Maintanability and reliability are probably a higher priority for us than performance, if we have to trade off.
Don't know about a clearly superior option, and I don't know about sql-server architecture. But I would go for the first option with separate tables for different families of data. Some advantages could be:
granting access to specific sets of data (may be desirable for future applications)
archiving different famalies of data at different rates
partial functionality of the application in the case of maintenance on a part (some tables available while another is restored)
indexing and partitioning/sharding can be performed on different attributes (static information could be partitioned on device id, logging information on date)
different families can be assigned to different cache areas (so the static data can remain in a more "static" cache, and more rapidly changing logging type data can be in another "rolling" cache area)
smaller rows pack more rows into a block which means fewer block pulls to scan a table for a specific attribute
less chance of row chaining if altering a table to add a row, easier to perform maintenance if you do
easier to understand the data when seprated into logical units (families)
I wouldn't consider table joining as a disadvantage when properly indexed. But more tables will mean more moving parts and the need for greater awareness/documentation on what is going on.
The first option is the recognized "standard" way to store such data in a relational database.
Although a good design would probably result in more tables. Relational databases software such as SQLServer were designed to store and retrieve data in multiple tables quickly and efficiently.
In addition such designs allow for great flexibility, both in terms of changing the database to store extra data, and, in allowing unexpected/unusual queries against the data stored.
The single table option sounds beguilingly simple to practitioners unfamiliar with Relational databases. In practice they perform very badly, are difficult to manage, and lead to a high number of deadlocks and timeouts.
They also lead to development paralysis. You cannot add a requested feature because it cannot be done without a total redesign of the "simple" database schema.

real-time data warehouse for web access logs

We're thinking about putting up a data warehouse system to load with web access logs that our web servers generate. The idea is to load the data in real-time.
To the user we want to present a line graph of the data and enable the user to drill down using the dimensions.
The question is how to balance and design the system so that ;
(1) the data can be fetched and presented to the user in real-time (<2 seconds),
(2) data can be aggregated on per-hour and per-day basis, and
(2) as large amount of data can still be stored in the warehouse, and
Our current data-rate is roughly ~10 accesses per second which gives us ~800k rows per day. My simple tests with MySQL and a simple star schema shows that my quires starts to take longer than 2 seconds when we have more than 8 million rows.
Is it possible it get real-time query performance from a "simple" data warehouse like this,
and still have it store a lot of data (it would be nice to be able to never throw away any data)
Are there ways to aggregate the data into higher resolution tables?
I got a feeling that this isn't really a new question (i've googled quite a lot though). Could maybe someone give points to data warehouse solutions like this? One that comes to mind is Splunk.
Maybe I'm grasping for too much.
UPDATE
My schema looks like this;
dimensions:
client (ip-address)
server
url
facts;
timestamp (in seconds)
bytes transmitted
Seth's answer above is a very reasonable answer and I feel confident that if you invest in the appropriate knowledge and hardware, it has a high chance of success.
Mozilla does a lot of web service analytics. We keep track of details on an hourly basis and we use a commercial DB product, Vertica. It would work very well for this approach but since it is a proprietary commercial product, it has a different set of associated costs.
Another technology that you might want to investigate would be MongoDB. It is a document store database that has a few features that make it potentially a great fit for this use case.
Namely, the capped collections (do a search for mongodb capped collections for more info)
And the fast increment operation for things like keeping track of page views, hits, etc.
http://blog.mongodb.org/post/171353301/using-mongodb-for-real-time-analytics
Doesn't sound like it would be a problem. MySQL is very fast.
For storing logging data, use MyISAM tables -- they're much faster and well suited for web server logs. (I think InnoDB is the default for new installations these days - foreign keys and all the other features of InnoDB aren't necessary for the log tables). You might also consider using merge tables - you can keep individual tables to a manageable size while still being able to access them all as one big table.
If you're still not able to keep up, then get yourself more memory, faster disks, a RAID, or a faster system, in that order.
Also: Never throwing away data is probably a bad idea. If each line is about 200 bytes long, you're talking about a minimum of 50 GB per year, just for the raw logging data. Multiply by at least two if you have indexes. Multiply again by (at least) two for backups.
You can keep it all if you want, but in my opinion you should consider storing the raw data for a few weeks and the aggregated data for a few years. For anything older, just store the reports. (That is, unless you are required by law to keep around. Even then, it probably won't be for more than 3-4 years).
Also, look into partitioning, especially if your queries mostly access latest data; you could -- for example -- set-up weekly partitions of ~5.5M rows.
If aggregating per-day and per hour, consider having date and time dimensions -- you did not list them so I assume you do not use them. The idea is not to have any functions in a query, like HOUR(myTimestamp) or DATE(myTimestamp). The date dimension should be partitioned the same way as fact tables.
With this in place, the query optimizer can use partition pruning, so the total size of tables does not influence the query response as before.
This has gotten to be a fairly common data warehousing application. I've run one for years that supported 20-100 million rows a day with 0.1 second response time (from database), over a second from web server. This isn't even on a huge server.
Your data volumes aren't too large, so I wouldn't think you'd need very expensive hardware. But I'd still go multi-core, 64-bit with a lot of memory.
But you will want to mostly hit aggregate data rather than detail data - especially for time-series graphing over days, months, etc. Aggregate data can be either periodically created on your database through an asynchronous process, or in cases like this is typically works best if your ETL process that transforms your data creates the aggregate data. Note that the aggregate is typically just a group-by of your fact table.
As others have said - partitioning is a good idea when accessing detail data. But this is less critical for the aggregate data. Also, reliance on pre-created dimensional values is much better than on functions or stored procs. Both of these are typical data warehousing strategies.
Regarding the database - if it were me I'd try Postgresql rather than MySQL. The reason is primarily optimizer maturity: postgresql can better handle the kinds of queries you're likely to run. MySQL is more likely to get confused on five-way joins, go bottom up when you run a subselect, etc. And if this application is worth a lot, then I'd consider a commercial database like db2, oracle, sql server. Then you'd get additional features like query parallelism, automatic query rewrite against aggregate tables, additional optimizer sophistication, etc.

What is the best way to partition large tables in SQL Server?

In a recent project the "lead" developer designed a database schema where "larger" tables would be split across two separate databases with a view on the main database which would union the two separate database-tables together. The main database is what the application was driven off of so these tables looked and felt like ordinary tables (except some quirky things around updating). This seemed like a HUGE performance problem. We do see problems with performance around these tables but nothing to make him change his mind about his design. Just wondering what is the best way to do this, or if it is even worth doing?
I don't think that you are really going to gain anything by partitioning the table across multiple databases in a single server. All you have essentially done there is increased the overhead in working with the "table" in the first place by having several instances (i.e. open in two different DBs) of it under a single SQL Server instance.
How large of a dataset do you have? I have a client with a 6 million row table in SQL Server that contains 2 years worth of sales data. They use it transactionally and for reporting without any noticiable speed problems.
Tuning the indexes and choosing the correct clustered index is crucial to performance of course.
If your dataset is really large and you are looking to partition, you will get more bang for your buck partitioning the table across physical servers.
Partitioning is not something to be undertaken lightly as there can be many subtle performance implications.
My first question is are you referring simply to placing larger table objects in separate filegroups (on separate spindles) or are you referring to data partitioning inside of a table object?
I suspect that the situation described is an attempt to have the physical storage of certain large tables on different spindles from the rest of the tables. In this case, adding the extra overhead of separate databases, losing any ability to enforce referential integrity across databases, and the security implications of enabling cross-database ownership chaining does not provide any benefit over using multiple filegroups within a single database. If, as is quite possible, the separate databases you refer to in your question are not even stored on separate spindles but are all stored on the same spindle then you negate even the slight performance benefit you could have gained by physically separating your disk activity and have received absolutely no benefit.
I would suggest instead of using additional databases to hold large tables you look into the Filegroup topic in SQL Server Books Online or for a quick review see this article:
If you are interested in data partitioning (including partitioning into multiple file groups) then I recommend reading articles by Kimberly Tripp, who gave an excellent presentation at the time SQL Server 2005 came out about the improvements available there. A good place to start is this whitepaper
Which version of SQL Server are you using? SQL Server 2005 has partitioned tables, but in 2000 (or 7.0) you needed to use partition views.
Also, what was the reasoning for putting the table partitions in a separate database?
When I've had to partition tables in the past (pre-2005), it's usually by a date column or something similar, with a view over the various partitions. Books Online has a section that talks about how to do this and all of the rules around it. You need to follow the rules to make it work how it's supposed to work.
The key thing to remember is that your partitioning column must be part of the primary key and you want to try to always use that column in any access against the table so that the optimizer can ignore partitions that shouldn't be affected by the query.
Look up "partitioned table" in MSDN and you should be able to find a more complete tutorial for SQL Server 2005 partitioned tables as well as advice on how to set them up for maximum performance.
Are you asking about best practices in terms of database design, or convincing your lead to change his mind? :)
In terms of design... Back in the goode olde days, vertical partitioning was sometimes needed to work around database engine limitations, where the number of columns in a table was a hard limit, like 255 columns. These days the main benefits are purely for performance: putting rarely used columns, or blobs on a separate disk array. But if you're regularly pulling things from both tables it will likely be a loss. It sounds like your lead is suffering from a case of premature optimisation.
In terms of telling your lead is wrong... that requires diplomacy. If he's aware of mutterings of discontent in terms of performance, a benchmark is probably the best way to show the difference.
Create a new physical table somewhere with 'create table t1 as select * from view1' and then run some lengthy batch with the vertically partitioned table and your new table. If it's as bad as you say, the difference should be evident.
But this too may be premature optimisation. Find out what the end-users think of the performance. If the performance is good enough, for some definition of good, then don't fix what ain't broke.
There is a definite benefit for table partitioning (regardless whether it's on same or different filegroups /disks). If the partition column is correctly selected, you'll realize that your queries will hit only the required partition. So imagine if you have 100 million records (I've partitioned tables much bigger than that - about 20+ Billion rows) and if for the most part, more than 70% of your data access is only a certain category or timeline or type of data then it helps to keep the most accessed data in a separate partition. Plus you can align the partition with separate file groups with various type of disks (SATA, Fiber channel, SSDs) so that the most accessed/busy data are on the fastest storage and the least/rarely accessed are virtually on slower disks.
Although, in SQL Server there's limited partitioning ability, unlike Oracle. You can choose only one column for partitioning (even in SQL 2008). So you've to choose a column wisely where that column also is part of most of your frequent queries. For the most part, people find it easy to choose to partition by a date column. However although it seems logical to partition that way, if your queries do not have that column as part of the condition, you won't be gaining sufficient benefits from partitioning (in other words, your query will hit all the partition regardless).
It's much easier to partition for data warehouse/data mining type databases than OLTP as most DW database queries are limited by time period.
That's why these days due to the volume of data being handled by databases, it's wise to design the application in such a way that ever query is limited by some broader group such as time, geographical location or such so that when such columns are chosen for partitioning you'll gain maximum benefits.
I would disagree with the assumption that nothing can be gained by partitioning.
If the partition data is physically and logically aligned, then the potential IO of queries should be dramatically reduced.
For example, We have a table which has the batch field as an INT representing an INT.
If we partition the data by this field and then re-run a query for a particular batch, we should be able to run set statistics io ON before and after partitioning and see a reduction in IO,
If we have a million rows per partition and each partition is written to a separate device. The query should be able to eliminate the nonessential partitions.
I've not done a lot of partitioning on SQL Server, but I do have experience of partitioning on Sybase ASE, and this is known as partition eliminiation. When I have time I'm going to test out the scenario on a SQL Server 2005 machine.