What is the maximum of simultaneous connections which DB2 can handle? The connection can be from one App or can be from different apps from same or different applications.
Also, how much data can DB2 hold in its tables? In one table and in all the Tables? What is the approximate max number of records and the size of data?
The answer typical depends on your system resources. The DB2 documentation ("Knowledge Center") has a section on "SQL and XML Limits". There you will find the specific system limits, including maximum concurrency and table sizes.
Without any additional software, the maximum concurrency (parallel connections) is 64000.
The maximum table size requires some math. It depends on tablespaces, table organization, page size, number of columns and more. How many petabytes of storages does your system has available...?
Related
Interested if anyone knows about what could be going on under the hood to cause the same query to fail at one Azure SQL performance level but work at a higher level?
The query does max out the server to 100% at the Standard level, but I would expect to get an out of memory related exception if this was the issue. But instead I get a
Cannot create a row of size 8075 which is greater than the allowable maximum row size of 8060
I am aware that the query needs to be optimized, but what I am interested in for the purposes of this question is what about bumping up to Premium 150DTU would suddenly make the same data not exceed the max row size?
I can make an educated guess as to what your problem is. When you change from one reservation size to another, the resources available to the optimizer changes. It believes it has more memory, specifically. Memory is a key component in costing queries in the query optimizer. The plan choice happening in the lower reservation size likely has a spool or sort that is trying to create an object in tempdb. The one in the higher reservation size is not. This it hitting a limitation in the storage engine since the intermediate table can not be materialized.
Without looking at the plan, it is not possible to say with certainty whether this is a requirement of the chosen query plan or merely an optimization. However, you can try using the NO_PERFORMANCE_SPOOL hint on the query to see if that makes it work on the smaller reservation size. (Given that it has less memory, I will guess that it is not the issue, however).
https://learn.microsoft.com/en-us/sql/t-sql/queries/hints-transact-sql-query?view=sql-server-2017
(Now I am guessing with general advice since I don't know what kind of app you have but it based on the normal patterns I see regularly):
If your schema is really wide or poorly defined, please consider revising your table definition to reduce the size of the columns to the right minimum. For data warehousing applications, please consider using dimension tables + surrogate keys. If you are dumping text log files into SQL and then trying to distinct them, note that distinct is often going to imply a sort which could lead to this kind of issue if the row is too wide (as you are trying to use all columns in the key).
Best of luck on getting your app to work. SQL tends to work very well for you for relational apps where you think through the details of your schema and indexes a bit. In the future, please post a bit more detail about your query patterns, schema, and query plans so others can help you more precisely.
I am still struggling with identifying how the concept of table distribution in azure sql data warehouse differs from concept of table partition in Sql server?
Definition of both seems to be achieving same results.
Azure DW has up to 60 computing nodes as part of it's MPP architecture. When you store a table on Azure DW you are storing it amongst those nodes. Your tables data is distributed across these nodes (using Hash distribution or Round Robin distribution depending on your needs). You can also choose to have your table (preferably a very small table) replicated across these nodes.
That is distribution. Each node has its own distinct records that only that node worries about when interacting with the data. It's a shared-nothing architecture.
Partitioning is completely divorced from this concept of distribution. When we partition a table we decide which rows belong into which partitions based on some scheme (like partitioning an order table by the order.create_date for instance). A chunk of records for each create_date then gets stored in its own table separate from any other create_date set of records (invisibly behind the scenes).
Partitioning is nice because you may find that you only want to select 10 days worth of orders from your table, so you only need to read against 10 smaller tables, instead of having to scan across years of order data to find the 10 days you are after.
Here's an example from the Microsoft website where horizontal partitioning is done on the name column with two "shards" based on the names alphabetical order:
Table distribution is a concept that is only available on MPP type RDBMSs like Azure DW or Teradata. It's easiest to think of it as a hardware concept that is somewhat divorced (to a degree) from the data. Azure gives you a lot of control here where other MPP databases base distribution on primary keys. Partitioning is available on nearly every RDBMS (MPP or not) and it's easiest to think of it as a storage/software concept that is defined by and dependent on the data in the table.
In the end, they do both work to solve the same problem. But... nearly every RDBMS concept (indexing, disk storage, optimization, partition, distribution, etc) are there to solve the same problem. Namely: "How do I get the exact data I need out as quickly as possible?" When you combine these concepts together to match your data retrieval needs you make your SQL requests CRAZY fast even against monstrously huge data.
Just for fun, allow me to explain it with an analogy.
Suppose there exists one massive book about all history of the world. It has the size of a 42 story building.
Now what if the librarian splits that book into 1 book per year. That makes it much easier to find all information you need for some specific years. Because you can just keep the other books on the shelves.
A small book is easier to carry too.
That's what table partitioning is about. (Reference: Data Partitioning in Azure)
Keeping chunks of data together, based on a key (or set of columns) that is usefull for the majority of the queries and has a nice average distribution.
This can reduce IO because only the relevant chunks need to be accessed.
Now what if the chief librarian unbinds that book. And sends sets of pages to many different libraries.
When we then need certain information, we ask each library to send us copies of the pages we need.
Even better, those librarians could already summarize the information of their pages and then just send only their summaries to one library that collects them for you.
That's what the table distribution is about. (Reference: Table Distribution Guidance in Azure)
To spread out the data over the different nodes.
Conceptually they are the same. The basic idea is that the data will be split across multiple stores. However, the implementation is radically different. Under the covers, Azure SQL Data Warehouse manages and maintains the 70 databases that each table you define is created within. You do nothing beyond define the keys. The distribution is taken care of. For partitioning, you have to define and maintain pretty much everything to get it to work. There's even more to it, but you get the core idea. These are different processes and mechanisms that are, at the macro level, arriving at a similar end point. However, the processes these things support are very different. The distribution assists in increased performance while partitioning is primarily a means of improved data management (rolling windows, etc.). These are very different things with different intents even as they are similar.
SELECT queries on tables with BLOBs are slow, even if I don't include the BLOB column. Can someone explain why, and maybe how to circumvent it? I am using SQL Server 2012, but maybe this is more of a conceptual problem that would be common for other distributions as well.
I found this post: SQL Server: select on a table that contains a blob, which shows the same problem, but the marked answer doesn't explain why is this happening, neither provides a good suggestion on how to solve the problem.
If you are asking for a way to solve the performance drag, there are a number of approaches that you can take. Adding indexes to your table should help massively provided you aren't simply selecting the entire recordset. Creating views over the table may also assist. It's also worth checking the levels of index fragmentation on the table as this can cause poor performance and could be addressed with a regular maintenance job. The suggestion of creating a linked table to store the blob data is also a genuinely good one.
However, if your question is asking why it's happening, this is because of the fundamentals of the way MS SQL Server functions. Essentially your database, and all databases on the server and split into pages, 8kb chunks of data with a 96-byte header. Each page representing what is possible in a single I/O operation. Pages are collected contained and grouped within Exents, 64kb collections of eight contiguous pages. SQL Server therefore uses sixteen Exents per megabyte of data. There are a few differing page types, a data page type for example won't contain what are termed "Large Objects". This include the data types text, image, varbinary(max), xml data, etc... These also are used to store variable length columns which exceed 8kb (and don't forget the 96 byte header).
At the end of each page will be a small amount of free space. Database operations obviously shift these pages around all the time and free space allocations can grow massively in a database dealing with large amounts of I/O and random record access / modification. This is why free space on a database can grow massively. There are tools available within the management suite to allow you to reduce or remove free space and basically this re-organizes pages and exents.
Now, I may be making a leap here but I'm guessing that the blobs you have in your table exceed 8kb. Bear in mind if they exceed 64kb they will not only span multiple pages but indeed span multiple exents. The net result of this will be that a "normal" table read will cause massive amounts of I/O requests. Even if you're not interested in the BLOB data, the server may have to read through the pages and exents to get the other table data. This will only be compounded as more transactions make pages and exents that make up a table to become non-contiguous.
Where "Large Objects" are used, SQL Server writes Row-Overflow values which include a 24bit pointer to where the data is actually stored. If you have several columns on your table which exceed the 8kb page size combined with blobs and impacted by random transactions, you will find that the majority of the work your server is doing is I/O operations to move pages in and out of memory, reading pointers, fetching associated row data, etc, etc... All of which represents serious overhead.
I got a suggestion then, have all the blobs in a separate table with an identity ID, then only save the identity ID in your main table
it could be because - maybe SQL cannot cache the table pages as easily, and you have to go to the disk more often. I'm no expert as to why though.
A lot of people frown at BLOBS/images in databases - In SQL 2012 there is some sort of compromise where you can configure the DB to keep objects in a file structure, not in the actual DB anymore - you might want to look for that
Our application (using a SQL Server 2008 R2 back-end) stores data about remote hardware devices reporting back to our servers over the Internet. There are a few "families" of information we have about each device, each stored by a different server application into a shared database:
static configuration information stored by users using our web app. e.g. Physical Location, Friendly Name, etc.
logged information about device behavior, e.g. last reporting time, date the device first came online, whether device is healthy, etc.
expensive information re-computed by scheduled jobs, e.g. average signal strength, average length of transmission, historical failure rates, etc.
These properties are all scalar values reflecting the most current data we have about a device. We have a separate way to store historical information.
The largest number of device instances we have to worry about will be around 100,000, so this is not a "big data" problem. In most cases a database will have 10,000 devices or less to worry about.
Writes to the data about an individual device happens infrequently-- typically every few hours. It's theoretically possible for a scheduled task, user-inputted configuration changes, and dynamic data to all make updates for the same device at the same time, but this seems very rare. Reads are more frequent: probably 10x per minute reads against at least one device in a database, and several times per hour for a full scan of some properties of all devices described in a database.
Deletes are relatively rare, in fact many cases we only "soft delete" devices so we can use them for historical reporting. New device inserts are more common, perhaps a few every day.
There are (at least) two obvious ways to store this data in our SQL database:
The current design of our application stores each of these families of information in separate tables, each with a clustered index on a Device ID primary key. One server application writes to one table each.
An alternate implementation that's been proposed is to use one large table, and create covering indexes as needed to accelerate queries for groups of properties (e.g. all static info, all reliability info, etc.) that are frequently queried together.
My question: is there a clearly superior option? If the answer is "it depends" then what are the circumstances which would make "one large table" or "multiple tables" better?
Answers should consider: performance, maintainability of DB itself, maintainability of code that reads/writes rows, and reliability in the face of unexpected behavior. Maintanability and reliability are probably a higher priority for us than performance, if we have to trade off.
Don't know about a clearly superior option, and I don't know about sql-server architecture. But I would go for the first option with separate tables for different families of data. Some advantages could be:
granting access to specific sets of data (may be desirable for future applications)
archiving different famalies of data at different rates
partial functionality of the application in the case of maintenance on a part (some tables available while another is restored)
indexing and partitioning/sharding can be performed on different attributes (static information could be partitioned on device id, logging information on date)
different families can be assigned to different cache areas (so the static data can remain in a more "static" cache, and more rapidly changing logging type data can be in another "rolling" cache area)
smaller rows pack more rows into a block which means fewer block pulls to scan a table for a specific attribute
less chance of row chaining if altering a table to add a row, easier to perform maintenance if you do
easier to understand the data when seprated into logical units (families)
I wouldn't consider table joining as a disadvantage when properly indexed. But more tables will mean more moving parts and the need for greater awareness/documentation on what is going on.
The first option is the recognized "standard" way to store such data in a relational database.
Although a good design would probably result in more tables. Relational databases software such as SQLServer were designed to store and retrieve data in multiple tables quickly and efficiently.
In addition such designs allow for great flexibility, both in terms of changing the database to store extra data, and, in allowing unexpected/unusual queries against the data stored.
The single table option sounds beguilingly simple to practitioners unfamiliar with Relational databases. In practice they perform very badly, are difficult to manage, and lead to a high number of deadlocks and timeouts.
They also lead to development paralysis. You cannot add a requested feature because it cannot be done without a total redesign of the "simple" database schema.
I am looking for a way to get statistics from both Oracle and DB2 databases for select/update/insert/delete operations count performed on every table. To say in other way, I would like to know how many scan operations were performed on given table vs. how many modifying operations were executed.
I had found that it is possible to it do in MS SQL Server as described in http://msdn.microsoft.com/en-us/library/dd894051%28v=sql.100%29.aspx
The reason I need it, is because it provides reasonable statistic if it is worthwhile to apply compression for a given table. The better the scan / update ratio - the better candidate the table is. I think this also holds true for other databases.
So is it possible to get these statistics in Oracle or/and DB2 ? Thanks in advance.
In Oracle you can see how many update/delete/inserts have been on a table in sys.dba_tab_modifications. The data is flushed to the table every 4 hours. For the reads you can use dba_hist_seg_stat, part of AWR. Use of this is licensed.
sys.dba_tab_modifications is reset once a table gets new optimizer statistics.
My answer applies to the DB2 database engines for Linux, UNIX, and Windows (LUW) platforms, not DB2 for iSeries (AS/400) or DB2 for z/OS, which have significantly different engine internals than the LUW platforms. All of the documentation links I've included reference version 9.7 of DB2 for LUW.
DB2 for LUW provides extensive performance and utilization statistics in every version of the data engine, including the no-cost DB2 Express-C product. The collection of these statistics is governed by a series of database engine settings called system monitor switches. The statistics you seek involve the table monitor switch, and possibly also the statement and UOW (unit of work) monitor switches. When those system monitor switches are enabled, you can retrieve running totals of various performance gauges and counters from snapshot monitors or by selecting from administrative SQL views (in the SYSIBMADM schema) that present the same snapshot monitor output as SQL result sets. The snapshot monitors incur less system overhead than event monitors, which run in the background as a trace and store a stream of detailed information to special tables or files.
Compression is a licensed feature that alters the internal storage of tables and indexes all the way from the tablespace to the buffer pool (RAM cache) to the transaction log file. In most cases, the additional CPU overhead of compression and decompression is more than offset by the overall reduction in I/O. The deep row compression feature compresses rows in tables by building and using a 12-bit dictionary of multi-byte patterns that can even cross column boundaries. Enabling deep row compression for a table typically reduces its size by 40% or more before DBA intervention. Indexes are compressed through a shorthand algorithm that exploits their sorted nature by omitting common leading bytes between the current and previous index keys.