What is the difference/relationship between extent and allocation unit? - sql

Can you explain the difference -or relationship- between 'Extent' and 'Allocation Unit' in SQL?

The allocation unit is basically just a set of pages. It can be small (one page) or large (many many pages). It has a metadata entry in sys.allocation_units. It is tracked by a IAM chain. The most common use of allocation units is the 3 well known AUs of a rowset: IN_ROW_DATA, ROW_OVERFLOW and LOB_DATA.
An extent is any 8 consecutive pages that start from a page ID that is divisible by 8. SQL Server IO is performed in an extent aware fashion: ideally an entire extent is read in at once, an entire extent is write out at once. This is subject to current state of the buffer pool, for details see How It Works: Bob Dorr's SQL Server I/O Presentation. Extents are usually allocated together, so all pages of an extent belong to the same allocation unit. But since this would lead to overallocation for small tables a special type of extent is a so called 'mixed' extent, in which each page can belong to a separate allocation unit. For details see Inside The Storage Engine: GAM, SGAM, PFS and other allocation maps.
So as you see the concepts are related, but very different. Perhaps you should explain a bit what is the problem you're trying to solve or why are you interested in these concepts, perhaps we can then elaborate.

Each object (be it an index or a heap) has a number of partitions (1-15k). Each partition can have three different allocation units, the HoBT (heap or b-tree, also known as the hobbit) where your actual data is stored. The LOB ALU for the LOB types as well as the SLOB ALU for row-overflow data.
Pages belong to a certain allocation unit. All pages belong to an extent - a group of 8 pages. While the individual pages can belong to different allocation units, they'll always belong to the same object in a uniform extent - while a mixed extent contains pages for different objects and potentially different allocation units.

Related

Does Google BigTable support range scan?

I am student learning Google BigTable design.
I am confused, the SST table is sorted internally. Two SST tables may not be sorted. In this case, it seems BigTable doesn't support efficient range scan for primary key? For example, "select * where id between 100 and 200". BigTable may need to scan all the SST to get the result.
Then my understanding for why SST is sorted is because for single primary key query, we can do binary search within a SST.
Another question I have is, does MemTable sorted? If yes, how? Because MemTable need to update frequently. If use data structure like tree, then we need to traverse the tree when we write MemTable into SST?
It sounds like you've at least been through an overview of the original Bigtable paper, but here's a reference if you haven't read the whole thing; your questions can mostly be answered by a closer read: https://static.googleusercontent.com/media/research.google.com/en//archive/bigtable-osdi06.pdf
Your intuitions about Bigtable are spot on. Both the SStables on disk and the Memtable are sorted based on the primary key, and any read (not just a scan) requires consulting all of them to produce a merged view. However, note that they are all sorted on the same key, so this amounts to a parallel traversal. We seek to the beginning of the range to be read in each sstable and in the memtable, and traverse them in parallel from there.
This process is alluded to in section 5.3: "A valid read operation is executed on a merged view of the sequence of SSTables and the memtable. Since the SSTables and the memtable are lexicographically sorted data structures, the merged view can be formed efficiently."
Some of these lookups can be mitigated using Bloom Filters as described in section 6 of the paper, but not all.
Of course, if there are too many sstables in a tablet, this will still become inefficient. Section 5.4 of the paper goes into a bit more detail about how this is solved, which is to periodically merge the "stack" of sstables for a tablet into a smaller number of new sstables. If a particular tablet stops receiving writes, this will eventually quiesce its state down to a single file.
Regarding the efficiency of the memtable, the paper does not prescribe a particular data structure. However, suffice it to say that there are many examples of efficient in-memory sorted data structures. In addition, Section 5.4 mentions that there can actually be multiple memtables in a given tablet. By the time we scan a memtable to write it out to disk, we have already installed a new memtable at the top of the tablet "stack" and are serving the latest incoming reads from there.

How to find out how much space a SQL Server table uses?

Is it possible to get the amount of space on disk that a particular table uses? Let's say I have a million users stored in my table and I want to know how much space it's required to store all users and/or one of them.
Update:
I'm planning to use redis to cache some fields from one particular table in memory to quickly retrieve the needed data after. So I need to calculate how much space approximately will it take and thus will it fit in the memory or not. Definitely it depends on the data types that I use inside my table but if a table consists of several dozens of fields it would take too much time to count this one by one.
There is exactly such answer for the MySQL though it's not suitable for SQL Server: How can you determine how much disk space a particular MySQL table is taking up? You can check it to see what I mean.
If you have SSMS, you can right-click on the table in the Object Explorer, go to Properties, and then look at the Storage page. The field, Data space, is the size of the data in that table, but it probably does not include some of the overhead costs of the table.
This is really an extended comment, because it does not directly answer the question.
For most purposes, you just use the size of the columns, add them together, and multiply by the number of rows. This lowballs the estimate, but it is reasonable. And (depending on how you handle the types) might be a reasonable estimate of the size of exporting the data.
That said, the storage of tables is a difficult matter. Here are some of the factors you need to take into account:
The size of individuals fields. This is made slightly more difficult because some types have varying sizes, so those are entirely data dependent.
The number of pages occupied by a table (or equivalently how full each data page is). Note that this can vary, depending on how full each table is.
The number of pages occupied by "overflow" data types, such as varchar(max).
Whether or not the data pages are compressed or encrypted.
The indexes for the table.
How full each index page is.
And, no doubt, I've left out a bunch of other relevant internal details (here is a place to start on page layouts).
In other words, there isn't a simple answer. Equivalent tables on two different systems could occupy very different amounts of space. This is true of the "same" table on the same system at different times.
The general answer when working with databases is that you need a lot more space than number of rows * row size -- I seem to recall using a factor of 3 at one point in time. In general, storage is pretty cheap, so this is not the limiting factor using a database.
We would need to see your full database schema, with tables and columns and all fields' data types. Without those pieces of information it's just a lucky guess. Here is a helpful cheat sheet of the sizes of each data type: https://www.connectionstrings.com/sql-server-2012-data-types-reference/
Then you just have to do the Math and calculate the space needed for X, which is your number of records

Efficiently Computing Significant Terms in SQL

I was introduced to ElasticSearch significant terms aggregation a while ago and was positively surprised how good and relevant this metric turns out to be. For those not familiar with it, it's quite a simple concept - for a given query (foreground set) a given property is scored against the statistical significance of the background set.
For example, if we were querying for the most significant crime types in the British Transport Police:
C = 5,064,554 -- total number of crimes
T = 66,799 -- total number of bicycle thefts
S = 47,347 -- total number of crimes in British Transport Police
I = 3,640 -- total number of bicycle thefts in British Transport Police
Ordinarily, bicycle thefts represent only 1% of crimes (66,799/5,064,554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3,640/47,347) is a bike theft. This is a significant seven-fold increase in frequency.
The significance for "bicycle theft" would be [(I/S) - (T/C)] * [(I/S) / (T/C)] = 0.371...
Where:
C is the number of all documents in the collection
S is the number of documents matching the query
T is the number of documents with the specific term
I is the number of documents that intersect both S and T
For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements), I'm looking to implement the significant terms aggregation in SQL or directly in code.
I've been looking at some ways to potentially optimize this kind of query, specifically, decreasing the memory requirements and increasing the query speed, at the expense of some error margin - but so far I haven't cracked it. It seems to me that:
The variables C and S are easily cacheable or queriable.
The variable T could be derived from a Count-Min Sketch instead of querying the database.
The variable I however, seems impossible to derive with the Count-Min Sketch from T.
I was also looking at the MinHash, but from the description it seems that it couldn't be applied here.
Does anyone know about some clever algorithm or data structure that helps tackling this problem?
I doubt a SQL impl will be faster.
The values for C and T are maintained ahead of time by Lucene.
S is a simple count derived from the query results and I is looked up using O(1) data structures. The main cost are the many T lookups for each of the terms observed in the chosen field. Using min_doc_count typically helps drastically reduce the number of these lookups.
For practical reasons (the sheer amount of data I have and huge ElasticSearch memory requirements
Have you looked into using doc values to manage elasticsearch memory better? See https://www.elastic.co/blog/support-in-the-wild-my-biggest-elasticsearch-problem-at-scale
An efficient solution is possible for the case when the foreground set is small enough. Then you can afford processing all documents in the foreground set.
Collect the set {Xk} of all terms occurring in the foreground set for the chosen field, as well as their frequencies {fk} in the foreground set.
For each Xk
Calculate the significance of Xk as (fk - Fk) * (fk / Fk), where Fk=Tk/C is the frequency of Xk in the background set.
Select the terms with the highest significance values.
However, due to the simplicity of this approach, I wonder if ElasticSearch already contains that optimization. If it doesn't - then it very soon will!

Postgres performance improvement and checklist

I'm studing a series of issues related to performance of my application written in Java, which has about 100,000 hits per day and each visit on average from 5 to 10 readings/writings on the 2 principale database tables (divided equally) whose cardinality is for both between 1 and 3 million records (i access to DB via hibernate).
My two main tables store user information (about 60 columns of type varchar, integer and timestamptz) and another linked to the data to be displayed (with about 30 columns here mainly varchar, integer, timestamptz).
The main problem I encountered may have had a drop in performance of my site (let's talk about time loads over 5 seconds which obviously does not depend only on the database performance), is the use of FillFactor which is currently the default value of 100 (that it's used always when data not changing..).
Obviously fill factor it's same on index (there are 10 for each 2 tables of type btree)
Currently on my main tables I make
40% select operations
30% update operations
20% operations insert
10% delete operations.
My database is also made ​​up of 40 other tables of minor importance (there is just others 3 with same cardinality of user).
My questions are:
How do you find the right value of the fill factor to be set ?
Which can be a checklist of tasks to be checked to improve the performance
of a database of this kind?
Database is on server dedicated (16GB Ram, 8 Core) and storage it's on SSD disk (data are backupped all days and moved on another storage)
You have likely hit the "knee" of your memory usage where the entire index of the heavily used tables no longer fits in shared memory, so disk I/O is slowing it down. Confirm by checking if disk I/O is higher than normal. If so, try increasing shared memory (shared_buffers), or if that's already maxed, adjust the system shared memory size or add more system memory so you can bump it higher. You'll also probably have to start adjusting temp buffers, work memory and maintenance memory, and WAL parameters like checkpoint_segments, etc.
There are some perf tuning hints on PostgreSQL.org, and Google is your friend.
Edit: (to address the first comment) The first symptom of not-enough-memory is a big drop in performance, everything else being the same. Changing the table fill factor is not going to make a difference if you hit a knee in memory usage, if anything it will make it worse w.r.t. load times (which I assume means "db reads") because row information will be expanded across more pages on disk with blank space in each page thus more disk I/O is needed for table scans. But fill factor less than 100% can help with UPDATE operations, but I've found adjusting WAL parameters can compensate most of the time when using indexes (unless you've already optimized those). Bottom line, you need to profile all the heavy queries using EXPLAIN to see what will help. But at first glance, I'm pretty certain this is a memory issue even with the database on an SSD. We're talking a lot of random reads and random writes and a lot of SSDs actually get worse than HDDs after a lot of random small writes.

Do relational databases provide a feasible backend for a process historian?

In the process industry, lots of data is read, often at a high frequency, from several different data sources, such as NIR instruments as well as common instruments for pH, temperature, and pressure measurements. This data is often stored in a process historian, usually for a long time.
Due to this, process historians have different requirements than relational databases. Most queries to a process historian require either time stamps or time ranges to operate on, as well as a set of variables of interest.
Frequent and many INSERT, many SELECT, few or no UPDATE, almost no DELETE.
Q1. Is relational databases a good backend for a process historian?
A very naive implementation of a process historian in SQL could be something like this.
+------------------------------------------------+
| Variable |
+------------------------------------------------+
| Id : integer primary key |
| Name : nvarchar(32) |
+------------------------------------------------+
+------------------------------------------------+
| Data |
+------------------------------------------------+
| Id : integer primary key |
| Time : datetime |
| VariableId : integer foreign key (Variable.Id) |
| Value : float |
+------------------------------------------------+
This structure is very simple, but probably slow for normal process historian operations, as it lacks "sufficient" indexes.
But for example if the Variable table would consist of 1.000 rows (rather optimistic number), and data for all these 1.000 variables would be sampled once per minute (also an optimistic number) then the Data table would grow with 1.440.000 rows per day. Lets continue the example, estimate that each row would take about 16 bytes, which gives roughly 23 megabytes per day, not counting additional space for indexes and other overhead.
23 megabytes as such perhaps isn't that much but keep in mind that numbers of variables and samples in the example were optimistic and that the system will need to be operational 24/7/365.
Of course, archiving and compression comes to mind.
Q2. Is there a better way to accomplish this? Perhaps using some other table structure?
I work with a SQL Server 2008 database that has similar characteristics; heavy on insertion and selection, light on update/delete. About 100,000 "nodes" all sampling at least once per hour. And there's a twist; all of the incoming data for each "node" needs to be correlated against the history and used for validation, forecasting, etc. Oh, there's another twist; the data needs to be represented in 4 different ways, so there are essentially 4 different copies of this data, none of which can be derived from any of the other data with reasonable accuracy and within reasonable time. 23 megabytes would be a cakewalk; we're talking hundreds-of-gigabytes to terabytes here.
You'll learn a lot about scale in the process, about what techniques work and what don't, but modern SQL databases are definitely up to the task. This system that I just described? It's running on a 5-year-old IBM xSeries with 2 GB of RAM and a RAID 5 array, and it performs admirably, nobody has to wait more than a few seconds for even the most complex queries.
You'll need to optimize, of course. You'll need to denormalize frequently, and maintain pre-computed aggregates (or a data warehouse) if that's part of your reporting requirement. You might need to think outside the box a little: for example, we use a number of custom CLR types for raw data storage and CLR aggregates/functions for some of the more unusual transactional reports. SQL Server and other DB engines might not offer everything you need up-front, but you can work around their limitations.
You'll also want to cache - heavily. Maintain hourly, daily, weekly summaries. Invest in a front-end server with plenty of memory and cache as many reports as you can. This is in addition to whatever data warehousing solution you come up with if applicable.
One of the things you'll probably want to get rid of is that "Id" key in your hypothetical Data table. My guess is that Data is a leaf table - it usually is in these scenarios - and this makes it one of the few situations where I'd recommend a natural key over a surrogate. The same variable probably can't generate duplicate rows for the same timestamp, so all you really need is the variable and timestamp as your primary key. As the table gets larger and larger, having a separate index on variable and timestamp (which of course needs to be covering) is going to waste enormous amounts of space - 20, 50, 100 GB, easily. And of course every INSERT now needs to update two or more indexes.
I really believe that an RDBMS (or SQL database, if you prefer) is as capable for this task as any other if you exercise sufficient care and planning in your design. If you just start slinging tables together without any regard for performance or scale, then of course you will get into trouble later, and when the database is several hundred GB it will be difficult to dig yourself out of that hole.
But is it feasible? Absolutely. Monitor the performance constantly and over time you will learn what optimizations you need to make.
It sounds like you're talking about telemetry data (time stamps, data points).
We don't use SQL databases for this (although we do use SQL databases to organize it); instead, we use binary streaming files to capture the actual data. There are a number of binary file formats that are suitable for this, including HDF5 and CDF. The file format we use here is a proprietary compressible format. But then, we deal with hundreds of megabytes of telemetry data in one go.
You might find this article interesting (links directly to Microsoft Word document):
http://www.microsoft.com/caseStudies/ServeFileResource.aspx?4000003362
It is a case study from the McClaren group, describing how SQL Server 2008 is used to capture and process telemetry data from formula one race cars. Note that they don't actually store the telemetry data in the database; instead, it is stored in the file system, and the FILESTREAM capability of SQL Server 2008 is used to access it.
I believe you're headed in the right path. We have a similar situation were we work. Data comes from various transport / automation systems across various technologies such as manufacturing, auto, etc. Mainly we deal with the big 3: Ford, Chrysler, GM. But we've had a lot of data coming in from customers like CAT.
We ended up extracting data into a database and as long as you properly index your table, keep updates to a minimum and schedule maintenance (rebuild indexes, purge old data, update statistics) then I see no reason for this to be a bad solution; in fact I think it is a good solution.
Certainly a relational database is suitable for mining the data after the fact.
Various nuclear and particle physics experiments I have been involved with have explored several points from not using a RDBMS at all though storing just the run summaries or the run summaries and the slowly varying environmental conditions in the DB all the way to cramming every bit collected into the DB (though it was staged to disk first).
When and where the data rate allows more and more groups are moving towards putting as much data as possible into the database.
IBM Informix Dynamic Server (IDS) has a TimeSeries DataBlade and RealTime Loader which might provide relevant functionality.
Your naïve schema records each reading 100% independently, which makes it hard to correlate across readings- both for the same variable at different times and for different variables at (approximately) the same time. That may be necessary, but it makes life harder when dealing with subsequent processing. How much of an issue that is depends on how often you will need to run correlations across all 1000 variables (or even a significant percentage of the 1000 variables, where significant might be as small as 1% and would almost certainly start by 10%).
I would look to combine key variables into groups that can be recorded jointly. For example, if you have a monitor unit that records temperature, pressure and acidity (pH) at one location, and there are perhaps a hundred of these monitors in the plant that is being monitored, I would expect to group the three readings plus the location ID (or monitor ID) and time into a single row:
CREATE TABLE MonitorReading
(
MonitorID INTEGER NOT NULL REFERENCES MonitorUnit,
Time DATETIME NOT NULL,
PhReading FLOAT NOT NULL,
Pressure FLOAT NOT NULL,
Temperature FLOAT NOT NULL,
PRIMARY KEY (MonitorID, Time)
);
This saves having to do self-joins to see what the three readings were at a particular location at a particular time, and uses about 20 bytes instead of 3 * 16 = 48 bytes per row. If you are adamant that you need a unique ID integer for the record, that increases to 24 or 28 bytes (depending on whether you use a 4-byte or 8-byte integer for the ID column).
Yes, a DBMS is appropriate for this, although not the fastest option. You will need to invest in a reasonable system to handle the load though. I will address the rest of my answer to this problem.
It depends on how beefy a system you're willing to throw at the problem. There are two main limiters for how fast you can insert data into a DB: bulk I/O speed and seek time. A well-designed relational DB will perform at least 2 seeks per insertion: one to begin the transaction (in case the transaction can not be completed), and one when the transaction is committed. Add to this additional storage to seek to your index entries and update them.
If your data are large, then the limiting factor will be how fast you can write data. For a hard drive, this will be about 60-120 MB/s. For a solid state disk, you can expect upwards of 200 MB/s. You will (of course) want extra disks for a RAID array. The pertinent figure is storage bandwidth AKA sequential I/O speed.
If writing a lot of small transactions, the limitation will be how fast your disk can seek to a spot and write a small piece of data, measured in IO per second (IOPS). We can estimate that it will take 4-8 seeks per transaction (a reasonable case with transactions enabled and an index or two, plus some integrity checks). For a hard drive, the seek time will be several milliseconds, depending on disk RPM. This will limit you to several hundred writes per second. For a solid state disk, the seek time is under 1 ms, so you can write several THOUSAND transactions per second.
When updating indices, you will need to do about O(log n) small seeks to find where to update, so the DB will slow down as the record counts grow. Remember that a DB may not write in the most efficient format possible, so data size may be bigger than you expect.
So, in general, YES, you can do this with a DBMS, although you will want to invest in good storage to ensure it can keep up with your insertion rate. If you wish to cut on cost, you may want to roll data over a specific age (say 1 year) into a secondary, compressed archive format.
EDIT:
A DBMS is probably the easiest system to work with for storing recent data, but you should strongly consider the HDF5/CDF format someone else suggested for storing older, archived data. It is an flexible and widely supported format, provides compression, and provides for compression and VERY efficient storage of large time series and multi-dimensional arrays. I believe it also provides for some methods of indexing in the data. You should be able to write a little code to fetch from these archive files if data is too old to be in the DB.
There is probably a data structure that would be more optimal for your given case than a relational database.
Having said that, there are many reasons to go with a relational DB including robust code support, backup & replication technology and a large community of experts.
Your use case is similar to high-volume financial applications and telco applications. Both are frequently inserting data and frequently doing queries that are both time-based and include other select factors.
I worked on a mid-sized billing project that handled cable bills for millions of subscribers. That meant an average of around 5 rows per subscriber times a few million subscribers per month in the financial transaction table alone. That was easily handled by a mid-size Oracle server using (now) 4 year old hardware and software. Large billing platforms can have 10x that many records per unit time.
Properly architected and with the right hardware, this case can be handled well by modern relational DB's.
Years ago, a customer of ours tried to load an RDBMS with real-time data collected from monitoring plant machinery. It didn't work in a simplistic way.
Is relational databases a good backend for a process historian?
Yes, but. It needs to store summary data, not details.
You'll need a front-end based in-memory and on flat files. Periodic summaries and digests can be loaded into an RDBMS for further analysis.
You'll want to look at Data Warehousing techniques for this. Most of what you want to do is to split your data into two essential parts ---
Facts. The data that has units. Actual measurements.
Dimensions. The various attributes of the facts -- date, location, device, etc.
This leads you to a more sophisticated data model.
Fact: Key, Measure 1, Measure 2, ..., Measure n, Date, Geography, Device, Product Line, Customer, etc.
Dimension 1 (Date/Time): Year, Quarter, Month, Week, Day, Hour
Dimension 2 (Geography): location hierarchy of some kind
Dimension 3 (Device): attributes of the device
Dimension *n*: attributes of each dimension of the fact
You may want to look at KDB. It is specificaly optimized for this kind of usage: many inserts, few or no updates or deletes.
It isn't as easy to use as traditional RDBMS though.
The other aspect to consider is what kind of selects you're doing. Relational/SQL databases are great for doing complex joins dependent on multiple indexes, etc. They really can't be beaten for that. But if you're not doing that kind of thing, they're probably not such a great match.
If all you're doing is storing per-time records, I'd be tempted to roll your own file format ... even just output the stuff as CSV (groans from the audience, I know, but it's hard to beat for wide acceptance)
It really depends on your indexing/lookup requirements, and your willingness to write tools to do it.
You may want to take a look at a Stream Data Manager System (SDMS).
While not addressing all your needs (long-time persistence), sliding windows over time and rows and frequently changing data are their points of strength.
Some useful links:
Stanford Stream Data Manager
Stream Mill
Material about Continuous Queries
AFAIK major database makers all should have some kind of prototype version of an SDMS in the works, so I think it's a paradigm worth checking out.
I know you're asking about relational database systems, but those are unicorns. SQL DBMSs are probably a bad match for your needs because no current SQL system (I know of) provides reasonable facilities to deal with temporal data. depending on your needs you might or might not have another option in specialized tools and formats, see e. g. rrdtool.