How does SQL Server Page Allocation work? - sql

My question follows on from a previous one that I asked here (Using SQLBulkCopy - Significantly larger tables in SQL Server 2016 than in SQL Server 2014)
where my use of SQLBulkcopy with SQL2016 was causing large amounts of memory to be allocated and not used. This was due to an optimisation in SQL2016 (detailed fully here - https://blogs.msdn.microsoft.com/sql_server_team/sql-server-2016-minimal-logging-and-impact-of-the-batchsize-in-bulk-load-operations) that optimised the getting of storage for speed, and did not care to fill in small amounts of allocated storage.
When running the following sql on the database
SELECT
t.NAME AS TableName,
s.Name AS SchemaName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 AS TotalSpaceKB,
SUM(a.used_pages) * 8 AS UsedSpaceKB,
(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM
sys.tables t
INNER JOIN
sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN
sys.allocation_units a ON p.partition_id = a.container_id
LEFT OUTER JOIN
sys.schemas s ON t.schema_id = s.schema_id
WHERE
t.NAME not like 'dt%'
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY
t.Name, s.Name, p.Rows
ORDER BY
UnusedSpaceKB desc
you would see output like the below. (See the amount of UnusedSpace).
There are two ways to fix this.
1. Switch on TraceFlag 692 as detailed in the above article. (Note – This does work, but is not allowable in my current situation for non-technical reasons).
2. Load the data a different way – (i.e. do not use SqlBulkCopy)
To this end I wrote a mechanism that took in a database and created the necessary SQL that I needed, which I then invoked using ExecuteNonQuery. This initially looked very promising, but then I found that some of my tables exhibited the old behaviour (Note – I proved this by rerunning the process with the TraceFlag 692 on (which I can do in test situations in order to investigate but not in live situations) and did not suffer a repeat of the unwanted extra space behaviour).
In my current situation I am preparing some sql and firing it to the database using ExecuteNonQuery. The table that I am inserting to looks like
CREATE TABLE [dbo].[TestTable](
[TagIndex] [smallint] NOT NULL,
[TimeStamp] [datetime] NOT NULL,
[SystemTime] [datetime] NOT NULL,
[ModuleControllerWatchdog] [smallint] NULL
) ON [PRIMARY]
GO
What is strange is that
If i run my process and create Sql that sends 511 rows to the database at a time that the storage looks fine and I get a result that looks like this.
If i run my process and create Sql that sends 512 rows to the database at a time then the storage expands and I get wasted storage (as below)
(Note - i have left out the sql that was generated as it quite mundane I think) and I did not want to further bloat an already long question but if anybody does want extra information of this nature then feel free to ask.
After all that, my question. do anybody know what is happening here?: Does SQL Server have a tipping point or similar whereby it will decide to do a bulk insert or use a similar mechanism one the insert has grown to a certain size etc?
Any help would be gratefully appreciated.
As per Dan Guzman requests I reran the process two times - one with 511 rows and one with 512 rows (on the same data so that I can present the difference here).
511 row chunks
512 row chunks
and then having thought it a little I decided I would include the results of the following query (which shows the distributions and allocation of pages) and shows the gaps that are present when the space is being wasted.
use [Extract_ForGuzman512Rows]
select * from sys.dm_db_database_page_allocations
(DB_id() , object_id('[dbo].[TestTable]') , NULL , NULL , 'DETAILED')
From the 511 rows table
From the 512 rows table
Following up on the idea from Dan that this may be about how much data fits on one page, I decided to run my process into different databases (one with 511 chunks and one with 512 chunks) and to run only until the first chunk had gone in so that I could check in particular to see how the pages were being allocated right from the start.
With 511 Size Chunks I could see that 2 data pages (along with a third for were being used meaning that it seemed like it went over a page's worth of data (as below)
With 512 Size Chunks I could see that again 2 data pages are being used (along with a third for the index page) but these are spaced out whereas they are tightly packed in the 511 example.

Related

Very Slow Updates on tables

We are reproducing an existing solution in Azure DWH (Now Synapse) It loads incremental data and has a flag that is set to 0 to indicate the data needs to be "processed" into the DIMS and FACTS. Once processed the tables are added to a list that needs to be reset.
A maintenance job in the evening runs through the list and runs
UPDATE xxxx SET FLAG_COLUMN = 1 WHERE FLAG_COLUMN = 0
I'm getting really variable performance on this. 1.8bn row tables update in 5m, smaller 700m row tables are taking approaching an hour. Almost all the tables are COLUMN STORES. I have tried simplifying the UPDATE to
UPDATE xxxx SET FLAG_COLUMN = 1
I would expect this to be pretty quick for a column store as it's flushing the entire column but this seems to make no meaningful difference between column stores and heaps. There are 1800 tables that need to be reset every day. Running these 40 at a time is still going to run to 2-3+ hrs for a reset at the best speeds I have achieved. For the queries that are crawling it's unachievable in a day.
All this is running while the environment is quiet so it's not an issue with other queries interfering. I haven't explored altering the resource class as yet but the account it's running under is StaticRC40 and seems to run the ADF driven loads way way faster than these updated at this level of parallelism (in terms of queries).
Has anyone got any advice? ideas of other things I might try? The tables vary in size form low 100k to 18bn rows (thankfully most are in the sub 10m range) we're running the instance at scale DW3000c and it's quick enough on most other stuff we run.
These relatively simple UPDATES just seem to be terminally sub optimal. Any advice would be genuinely appreciated
Many thanks
So this turned out to be pretty straight forward. We weren't reproducing the existing behavior.
In the migration there are approx. 1800 ODS tables, all of them have the ROW_SENT_TO_EDW flag but a huge number of them have never had this flag set so all the rows are 0. The “slow” tables were the ones where all the rows were 0 so the update was against many hundreds of millions of billions of rows. This confirms it’s the TRANSACTION logging for the UPDATE that's the root cause of the slow and the randomness was to do with how many rows we were having to update. (I still don't understand why setting the entire column = 0 for a COLUMNSTORE INDEX table isn't super fast)
Running some analysis and looking at using HEAPS instead of CLUSTERED COLUMN INDEX tables. This was all at 3000C DWU
Impact of HEAP versus CCI
CCI table (update all 1.8bn rows)
-- (staticrc40) 1hr 43m 47s
-- (largerc) 1hr 4m 11s
HEAP table (update all 1.8bn rows)
-- (staticrc40) 2hr 13m 5s
-- (largerc) 48m 47s
CTAS the entire table out just switching the column value
CTAS is Microsofts answer to everything
CCI
-- (staticrc40) 56m
HEAP
-- (staticrc40) 1h 35m
Yes it’s faster than updating the entire table but still too slow for us
Impact of resource class
CCI table updating 14m rows of 1.8bn (smallrc typical user)
-- (staticrc40) 1m 30s
-- (smallrc) 1m 40s
CCI table with fresh STATS on the ROW_SENT_TO_EDW column before flipping the 0 to 1
-- (staticrc40) 1m 6s
-- (smallrc) 1m 8s
HEAP table updating 14m rows of 1.8bn (smallrc typical user)
-- (staticrc40) 30s
-- (smallrc) 37s
HEAP table with fresh STATS on the ROW_SENT_TO_EDW column before flipping the 0 to 1
-- (staticrc40) 25s
-- (smallrc) 25s
So it looks like HEAPS perform better than CCI in this particular instance and resource classes do make a difference but not as much as you might think (the storage operations and transaction logging are obviously the critical factors)
Sneaky Full table Update
Finally, the full table changes are obviously enormous and sub optimal. So we decided to look at dropping the column entirely and the adding it back in with a default
You have to find the any stats and drop those before you can drop the column
Dropping the column takes 15s on CCI and 20s on HEAP
Adding it back in with a default on CCI takes 29s
On HEAP it takes 20s (shocked by this as I expected the page writes to take a huge amount of time but it obviously does something clever under the covers)
You have to drop the default constraint straight away otherwise you won’t be able to drop the column again and also put the stats back on. (this often takes 20s)
But for both types of tables this method is super super fast compared to a full table update
SQL for this drop and add looks like this
-- get the name of the stat to drop
SELECT
t.[name] AS [table_name]
, s.[name] AS [table_schema_name]
, c.[name] AS [column_name]
, c.[column_id] AS [column_id]
, t.[object_id] AS [object_id]
,st.[name] As stats_name
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
FROM
sys.tables t
JOIN sys.schemas s ON t.[schema_id] = s.[schema_id]
JOIN sys.columns c ON t.[object_id] = c.[object_id]
INNER JOIN sys.stats_columns l ON l.[object_id] = c.[object_id]
AND l.[column_id] = c.[column_id]
AND l.[stats_column_id] = 1
INNER JOIN sys.stats st
ON t.[object_id] = st.[object_id]
and l.stats_id = st.stats_id
WHERE t.[object_id] = OBJECT_ID('BKP.SRC_xxx')
DROP STATISTICS BKP.SRC_xxx._WA_Sys_0000000E_2CEA8251
-- now alter the actual table
ALTER TABLE BKP.SRC_xxx
DROP COLUMN ROW_SENT_TO_EDW
-- 13s
ALTER TABLE BKP.SRC_xxx
ADD ROW_SENT_TO_EDW INT NOT NULL DEFAULT 1
-- 1s
-- find the constraint
SELECT t.[name] AS Table_Name,
c.[name] AS Column_Name,
dc.[name] as DefaultConstraintName
FROM sys.tables t
INNER JOIN sys.columns c
ON t.[object_id] = c.[object_id]
LEFT OUTER JOIN sys.default_constraints dc
ON t.[object_id] = dc.parent_object_id
AND c.column_id = dc.parent_column_id
WHERE t.[object_id] = OBJECT_ID('BKP.SRC_xxx')
AND c.[name] = 'ROW_SENT_TO_EDW'
-- drop the constraint
ALTER TABLE BKP.SRC_xxx DROP CONSTRAINT [Cnstr_7183c5bec657448da3475af85110123a]
-- 18s
-- create stats
CREATE STATISTICS [thingy] ON BKP.SRC_xxx([ROW_SENT_TO_EDW])
-- 9s
I've used the same pattern of having a BIT column to identify changed/inserted ODS records which should trigger a dimension update.
I didn't specifically test CCI on the source tables, I used heap as they were all 10M row less. However, I did find that it was essential to have a non-clustered index on the BIT column and to use CONVERT(BIT... so that the index could be used. 0 or 1 in a where clause without a CONVERT/CAST comes across as an INT and can block the ability to use the NCI. This had a huge impact on the performance to both update the dims based on the flag column and the update operation.
UPDATE src_table
SET processed_flag = 1
where processed_flag = CONVERT(BIT, 0)
BTW - totally agree with the sentiment that CTAS is too often the answer to everything when it's not always practical and is often a slower operation simply due to having to copy so much data.

Most efficient way to store queries and counts of large SQL data

I have a SQL Server database with a large amount of data (65 million rows mostly of text, 8Gb total). The data gets changed only once per week. I have an ASP.NET web application that will run several SQL queries on this data that will count the number of rows satisfying various conditions. Since the data gets changed only once per week, what is the most efficient way to store both the SQL queries and their counts for the week? Should I store it in the database or in the application?
If the data is only modified once a week, as part of and at the end of that (ETL?) process, perform your "basic" counts and store the results in a table in the database. Thereafter, rather than lengthy queries on the big tables, you can just query those small summary tables.
If you do not need 100% up-to-the-minute accurate row counts, you could query SQL Server's internal info:
Select so.name as 'TableName', si.rowcnt as 'RowCount'
from sysobjects so
inner join sysindexes si on so.id = si.id
where so.type = 'u' and indid < 2
Very quick to execute and no extra tables required. Not accurate where many updates are occurring but might be accurate enough in your intended usage. [Thank you to commenters!]
Update: did a bit of digging and this does produce accurate counts (slower due to the sum, but still quick):
SELECT OBJECT_SCHEMA_NAME(ps.object_id) AS SchemaName,
OBJECT_NAME(ps.object_id) AS ObjectName,
SUM(ps.row_count) AS row_count
FROM sys.dm_db_partition_stats ps
JOIN sys.indexes i ON i.object_id = ps.object_id
AND i.index_id = ps.index_id
WHERE i.type_desc IN ('CLUSTERED','HEAP')
AND OBJECT_SCHEMA_NAME(ps.object_id) <> 'sys'
GROUP BY ps.object_id
ORDER BY OBJECT_NAME(ps.object_id), OBJECT_SCHEMA_NAME(ps.object_id)
Ref.
Remember that the stored count information was not always 100%
accurate in SQL Server 2000. For a new table created on 2005 the
counts will be accurate. But for a table that existed in 2000 and now
resides on 2005 through a restore or update, you need to run (only
once after the move to 2005) either sp_spaceused #updateusage =
N'true' or DBCC UPDATEUSAGE with the COUNT_ROWS option.
The queries should be stored as stored procedures or views, depending on complexity.
For your situation I would look into indexed views.
They let you both store a query AND the result set for things like aggregation that otherwise cannot be indexed.
As a bonus, the query optimizer "knows" it has this data as well, so if you check for a count or something else stored in the view index in another query (even one not referencing the view directly) it can still use that stored data.

SQL Server 2005: Index bigger than data stored

I created 1 database with 2 file groups: 1 primary and 1 index.
Primary file group includes 1 data file (*.mdf): store all tables
Index file group includes 1 index file (*.ndf): store all indexes
Most of indexes are non-clustered indexes
After a short time using the database, the data file is 2GB but the index file is 12 GB. I do not know what problem happened in my database.
I have some questions:
How do I reduce the size of the index file?
How do I know what is stored in the index file?
How do I trace all impacts to the index file?
How do I limit size growing of index file?
How do I reduce size of index file ?
Drop some unneeded indexes or reduce the number of columns in existing ones. Remember that the clustered index column(s) is a "hidden" included column in all non clustered indexes.
If you have an index on a,b,c,d and an index on a,b,c you might consider dropping the second one as the first one covers the second one.
You may also be able to find potential unused indexes by looking at sys.dm_db_index_usage_stats
How to know what is stored in index file?
It will store whatever you defined it to store! The following query will help you tell which indexes are using the most space and for what reason (in row data, lob data)
SELECT convert(char(8),object_name(i.object_id)) AS table_name, i.name AS index_name,
i.index_id, i.type_desc as index_type,
partition_id, partition_number AS pnum, rows,
allocation_unit_id AS au_id, a.type_desc as page_type_desc, total_pages AS pages
FROM sys.indexes i JOIN sys.partitions p
ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a
ON p.partition_id = a.container_id
order by pages desc
My guess (which I think is where marc_s is also headed) is that you've declared your clustered indexes for at least some of your tables to be on the index file group. The clustered index determines how (and where) the actual data for your table is stored.
Posting some of your code would certainly help others pinpoint the problem though.
I think that Martin Smith answered your other questions pretty well. I'll just add this... If you want to limit index sizes you need to evaluate your indexes. Don't add indexes just because you think that you might need them. Do testing with realistic (or ideally real-world) loads on the database to see which indexes will actually give you needed boosts to performance. Indexes have costs to them. In addition to the space cost which you're seeing, they also add to the overhead of inserts and updates, which have to keep the indexes in sync. Because of these costs, you should always have a good reason to add an index and you should consciously think about the trade-offs.
Consider that it is actually quite common for the total storage required for Indexes to be greater than the storage required for the table data within a given database.
Your particular scenario however would appear to quite excessive. As others have pointed out, if you have assigned the Clustered Index for a given table to reside in a separate data file (Your Index data file) then the entire physical table itself will reside in this file also, because in a manner of speak the Clustered Index is the table.
Providing details of your Table Schema and Index Structures will enable us to provide you with more specific guidance.
Other posters have mentioned that:
You should review your index
definitions for duplicate indexes. Take a look at Identifying Overlapping Indexes by Brent Ozar.
You should look to identify unused
indexes. Take a look as SQL Server Pedia Article: Finding Unused Indexes
Other avenues to explore include reviewing the fragmentation of your indexes, as this can increase the storage requirements.
Heavy fragmentation, particularly in the Clustered Index of a table containing LOB data, can result in a significant increase in storage needs. Reorganizing the Clustered Index on tables that contain LOB data will compact the LOB data.
See Reorganizing and Rebuilding Indexes
#martin-smith's answer is almost what I needed...
Here is how you sort by index size in GB (mssql uses 8KB pages == 128 pages/MB)
SELECT
object_name(p.object_id) AS table_name
, i.name AS index_name
, i.index_id
, i.type_desc AS index_type
-- , partition_id
-- , partition_number AS pnum
-- , allocation_unit_id AS au_id
, rows
, a.type_desc as page_type_desc
, total_pages/(1024 * 128.0) AS sizeGB
FROM
sys.indexes i
JOIN sys.partitions p ON i.object_id = p.object_id AND i.index_id = p.index_id
JOIN sys.allocation_units a ON p.partition_id = a.container_id
JOIN sys.all_objects ao ON (ao.object_id = i.object_id)
WHERE ao.type_desc = 'USER_TABLE'
ORDER BY
-- table_name
sizeGB DESC

select count(*) vs keep a counter

Assuming indexes are put in place, and absolute-count-accuracy is not necessary (it's okay to be off by one or two), is it okay to use:
Option A
select count(*)
from Table
where Property = #Property
vs
Option B
update PropertyCounters
SET PropertyCount = PropertyCount + 1
where Property = #Property
then doing:
select PropertyCount
from PropertyCounters
where Property = #Property
How much performance degradation can I reasonably expect from doing select count(*) as the table grows into thousands/millions of records?
Keeping a separate count column in addition to the real data is a denormalisation. There are reasons why you might need to do it for performance, but you shouldn't go there until you really need to. It makes your code more complicated, with more chance of inconsistencies creeping in.
For the simple case where the query really is just SELECT COUNT(property) FROM table WHERE property=..., there's no reason to denormalise; you can make that fast by adding an index on the property column.
You didn't specify the platform, but since you use T-SQL syntax for #variables I'll venture a SQL Server platform specific answer:
count(*), or strictly speaking would be count_big(*), is an expression that can be used in indexed views, see Designing Indexed Views.
create view vwCounts
with schembinding
as select Property, count_big(*) as Count
from dbo.Table
group by Property;
create unique clustered index cdxCounts on vwCounts(Property);
select Count
from vwCount with (noexpand)
where Property = #property;
On Enterprise Edition the optimizer will even use the indexed view for your original query:
select count_big(*)
from Table
where Property = #property;
So in the end you get your cake and eat it too: the property is already aggregated and maintained for your for free by the engine. The price is that updates have to maintain the indexed view (they will not recompute the aggregate count though) and the aggregation will create hot spots for contention (locks on separate rows on Table will contend for same count(*) update on the indexed view).
If you say that you do not need absolute accuracy, then Option B is a strange approach. If Option A becomes too heavy (even after adding indexes), you can cache the output of Option A in memory or in another table (your PropertyCounters), and periodically refresh it.
This isn't something that can be answered in general SQL terms. Quite apart from the normal caveats about indices and so on affecting queries, it's also something where there is considerable different between platforms.
I'd bet on better performance on this from SQL Server than Postgres, to the point where I'd consider the latter approach sooner on Postgres and not on SQL Server. However, with a partial index set just right for matching the criteria, I'd bet on Postgres beating out SQL Server. That's just what I'd bet small winnings on though, either way I'd test if I needed to think about it for real.
If you do go for the latter approach, enforce it with a trigger or similar, so that you can't become inaccurate.
On SQL Server, if you don't need absolutely accurate counts, you could also inspect the catalog views. This would be much easier to do - you don't have to keep a count yourself - and it's a lot less taxing on the system. After all, if you need to count all the rows in a table, you need to scan that table, one way or another - no way around that.
With this SQL statement here, you'll get all the tables in your database, and their row counts, as kept by SQL Server:
SELECT
t.NAME AS TableName,
SUM(p.rows) AS RowCounts
FROM
sys.tables t
INNER JOIN
sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
WHERE
t.NAME NOT LIKE 'dt%' AND
i.OBJECT_ID > 255 AND
i.index_id <= 1
GROUP BY
t.NAME, i.object_id, i.index_id, i.name
ORDER BY
OBJECT_NAME(i.object_id)
I couldn't find any documentation on exactly how current those numbers are, typically - but from my own experience, they're usually on the spot (unless you're doing some bulk loading or something - but in that case, you wouldn't want to constantly scan the table to get the exact count, either)

SQL Table Rowcount different from Select Count in SQL Server

I am using Microsoft SQL Server.
I have a Table which had been updated by 80 rows.
If I right click and look at the table properties the rowcount say 10000 but a select Count(id) from TableName indicates 10080.
I checked the statistics and they also have a rowcount of 10080.
Why is there a difference between the Rocount in Properties and the Select Count?
Thanks,
S
This information most probably comes from the sysindexes table (see the documentation) and the
information in sysindexes isn't guaranteed to be up-to-date. This is a known fact in SQL Server.
Try running DBCC UPDATEUSAGE and check the values again.
Ref: http://msdn.microsoft.com/en-us/library/ms188414.aspx
DBCC UPDATEUSAGE corrects the rows,
used pages, reserved pages, leaf pages
and data page counts for each
partition in a table or index. If
there are no inaccuracies in the
system tables, DBCC UPDATEUSAGE
returns no data. If inaccuracies are
found and corrected and WITH
NO_INFOMSGS is not used, DBCC
UPDATEUSAGE returns the rows and
columns being updated in the system
tables.
Example:
DBCC UPDATEUSAGE (0)
Update the statistics. That's the only way RDBMS knows current status of your tables and indexes. This also helps RDBMS to choose correct execution path for optimal performance.
SQL Server 2005
UPDATE STATISTICS dbOwner.yourTableName;
Oracle
UPDATE STATISTICS yourSchema.yourTableName;
The property info is cached in SSMS.
there are a variety of ways to check the size of a table.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx mentions 4 of various accuracy and speed.
The ever reliable full table scan is a bit slow ..
SELECT COUNT(*) FROM Transactions
and the quick alternative depends on statistics
SELECT CONVERT(bigint, rows)
FROM sysindexes
WHERE id = OBJECT_ID('Transactions')
AND indid < 2
It also mentions that the ssms gui uses the query
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
and that a fast, and relatively accurate way to size a table is
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Transactions')
AND (index_id=0 or index_id=1);
Unfortunately this last query requires extra permissions beyond basic select.