We are reproducing an existing solution in Azure DWH (Now Synapse) It loads incremental data and has a flag that is set to 0 to indicate the data needs to be "processed" into the DIMS and FACTS. Once processed the tables are added to a list that needs to be reset.
A maintenance job in the evening runs through the list and runs
UPDATE xxxx SET FLAG_COLUMN = 1 WHERE FLAG_COLUMN = 0
I'm getting really variable performance on this. 1.8bn row tables update in 5m, smaller 700m row tables are taking approaching an hour. Almost all the tables are COLUMN STORES. I have tried simplifying the UPDATE to
UPDATE xxxx SET FLAG_COLUMN = 1
I would expect this to be pretty quick for a column store as it's flushing the entire column but this seems to make no meaningful difference between column stores and heaps. There are 1800 tables that need to be reset every day. Running these 40 at a time is still going to run to 2-3+ hrs for a reset at the best speeds I have achieved. For the queries that are crawling it's unachievable in a day.
All this is running while the environment is quiet so it's not an issue with other queries interfering. I haven't explored altering the resource class as yet but the account it's running under is StaticRC40 and seems to run the ADF driven loads way way faster than these updated at this level of parallelism (in terms of queries).
Has anyone got any advice? ideas of other things I might try? The tables vary in size form low 100k to 18bn rows (thankfully most are in the sub 10m range) we're running the instance at scale DW3000c and it's quick enough on most other stuff we run.
These relatively simple UPDATES just seem to be terminally sub optimal. Any advice would be genuinely appreciated
Many thanks
So this turned out to be pretty straight forward. We weren't reproducing the existing behavior.
In the migration there are approx. 1800 ODS tables, all of them have the ROW_SENT_TO_EDW flag but a huge number of them have never had this flag set so all the rows are 0. The “slow” tables were the ones where all the rows were 0 so the update was against many hundreds of millions of billions of rows. This confirms it’s the TRANSACTION logging for the UPDATE that's the root cause of the slow and the randomness was to do with how many rows we were having to update. (I still don't understand why setting the entire column = 0 for a COLUMNSTORE INDEX table isn't super fast)
Running some analysis and looking at using HEAPS instead of CLUSTERED COLUMN INDEX tables. This was all at 3000C DWU
Impact of HEAP versus CCI
CCI table (update all 1.8bn rows)
-- (staticrc40) 1hr 43m 47s
-- (largerc) 1hr 4m 11s
HEAP table (update all 1.8bn rows)
-- (staticrc40) 2hr 13m 5s
-- (largerc) 48m 47s
CTAS the entire table out just switching the column value
CTAS is Microsofts answer to everything
CCI
-- (staticrc40) 56m
HEAP
-- (staticrc40) 1h 35m
Yes it’s faster than updating the entire table but still too slow for us
Impact of resource class
CCI table updating 14m rows of 1.8bn (smallrc typical user)
-- (staticrc40) 1m 30s
-- (smallrc) 1m 40s
CCI table with fresh STATS on the ROW_SENT_TO_EDW column before flipping the 0 to 1
-- (staticrc40) 1m 6s
-- (smallrc) 1m 8s
HEAP table updating 14m rows of 1.8bn (smallrc typical user)
-- (staticrc40) 30s
-- (smallrc) 37s
HEAP table with fresh STATS on the ROW_SENT_TO_EDW column before flipping the 0 to 1
-- (staticrc40) 25s
-- (smallrc) 25s
So it looks like HEAPS perform better than CCI in this particular instance and resource classes do make a difference but not as much as you might think (the storage operations and transaction logging are obviously the critical factors)
Sneaky Full table Update
Finally, the full table changes are obviously enormous and sub optimal. So we decided to look at dropping the column entirely and the adding it back in with a default
You have to find the any stats and drop those before you can drop the column
Dropping the column takes 15s on CCI and 20s on HEAP
Adding it back in with a default on CCI takes 29s
On HEAP it takes 20s (shocked by this as I expected the page writes to take a huge amount of time but it obviously does something clever under the covers)
You have to drop the default constraint straight away otherwise you won’t be able to drop the column again and also put the stats back on. (this often takes 20s)
But for both types of tables this method is super super fast compared to a full table update
SQL for this drop and add looks like this
-- get the name of the stat to drop
SELECT
t.[name] AS [table_name]
, s.[name] AS [table_schema_name]
, c.[name] AS [column_name]
, c.[column_id] AS [column_id]
, t.[object_id] AS [object_id]
,st.[name] As stats_name
, ROW_NUMBER()
OVER(ORDER BY (SELECT NULL)) AS [seq_nmbr]
FROM
sys.tables t
JOIN sys.schemas s ON t.[schema_id] = s.[schema_id]
JOIN sys.columns c ON t.[object_id] = c.[object_id]
INNER JOIN sys.stats_columns l ON l.[object_id] = c.[object_id]
AND l.[column_id] = c.[column_id]
AND l.[stats_column_id] = 1
INNER JOIN sys.stats st
ON t.[object_id] = st.[object_id]
and l.stats_id = st.stats_id
WHERE t.[object_id] = OBJECT_ID('BKP.SRC_xxx')
DROP STATISTICS BKP.SRC_xxx._WA_Sys_0000000E_2CEA8251
-- now alter the actual table
ALTER TABLE BKP.SRC_xxx
DROP COLUMN ROW_SENT_TO_EDW
-- 13s
ALTER TABLE BKP.SRC_xxx
ADD ROW_SENT_TO_EDW INT NOT NULL DEFAULT 1
-- 1s
-- find the constraint
SELECT t.[name] AS Table_Name,
c.[name] AS Column_Name,
dc.[name] as DefaultConstraintName
FROM sys.tables t
INNER JOIN sys.columns c
ON t.[object_id] = c.[object_id]
LEFT OUTER JOIN sys.default_constraints dc
ON t.[object_id] = dc.parent_object_id
AND c.column_id = dc.parent_column_id
WHERE t.[object_id] = OBJECT_ID('BKP.SRC_xxx')
AND c.[name] = 'ROW_SENT_TO_EDW'
-- drop the constraint
ALTER TABLE BKP.SRC_xxx DROP CONSTRAINT [Cnstr_7183c5bec657448da3475af85110123a]
-- 18s
-- create stats
CREATE STATISTICS [thingy] ON BKP.SRC_xxx([ROW_SENT_TO_EDW])
-- 9s
I've used the same pattern of having a BIT column to identify changed/inserted ODS records which should trigger a dimension update.
I didn't specifically test CCI on the source tables, I used heap as they were all 10M row less. However, I did find that it was essential to have a non-clustered index on the BIT column and to use CONVERT(BIT... so that the index could be used. 0 or 1 in a where clause without a CONVERT/CAST comes across as an INT and can block the ability to use the NCI. This had a huge impact on the performance to both update the dims based on the flag column and the update operation.
UPDATE src_table
SET processed_flag = 1
where processed_flag = CONVERT(BIT, 0)
BTW - totally agree with the sentiment that CTAS is too often the answer to everything when it's not always practical and is often a slower operation simply due to having to copy so much data.
Related
I have a Dataframe in Spark which is registered as a table called A and has 1 billion records and 10 columns. First column (ID) is Primary Key.
Also have another Dataframe which is registered as a table called B and has 10,000 records and 10 columns (same columns as table A, first column (ID) is Primary Key).
Records in Table B are 'Update records'. So I need to update all 10,000 records in table A with records in table B.
I tried first with this SQL query:
select * from A where ID not in (select ID from B) and then to Union that with table B. Approach is ok but first query (select * from A where ID not in (select ID from B)) is extremly slow (hours on moderate cluster).
Then I tried to speed up first query with LEFT JOIN:
select A.* from A left join B on (A.ID = B.ID ) where B.ID is null
That approach seems fine logically but it takes WAY to much memory for Spark containers (YARN for exceeding memory limits. 5.6 GB of 5.5 GB physical memory used. Consider boosting spark.yarn.executor.memory)..
What would be a better/faster/less memory consumption approach?
I would go with left join too rather than not in.
A couple of advices to reduce memory requirement and performance -
Please see the large table is uniformly distributed by join key (ID). If not then some tasks will be heavily burdened and some lightly busy. This will cause serious slowness. Please do a groupBy ID and count to measure this.
If the join key is naturally skewed then add more columns to the join condition keeping the result same. More columns may increase the chance to shuffle data uniformly. This is little hard to achieve.
Memory demand depends on - number of parallel tasks running, volume of data per task being executed in an executor. Reducing either or both will reduce memory pressure and obviously run slower but that is better than crashing. I would reduce the volume of data per task by creating more partitions on the data. Say you have 10 partitions for 1B rows then make it 200 to reduce the volume per task. Use repartition on table A. Don't create too many partitions because that will cause inefficiency, 10K partitions may be a bad idea.
There are some parameters to be tweaked which is explained here.
The small table having 10K rows should be automatically broadcasted because its small. If not you can increase the broadcast limit and apply broadcast hint.
My question follows on from a previous one that I asked here (Using SQLBulkCopy - Significantly larger tables in SQL Server 2016 than in SQL Server 2014)
where my use of SQLBulkcopy with SQL2016 was causing large amounts of memory to be allocated and not used. This was due to an optimisation in SQL2016 (detailed fully here - https://blogs.msdn.microsoft.com/sql_server_team/sql-server-2016-minimal-logging-and-impact-of-the-batchsize-in-bulk-load-operations) that optimised the getting of storage for speed, and did not care to fill in small amounts of allocated storage.
When running the following sql on the database
SELECT
t.NAME AS TableName,
s.Name AS SchemaName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 AS TotalSpaceKB,
SUM(a.used_pages) * 8 AS UsedSpaceKB,
(SUM(a.total_pages) - SUM(a.used_pages)) * 8 AS UnusedSpaceKB
FROM
sys.tables t
INNER JOIN
sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN
sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN
sys.allocation_units a ON p.partition_id = a.container_id
LEFT OUTER JOIN
sys.schemas s ON t.schema_id = s.schema_id
WHERE
t.NAME not like 'dt%'
AND t.is_ms_shipped = 0
AND i.OBJECT_ID > 255
GROUP BY
t.Name, s.Name, p.Rows
ORDER BY
UnusedSpaceKB desc
you would see output like the below. (See the amount of UnusedSpace).
There are two ways to fix this.
1. Switch on TraceFlag 692 as detailed in the above article. (Note – This does work, but is not allowable in my current situation for non-technical reasons).
2. Load the data a different way – (i.e. do not use SqlBulkCopy)
To this end I wrote a mechanism that took in a database and created the necessary SQL that I needed, which I then invoked using ExecuteNonQuery. This initially looked very promising, but then I found that some of my tables exhibited the old behaviour (Note – I proved this by rerunning the process with the TraceFlag 692 on (which I can do in test situations in order to investigate but not in live situations) and did not suffer a repeat of the unwanted extra space behaviour).
In my current situation I am preparing some sql and firing it to the database using ExecuteNonQuery. The table that I am inserting to looks like
CREATE TABLE [dbo].[TestTable](
[TagIndex] [smallint] NOT NULL,
[TimeStamp] [datetime] NOT NULL,
[SystemTime] [datetime] NOT NULL,
[ModuleControllerWatchdog] [smallint] NULL
) ON [PRIMARY]
GO
What is strange is that
If i run my process and create Sql that sends 511 rows to the database at a time that the storage looks fine and I get a result that looks like this.
If i run my process and create Sql that sends 512 rows to the database at a time then the storage expands and I get wasted storage (as below)
(Note - i have left out the sql that was generated as it quite mundane I think) and I did not want to further bloat an already long question but if anybody does want extra information of this nature then feel free to ask.
After all that, my question. do anybody know what is happening here?: Does SQL Server have a tipping point or similar whereby it will decide to do a bulk insert or use a similar mechanism one the insert has grown to a certain size etc?
Any help would be gratefully appreciated.
As per Dan Guzman requests I reran the process two times - one with 511 rows and one with 512 rows (on the same data so that I can present the difference here).
511 row chunks
512 row chunks
and then having thought it a little I decided I would include the results of the following query (which shows the distributions and allocation of pages) and shows the gaps that are present when the space is being wasted.
use [Extract_ForGuzman512Rows]
select * from sys.dm_db_database_page_allocations
(DB_id() , object_id('[dbo].[TestTable]') , NULL , NULL , 'DETAILED')
From the 511 rows table
From the 512 rows table
Following up on the idea from Dan that this may be about how much data fits on one page, I decided to run my process into different databases (one with 511 chunks and one with 512 chunks) and to run only until the first chunk had gone in so that I could check in particular to see how the pages were being allocated right from the start.
With 511 Size Chunks I could see that 2 data pages (along with a third for were being used meaning that it seemed like it went over a page's worth of data (as below)
With 512 Size Chunks I could see that again 2 data pages are being used (along with a third for the index page) but these are spaced out whereas they are tightly packed in the 511 example.
How can I speed up this rather simple UPDATE query? It's been running for over 5 hours!
I'm basically replacing SourceID in a table by joining on a new table that houses the Old and New IDs. All these fields are VARCHAR(72) and must stay that way.
Pub_ArticleFaculty table has 8,354,474 rows (8.3 million). ArticleAuthorOldNew has 99,326,472 rows (99.3 million) and only the 2 fields you see below.
There are separate non-clustered indexes on all these fields. Is there a better way to write this query to make it run faster?
UPDATE PF
SET PF.SourceId = AAON.NewSourceId
FROM AA..Pub_ArticleFaculty PF WITH (NOLOCK)
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
In my experience, looping your update so that it acts on small a numbers of rows each iteration is a good way to go. The ideal number of rows to update each iteration is largely dependent on your environment and the tables you're working with. I usually stick around 1,000 - 10,000 rows per iteration.
Example
SET ROWCOUNT 1000 -- Set the batch size (number of rows to affect each time through the loop).
WHILE (1=1) BEGIN
UPDATE PF
SET NewSourceId = 1
FROM AA..Pub_ArticleFaculty PF WITH (NOLOCK)
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
WHERE NewSourceId IS NULL -- Only update rows that haven't yet been updated.
-- When no rows are affected, we're done!
IF ##ROWCOUNT = 0
BREAK
END
SET ROWCOUNT 0 -- Reset the batch size to the default (i.e. all rows).
GO
If you are resetting all or almost all of the values, then the update will be quite expensive. This is due to logging and the overhead for the updates.
One approach you can take instead is insert into a temporary table, then truncate, then re-insert:
select pf.col1, pf.col2, . . . ,
coalesce(aaon.NewSourceId, pf.sourceid) as SourceId
into temp_pf
from AA..Pub_ArticleFaculty PF LEFT JOIN
AA2..ArticleAuthorOldNew AAON
on AAON.OldFullSourceId = PF.SourceId;
truncate table AA..Pub_ArticleFaculty;
insert into AA..Pub_ArticleFaculty
select * from temp_pf;
Note: You should either be sure that the columns in the original table match the temporary table or, better yet, list the columns explicitly in the insert.
I should also note that the major benefit is when your recovery mode is simple or bulk-logged. The reason is that logging for the truncate, select into, and insert . . . select is minimal (see here). This savings on the logging can be very significant.
I would
Disable the index on PF.SourceId
Run the update
Then rebuild the index
I don't get the NOLOCK on the table you are updating
UPDATE PF
SET PF.SourceId = AAON.NewSourceId
FROM AA..Pub_ArticleFaculty PF
INNER JOIN AA2..ArticleAuthorOldNew AAON WITH (NOLOCK)
ON AAON.OldFullSourceId = PF.SourceId
AND PF.SourceId <> AAON.NewSourceId
I don't have enough points to comment on the question. So I'm adding it as an answer. Can you check the basics
Are there any triggers on the table? If there are, there will be additional overhead when you are updating rows.
Are there indexes on the joining columns?
In other cases, does the system perform well? Verify that the system have enough power.
But 8 million records are not much to run more than 1 minute max if processing properly. An execution time of 5 hrs indicates there is a problem somewhere else.
Let's say I have an update such as:
UPDATE [db1].[sc1].[tb1]
SET c1 = LEFT(c1, LEN(c1)-1)
WHERE c1 like '%:'
This update is basically going to go through millions of rows and trim the colon if there is one in the c1 column.
How can I track how far along in the table this has progressed?
Thanks
This is sql server 2008
You can use the sysindexes table, which keeps track of how much an index has changed. Because this is done in an atomic update, it won't have a chance to recalc statistics, so rowmodctr will keep growing. This is sometimes not noticeable in small tables, but for millions, it will show.
-- create a test table
create table testtbl (id bigint identity primary key clustered, nv nvarchar(max))
-- fill it up with dummy data. 1/3 will have a trailing ':'
insert testtbl
select
convert(nvarchar(max), right(a.number*b.number+c.number,30)) +
case when a.number %3=1 then ':' else '' end
from master..spt_values a
inner join master..spt_values b on b.type='P'
inner join master..spt_values c on c.type='P'
where a.type='P' and a.number between 1 and 5
-- (20971520 row(s) affected)
update testtbl
set nv = left(nv, len(nv)-1)
where nv like '%:'
Now in another query window, run the below continuously and watch the rowmodctr going up and up. rowmodctr vs rows gives you an idea where you are up to, if you know where rowmodctr needs to end up being. In our case, it is 67% of just over 2 million.
select rows, rowmodctr
from sysindexes with (nolock)
where id = object_id('testtbl')
Please don't run (nolock) counting queries on the table itself while it is being updated.
Not really... you can query with the nolock hint and same where, but this will take resources
It isn't an optimal query with a leading wildcard of course...)
Database queries, particularly Data Manipulation Language (DML), are atomic. That means that the INSERT/UPDATE/DELETE either successfully occurs, or it doesn't. There's no means to see what record is being processed -- to the database, they all had been changed once the COMMIT is issued after the UPDATE. Even if you were able to view the records in process, by the time you would see the value, the query will have progressed on to other records.
The only means to knowing where in the process is to script the query to occur within a loop, so you can use a counter to know how many are processed. It's common to do this so large data sets are periodically committed, to minimize the risk of failure requiring having to run the entire query over again.
I am using Microsoft SQL Server.
I have a Table which had been updated by 80 rows.
If I right click and look at the table properties the rowcount say 10000 but a select Count(id) from TableName indicates 10080.
I checked the statistics and they also have a rowcount of 10080.
Why is there a difference between the Rocount in Properties and the Select Count?
Thanks,
S
This information most probably comes from the sysindexes table (see the documentation) and the
information in sysindexes isn't guaranteed to be up-to-date. This is a known fact in SQL Server.
Try running DBCC UPDATEUSAGE and check the values again.
Ref: http://msdn.microsoft.com/en-us/library/ms188414.aspx
DBCC UPDATEUSAGE corrects the rows,
used pages, reserved pages, leaf pages
and data page counts for each
partition in a table or index. If
there are no inaccuracies in the
system tables, DBCC UPDATEUSAGE
returns no data. If inaccuracies are
found and corrected and WITH
NO_INFOMSGS is not used, DBCC
UPDATEUSAGE returns the rows and
columns being updated in the system
tables.
Example:
DBCC UPDATEUSAGE (0)
Update the statistics. That's the only way RDBMS knows current status of your tables and indexes. This also helps RDBMS to choose correct execution path for optimal performance.
SQL Server 2005
UPDATE STATISTICS dbOwner.yourTableName;
Oracle
UPDATE STATISTICS yourSchema.yourTableName;
The property info is cached in SSMS.
there are a variety of ways to check the size of a table.
http://blogs.msdn.com/b/martijnh/archive/2010/07/15/sql-server-how-to-quickly-retrieve-accurate-row-count-for-table.aspx mentions 4 of various accuracy and speed.
The ever reliable full table scan is a bit slow ..
SELECT COUNT(*) FROM Transactions
and the quick alternative depends on statistics
SELECT CONVERT(bigint, rows)
FROM sysindexes
WHERE id = OBJECT_ID('Transactions')
AND indid < 2
It also mentions that the ssms gui uses the query
SELECT CAST(p.rows AS float)
FROM sys.tables AS tbl
INNER JOIN sys.indexes AS idx ON idx.object_id = tbl.object_id and idx.index_id < 2
INNER JOIN sys.partitions AS p ON p.object_id=CAST(tbl.object_id AS int)
AND p.index_id=idx.index_id
WHERE ((tbl.name=N'Transactions'
AND SCHEMA_NAME(tbl.schema_id)='dbo'))
and that a fast, and relatively accurate way to size a table is
SELECT SUM (row_count)
FROM sys.dm_db_partition_stats
WHERE object_id=OBJECT_ID('Transactions')
AND (index_id=0 or index_id=1);
Unfortunately this last query requires extra permissions beyond basic select.