SQL Azure tempdb size quota on index create - azure-sql-database

I need to create an index on a 800 million row heap table with no partitioning
CREATE NONCLUSTERED INDEX [IX] ON [dbo].[table]
(
columns....
)
INCLUDE(columns) WITH (
STATISTICS_NORECOMPUTE = OFF
, DROP_EXISTING = OFF
, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF
, SORT_IN_TEMPDB =OFF
, ONLINE = OFF) ON [PRIMARY]
GO
This command results in:
The database 'tempdb' has reached its size quota. Partition or delete data, drop indexes, or consult the documentation for possible resolutions.
Before I start trying batch loading into a new table, cutoff indexing, and partitioning, is there any way to force SQL Azure to create an index?

Related

Is this index on a table necessary?

I have a table with 5 indexes in SQL server. I'm aware of the fact that indexes affect inserts so I'd like to keep them to the minimum. I definitely need the first four indexes (as seen in the below sample).
However, I'm note quite sure if the last index (TimeSubmitted) is absolutely necessary - note that there is already CliendId+TimeSubmitted index.
The only reason why it is there is to make purging of expired rows from the table more efficient - or at least this is what my intention is. The purging job will be scheduled to run once a day - most likely at night.
There could be hundreds of thousands of records in the table at any given time.
Stored proc to purge the table:
CREATE PROCEDURE uspPurgeMyTable
(
#ExpiryDate datetime
)
AS
BEGIN
DELETE FROM MyTable
WHERE TimeSubmitted < #ExpiryDate;
END
Table (non relevant columns omitted):
CREATE TABLE MyTable (
[ClientId] [char](36) NOT NULL,
[UserName] [nvarchar](256) NOT NULL,
[TimeSubmitted] [datetime] NOT NULL,
[ProviderId] [uniqueidentifier] NULL,
[RegionId] [int] NULL
) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_MyTable_ClientArea] ON [dbo].[MyTable]
(
[ClientId] ASC,
[RegionId] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_MyTable_ClientPrinter] ON [dbo].[MyTable]
(
[ClientId] ASC,
[ProviderId] DESC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_MyTable_ClientTime] ON [dbo].[MyTable]
(
[ClientId] ASC,
[TimeSubmitted] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IX_MyTable_ClientUser] ON [dbo].[MyTable]
(
[ClientId] ASC,
[UserName] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_MyTable_TimeSubmitted] ON [dbo].[MyTable]
(
[TimeSubmitted] ASC
)WITH (STATISTICS_NORECOMPUTE = OFF, DROP_EXISTING = OFF, ONLINE = OFF) ON [PRIMARY]
Yes, the IX_MyTable_TimeSubmitted index will be beneficial for the purge process you described. Since the purge is based on the TimeSubmitted column only and none of the other indexes start with that, the best SQL Server would be able to do is use them in an index scan.
As with any indexing, there are trade-offs between read and update performance. You should measure your read, insert, and purge performance with and without the additional index to see what provides the best response time for each situation. Batch processes such as the purge that run overnight may not be as time sensitive as inserting or reading in your particular scenario.
If you are concerned about performance, you should take a different approach, partitioning. A reasonable place to learn about partitioning is the documentation.
A partitioned table stores each partition in a separate set of files. These are invisible to users of the database, in general. However, if you want to drop old data, you can just drop a partition.
This not only eliminates the need for the index. It also eliminates the need for the delete. And, dropping a partition is much more efficient than deleting, because there is much less logging.

SQL Server : MERGE performance

I have a database table with 5 million rows. The clustered index is auto-increment identity column. There PK is a code generated 256 byte VARCHAR which is a SHA256 hash of a URL, this is a non-clustered index on the table.
The table is as follows:
CREATE TABLE [dbo].[store_image](
[imageSHAID] [nvarchar](256) NOT NULL,
[imageGUID] [uniqueidentifier] NOT NULL,
[imageURL] [nvarchar](2000) NOT NULL,
[showCount] [bigint] NOT NULL,
[imageURLIndex] AS (CONVERT([nvarchar](450),[imageURL],(0))),
[autoIncID] [bigint] IDENTITY(1,1) NOT NULL,
CONSTRAINT [PK_imageSHAID] PRIMARY KEY NONCLUSTERED
(
[imageSHAID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
GO
CREATE CLUSTERED INDEX [autoIncPK] ON [dbo].[store_image]
(
[autoIncID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
imageSHAID is a SHA256 hash of an image URL e.g. "http://blah.com/image1.jpg", it is hashed into a varchar of 256 length.
imageGUID is a code generated guid in which I identify the image (it will be used as an index later, but for now I have omitted this column as an index)
imageURL is the full URL of the image (up to 2000 characters)
showCount is the number of times the image is shown, this is incremented each time this particular image is shown.
imageURLIndex is a computed column limited by 450 characters, this allows me to do text searches on the imageURL should I choose to, it is indexable (again index is omitted for brevity)
autoIncID is the clustered index, should allow faster inserting of data.
Periodically I merge from a temp table into the store_image table. The temp table structure is as follows (very similar to the store_image table):
CREATE TABLE [dbo].[store_image_temp](
[imageSHAID] [nvarchar](256) NULL,
[imageURL] [nvarchar](2000) NULL,
[showCount] [bigint] NULL,
) ON [PRIMARY]
GO
When the merge process is run, I write a DataTable to the temp table using the following code:
using (SqlBulkCopy bulk = new SqlBulkCopy(storeConn, SqlBulkCopyOptions.KeepIdentity | SqlBulkCopyOptions.KeepNulls, null))
{
bulk.DestinationTableName = "[dbo].[store_image_temp]";
bulk.WriteToServer(imageTableUpsetDataTable);
}
I then run the merge command to update the showCount in the store_image table by merging from the temp table based on the imageSHAID. If the image doesn't currently exist in the store_image table, I create it:
merge into store_image as Target using [dbo].[store_image_temp] as Source
on Target.imageSHAID=Source.imageSHAID
when matched then update set
Target.showCount=Target.showCount+Source.showCount
when not matched then insert values (Source.imageSHAID,NEWID(), Source.imageURL, Source.showCount);
I'm typically trying to merge 2k-5k rows from the temp table to the store_image table at any one merge process.
I used to run this DB on a SSD (only SATA 1 connected) and it was very fast (under 200 ms). I ran out of room on the SSD so I swapped the DB to a 1TB 7200 cache spinning disk, since then completion times are over 6-100 seconds (6000 - 100000MS). When the bulk insert is running I can see disk activity of around 1MB-2MB/sec, low CPU usage.
Is this a typical write time for this amount of data? It seems a little slow to me, what is causing the slow performance? Surely with the imageSHAID being indexed we should expect quicker seek times than this?
Any help would be appreciated.
Thanks for your time.
Your UPDATE clause in the MERGE updates showCount. This requires a key lookup on the clustered index.
However, the clustered index is also declared non-unique. This gives information to the optimiser even though the underlying column is unique.
So, I'd make these changes
the clustered primary key to be autoIncID
the current PK on imageSHAID to be a standalone unique index (not constraint) and add an INCLUDE for showCount. Unique constraints can't have INCLUDEs
More observations:
you don't need nvarchar for the hash or URL columns. These are not unicode.
A hash is also fixed length so can be char(64) (for SHA2-512).
The length of a column defines how much memory to assign to the query. See this for more: is there an advantage to varchar(500) over varchar(8000)?

create primary key on existing table with data

As part of a migration project, we have imported data from a JDE iSeries DB2 database. An SSIS package was created to create the destination tables and import data. The import went successfully.
Now comes the problem - The customer wants Primary Keys created in the destination DB (SQL 2008 R2). The problem table in this case, would be one table that has 104 columns and 7.5 million rows of data. The PK required for this table is composite and has 7 columns.
We are considering this :
BEGIN TRANSACTION
GO
ALTER TABLE [dbo].[F0911] ADD CONSTRAINT [F0911_PK] PRIMARY KEY CLUSTERED
(
[GLDCT] ASC,
[GLDOC] ASC,
[GLKCO] ASC,
[GLDGJ] ASC,
[GLJELN] ASC,
[GLLT] ASC,
[GLEXTL] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
COMMIT
or this:
-- Rename existing tables
sp_RENAME '[F0911]' , '[F0911_old]'
GO
-- Create new table
SELECT * INTO F0911 FROM F0911_old WHERE 1=0
GO
--Create PK constraints
ALTER TABLE [dbo].[F0911] ADD CONSTRAINT [F0911_PK] PRIMARY KEY CLUSTERED
(
[GLDCT] ASC,
[GLDOC] ASC,
[GLKCO] ASC,
[GLDGJ] ASC,
[GLJELN] ASC,
[GLLT] ASC,
[GLEXTL] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
--Insert data into new tables
INSERT INTO F0911
SELECT * FROM F0911_old
GO
-- Drop old tables
DROP TABLE F0911_old
GO
Which would be a more efficient approach, performance wise? I have a gut feeling that both are the same and even the first approach does the same thing as the second one does, implicitly. Is this understanding correct?
Please note that all these columns already exist in the table and we cannot modify the table definition.
Thanks,
Raj
They're the same. The effect of creating a clustered index is to arrange the pages which will happen in both cases. For non-clustered indexes it will help to disable the index and then turn it back on and rebuilding it.
I think the first approach is right, but I don't understand the reason of BEGIN Transaction and END transaction. I don't think Transaction keyword is necessary because you are not modifying data of the table. Transaction is used where we have to lock the data and we are modifying real time data so the old data is not used.

SQL Server 2008 index fragmentation problem

Everyday I import 2,000,000 rows from some text files using BULK INSERT into SQL Server 2008 and then I do some post-processing to update the records.
I have some indexes on the table to execute the post-process as fast as possible and in normal situation, the post-processing script takes about 40 seconds to run.
But sometimes (I don't know when) the post-processing does not work. In the situation I've mentioned, it is not done after an hour! After rebuilding indexes, everything is fine and normal.
What should I do to prevent the problem to be happened?
Right now, I have a nightly job to rebuild all indexes. Why the index fragmentation grows up to 90%?
Update:
Here is my table which I import text file into:
CREATE TABLE [dbo].[My_Transactions](
[My_TransactionId] [bigint] NOT NULL,
[FileId] [int] NOT NULL,
[RowNo] [int] NOT NULL,
[TransactionTypeId] [smallint] NOT NULL,
[TransactionDate] [datetime] NOT NULL,
[TransactionNumber] [dbo].[TransactionNumber] NOT NULL,
[CardNumber] [dbo].[CardNumber] NULL,
[AccountNumber] [dbo].[CardNumber] NULL,
[BankCardTypeId] [smallint] NOT NULL,
[AcqBankId] [smallint] NOT NULL,
[DeviceNumber] [dbo].[DeviceNumber] NOT NULL,
[Amount] [dbo].[Amount] NOT NULL,
[DeviceTypeId] [smallint] NOT NULL,
[TransactionFee] [dbo].[Amount] NOT NULL,
[AcqSwitchId] [tinyint] NOT NULL
) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [_dta_index_Jam_Transactions_8_1290487676__K1_K4_K12_K6_K11_5] ON [dbo].[Jam_Transactions]
(
[Jam_TransactionId] ASC,
[TransactionTypeId] ASC,
[Amount] ASC,
[TransactionNumber] ASC,
[DeviceNumber] ASC
)
INCLUDE ( [TransactionDate]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [_dta_index_Jam_Transactions_8_1290487676__K12_K6_K11_K1_5] ON [dbo].[Jam_Transactions]
(
[Amount] ASC,
[TransactionNumber] ASC,
[DeviceNumber] ASC,
[Jam_TransactionId] ASC
)
INCLUDE ( [TransactionDate]) WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IX_Jam_Transactions] ON [dbo].[Jam_Transactions]
(
[Jam_TransactionId] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Have you tried just refreshing the statistics after such a large insert:
UPDATE STATISTICS my_table
My experience with large bulk inserts is that the statistics get all mangled up and need a refreshed afterwards, it's also much faster than running a REINDEX or index REORDER.
Another option is to look into padding the index, you likely have no padding fill factor on your indexes meaning that if your index is:
A, B, D, E, F
and you insert a value with a CardNumber of C, then your index will look like:
A, B, D, E, F, C
and hence be ~20% fragmented, if you instead specify a fill factor for your index of say 15% we would see it look like roughly:
A, B, D, _, E, F
(Note the internal the empty space is put roughly in the middle point of the fillfactor % not at the end)
So that when you insert the C value it is closer to being correct, but it actually sees that the D is just swapped with the C and usually moves the D at that point.
Beyond that, are you sure that the fragmentation is actually the problem, as part of reindexing the table is read and loaded entirely into memory (provided it fits) and thus any query you run on it will be very fast.
Instead of including this table in the nightly job, why don't you make index maintenance (on this table specifically) part of the nightly import job, between BULK INSERT and whatever 'post-processing' is?
We don't have enough information to know why the index fragmentation grows that quickly. Which index(es)? How many indexes are there? What is the order of the data in the file?
You may also consider using the ORDER option in the BULK INSERT statement to change the way the data is inserted. It may make the load take longer but it should reduce the need to reorganize. Again depending on the order of the source data and the index(es) that become fragmented.
Finally, what is the impact of rebuilding/not rebuilding or reorganizing/not reorganizing the indexes? Have you tried both? Perhaps it makes the post-processing run quicker if you rebuild, but perhaps only a defragment is necessary. And while it may make the post-processing quicker, what about the queries that are run against the table later in the day? Have you done any metrics against those to see if they speed up or slow down depending on what you do at night?
Does your main table keep growing by 2 million rows per day or is there a lot of deleting taking place as well? Could you bulk insert into a temporary import table and do your processing prior to inserting into the main table? You can always use hints to force your queries to use certain indices:
SELECT *
FROM your_table_name WITH (INDEX(your_index_name))
WHERE your_column_name = 5
I would try taking the index offline before a mass row insertion and the bring it back online after the mass row insertion. Much, much more faster as compared to re-indexing, or performing a drop and create index..... difference is that the index is there data is being stored but the index is currently not being used, "Offline" until it is brought back "Online". I have a 1.5 million row insertion process and had a problem with one of my non clustered indexes fragmenting which was causing poor performance. Went form 99% fragmentation to .14% using the MSSQL Offline Online Index option....
Code sample:
ALTER INDEX idx_a ON dbo.tbl_A
REBUILD WITH (ONLINE = OFF);
Toggle between OFF and ON and you are good to go....

Problem in using indexes with SQL Server 2005 and Hibernate

I've a problem with a query generated by Hibernate that do not uses an index. Access to database is made from Java using JTDS and server version is SQL Server 2005, latest service pack.
The field is nullable and is a foreign key that, in some specific scenarios, could be completely null, column is indexed via a not clustered index but the index is never used when the column is entirely null, creating a large number of full table scans and performance issues.
The situation could be verified also using the standard query analyzer with the following SQL code:
Create table and indexes
CREATE TABLE [dbo].[TestNulls](
[PK] [varchar](36) NOT NULL,
[DATA] [varchar](36) NULL,
[DATANULL] [varchar](36) NULL,
CONSTRAINT [PK_TestNulls] PRIMARY KEY NONCLUSTERED
(
[PK] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF,
ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
CREATE NONCLUSTERED INDEX [IDX_DATA] ON [dbo].[TestNulls]
(
[DATA] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
CREATE NONCLUSTERED INDEX [IDX_DATANULL] ON [dbo].[TestNulls]
(
[DATANULL] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
GO
Fill it with some random data using the newid function
declare #i as int
set #i = 0
while (#i < 500000)
begin
set nocount on
insert into TestNulls values(NEWID(), NEWID(), null)
insert into TestNulls values(NEWID(), null, null)
insert into TestNulls values(NEWID(), null, null)
set #i = (#i + 1)
set nocount on
end;
This query perform a full table scan
declare #p varchar(36)
set #p = NEWID()
select PK, DATA, DATANULL from TestNulls
where DATANULL = #p
If I complete the query with a "and DATANULL IS NOT NULL" the query now uses the index.
Help needed:
How can I force the JTDS/Hibernate combination to use the index (the sendStringParametersAsUnicode is already set to false by default)?
Is there a way to append "and column is not null" for all hibernate queries that uses a nullable field?
Any explanation about this behavior?
Regards
Massimo
1) I think, we should avoid NULL values. Just use DEFAULT and place some {00000-0000-000...} as NULL value. Your data filling script generates too many nulls values, so selectivity of values of this field is very low. I think SQL Server will choose to scan then use index in this case (SQL Server automaticly chooses to use or does not use index itself). And it makes sence. You should analyse your REAL data. Any way you can force it to just use some index. You can create stored procedure to sql server and then query it from hibernate, for example,
or command hibernate to use custom query to request data (I think, it is possible) and add table hint to your query to force using some index:
INDEX ( index_val [ ,...n ] ):
select PK, DATA, DATANULL from TestNulls WITH INDEX(IDX_DATANULL)
The selectivity is the "number of rows" / "cardinality", so if you have 10K customers, and search for all "female", you have to consider that the search would return 10K/2 = 5K rows, so a very "bad" selectivity.
Luck.
You are using a table with no clustered index ("heap table", as it is called), that is generally not very efficient for SELECTs, because any meaningful query requires either a bookmark lookup or a full table scan.
So, to use the index the server will have to: 1) find the given values in the index and retrieve corresponding Row IDs, 2) retrive the rows by the IDs and return the data.
Given the nature of your data, the optimizer "thinks" full scan is mor efficient.
I'd suggest you to try:
Rebuilding the statistics on the table. Outdated stats could lead optimizer to wrong decisions.
Force using index via a hint. Do not forget to test whether it is really faster on your actual data (sometimes optimizer happens to know better than you).
Create a covering index for this query by adding some data (it will make inserts/updates somewhat slower, so you should consider the overall impact on the system):
CREATE INDEX IDX_DATANULL_FULL ON TestNulls (DATANULL) INCLUDE (PK, DATA)