SQL select from table - only including data in specific filegroup - sql

I followed this article:
http://www.mssqltips.com/sqlservertip/1796/creating-a-table-with-horizontal-partitioning-in-sql-server/
Which in essence does the following:
Creates a database with three filegroups, call them A, B, and C
Creates a partition scheme, mapping to the three filegroups
Creates table - SalesArchival, using the partition scheme
Inserts a few rows into the table, split over the filegroups.
I'd like to perform a query like this (excuse my pseudo-code)
select * from SalesArchival
where data in filegroup('A')
Is there a way of doing this, or if not, how do I go about it.
What I want to accomplish is to have a batch run every day that moves data older than 90 days to a different file group, and perform my front end queries only on the 'current' file group.

To get at a specific filegroup, you'll always want to utilize partition elimination in your predicates to ensure minimal records get read. This is very important if you are to get any benefits from partitioning.
For archival, I think you're looking for how to split and merge ranges. You should always keep the first and last partitions empty, but this should give you an idea of how to use partitions for archiving. FYI, moving data from 1 filegroup to another is very resource intensive. Additionally, results will be slightly different if you use a range right pf. Since you are doing partitioning, hopefully you've read up on best practices.
DO NOT RUN ON PRODUCTION. THIS IS ONLY AN EXAMPLE TO LEARN FROM.
This example assumes you have 4 filegroups (FG1,FG2,FG3, & [PRIMARY]) defined.
IF EXISTS(SELECT NULL FROM sys.tables WHERE name = 'PartitionTest')
DROP TABLE PartitionTest;
IF EXISTS(SELECT NULL FROM sys.partition_schemes WHERE name = 'PS')
DROP PARTITION SCHEME PS;
IF EXISTS(SELECT NULL FROM sys.partition_functions WHERE name = 'PF')
DROP PARTITION FUNCTION PF;
CREATE PARTITION FUNCTION PF (datetime) AS RANGE LEFT FOR VALUES ('2012-02-05', '2012-05-10','2013-01-01');
CREATE PARTITION SCHEME PS AS PARTITION PF TO (FG1,FG2,FG3,[PRIMARY]);
CREATE TABLE PartitionTest( Id int identity(1,1), DT datetime) ON PS(DT);
INSERT PartitionTest (DT)
SELECT '2012-02-05' --FG1
UNION ALL
SELECT '2012-02-06' --FG2(This is the one 90 days old to archive into FG1)
UNION ALL
SELECT '2012-02-07' --FG2
UNION ALL
SELECT '2012-05-05' --FG2 (This represents a record entered recently)
Check the filegroup associated with each record:
SELECT O.name TableName, fg.name FileGroup, ps.name PartitionScheme,pf.name PartitionFunction, ISNULL(prv.value,'Undefined') RangeValue,p.rows
FROM sys.objects O
INNER JOIN sys.partitions p on P.object_id = O.object_id
INNER JOIN sys.indexes i on p.object_id = i.object_id and p.index_id = i.index_id
INNER JOIN sys.data_spaces ds on i.data_space_id = ds.data_space_id
INNER JOIN sys.partition_schemes ps on ds.data_space_id = ps.data_space_id
INNER JOIN sys.partition_functions pf on ps.function_id = pf.function_id
LEFT OUTER JOIN sys.partition_range_values prv on prv.function_id = ps.function_id and p.partition_number = prv.boundary_id
INNER JOIN sys.allocation_units au on p.hobt_id = au.container_id
INNER JOIN sys.filegroups fg ON au.data_space_id = fg.data_space_id
WHERE o.name = 'PartitionTest' AND i.type IN (0,1) --Remove nonclustereds. 0 for heap, 1 for BTree
ORDER BY O.name, fg.name, prv.value
This proves that 2012-02-05 is in FG1 while the rest are in FG2.
In order to archive, your' first instinct is to move the data. When partitioning though, you actually have to slide the partition function range value.
Now let's move 2012-02-06 (90 days or older in your case) into FG1:--Move 2012-02-06 from FG2 to FG1
ALTER PARTITION SCHEME PS NEXT USED FG1;
ALTER PARTITION FUNCTION PF() SPLIT RANGE ('2012-02-06');
Rerun the filegroup query to verify that 2012-02-06 got moved into FG1.

$PARTITION (Transact-SQL) should have what you want to do.
Run the following to know the size of your partitions and ID:
USE AdventureWorks2012;
GO
SELECT $PARTITION.TransactionRangePF1(TransactionDate) AS Partition,
COUNT(*) AS [COUNT] FROM Production.TransactionHistory
GROUP BY $PARTITION.TransactionRangePF1(TransactionDate)
ORDER BY Partition ;
GO
and the following should give you data from given partition id:
SELECT * FROM Production.TransactionHistory
WHERE $PARTITION.TransactionRangePF1(TransactionDate) = 5 ;

No. You need to use the exact condition that you use in your partition function. Which is probably like
where keyCol between 3 and 7

Related

How to get the min value of a column after selecting it. Cannot perform an aggregate function on a column

My error is in this part (select min(sc.name) from so.name ), how to solve it ?
In the select I am getting the table and column name, in the same time i want to get the min value of the column from the table. Is that possible?.
select so.name table_name , sc.name Column_name,(select min(sc.name) from so.name )
from sysindexes si, syscolumns sc, sysobjects so
where si.indid < 2 -- 0 = if a table. 1 = if a clustered index on an allpages-locked table. >1 = if a nonclustered index or a clustered index on a data-only-locked table.
and so.type = 'U' --U – user table
and sc.status & 128 = 128 --(value 128) – indicates an identity column.
and so.id = sc.id
and so.id = si.id
So the problem is that You are trying to basically trying to do dynamic code where You try to select a column based on a table name from a system table.
Problem is that SQL doesnt know that the 'so.name' you are referencing is a table (further more, sysobjects also contains procedures and functions).
Rather than that, you should do an Inner join between sys.syscolumns and sys.systables based on object_id.

Query all databases tables columns and size in a server

I need to query all databases, tables, columns, and number of rows for each table from a server.
The following code almost does what I need except it is only for a single database. I need this output with the addition of a column for the database name. And for it to run against all databases instead of just a single named one. Also need the number of records for each table
USE [temp_db];
SELECT
OBJECT_SCHEMA_NAME(T.[object_id],DB_ID()) AS [Schema],
T.[name] AS [table_name], AC.[name] AS [column_name],
TY.[name] AS system_data_type, AC.[max_length],
AC.[precision], AC.[scale], AC.[is_nullable], AC.[is_ansi_padded]
FROM
sys.[tables] AS T
INNER JOIN
sys.[all_columns] AC ON T.[object_id] = AC.[object_id]
INNER JOIN
sys.[types] TY ON AC.[system_type_id] = TY.[system_type_id]
AND AC.[user_type_id] = TY.[user_type_id]
WHERE
T.[is_ms_shipped] = 0
ORDER BY
T.[name], AC.[column_id]
Current output:
Schema|table_name|column_name|system_data_type|max_length|precision|scale|is_nullable|is_ansi_padded
I need the output to be:
db_name|table_name|column_name|system_data_type|num_records

Azure SQL DB causing connection time out for stored procedures

We have hosted our database in Azure and are running stored procedures on this DB. The stored procedures had been running fine till last week but suddenly started giving error connection timeout.
Our database size is 14 GB and the stored procedures in general return 2k to 20k records and we are using the S3 pricing tier (50 DTU) of Azure DB.
What I found interesting was the first time the stored procedure is executed, it takes a lot of time 2 - 3 mins and this is causing the timeout. The later executions are fast (maybe it caches the execution plan).
Also when I run on the same DB with the same number of records on a machine with the config of 8gb ram, Win10 it runs in 15 seconds.
This is my stored procedure:
CREATE PROCEDURE [dbo].[PRSP]
#CompanyID INT,
#fromDate DATETIME,
#toDate DATETIME,
#ListMailboxId as MailboxIds Readonly,
#ListConversationType as ConversationTypes Readonly
AS
BEGIN
SET NOCOUNT ON;
SELECT
C.ID,
C.MailboxID,
C.Status,
C.CustomerID,
Cust.FName,
Cust.LName,
C.ArrivalDate as ConversationArrivalDate,
C.[ClosureDate],
C.[ConversationType],
M.[From],
M.ArrivalDate as MessageArrivalDate,
M.ID as MessageID
FROM
[Conversation] as C
INNER JOIN
[ConversationHistory] AS CHis ON (CHis.ConversationID = C.ID)
INNER JOIN
[Message] AS M ON (M.ConversationID = C.ID)
INNER JOIN
[Mailbox] AS Mb ON (Mb.ID = C.MailboxID)
INNER JOIN
[Customer] AS Cust ON (Cust.ID = C.CustomerID)
JOIN
#ListConversationType AS convType ON convType.ID = C.[ConversationType]
JOIN
#ListMailboxId AS mailboxIds ON mailboxIds.ID = Mb.ID
WHERE
Mb.CompanyID = #CompanyID
AND ((CHis.CreatedOn > #fromDate
AND CHis.CreatedOn < #toDate
AND CHis.Activity = 1
AND CHis.TagData = '3')
OR (M.ArrivalDate > #fromDate
AND M.ArrivalDate < #toDate))
END
This is the execution plan :
Execution Plan
Please do give your suggestions as to what improvement is needed? Also do we need to upgrade my pricing tier?
Ideally for a 14GB DB what should be the Azure Pricing tier?
That query should take 1 to 3 seconds to complete on your Windows 10 8Gb RAM machine. It takes 15 seconds because SQL Server choose a poor execution plan. In this case, the root cause of poor execution plan is bad estimates, several operators in the plan show big difference between estimated rows and actual rows. For example, SQL Server estimated it only need to perform one seek into pk_customer clustered index, but it performed 16,522 seeks. The same thing occurs with [ConversationHistory].[IX_ConversationID_CreatedOn_Activity_ByWhom] and with [Message].[IX_ConversationID_ID_ArrivalDt_From_RStatus_Type.
Here you have some hints you could follow to improve the performance of the query:
Update statistics
Try OPTION (HASH JOIN) at the end of the query. It might improve the
performance or it might slow it down, it even can cause the query to
error.
Store table variable data in temporal tables and use them in the query. (SELECT * INTO #temp_table FROM #table_variable). Table variables don't have statistics causing bad estimates.
Identify the first operator where the difference between estimated rows and actual rows are big enough. Split the query. Query1: SELECT * INTO #operator_result FROM (query equivalent to operator). Query2: write the query using #operator_result. Because #operator_result is a temporal table SQL Server is forced to reevaluate estimates. In this case, the offending operator is the hash match (inner join)
There are other things you can do to improve the performance of this query:
Avoid key lookups. There are 16,522 key lookups into Conversation.PK_dbo.Conversation clusterd index. It can ve avoided by creating the appropriate covering index. In this case, the covering index is the following:
DROP INDEX [IX_MailboxID] ON [dbo].[Conversation]
GO
CREATE INDEX IX_MailboxID ON [dbo].[Conversation](MailboxID)
INCLUDE (ArrivalDate, Status, ClosureDate, CustomerID, ConversationType)
Split OR predicate into UNION or UNION ALL. For example:
instead of:
SELECT *
FROM table
WHERE <predicate1> OR <predicate2>
use:
SELECT *
FROM table
WHERE <predicate1>
UNION
SELECT *
FROM table
WHERE <predicate2>
Sometimes it improves performance.
Apply each hint individually and measure performance.
EDIT: You can try the following and see if it improves performance:
SELECT
C.ID,
C.MailboxID,
C.Status,
C.CustomerID,
Cust.FName,
Cust.LName,
C.ArrivalDate as ConversationArrivalDate,
C.[ClosureDate],
C.[ConversationType],
M.[From],
M.ArrivalDate as MessageArrivalDate,
M.ID as MessageID
FROM
#ListConversationType AS convType
INNER JOIN (
#ListMailboxId AS mailboxIds
INNER JOIN
[Mailbox] AS Mb ON (Mb.ID = mailboxIds.MailboxID)
INNER JOIN
[Conversation] as C
ON C.ID = Mb.ID
) ON convType.ID = C.[ConversationType]
INNER HASH JOIN
[Customer] AS Cust ON (Cust.ID = C.CustomerID)
INNER HASH JOIN
[ConversationHistory] AS CHis ON (CHis.ConversationID = C.ID)
INNER HASH JOIN
[Message] AS M ON (M.ConversationID = C.ID)
WHERE
Mb.CompanyID = #CompanyID
AND ((CHis.CreatedOn > #fromDate
AND CHis.CreatedOn < #toDate
AND CHis.Activity = 1
AND CHis.TagData = '3')
OR (M.ArrivalDate > #fromDate
AND M.ArrivalDate < #toDate))
And this:
SELECT
C.ID,
C.MailboxID,
C.Status,
C.CustomerID,
Cust.FName,
Cust.LName,
C.ArrivalDate as ConversationArrivalDate,
C.[ClosureDate],
C.[ConversationType],
M.[From],
M.ArrivalDate as MessageArrivalDate,
M.ID as MessageID
FROM
#ListConversationType AS convType
INNER JOIN (
#ListMailboxId AS mailboxIds
INNER JOIN
[Mailbox] AS Mb ON (Mb.ID = mailboxIds.MailboxID)
INNER JOIN
[Conversation] as C
ON C.ID = Mb.ID
) ON convType.ID = C.[ConversationType]
INNER MERGE JOIN
[Customer] AS Cust ON (Cust.ID = C.CustomerID)
INNER MERGE JOIN
[ConversationHistory] AS CHis ON (CHis.ConversationID = C.ID)
INNER MERGE JOIN
[Message] AS M ON (M.ConversationID = C.ID)
WHERE
Mb.CompanyID = #CompanyID
AND ((CHis.CreatedOn > #fromDate
AND CHis.CreatedOn < #toDate
AND CHis.Activity = 1
AND CHis.TagData = '3')
OR (M.ArrivalDate > #fromDate
AND M.ArrivalDate < #toDate))
50 DTU is equivalent to 1/2 logical core.
See more: Using the Azure SQL Database DTU Calculator
I had the same issue this week and the final users claimed slowness in using the application connected to the VM hosted in Azure. Also, I have almost the same VM (4CPUs, 14GB of RAM and S3 but with 100DTUs).
In my case, I had a lot of indexes with avg_fragmentation_in_percent greater than 30 and this caused poor performance in executing stored procedures.
Run this in SSMS and if the indexes of the tables you are running your stored procedure against are there, then you might take care of it:
SELECT dbschemas.[name] as 'Schema',
dbtables.[name] as 'Table',
dbindexes.[name] as 'Index',
indexstats.avg_fragmentation_in_percent,
indexstats.page_count
FROM sys.dm_db_index_physical_stats (DB_ID(), NULL, NULL, NULL, NULL) AS indexstats
INNER JOIN sys.tables dbtables on dbtables.[object_id] = indexstats.[object_id]
INNER JOIN sys.schemas dbschemas on dbtables.[schema_id] = dbschemas.[schema_id]
INNER JOIN sys.indexes AS dbindexes ON dbindexes.[object_id] = indexstats.[object_id]
WHERE indexstats.database_id = DB_ID()
AND indexstats.index_id = dbindexes.index_id
AND indexstats.avg_fragmentation_in_percent >30
--AND dbindexes.[name] like '%CLUSTER%'
ORDER BY indexstats.avg_fragmentation_in_percent DESC
More info here.
Edit:
Check also how old the statistics are:
SELECT
sys.objects.name AS table_name,
sys.indexes.name as index_name,
sys.indexes.type_desc as index_type,
stats_date(sys.indexes.object_id,sys.indexes.index_id)
as last_update_stats_date,
DATEDIFF(d,stats_date(sys.indexes.object_id,sys.indexes.index_id),getdate())
as stats_age_in_days
FROM
sys.indexes
INNER JOIN sys.objects on sys.indexes.object_id=sys.objects.object_id
WHERE
sys.objects.type = 'U'
AND
sys.indexes.index_id > 0
--AND sys.indexes.name Like '%CLUSTER%'
ORDER BY
stats_age_in_days DESC;
GO

Size of a table for records by date.

Is it possible to get the size of a table in terms of records before a certain date. Meaning I want to know what the size of the table with records that are 2 years and older.
As you want to get only size based on partial rows in the table, you can sum up length of each column, this will be approximate
Assuming your date column is createdDate
select SUM(datalength(col1))+SUM(datalength(col2))+.. from tableName
WHERE datediff(year, createdDate, getdate()) > 2
You can get the amount of database pages currently allocated for the table and adjust the value to the amount of rows it had 2 years ago:
select sq.SchemaName, sq.TableName, sq.IndexName, sq.IndexId, sq.Rows, sq.TableSize,
sum(sq.total_pages) as [TotalPages],
round(cast(sum(sq.total_pages) as money) / 128, 3) as [ObjectSizeMB]
from (
select object_schema_name(t.object_id) as [SchemaName], t.name as [TableName], i.Name as [IndexName], i.Index_Id as [IndexId], pt.Rows,
au.total_pages,
round(cast(sum(au.total_pages) over(partition by t.object_id) as money) / 128, 3) as [TableSize]
from sys.tables t
inner join sys.partitions pt on t.object_id = pt.object_id
inner join sys.indexes i on pt.object_id = i.object_id and pt.index_id = i.index_id
inner join sys.allocation_units au on au.container_id = case
when au.type in (1, 3) then pt.hobt_id
when au.type = 2 then pt.partition_id
end
where t.type in ('U', 'V')
) sq
where sq.IndexId < 2
group by sq.TableName, sq.IndexName, sq.IndexId, sq.Rows, sq.TableSize, sq.SchemaName
order by sq.tablesize desc, sq.SchemaName, sq.TableName, TotalPages desc;
The ObjectSizeMB column shows space taken by an object (heap, clustered or nonclustered index), while TableSize contains subtotals of these values for the entire table (or indexed view). Depending on your definition of "table size", you can use either of them. And if you want to see nonclustered indices in the list, the outer where sq.IndexId < 2 should be commented out.
Should give you a good starting point.

How to determine if a specific set of tables in a database are empty

I have database A which contains a table (CoreTables) that stores a list of active tables within database B that the organization's users are sending data to.
I would like to be able to have a set-based query that can output a list of only those tables within CoreTables that are populated with data.
Dynamically, I normally would do something like:
For each row in CoreTables
Get the table name
If table is empty
Do nothing
Else
Print table name
Is there a way to do this without a cursor or other dynamic methods? Thanks for any assistance...
Probably the most efficient option is:
SELECT c.name
FROM dbo.CoreTables AS c
WHERE EXISTS
(
SELECT 1
FROM sys.partitions
WHERE index_id IN (0,1)
AND rows > 0
AND [object_id] = OBJECT_ID(c.name)
);
Just note that the count in sys.sysindexes, sys.partitions and sys.dm_db_partition_stats are not guaranteed to be completely in sync due to in-flight transactions.
While you could just run this query in the context of the database, you could do this for a different database as follows (again assuming that CoreTables does not include schema in the name):
SELECT c.name
FROM DatabaseA.CoreTables AS c
WHERE EXISTS
(
SELECT 1
FROM DatabaseB.sys.partitions AS p
INNER JOIN DatabaseB.sys.tables AS t
ON p.[object_id] = t.object_id
WHERE t.name = c.name
AND p.rows > 0
);
If you need to do this for multiple databases that all contain the same schema (or at least overlapping schema that you're capturing in aggregate in a central CoreTables table), you might want to construct a view, such as:
CREATE VIEW dbo.CoreTableCounts
AS
SELECT db = 'DatabaseB', t.name, MAX(p.rows)
FROM DatabaseB.sys.partitions AS p
INNER JOIN DatabaseB.sys.tables AS t
ON p.[object_id] = t.[object_id]
INNER JOIN DatabaseA.dbo.CoreTables AS ct
ON t.name = ct.name
WHERE p.index_id IN (0,1)
GROUP BY t.name
UNION ALL
SELECT db = 'DatabaseC', t.name, rows = MAX(p.rows)
FROM DatabaseC.sys.partitions AS p
INNER JOIN DatabaseC.sys.tables AS t
ON p.[object_id] = t.[object_id]
INNER JOIN DatabaseA.dbo.CoreTables AS ct
ON t.name = ct.name
WHERE p.index_id IN (0,1)
GROUP BY t.name
-- ...
GO
Now your query isn't going to be quite as efficient, but doesn't need to hard-code database names as object prefixes, instead it can be:
SELECT name
FROM dbo.CoreTableCounts
WHERE db = 'DatabaseB'
AND rows > 0;
If that is painful to execute you could create a view for each database instead.
In SQL Server, you can do something like:
SELECT o.name, st.row_count
FROM sys.dm_db_partition_stats st join
sys.objects o
on st.object_id = o.object_id
WHERE index_id < 2 and st.row_count > 0
By the way, this specifically does not use OBJECT_ID() or OBJECT_NAME() because these are evaluated in the current database. The above code continues to work for another database, using 3-part naming. This version also takes into account multiple partitions:
SELECT o.name, sum(st.row_count)
FROM <dbname>.sys.dm_db_partition_stats st join
<dbname>.sys.objects o
on st.object_id = o.object_id
WHERE index_id < 2
group by o.name
having sum(st.row_count) > 0
something like this?
//
foreach (System.Data.DataTable dt in yourDataSet.Tables)
{
if (dt.Rows.Count != 0) { PrintYourTableName(dt.TableName); }
}
//
This is a way you can do it, that relies on system tables, so be AWARE it may not always work in future versions of SQL. With that strong caveat in mind.
select distinct OBJECT_NAME(id) as tabName,rowcnt
from sys.sysindexes si
join sys.objects so on si.id=si.id
where indid=1 and so.type='U'
You would add to the where clause the tables you are interested in and rowcnt <1