SQL Server: Grouping heirarchical data - sql

I have a document table - see below for definition
In this table We have a root document which has a OriginalDocID of null
Every time a revision is made a new entry is added with the parents documentID as the OriginalDocID
What I am looking to do is to be able to group/partition everything around the Original document that has OriginalDocID of null
Each document can have multiple revisions from one point of origin.
meaning we can have
Doc Id 1 -> 2 -> 3
2 -> 8 -> 9
1 -> 4 -> 7
5 -> 10
So what I would hope to see back is all the rows with the root document. appended
I hope this makes sense. For the life of me I can't wrap my head around a sufficient query.
CREATE TABLE [dbo].[Document](
[DocumentID] [int] IDENTITY(1,1) NOT NULL,
[DocumentName] [varchar](max) NOT NULL,
[ContentType] [varchar](50) NULL,
[DocumentText] [varchar](max) NULL,
[DateCreated] [datetime] NULL,
[DocumentTypeId] [int] NULL,
[Note] [varchar](8000) NULL,
[RefID] [int] NULL,
[Version] [int] NULL,
[Active] [bit] NULL,
[OriginalDocID] [int] NULL
)

You'll need to use a Recursive CTE to do this. That's a query that refers back to itself so it can traverse a hierarchy and gather information as it works it's way down (or up) the levels of that hierarchy.
In your case, something like:
WITH RECURSIVE docCTE AS
(
/* Recursive Seed */
SELECT
cast(null as int) as parentdoc
documentID,
0 as depth,
documentid as originalDocument,
CAST(null as varchar(100) as docpath
FROM
dbo.document
Where originalDocID IS NULL
UNION ALL
/* Recursive Term */
SELECT
docCTE.DocumentID as parentdoc,
document.documentID,
depth + 1 as depth,
docCTE.originalDocument,
docCTE.Path + '>' + document.documentID
FROM
docCTE
INNER JOIN dbo.document on doccte.document = document.originalDocID
WHERE
depth <= 15 /*Keep it from cycling in case of bad hierarchy*/
)
SELECT * FROM docCTE;
The recursive CTE is made up of two parts.
The recursive seed, which is what we use to kick of the query. This is all document records where the originalDocID is null.
The recursive term, where we join the table back to the recursive CTE establishing the parent/child relationship.
In your case we capture the documentid in the Recursive Seed as the originalDoc so that we can bring that down through each record found when we start traversing the hierarchy of documents.
These can be a little overwhelming when you get started, but after you write it a few times, it's second nature (and you'll find the really really helpful as you encounter more of this type of data).

Related

Dealing with huge table - 100M+ rows

I have table with around 100 million rows and it is only getting larger, as table is queried pretty frequently I have to come up with some solution to optimise this.
Firstly here is the model:
CREATE TABLE [dbo].[TreningExercises](
[TreningExerciseId] [uniqueidentifier] NOT NULL,
[NumberOfRepsForExercise] [int] NOT NULL,
[CycleNumber] [int] NOT NULL,
[TreningId] [uniqueidentifier] NOT NULL,
[ExerciseId] [int] NOT NULL,
[RoutineExerciseId] [uniqueidentifier] NULL)
Here is Trening table:
CREATE TABLE [dbo].[Trenings](
[TreningId] [uniqueidentifier] NOT NULL,
[DateTimeWhenTreningCreated] [datetime] NOT NULL,
[Score] [int] NOT NULL,
[NumberOfFinishedCycles] [int] NOT NULL,
[PercentageOfCompleteness] [int] NOT NULL,
[IsFake] [bit] NOT NULL,
[IsPrivate] [bit] NOT NULL,
[UserId] [nvarchar](128) NOT NULL,
[AllRoutinesId] [bigint] NOT NULL,
[Name] [nvarchar](max) NULL,
)
Indexes (other than PK which are clustered):
TreningExercises:
TreningId (also FK)
ExerciseId (also FK)
Trenings:
UserId (also FK)
AllRoutinesId (also FK)
Score
DateTimeWhenTreningCreated (ordered by DateTimeWhenTreningCreated DESC)
And here is the example of the most commonly executed query:
DECLARE #userId VARCHAR(40)
,#exerciseId INT;
SELECT TOP (1) R.[TreningExerciseId] AS [TreningExerciseId]
,R.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,R.[TreningId] AS [TreningId]
,R.[ExerciseId] AS [ExerciseId]
,R.[RoutineExerciseId] AS [RoutineExerciseId]
,R.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM (
SELECT TE.[TreningExerciseId] AS [TreningExerciseId]
,TE.[NumberOfRepsForExercise] AS [NumberOfRepsForExercise]
,TE.[TreningId] AS [TreningId]
,TE.[ExerciseId] AS [ExerciseId]
,TE.[RoutineExerciseId] AS [RoutineExerciseId]
,T.[DateTimeWhenTreningCreated] AS [DateTimeWhenTreningCreated]
FROM [dbo].[TreningExercises] AS TE
INNER JOIN [dbo].[Trenings] AS T ON TE.[TreningId] = T.[TreningId]
WHERE (T.[UserId] = #userId)
AND (TE.[ExerciseId] = #exerciseId)
) AS R
ORDER BY R.[DateTimeWhenTreningCreated] DESC
Execution plan: link
Please accept my apologies if it is bit unreadable or unoptimised, it was generated by ORM (Entity Framework), I just edited it a bit.
According to Azure's SQL Analytics tool this query has the most impact on my DB and even though it usually doesn't take too long to execute, from time to time there are spikes in DB I/O due to it.
Also there is a bit business logic involved in this, to simplify it: 99% of the time I need data which is less then a year old.
What are my best options regarding querying and table size?
My thoughts on querying, either:
Create indexed view OR
Add Date and UserId fields to the TreningExerciseId table OR
Some option that I haven't thought of :)
Regarding table size, either:
Partition table (probably by date) OR
Move most of the data (or all of it) to some NoSQL key-value store OR
Some option that I haven't thought of :)
What are your thoughts about these problems, how should I approach solving them?
If you add the following columns to the index "ix_TreninID":
NoOfRepsForExecercise
ExerciseID
RoutineExerciseID
That will make the index a "covering index" and eliminate the need for the lookup which is taking 95% of the plan.
Give it a go, and post back.

.NET framework error when enabling where clause in sql query

I am facing a weird issue wherein on disabling/enabling certain condition in where clause, my Select query throws .net framework error.
Here is the CREATE table script.
Table test_classes:
CREATE TABLE [dbo].[test_classes]
(
[CLASSID] [int] NOT NULL,
[PARENTID] [int] NULL,
[CATID] [int] NOT NULL,
[CLASS_NAME] [nvarchar](255) NOT NULL,
[ORIGINAL_NAME] [nvarchar](255) NULL,
[GEOMETRY] [tinyint] NOT NULL,
[READ_ONLY] [bit] NOT NULL,
[DISPLAY_STYLES] [image] NULL,
[FEATURE_COUNT] [int] NOT NULL,
[TEMPOWNER] [int] NULL,
[OPTIONS] [int] NOT NULL,
[POLYGON_TYPE] [int] NULL,
[CLASS_EXTRA] [nvarchar](1024) NULL,
[MAPID] [int] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Table test_polygon:
CREATE TABLE [dbo].[test_polygon]
(
[FID] [nvarchar](36) NOT NULL,
[EXTENT_L] [float] NOT NULL,
[EXTENT_T] [float] NOT NULL,
[EXTENT_R] [float] NOT NULL,
[EXTENT_B] [float] NOT NULL,
[COORDINATES] [image] NULL,
[CHAINS] [smallint] NOT NULL,
[CLASSID] [int] NOT NULL,
[SPATIAL_KEY] [bigint] NULL
) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]
Due to word limitation (due to image datatype), here is the INSERT input: GDrive SQL Link
SELECT SQL query:
select
Class_Name, FID,
geometry::STGeomFromWKB(b1+b2,0) as polygon,
Class_ID, Original_Name
from
(Select
cl.Class_Name, p.FID,
substring(CAST(p.Coordinates AS varbinary(max)),1,1) as b1,
substring(CAST(p.Coordinates AS varbinary(max)),3,999999) as b2,
cl.ClassID as Class_ID,
cl.Original_Name
From
test_polygon p
Inner Join
test_classes cl on cl.ClassID = p.ClassID) s_polygon
--where Class_ID = 215 --Filter#1
--where Class_Name = 'L1_County' --Filter#2
To note, Class_ID 215 represents 'L1_County' class_name.
Problem is, if you enable Filter#1, then the output is as expected. But when I only enable Filter#2 then the query fails with .NET Error.
Expected output :
Class_Name FID polygon Class_ID Original_Name
----------- ---------------- ------------- ----------- ------------------------
L1_County Northamptonshire <long value> 215 B8USR_4DB8184E88092424
Error I get :
Msg 6522, Level 16, State 1, Line 4
A .NET Framework error occurred during execution of user-defined routine or aggregate "geometry":
System.FormatException: 24119: The Polygon input is not valid because the start and end points of the exterior ring are not the same. Each ring of a polygon must have the same start and end points.
System.FormatException:
at Microsoft.SqlServer.Types.GeometryValidator.ValidatePolygonRing(Int32 iRing, Int32 cPoints, Double firstX, Double firstY, Double lastX, Double lastY)
at Microsoft.SqlServer.Types.Validator.Execute(Transition transition)
at Microsoft.SqlServer.Types.ForwardingGeoDataSink.EndFigure()
at Microsoft.SqlServer.Types.WellKnownBinaryReader.ReadLineStringPoints(ByteOrder byteOrder, UInt32 cPoints, Boolean readZ, Boolean readM)
at Microsoft.SqlServer.Types.WellKnownBinaryReader.ReadLinearRing(ByteOrder byteOrder, Boolean readZ, Boolean readM)
at Microsoft.SqlServer.Types.WellKnownBinaryReader.ParseWkbPolygonWithoutHeader(ByteOrder byteOrder, Boolean readZ, Boolean readM)
at Microsoft.SqlServer.Types.WellKnownBinaryReader.ParseWkb(OpenGisType> type) > at Microsoft.SqlServer.Types.WellKnownBinaryReader.Read(OpenGisType type, Int32 srid)
at Microsoft.SqlServer.Types.SqlGeometry.GeometryFromBinary(OpenGisType type, SqlBytes binary, Int32 srid) .
What I am trying to ask is, Why do I get error when WHERE clause has Class_Name and not when Class_ID.
I am using SQL Server 2012 Enterprise edition. Error replicates in SQL Server 2008 as well.
edit:
Estimated Execution plan for Filter#1 :
Estimated Execution plan for Filter#2 :
I will summarise comments:
You are seeing this issue because your table contains invalid data. The reason you do not see it when searching by test_polygon.Class_ID is that Class_ID is passed as a predicate to the table scan. When test_classes.Class_Name is used as filter the search predicate is applied to test_classes table.
Since geometry::STGeomFromWKB "Compute Scalar" happens before "Join" it causes all rows of test_polygon to be evaluated by this function, including rows containing invalid data.
Update: Even though the plans look the same, they are not, as predicate conditions are different for different filters (WHERE conditions) and therefore outputs of table scans operators are different.
The is no standard way to force the order of evaluation in SQL Server query as by design you are not supposed to.
There are two options:
Materialise (store in a table) the result of the sub-query. This, simply, splits the query into two separate queries, one to find records and the second query to compute data on the found results. The intermediate results are stored in a (temp) table.
Use "hacks" that allow you to coerce SQL Server to evaluate query a certain way.
Below is an example of a "hack":
select
Class_Name, FID,
CASE WHEN Class_Name = Class_Name THEN geometry::STGeomFromWKB(b1+b2,0) ELSE NULL END as polygon,
Class_ID, Original_Name
from
(Select
cl.Class_Name, p.FID,
substring(CAST(p.Coordinates AS varbinary(max)),1,1) as b1,
substring(CAST(p.Coordinates AS varbinary(max)),3,999999) as b2,
cl.ClassID as Class_ID,
cl.Original_Name
From
test_polygon p
Inner Join
test_classes cl on cl.ClassID = p.ClassID) s_polygon
--where Class_ID = 215 --Filter#1
where Class_Name = 'L1_County' --Filter#2
By adding a dummy CASE expression that looks at test_classes.Class_Name we are forcing SQL Server to evaluate it after the JOIN has been resolved.
The plan:
Useful Article:
http://dataeducation.com/cursors-run-just-fine/

SQL - Joining to Nonexistent Records

After doing a bit of looking, I thought maybe I'd found a solution here: sql join including null and non existing records. Cross Joining my tables seems like a great way to solve my problem, but now I've hit a snag:
Below are the tables I’m using:
CREATE TABLE [dbo].[DCRSales](
[WorkingDate] [smalldatetime] NOT NULL,
[Store] [int] NOT NULL,
[Department] [int] NOT NULL,
[NetSales] [money] NOT NULL,
[DSID] [int] IDENTITY(1,1) NOT NULL)
CREATE TABLE [dbo].[Stores](
[Number] [int] NOT NULL,
[Has_Deli] [bit] NOT NULL,
[Alcohol_Register] [int] NULL,
[Is_Cost_Saver] [bit] NOT NULL,
[Store_Status] [nchar](10) NOT NULL,
[Supervisor_Number] [int] NOT NULL,
[Email_Address] [nchar](20) NOT NULL,
[Sales_Area] [int] NULL,
[PZ_Store_Number] [int] NULL,
[Has_SCO] [bit] NULL,
[SCO_Reg] [nchar](25) NULL,
[Has_Ace] [bit] NULL,
[Ace_Sq_Ft] [int] NULL,
[Open_Date] [datetime] NULL,
[Specialist] [nchar](2) NULL,
[StateID] [int] NOT NULL)
CREATE TABLE [dbo].[DepartmentMap](
[Department_Number] [int] NOT NULL,
[Description] [nvarchar](max) NOT NULL,
[Parent_Department] [int] NOT NULL)
CREATE TABLE [dbo].[ParentDepartments](
[Parent_Department] [int] NOT NULL,
[Description] [varchar](50) NULL
DCRSales is a table holding new and archived data. The archived data is not perfect, meaning that there are of course certain missing date gaps and some stores which currently have a department they didn't have or no longer have a department they used to have. My goal is to join this table to our department list, list the child departments and parent departments and SUM up the netsales for a given date range. In cases where a store does not have a department whatsoever in that date range, I still need to display it as 0.00.
A more robust solution would probably be to just store all departments for each store regardless of whether they have that department or not (with sales set to 0.00 of course). However I imagine doing that and/or solving my problem here would require very similar queries anyway.
The query I have tried is as follows:
WITH CTE AS (
SELECT S.Number as Store, DepartmentMap.Department_Number as Department, ParentDepartments.Parent_Department as Parent, ParentDepartments.Description as ParentDescription, DepartmentMap.Description as ChildDescription
FROM Stores as S CROSS JOIN dbo.DepartmentMap INNER JOIN ParentDepartments ON DepartmentMap.Parent_Department = ParentDepartments.Parent_Department
WHERE S.Number IN(<STORES>) AND Department_Number IN(<DEPTS>)
)
SELECT CTE.Store, CTE.Department, SUM(ISNULL(DCRSales.NetSales, 0.00)) as Sales, CTE.Parent, CTE.ParentDescription, CTE.ChildDescription
FROM CTE LEFT JOIN DCRSales ON DCRSales.Department = CTE.Department AND DCRSales.Store = CTE.Store
WHERE DCRSales.WorkingDate BETWEEN '<FIRSTDAY>' AND '<LASTDAY>' OR DCRSales.WorkingDate IS NULL
GROUP BY CTE.Store, CTE.Department, CTE.Parent, CTE.ParentDescription, CTE.ChildDescription
ORDER BY CTE.Store ASC, CTE.Department ASC
In this query I try to CROSS JOIN each department to a store from the Stores table so that I can get a combination of every store and every department. I also include the Parent Departments of each department along with the child department's description and the parent department's description. I filter this first portion based on store and department, but this does not change the general concept.
With this result set, I then try to join this table to all of the sales in DCRSales that are within a certain date range. I also include the date if it’s null because the results that have a NULL sales also have a NULL WorkingDate.
This query seemed to work, until I noticed that not all departments are being used with all stores. The stores in particular that do not get combined with all departments are the ones that have no data for the given date range (meaning they have been closed). If there is no data for the department, it should still be listed with its department number, parent number, department description and parent description (with Sales as 0.00). Any help is greatly appreciated.
Your WHERE clause is filtering out records that do have sales at some point in time, but not for the desired period of time, those records don't meet either criteria and are therefore excluded.
I might be under-thinking it, but might just need to move:
DCRSales.WorkingDate BETWEEN '<FIRSTDAY>' AND '<LASTDAY>'
To your LEFT JOIN criteria and get rid of WHERE clause. If that's not right, you could filter sales by date in a 2nd cte prior to the join.
What you want is an OUTER JOIN.
See this: https://technet.microsoft.com/en-us/library/ms187518(v=sql.105).aspx
I suggest that this process is probably much too complicated to be done using a single query. I think that you need to perform several queries: to extract the transactions-of-interest into a separate table, then to modify the results one-or-more times in that table before using it to produce your final statistics. A stored procedure, which drives several separately stored queries, could be used to drive the process, which, in several "stages," makes several "passes" over the initially-extracted data.
One piece of important information, for instance, would be to know when a particular store had a particular department. (For instance: store, department, starting_date, ending_date.) This would be a refinement of a possibly-existing table (and, maybe, drawn from it ...) which lists what departments a particular store has today.
Let us hope that department-numbers do not change, or that your company hasn't acquired other companies with the resulting need to "re-map" these numbers in some way.
Also: frankly, if you have access to a really-good statistics package, such as SAS® or SPSS® ... can "R" do this sort of thing? ... you might find yourself better-off. (And no, I don't mean "Microsoft Excel ...") ;-)
When I have been faced with requirements like these (and I have been, many times ...), a stats-package was indispensable. I found that I had to "massage" the process and the extracted data a number of times, in a system of successive-refinement that gradually lead me to a reporting process that I could trust and therefore defend.

SQL query align sequence of events in one row

Hope you all enjoying holidays and fun time. Reply to this when you get a chance.
I'm using SQL Server 2005.
The issues is - there is a process in one of our teams which do things in sequence of events/actions. Mostly it is 4 actions/events for every sequence. These events are fixed. When they perform this DWH gets an entry for each event as a separate row.
Example
1) Call Customer
2) Sell Insurance
3) Send PDS
4) Send details to product team
I've it set in a way that all Actions and their definitions are in Dimension_Point_Code table. All events that come through from DWH are treated as Fact and stored in Fact_Point
The Point here refers to the Point in the sequence of the process.
So the table that store this info looks like below
Dimension table
CREATE TABLE [tbl_Dim_Point_Code]
(
[Point_Code_Key] [int] IDENTITY(101,1) NOT NULL,
[Point_Code] [varchar](8) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Point_Code_TouchPoint] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
Fact Table
CREATE TABLE [tbl_FACT_Point]
(
[Point_Key] [bigint] IDENTITY(100,1) NOT NULL,
[Point_Code_Key] [int] NOT NULL,
[Customer_Number] [varchar](19) COLLATE SQL_Latin1_General_CP1_CI_AS NULL,
[Sale_Date] [datetime] NULL,
[Rep_ID] [varchar](50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]
Data in Dimension
INSERT INTO [tbl_Dim_Point_Code]
([Point_Code]
,[Point_Code_TouchPoint])
Select 'CALC' as Point_Code,'Point1' as Point_Code_TouchPoint
UNION
Select 'SELL' as Point_Code,'Point2' as Point_Code_TouchPoint
UNION
Select 'SPDS' as Point_Code,'Point3' as Point_Code_TouchPoint
UNION
Select 'TPRD' as Point_Code,'Point4' as Point_Code_TouchPoint
Data in Fact
INSERT INTO [tbl_FACT_Point]
([Point_Code_Key]
,[Customer_Number]
,[Sale_Date]
,[Rep_ID])
Select 101,'C101501','24-Feb-2012','ABC'
UNION
Select 102,'C101501','24-Feb-2012','ABC'
UNION
Select 103,'C101501','26-Feb-2012','DEF'
UNION
Select 104,'C101501','27-Feb-2012','XYZ'
UNION
Select 101,'C101502','2-Feb-2012','GHI'
UNION
Select 102,'C101502','2-Feb-2012','GHI'
UNION
Select 104,'C101502','4-Feb-2012','XYZ'
UNION
Select 101,'C101503','14-Feb-2012','ABC'
UNION
Select 103,'C101503','20-Feb-2012','ABC'
UNION
Select 104,'C101503','22-Feb-2012','BBC'
UNION
Select 101,'C101501','24-Oct-2012','ABC'
UNION
Select 102,'C101501','24-Oct-2012','ABC'
UNION
Select 103,'C101501','26-Oct-2012','DEF'
UNION
Select 104,'C101501','27-Oct-2012','XYZ'
Points to note
1) As you can see Customer C101501 was called & sold the twice.
2) And the processing for the all involved happends everyday - like the team, the DWH and the SQL Server Process. Hence most of the times we will not know what is going to happen. We will only know that Event 1 has occured. then few days later event 2 and so on.
3) 101 & 104 are mandatory events. 102 and 103 may or may not occur. The team will contact Product team irrespective of sale or not.
Now - what we want is that the entries to be transposed into this table
CREATE TABLE [tbl_Process1_EVENT]
(
[HS_EVENT_KEY] [int] IDENTITY(101,1) NOT NULL,
[Customer_Number] [varchar](19) ,
[Import_Date] [datetime] NULL,
[Point1_Date_Called] [datetime] NULL,
[Point1_PointCode] [varchar](8) ,
[Point1_Rep_Id] [varchar](50) ,
[Point2_Date_Sold] [datetime] NULL,
[Point2_PointCode] [varchar](8) ,
[Point2_Rep_Id] [varchar](50) ,
[Point3_Date_PDSSent] [datetime] NULL,
[Point3_PointCode] [varchar](8) ,
[Point3_Rep_Id] [varchar](50) ,
[Point4_Date_ProdTeamXfer] [datetime] NULL,
[Point4_PointCode] [varchar](8) ,
[Point4_Rep_Id] [varchar](50)
) ON [PRIMARY]
What I'd want to happen is, an output like this.
Customer_Number Import_Date Point1_Date_Called Point1_PointCode Point1_Rep_Id Point2_Date_Sold Point2_PointCode Point2_Rep_Id Point3_Date_PDSSent Point3_PointCode Point3_Rep_Id Point4_Date_ProdTeamXfer Point4_PointCode Point4_Rep_Id
C101501 28/02/2012 24/02/2012 CALC ABC 24/02/2012 SELL ABC 26/02/2012 SPDS DEF 27/02/2012 TPRD XYZ
C101502 3/02/2012 2/02/2012 CALC GHI 2/02/2012 SELL GHI NULL NULL NULL 4/02/2012 TPRD ABC
C101503 23/02/2012 14/02/2012 CALC ABC NULL NULL NULL 20/02/2012 SPDS ABC 22/02/2012 TPRD BBC
C101501 28/10/2012 24/10/2012 CALC ABC 24/10/2012 SELL ABC 26/10/2012 SPDS DEF 27/10/2012 TPRD XYZ
Transposing the data into rows with new entry for C101501 if new sale is made on same customer.
Import_date is the date when the Row is Updated.
As i said processing happens everyday.
So if I can elaborate this table for line 1 of this output
On 24/02/2012 the Event CALC occurs - we know about this on 25/02/2012 Point 1 & Point 2 on 25/02 will be filled with data and Point 3 & Point 4 will be empty. Import_Date - will be 25/02/2012
On 26/02/2012 PDS is sent which is point 3. So the same row will be updated. Point 3 only will be updated. Import_date will be 27/02/2012
On 27/02/2012 Point 4 event occurs. Hence on 28/02/2012 the same row will be updated with Point 4 details.
That's all I can think about the issue.
Please any help any time will be greatly appreciated.
Njoy holidays and have a good time.
PS: I may have done something wrrong with the format of output. Please advise I can make it better.
I think you need to use the pivot command (aka cross tab). Its not easy to get your head around:
the following links should / could help:
http://geekswithblogs.net/lorint/archive/2006/08/04/87166.aspx
http://msdn.microsoft.com/en-us/library/ms177410%28v=sql.105%29.aspx
http://www.mssqltips.com/sqlservertip/1019/crosstab-queries-using-pivot-in-sql-server/
http://www.simple-talk.com/blogs/2007/09/14/pivots-with-dynamic-columns-in-sql-server-2005/
good lucj

Query against 250k rows taking 53 seconds

The box this query is running on is a dedicated server running in a datacenter.
AMD Opteron 1354 Quad-Core 2.20GHz
2GB of RAM
Windows Server 2008 x64 (Yes I know I only have 2GB of RAM, I'm upgrading to 8GB when the project goes live).
So I went through and created 250,000 dummy rows in a table to really stress test some queries that LINQ to SQL generates and make sure they're not to terrible and I noticed one of them was taking an absurd amount of time.
I had this query down to 17 seconds with indexes but I removed them for the sake of this answer to go from start to finish. Only indexes are Primary Keys.
Stories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[UserID] [int] NOT NULL,
[CategoryID] [int] NOT NULL,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Categories table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL,
Users table --
[ID] [int] IDENTITY(1,1) NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL,
Currently in the database there is 1 user, 1 category and 250,000 stories and I tried to run this query.
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
Query takes 52 seconds to run, CPU usage hovers at 2-3%, Membery is 1.1GB, 900MB free but the Disk usage seems out of control. It's # 100MB/sec with 2/3 of that being writes to tempdb.mdf and the rest is reading from tempdb.mdf.
Now for the interesting part...
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
SELECT TOP(10) *
FROM Stories
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
ORDER BY Stories.LastActivityAt
All 3 of these queries are pretty much instant.
Exec plan for first query.
http://i43.tinypic.com/xp6gi1.png
Exec plans for other 3 queries (in order).
http://i43.tinypic.com/30124bp.png
http://i44.tinypic.com/13yjml1.png
http://i43.tinypic.com/33ue7fb.png
Any help would be much appreciated.
Exec plan after adding indexes (down to 17 seconds again).
http://i39.tinypic.com/2008ytx.png
I've gotten a lot of helpful feedback from everyone and I thank you, I tried a new angle at this. I query the stories I need, then in separate queries get the Categories and Users and with 3 queries it only took 250ms... I don't understand the issue but if it works and at 250ms no less for the time being I'll stick with that. Here's the code I used to test this.
DBDataContext db = new DBDataContext();
Console.ReadLine();
Stopwatch sw = Stopwatch.StartNew();
var stories = db.Stories.OrderBy(s => s.LastActivityAt).Take(10).ToList();
var storyIDs = stories.Select(c => c.ID);
var categories = db.Categories.Where(c => storyIDs.Contains(c.ID)).ToList();
var users = db.Users.Where(u => storyIDs.Contains(u.ID)).ToList();
sw.Stop();
Console.WriteLine(sw.ElapsedMilliseconds);
Try adding an index on Stories.LastActivityAt. I think the clustered index scan in the execution plan may be due to the sorting.
Edit:
Since my query returned in an instant with rows just a few bytes long, but has been running for 5 minutes already and is still going after I added a 2K varchar, I think Mitch has a point. It is the volume of that data that is shuffled around for nothing, but this can be fixed in the query.
Try putting the join, sort and top(10) in a view or in a nested query, and then join back against the story table to get the rest of the data just for the 10 rows that you need.
Like this:
select * from
(
SELECT TOP(10) id, categoryID, userID
FROM Stories
ORDER BY Stories.LastActivityAt
) s
INNER JOIN Stories ON Stories.ID = s.id
INNER JOIN Categories ON Categories.ID = s.CategoryID
INNER JOIN Users ON Users.ID = s.UserID
If you have an index on LastActivityAt, this should run very fast.
So if I read the first part correctly, it responds in 17 seconds with an index. Which is still a while to chug out 10 records. I'm thinking the time is in the order by clause. I would want an index on LastActivityAt, UserID, CategoryID. Just for fun, remove the order by and see if it returns the 10 records quickly. If it does, then you know it is not in the joins to the other tables. Also it would be helpful to replace the * with the columns needed as all 3 table columns are in the tempdb as you are sorting - as Neil mentioned.
Looking at the execution plans you'll notice the extra sort - I believe that is the order by which is going to take some time. I'm assuming you had an index with the 3 and it was 17 seconds... so you may want one index for the join criteria (userid, categoryID) and another for lastactivityat - see if that performs better. Also it would be good to run the query thru the index tuning wizard.
My first suggestion is to remove the *, and replace it with the minimum columns you need.
second, is there a trigger involved? Something that would update the LastActivityAt field?
Based on your problem query, try add a combination index on table Stories (CategoryID, UserID, LastActivityAt)
You are maxing out the Disks in your hardware setup.
Given your comments about your Data/Log/tempDB File placement, I think any amount of tuning is going to be a bandaid.
250,000 Rows is small. Imagine how bad your problems are going to be with 10 million rows.
I suggest you move tempDB onto its own physical drive (preferable a RAID 0).
Ok, so my test machine isn't fast. Actually it's really slow. It 1.6 ghz,n 1 gb of ram, No multiple disks, just a single (read slow) disk for sql server, os, and extras.
I created your tables with primary and foreign keys defined.
Inserted 2 categories, 500 random users, and 250000 random stories.
Running the first query above takes 16 seconds (no plan cache either).
If I index the LastActivityAt column I get results in under a second (no plan cache here either).
Here's the script I used to do all of this.
--Categories table --
Create table Categories (
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[ShortName] [nvarchar](8) NOT NULL,
[Name] [nvarchar](64) NOT NULL)
--Users table --
Create table Users(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[Username] [nvarchar](32) NOT NULL,
[Password] [nvarchar](64) NOT NULL,
[Email] [nvarchar](320) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[LastActivityAt] [datetime] NOT NULL
)
go
-- Stories table --
Create table Stories(
[ID] [int] IDENTITY(1,1) primary key NOT NULL,
[UserID] [int] NOT NULL references Users ,
[CategoryID] [int] NOT NULL references Categories,
[VoteCount] [int] NOT NULL,
[CommentCount] [int] NOT NULL,
[Title] [nvarchar](96) NOT NULL,
[Description] [nvarchar](1024) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UniqueName] [nvarchar](96) NOT NULL,
[Url] [nvarchar](512) NOT NULL,
[LastActivityAt] [datetime] NOT NULL)
Insert into Categories (ShortName, Name)
Values ('cat1', 'Test Category One')
Insert into Categories (ShortName, Name)
Values ('cat2', 'Test Category Two')
--Dummy Users
Insert into Users
Select top 500
UserName=left(SO.name+SC.name, 32)
, Password=left(reverse(SC.name+SO.name), 64)
, Email=Left(SO.name, 128)+'#'+left(SC.name, 123)+'.com'
, CreatedAt='1899-12-31'
, LastActivityAt=GETDATE()
from sysobjects SO
Inner Join syscolumns SC on SO.id=SC.id
go
--dummy stories!
-- A Count is given every 10000 record inserts (could be faster)
-- RBAR method!
set nocount on
Declare #count as bigint
Set #count = 0
begin transaction
while #count<=250000
begin
Insert into Stories
Select
USERID=floor(((500 + 1) - 1) * RAND() + 1)
, CategoryID=floor(((2 + 1) - 1) * RAND() + 1)
, votecount=floor(((10 + 1) - 1) * RAND() + 1)
, commentcount=floor(((8 + 1) - 1) * RAND() + 1)
, Title=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Description=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, CreatedAt='1899-12-31'
, UniqueName=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, Url=Cast(NEWID() as VARCHAR(36))+Cast(NEWID() as VARCHAR(36))
, LastActivityAt=Dateadd(day, -floor(((600 + 1) - 1) * RAND() + 1), GETDATE())
If #count % 10000=0
Begin
Print #count
Commit
begin transaction
End
Set #count=#count+1
end
set nocount off
go
--returns in 16 seconds
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
--Now create an index
Create index IX_LastADate on Stories (LastActivityAt asc)
go
--With an index returns in less than a second
DBCC DROPCLEANBUFFERS
SELECT TOP(10) *
FROM Stories
INNER JOIN Categories ON Categories.ID = Stories.CategoryID
INNER JOIN Users ON Users.ID = Stories.UserID
ORDER BY Stories.LastActivityAt
go
The sort is definitely where your slow down is occuring.
Sorting mainly gets done in the tempdb and a large table will cause LOTS to be added.
Having an index on this column will definitely improve performance on an order by.
Also, defining your Primary and Foreign Keys helps SQL Server immensly
Your method that is listed in your code is elegant, and basically the same response that cdonner wrote except in c# and not sql. Tuning the db will probably give even better results!
--Kris
Have you cleared the SQL Server cache before running each of the query?
In SQL 2000, it's something like DBCC DROPCLEANBUFFERS. Google the command for more info.
Looking at the query, I would have an index for
Categories.ID
Stories.CategoryID
Users.ID
Stories.UserID
and possibly
Stories.LastActivityAt
But yeah, sounds like the result could be bogus 'cos of caching.
When you have worked with SQL Server for some time, you will discover that even the smallest changes to a query can cause wildly different response times. From what I have read in the initial question, and looking at the query plan, I suspect that the optimizer has decided that the best approach is to form a partial result and then sort that as a separate step. The partial result is a composite of the Users and Stories tables. This is formed in tempdb. So the excessive disk access is due to the forming and then sorting of this temporary table.
I concur that the solution should be to create a compound index on Stories.LastActivityAt, Stories.UserId, Stories.CategoryId. The order is VERY important, the field LastActivityAt must be first.