I am using MS SQL Server 2005 at work to build a database. I have been told that most tables will hold 1,000,000 to 500,000,000 rows of data in the near future after it is built... I have not worked with datasets this large. Most of the time I don't even know what I should be considering to figure out what the best answer might be for ways to set up schema, queries, stuff.
So... I need to know the start and end dates for something and a value that is associated with in ID during that time frame. SO... we can the table up two different ways:
create table xxx_test2 (id int identity(1,1), groupid int, dt datetime, i int)
create table xxx_test2 (id int identity(1,1), groupid int, start_dt datetime, end_dt datetime, i int)
Which is better? How do I define better? I filled the first table with about 100,000 rows of data and it takes about 10-12 seconds to set up in the format of the second table depending on the query...
select y.groupid,
y.dt as [start],
z.dt as [end],
(case when z.dt is null then 1 else 0 end) as latest,
y.i
from #x as y
outer apply (select top 1 *
from #x as x
where x.groupid = y.groupid and
x.dt > y.dt
order by x.dt asc) as z
or
http://consultingblogs.emc.com/jamiethomson/archive/2005/01/10/t-sql-deriving-start-and-end-date-from-a-single-effective-date.aspx
Buuuuut... with the second table.... to insert a new row, I have to go look and see if there is a previous row and then if so update its end date. So... is it a question of performance when retrieving data vs insert/update things? It seems silly to store that end date twice but maybe...... not? What things should I be looking at?
this is what i used to generate my fake data... if you want to play with it for some reason (if you change the maximum of the random number to something higher it will generate the fake stuff a lot faster):
declare #dt datetime
declare #i int
declare #id int
set #id = 1
declare #rowcount int
set #rowcount = 0
declare #numrows int
while (#rowcount<100000)
begin
set #i = 1
set #dt = getdate()
set #numrows = Cast(((5 + 1) - 1) *
Rand() + 1 As tinyint)
while #i<=#numrows
begin
insert into #x values (#id, dateadd(d,#i,#dt), #i)
set #i = #i + 1
end
set #rowcount = #rowcount + #numrows
set #id = #id + 1
print #rowcount
end
For your purposes, I think option 2 is the way to go for table design. This gives you flexibility, and will save you tons of work.
Having the effective date and end date will allow you to have a query that will only return currently effective data by having this in your where clause:
where sysdate between effectivedate and enddate
You can also then use it to join with other tables in a time-sensitive way.
Provided you set up the key properly and provide the right indexes, performance (on this table at least) should not be a problem.
for anyone who can use LEAD Analytic function of SQL Server 2012 (or Oracle, DB2, ...), retrieving data from the 1st table (that uses only 1 date column) would be much much quicker than without this feature:
select
groupid,
dt "start",
lead(dt) over (partition by groupid order by dt) "end",
case when lead(dt) over (partition by groupid order by dt) is null
then 1 else 0 end "latest",
i
from x
Related
Could someone please advise on how to repeat the query if it returned no results. I am trying to generate a random person out of the DB using RAND, but only if that number was not used previously (that info is stored in the column "allready_drawn").
At this point when the query comes over the number that was drawn before, because of the second condition "is null" it does not display a result.
I would need for query to re-run once again until it comes up with a number.
DECLARE #min INTEGER;
DECLARE #max INTEGER;
set #min = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
set #max = (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0) and allready_drawn is NULL
The results (two possible outcomes):
Any suggestion is appreciated and I would like to thank everyone in advance.
Just try this to remove the "id" filter so you only have to run it once
select TOP 1
ordial,
name_surname
from [dbo].[persons]
where allready_drawn is NULL
ORDER BY NEWID()
#gbn that's a correct solution, but it's possible it's too expensive. For very large tables with dense keys, randomly picking a key value between the min and max and re-picking until you find a match is also fair, and cheaper than sorting the whole table.
Also there's a bug in the original post, as the min and max rows will be selected only half as often as the others, as each maps to a smaller interval. To fix generate a random number from #min to #max + 1, and truncate, rather than round. That way you map the interval [N,N+1) to N, ensuring a fair chance for each N.
For this selection method, here's how to repeat until you find a match.
--drop table persons
go
create table persons(id int, ordial int, name_surname varchar(2000), sector int, allready_drawn bit)
insert into persons(id,ordial,name_surname,sector, allready_drawn)
values (1,1,'foo',8,null),(2,2,'foo2',8,null),(100,100,'foo100',8,null)
go
declare #min int = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
declare #max int = 1+ (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
set nocount on
declare #results table(ordial int, name_surname varchar(2000))
declare #i int = 0
declare #selected bit = 0
while #selected = 0
begin
set #i += 1
insert into #results(ordial,name_surname)
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0, 1) and allready_drawn is NULL
if ##ROWCOUNT > 0
begin
select *, #i tries from #results
set #selected = 1
end
end
Is there a way to compare 2 rows of a table without self-join?
I have a table of events with columns: ID, date, tool. I query it, and add row numbers to the result set (sorted by date, for each tool separately). Now I want to know if the time difference between rows 1 and 4 is more than a week.
I could achieve this by joining my query to itself (pretty simple), however it will make the query run twice (right?) which is not very efficient (as my query is not simple and already required some joining). Is there a smarter way to achieve this?
I am using SQL server (not sure which version; probably 2008), and querying from an ASP.NET application, so I don't have administrative access to the DB, and some advanced stuff will not work (but I'm willing to try every suggestion and check).
Thanks.
You can play with windowing functions to achieve your goal, but maybe over(partition by ... order by) will suffice (http://www.sqlfiddle.com/#!6/2d448/1):
IF OBJECT_ID('tempdb..#SomeTable', 'U') IS NOT NULL
BEGIN
DROP TABLE #SomeTable
END
create table #SomeTable (Id int, [date] datetime, tool nvarchar(80), description nvarchar(80))
DECLARE #date AS DATETIME
SET #date = CAST('2013-01-10' as datetime)
declare #i as int = 0
declare #j as int = 0
declare #k as int = 0
WHILE (#i < 100)
begin
set #i = #i + 1
set #j = 0
WHILE (#j < 100)
begin
set #j = #j + 1
set #k = CAST(RAND() * 100 as int)
insert into #SomeTable (Id, [date], tool, description) values (#i*#j, DATEADD(dd,#k,#date), 'tool-' + STUFF('000', 3-LEN(#i)+1, LEN(#i), #i) , 'description-' + CAST(#i as nvarchar(10)) + '-' + CAST(#j as nvarchar(10)))
end
end
--may be this will be enough
--select *, DATEDIFF(dd, MIN([date]) OVER(PARTITION BY tool), [date]) AS 'days' from #SomeTable order by tool, days
--sql 2012 only
--4 rows window
--select *, DATEDIFF(dd, MIN([date]) OVER(PARTITION BY tool ORDER BY tool, [date] ROWS BETWEEN 4 PRECEDING AND CURRENT ROW), [date]) AS 'days' from #SomeTable order by tool, days
select *, DATEDIFF(dd, MIN([date]) OVER(ORDER BY tool, [date] ROWS BETWEEN 4 PRECEDING AND CURRENT ROW), [date]) AS 'days' from #SomeTable order by tool, days
I have the following SQL query:
DECLARE #MyVar datetime = '1/1/2010'
SELECT #MyVar
This naturally returns '1/1/2010'.
What I want to do is have a list of dates, say:
1/1/2010
2/1/2010
3/1/2010
4/1/2010
5/1/2010
Then i want to FOR EACH through the numbers and run the SQL Query.
Something like (pseudocode):
List = 1/1/2010,2/1/2010,3/1/2010,4/1/2010,5/1/2010
For each x in List
do
DECLARE #MyVar datetime = x
SELECT #MyVar
So this would return:-
1/1/2010
2/1/2010
3/1/2010
4/1/2010
5/1/2010
I want this to return the data as one resultset, not multiple resultsets, so I may need to use some kind of union at the end of the query, so each iteration of the loop unions onto the next.
edit
I have a large query that accepts a 'to date' parameter, I need to run it 24 times, each time with a specific to date which I need to be able to supply (these dates are going to be dynamic) I want to avoid repeating my query 24 times with union alls joining them as if I need to come back and add additional columns it would be very time consuming.
SQL is primarily a set-orientated language - it's generally a bad idea to use a loop in it.
In this case, a similar result could be achieved using a recursive CTE:
with cte as
(select 1 i union all
select i+1 i from cte where i < 5)
select dateadd(d, i-1, '2010-01-01') from cte
Here is an option with a table variable:
DECLARE #MyVar TABLE(Val DATETIME)
DECLARE #I INT, #StartDate DATETIME
SET #I = 1
SET #StartDate = '20100101'
WHILE #I <= 5
BEGIN
INSERT INTO #MyVar(Val)
VALUES(#StartDate)
SET #StartDate = DATEADD(DAY,1,#StartDate)
SET #I = #I + 1
END
SELECT *
FROM #MyVar
You can do the same with a temp table:
CREATE TABLE #MyVar(Val DATETIME)
DECLARE #I INT, #StartDate DATETIME
SET #I = 1
SET #StartDate = '20100101'
WHILE #I <= 5
BEGIN
INSERT INTO #MyVar(Val)
VALUES(#StartDate)
SET #StartDate = DATEADD(DAY,1,#StartDate)
SET #I = #I + 1
END
SELECT *
FROM #MyVar
You should tell us what is your main goal, as was said by #JohnFx, this could probably be done another (more efficient) way.
You could use a variable table, like this:
declare #num int
set #num = 1
declare #results table ( val int )
while (#num < 6)
begin
insert into #results ( val ) values ( #num )
set #num = #num + 1
end
select val from #results
This kind of depends on what you want to do with the results. If you're just after the numbers, a set-based option would be a numbers table - which comes in handy for all sorts of things.
For MSSQL 2005+, you can use a recursive CTE to generate a numbers table inline:
;WITH Numbers (N) AS (
SELECT 1 UNION ALL
SELECT 1 + N FROM Numbers WHERE N < 500
)
SELECT N FROM Numbers
OPTION (MAXRECURSION 500)
declare #counter as int
set #counter = 0
declare #date as varchar(50)
set #date = cast(1+#counter as varchar)+'/01/2013'
while(#counter < 12)
begin
select cast(1+#counter as varchar)+'/01/2013' as date
set #counter = #counter + 1
end
Off course an old question. But I have a simple solution where no need of Looping, CTE, Table variables etc.
DECLARE #MyVar datetime = '1/1/2010'
SELECT #MyVar
SELECT DATEADD (DD,NUMBER,#MyVar)
FROM master.dbo.spt_values
WHERE TYPE='P' AND NUMBER BETWEEN 0 AND 4
ORDER BY NUMBER
Note : spt_values is a Mircrosoft's undocumented table. It has numbers for every type. Its not suggestible to use as it can be removed in any new versions of sql server without prior information, since it is undocumented. But we can use it as quick workaround in some scenario's like above.
[CREATE PROCEDURE [rat].[GetYear]
AS
BEGIN
-- variable for storing start date
Declare #StartYear as int
-- Variable for the End date
Declare #EndYear as int
-- Setting the value in strat Date
select #StartYear = Value from rat.Configuration where Name = 'REPORT_START_YEAR';
-- Setting the End date
select #EndYear = Value from rat.Configuration where Name = 'REPORT_END_YEAR';
-- Creating Tem table
with [Years] as
(
--Selecting the Year
select #StartYear [Year]
--doing Union
union all
-- doing the loop in Years table
select Year+1 Year from [Years] where Year < #EndYear
)
--Selecting the Year table
selec]
I have the following problem, that I would like to solve with transact-sql.
I have something like this
Start | End | Item
1 | 5 | A
3 | 8 | B
and I want to create something like
Start | End | Item-Combination
1 | 2 | A
3 | 5 | A-B
6 | 8 | B
For the Item-Combination concatenation I already thought of using the FOR XML statement. But in order to create the different new intervals... I really don't know how to approach it. Any idea?
Thanks.
I had a very similar problem with some computer usage data. I had session data indicating login/logout times. I wanted to find the times (hour of day per day of week) that were the most in demand, that is, the hours where the most users were logged in. I ended up solving the problem client-side using hash tables. For each session, I would increment the bucket for a particular location corresponding to the day of week and hour of day for each day/hour for which the session was active. After examining all sessions the hash table values show the number of logins during each hour for each day of the week.
I think you could do something similar, keeping track of each item seen for each start/end value. You could then reconstruct the table by collapsing adjacent entries that have the same item combination.
And, no, I could not think of a way to solve my problem with SQL either.
This is a fairly typical range-finding problem, with the concatenation thrown in. Not sure if the following fits exactly, but it's a starting point. (Cursors are usually best avoided except in the small set of cases where they are faster than set-based solutions, so before the cursor haters get on me please note I use a cursor here on purpose because this smells to me like a cursor-friendly problem -- I typically avoid them.)
So if I create data like this:
CREATE TABLE [dbo].[sourceValues](
[Start] [int] NOT NULL,
[End] [int] NOT NULL,
[Item] [varchar](100) NOT NULL
) ON [PRIMARY]
GO
ALTER TABLE [dbo].[sourceValues] WITH CHECK ADD CONSTRAINT [End_after_Start] CHECK (([End]>[Start]))
GO
ALTER TABLE [dbo].[sourceValues] CHECK CONSTRAINT [End_after_Start]
GO
declare #i int; set #i = 0;
declare #start int;
declare #end int;
declare #item varchar(100);
while #i < 1000
begin
set #start = ABS( CHECKSUM( newid () ) % 100 ) + 1 ; -- "random" int
set #end = #start + ( ABS( CHECKSUM( newid () ) % 10 ) ) + 2; -- bigger random int
set #item = char( ( ABS( CHECKSUM( newid() ) ) % 5 ) + 65 ); -- random letter A-E
print #start; print #end; print #item;
insert into sourceValues( Start, [End], Item) values ( #start , #end, #item );
set #i += 1;
end
Then I can treat the problem like this: each "Start" AND each "End" value represents a change in the collection of current Items, either adding one or removing one, at a certain time. In the code below I alias that notion as "event," meaning an Add or Remove. Each start or end is like a time, so I use the term "tick." If I make a collection of all the events, ordered by event time (Start AND End), I can iterate through it while keeping a running tally in an in-memory table of all the Items that are in play. Each time the tick value changes, I take a snapshot of that tally:
declare #tick int;
declare #lastTick int;
declare #event varchar(100);
declare #item varchar(100);
declare #concatList varchar(max);
declare #currentItemsList table ( Item varchar(100) );
create table #result ( Start int, [End] int, Items varchar(max) );
declare eventsCursor CURSOR FAST_FORWARD for
select tick, [event], item from (
select start as tick, 'Add' as [event], item from sourceValues as adds
union all
select [end] as tick, 'Remove' as [event], item from sourceValues as removes
) as [events]
order by tick
set #lastTick = 1
open eventsCursor
fetch next from eventsCursor into #tick, #event, #item
while ##FETCH_STATUS = 0
BEGIN
if #tick != #lastTick
begin
set #concatList = ''
select #concatList = #concatlist + case when len( #concatlist ) > 0 then '-' else '' end + Item
from #currentItemsList
insert into #result ( Start, [End], Items ) values ( #lastTick, #tick, #concatList )
end
if #event = 'Add' insert into #currentItemsList ( Item ) values ( #item );
else if #event = 'Remove' delete top ( 1 ) from #currentItemsList where Item = #item;
set #lastTick = #tick;
fetch next from eventsCursor into #tick, #event, #item;
END
close eventsCursor
deallocate eventsCursor
select * from #result order by start
drop table #result
Using a cursor for this special case allows just one "pass" through the data, like a running totals problem. Itzik Ben-Gan has some great examples of this in his SQL 2005 books.
Thanks a lot for all the answers, for the moment I have found a way of doing it. SInce I'm dealing with a datawarehouse, and I have a Time dimension, I could do some joins with Time dimension in the style"inner join DimTime t on t.date between f.start_date and end_date".
It's not very good from the performance point of view, but it seems it's working for me.
I'll give a try to onupdatecascade implementation, to see which suits better for me.
This will exactly emulates and solves the mentioned problem:
-- prepare problem, it can have many rows with overlapping ranges
declare #range table
(
Item char(1) primary key,
[Start] int,
[End] int
)
insert #range select 'A', 1, 5
insert #range select 'B', 3, 8
-- unroll the ranges into helper table
declare #usage table
(
Item char(1),
Number int
)
declare
#Start int,
#End int,
#Item char(1)
declare table_cur cursor local forward_only read_only for
select [Start], [End], Item from #range
open table_cur
fetch next from table_cur into #Start, #End, #Item
while ##fetch_status = 0
begin
with
Num(Pos) as -- generate numbers used
(
select cast(#Start as int)
union all
select cast(Pos + 1 as int) from Num where Pos < #End
)
insert
#usage
select
#Item,
Pos
from
Num
option (maxrecursion 0) -- just in case more than 100
fetch next from table_cur into #Start, #End, #Item
end
close table_cur
deallocate table_cur
-- compile overlaps
;
with
overlaps as
(
select
Number,
(
select
Item + '-'
from
#usage as i
where
o.Number = i.Number
for xml path('')
)
as Items
from
#usage as o
group by
Number
)
select
min(Number) as [Start],
max(Number) as [End],
left(Items, len(Items) - 1) as Items -- beautify
from
overlaps
group by
Items
I use ROW_NUMBER() to do paging with my website content and when you hit the last page it timeout because the SQL Server takes too long to complete the search.
There's already an article concerning this problem but seems no perfect solution yet.
http://weblogs.asp.net/eporter/archive/2006/10/17/ROW5F00NUMBER28002900-OVER-Not-Fast-Enough-With-Large-Result-Set.aspx
When I click the last page of the StackOverflow it takes less a second to return a page, which is really fast. I'm wondering if they have a real fast database servers or just they have a solution for ROW_NUMBER() problem?
Any idea?
Years back, while working with Sql Server 2000, which did not have this function, we had the same issue.
We found this method, which at first look seems like the performance can be bad, but blew us out the water.
Try this out
DECLARE #Table TABLE(
ID INT PRIMARY KEY
)
--insert some values, as many as required.
DECLARE #I INT
SET #I = 0
WHILE #I < 100000
BEGIN
INSERT INTO #Table SELECT #I
SET #I = #I + 1
END
DECLARE #Start INT,
#Count INT
SELECT #Start = 10001,
#Count = 50
SELECT *
FROM (
SELECT TOP (#Count)
*
FROM (
SELECT TOP (#Start + #Count)
*
FROM #Table
ORDER BY ID ASC
) TopAsc
ORDER BY ID DESC
) TopDesc
ORDER BY ID
The base logic of this method relies on the SET ROWCOUNT expression to both skip the unwanted rows and fetch the desired ones:
DECLARE #Sort /* the type of the sorting column */
SET ROWCOUNT #StartRow
SELECT #Sort = SortColumn FROM Table ORDER BY SortColumn
SET ROWCOUNT #PageSize
SELECT ... FROM Table WHERE SortColumn >= #Sort ORDER BY SortColumn
The issue is well covered in this CodeProject article, including scalability graphs.
TOP is supported on SQL Server 2000, but only static values. Eg no "TOP (#Var)", only "TOP 200"