Azure Data Warehouse Generate serial faster query

Azure Data Warehouse Generate serial faster query - sql

Environment is Azure DW
I have a raw table like below;
ID Start End Action date
1 10 15 Processed 25-10-2019
2 55 105 In-Progress 21-10-2019
.....
I need to expand/transform the Start and End column so that they become serial number;
SN Action date
10 Processed 25-10-2019
11 Processed 25-10-2019
12 Processed 25-10-2019
13 Processed 25-10-2019
14 Processed 25-10-2019
.....
Azure Data Warehouse doesn't support recursive CTE or Cursor. So, have tried a while loop,
create table #temp_output (SerialNumber int not null, startSerialNumber int not null, endSerialNumber int not null);
insert into #temp_output select startSerialNumber, startSerialNumber, endSerialNumber from dbo.raw
declare #rowcount int, #cnt int, #start int, #end int
set #cnt = 1
set #rowcount = (select count(*) from dbo.raw)
while #cnt <= #rowcount
begin
select top (#cnt) #start = startSerialNumber from dbo.raw
select top (#cnt) #end = endSerialNumber from dbo.raw
while #start <= #end
begin
insert #temp_output
select max(SerialNumber) + 1,
startSerialNumber,
endSerialNumber
from #temp_output group by startSerialNumber, endSerialNumber having max(SerialNumber) < endSerialNumber
set #start = #start + 1
end
set #cnt = #cnt + 1
end
select SerialNumber, startSerialNumber, endSerialNumber from #temp_output_delta order by SerialNumber
However this takes ages (6 hrs, when I cancelled the query) as the raw table has 50 million rows.
Need a better way to do this.
Updated information 31-10-2019
Distribution for the source table is hash. 500 DWu .
60 million row in source table.
Average difference between start and end 3000.
The start can be 2million as well.
No index on main table.
Column count 15
Clustered columnstore index on raw table.

Your sample is incomplete but you don't need a loop. You can join it to a tally table using BETWEEN
If you have a tally table (which is a table that simply has number from 1 to... 1 million in it)
SELECT T.TallyNumber As SN, E.Action, E.Date
FROM YourTable E
INNER JOIN TallyTable As T
ON T.TallyNumber BETWEEN E.Start AND E.End
Since you are loading this into a new table, you should use CTAS
CREATE TABLE [dbo].[NewTable]
WITH
(
DISTRIBUTION = HASH([Start])
,CLUSTERED COLUMNSTORE INDEX
)
AS
SELECT T.TallyNumber As SN, E.Action, E.Date
FROM YourTable E
INNER JOIN TallyTable As T
ON T.TallyNumber BETWEEN E.[Start] AND E.[End];
Note there is a whole lot of design around DISTRIBUTION. You need to get this right for performance. The above statement is just an example. You should probably use a different hash.
You need to get the distribution of the two source tables as well as the distribution of the target table right for good performance.

Related

Create equal sized, random buckets, with no repetition to the row

Having some difficulty in a scheduling task.
Background: I have 100 members, 10 different sessions, and 10 different activities.
Rules:
Each member must do each activity only once.
Each activity must have the same number of members in each session.
The members must be with (at least mostly) different people in each session.
Each activity must be run in each session with 10 people per activity.
The expected outcome would be something like this:
Person ID
Session ID
Activity ID
1
S1
A
2
S1
B
3
S1
C
1
S2
B
2
S2
C
3
S2
A
In the above example, each activity in each session has only 1 participant, I have to lock that activity in that session out at 10 members.
I have tried a few different solutions in excel / SQL, but not able to meet all 3 rules. The hardest being keeping each activity/session slot to 10 people.
The closest solution I've had is the following.. its not pretty though:
SET STATISTICS TIME, io OFF
-- Create list of applicants
IF OBJECT_ID('process.Numbers') IS NOT NULL DROP TABLE process.Numbers
CREATE TABLE Numbers (ApplicantID INT, SessionID INT, GroupID INT)
DECLARE #i INT,
#Session INT,
#Group INT;
SELECT #i = 1;
SET NOCOUNT ON
WHILE #i <= 100
BEGIN
INSERT INTO Numbers (ApplicantID, SessionID) VALUES (#i, 1);
SELECT #i = #i + 1;
END;
-- Duplicate ApplicantID list for each different session
SELECT #Session = 1
WHILE #Session <= 10
BEGIN
IF #Session > 1
BEGIN
INSERT INTO
Numbers (ApplicantID, SessionID)
SELECT ApplicantID, #Session FROM Numbers WHERE SessionID = 1
END
-- SELECT RANDOM TOP 10 AND SET AS GROUP ID
SELECT #Group = 1
WHILE #Group <= 10
BEGIN
WITH dups_check AS ( SELECT ApplicantID,
GroupID,
COUNT(*) AS vol
FROM Numbers
GROUP BY ApplicantID,
GroupID),
cte AS ( SELECT TOP 10 *
FROM Numbers
WHERE numbers.GroupID IS NULL
AND SessionID = #Session
AND NOT EXISTS (SELECT 1
FROM dups_check
WHERE Numbers.ApplicantID = dups_check.ApplicantID
AND dups_check.GroupID = #Group)
ORDER BY newid())
UPDATE cte SET GroupID = #Group
SELECT #Group = #Group + 1
END
SELECT #Session = #Session + 1
END
SELECT * FROM Numbers
SET NOCOUNT OFF
This code starts to fall over regularly in the higher session numbers when it tries to set an activity that the individual has already done.
Thanks!

I tried using your code to Generate ApplicantID and SessionID rows and modified the last part to generate GroupID column using Ranking functions.
Below is the output of what I have tried:
SET STATISTICS TIME, io OFF
-- Create list of applicants
IF OBJECT_ID('dbo.Numbers') IS NOT NULL DROP TABLE dbo.Numbers
CREATE TABLE dbo.Numbers (ApplicantID INT, SessionID INT, GroupID INT)
DECLARE #i INT,
#Session INT,
#Group INT;
SELECT #i = 1;
SET NOCOUNT ON
WHILE #i <= 100
BEGIN
INSERT INTO Numbers (ApplicantID, SessionID) VALUES (#i, 1);
SELECT #i = #i + 1;
END;
-- Duplicate ApplicantID list for each different session
SELECT #Session = 1
WHILE #Session <= 10
BEGIN
IF #Session > 1
BEGIN
INSERT INTO
Numbers (ApplicantID, SessionID)
SELECT ApplicantID, #Session FROM Numbers WHERE SessionID = 1
END
SELECT #Session = #Session + 1
END
SET NOCOUNT OFF
drop table if exists #temp;
select ApplicantID, SessionID, row_number() OVER(PARTITION BY applicantID ORDER BY applicantID) AS grp_row into #temp
from Numbers
update a
set a.GroupID = b.grp_row
from Numbers a
join #temp b on a.ApplicantID = b. ApplicantID and a.SessionID = b.SessionID
where a.GroupID is null
Each member must do each activity only once.
There are 100 applicants, and as an example, I am showing applicants 1 & 100. Here Each applicant is having each groupID only once.
Each activity must have the same number of members in each session.
There are 10 GroupID's and the number of applicants for each GroupID is the same (100).
The members must be with (at least mostly) different people in each session.
There are 100 applicants but I am taking the top 10 as an example. Here each sessionID has different applicants.

Repeat query if no results came up

Could someone please advise on how to repeat the query if it returned no results. I am trying to generate a random person out of the DB using RAND, but only if that number was not used previously (that info is stored in the column "allready_drawn").
At this point when the query comes over the number that was drawn before, because of the second condition "is null" it does not display a result.
I would need for query to re-run once again until it comes up with a number.
DECLARE #min INTEGER;
DECLARE #max INTEGER;
set #min = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
set #max = (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0) and allready_drawn is NULL
The results (two possible outcomes):
Any suggestion is appreciated and I would like to thank everyone in advance.

Just try this to remove the "id" filter so you only have to run it once
select TOP 1
ordial,
name_surname
from [dbo].[persons]
where allready_drawn is NULL
ORDER BY NEWID()

#gbn that's a correct solution, but it's possible it's too expensive. For very large tables with dense keys, randomly picking a key value between the min and max and re-picking until you find a match is also fair, and cheaper than sorting the whole table.
Also there's a bug in the original post, as the min and max rows will be selected only half as often as the others, as each maps to a smaller interval. To fix generate a random number from #min to #max + 1, and truncate, rather than round. That way you map the interval [N,N+1) to N, ensuring a fair chance for each N.
For this selection method, here's how to repeat until you find a match.
--drop table persons
go
create table persons(id int, ordial int, name_surname varchar(2000), sector int, allready_drawn bit)
insert into persons(id,ordial,name_surname,sector, allready_drawn)
values (1,1,'foo',8,null),(2,2,'foo2',8,null),(100,100,'foo100',8,null)
go
declare #min int = (select top 1 id from [dbo].[persons] where sector = 8 order by id ASC);
declare #max int = 1+ (select top 1 id from [dbo].[persons] where sector = 8 order by id DESC);
set nocount on
declare #results table(ordial int, name_surname varchar(2000))
declare #i int = 0
declare #selected bit = 0
while #selected = 0
begin
set #i += 1
insert into #results(ordial,name_surname)
select
ordial,
name_surname
from [dbo].[persons]
where id = ROUND(((#max - #min) * RAND() + #min), 0, 1) and allready_drawn is NULL
if ##ROWCOUNT > 0
begin
select *, #i tries from #results
set #selected = 1
end
end

SQL query with start and end dates - what is the best option?

I am using MS SQL Server 2005 at work to build a database. I have been told that most tables will hold 1,000,000 to 500,000,000 rows of data in the near future after it is built... I have not worked with datasets this large. Most of the time I don't even know what I should be considering to figure out what the best answer might be for ways to set up schema, queries, stuff.
So... I need to know the start and end dates for something and a value that is associated with in ID during that time frame. SO... we can the table up two different ways:
create table xxx_test2 (id int identity(1,1), groupid int, dt datetime, i int)
create table xxx_test2 (id int identity(1,1), groupid int, start_dt datetime, end_dt datetime, i int)
Which is better? How do I define better? I filled the first table with about 100,000 rows of data and it takes about 10-12 seconds to set up in the format of the second table depending on the query...
select y.groupid,
y.dt as [start],
z.dt as [end],
(case when z.dt is null then 1 else 0 end) as latest,
y.i
from #x as y
outer apply (select top 1 *
from #x as x
where x.groupid = y.groupid and
x.dt > y.dt
order by x.dt asc) as z
or
http://consultingblogs.emc.com/jamiethomson/archive/2005/01/10/t-sql-deriving-start-and-end-date-from-a-single-effective-date.aspx
Buuuuut... with the second table.... to insert a new row, I have to go look and see if there is a previous row and then if so update its end date. So... is it a question of performance when retrieving data vs insert/update things? It seems silly to store that end date twice but maybe...... not? What things should I be looking at?
this is what i used to generate my fake data... if you want to play with it for some reason (if you change the maximum of the random number to something higher it will generate the fake stuff a lot faster):
declare #dt datetime
declare #i int
declare #id int
set #id = 1
declare #rowcount int
set #rowcount = 0
declare #numrows int
while (#rowcount<100000)
begin
set #i = 1
set #dt = getdate()
set #numrows = Cast(((5 + 1) - 1) *
Rand() + 1 As tinyint)
while #i<=#numrows
begin
insert into #x values (#id, dateadd(d,#i,#dt), #i)
set #i = #i + 1
end
set #rowcount = #rowcount + #numrows
set #id = #id + 1
print #rowcount
end

For your purposes, I think option 2 is the way to go for table design. This gives you flexibility, and will save you tons of work.
Having the effective date and end date will allow you to have a query that will only return currently effective data by having this in your where clause:
where sysdate between effectivedate and enddate
You can also then use it to join with other tables in a time-sensitive way.
Provided you set up the key properly and provide the right indexes, performance (on this table at least) should not be a problem.

for anyone who can use LEAD Analytic function of SQL Server 2012 (or Oracle, DB2, ...), retrieving data from the 1st table (that uses only 1 date column) would be much much quicker than without this feature:
select
groupid,
dt "start",
lead(dt) over (partition by groupid order by dt) "end",
case when lead(dt) over (partition by groupid order by dt) is null
then 1 else 0 end "latest",
i
from x

delete old records and keep 10 latest in sql compact

i'm using a sql compact database(sdf) in MS SQL 2008.
in the table 'Job', each id has multiple jobs.
there is a system regularly add jobs into the table.
I would like to keep the 10 latest records for each id order by their 'datecompleted'
and delete the rest of the records
how can i construct my query? failed in using #temp table and cursor

Well it is fast approaching Christmas, so here is my gift to you, an example script that demonstrates what I believe it is that you are trying to achieve. No I don't have a big white fluffy beard ;-)
CREATE TABLE TestJobSetTable
(
ID INT IDENTITY(1,1) not null PRIMARY KEY,
JobID INT not null,
DateCompleted DATETIME not null
);
--Create some test data
DECLARE #iX INT;
SET #iX = 0
WHILE(#iX < 15)
BEGIN
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(1,getDate())
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(34,getDate())
SET #iX = #iX + 1;
WAITFOR DELAY '00:00:0:01'
END
--Create some more test data, for when there may be job groups with less than 10 records.
SET #iX = 0
WHILE(#iX < 6)
BEGIN
INSERT INTO TestJobSetTable(JobID,DateCompleted) VALUES(23,getDate())
SET #iX = #iX + 1;
WAITFOR DELAY '00:00:0:01'
END
--Review the data set
SELECT * FROM TestJobSetTable;
--Apply the deletion to the remainder of the data set.
WITH TenMostRecentCompletedJobs AS
(
SELECT ID, JobID, DateCompleted
FROM TestJobSetTable A
WHERE ID in
(
SELECT TOP 10 ID
FROM TestJobSetTable
WHERE JobID = A.JobID
ORDER BY DateCompleted DESC
)
)
--SELECT * FROM TenMostRecentCompletedJobs ORDER BY JobID,DateCompleted desc;
DELETE FROM TestJobSetTable
WHERE ID NOT IN(SELECT ID FROM TenMostRecentCompletedJobs)
--Now only data of interest remains
SELECT * FROM TestJobSetTable
DROP TABLE TestJobSetTable;

How about something like:
DELETE FROM
Job
WHERE NOT
id IN (
SELECT TOP 10 id
FROM Job
ORDER BY datecompleted)
This is assuming you're using 3.5 because nested SELECT is only available in this version or higher.
I did not read the question correctly. I suspect something more along the lines of a CTE will solve the problem, using similar logic. You want to build a query that identifies the records you want to keep, as your starting point.
Using CTE on SQL Server Compact 3.5

Split query result by half in TSQL (obtain 2 resultsets/tables)

I have a query that returns a large number of heavy rows.
When I transform this rows in a list of CustomObject I have a big memory peak, and this transformation is made by a custom dotnet framework that I can't modify.
I need to retrieve a less number of rows to do "the transform" in two passes and then avoid the memory peak.
How can I split the result of a query by half? I need to do it in DB layer. I thing to do a "Top count(*)/2" but how to get the other half?
Thank you!

If you have identity field in the table, select first even ids, then odd ones.
select * from Table where Id % 2 = 0
select * from Table where Id % 2 = 1
You should have roughly 50% rows in each set.

Here is another way to do it from(http://www.tek-tips.com/viewthread.cfm?qid=1280248&page=5). I think it's more efficient:
Declare #Rows Int
Declare #TopRows Int
Declare #BottomRows Int
Select #Rows = Count(*) From TableName
If #Rows % 2 = 1
Begin
Set #TopRows = #Rows / 2
Set #BottomRows = #TopRows + 1
End
Else
Begin
Set #TopRows = #Rows / 2
Set #BottomRows = #TopRows
End
Set RowCount #TopRows
Select * From TableName Order By DisplayOrder
Set RowCount #BottomRows
Select * From TableNameOrder By DisplayOrderDESC
--- old answer below ---
Is this a stored procedure call or dynamic sql? Can you use temp tables?
if so, something like this would work
select row_number() OVER(order by yourorderfield) as rowNumber, *
INTO #tmp
FROM dbo.yourtable
declare #rowCount int
SELECT #rowCount = count(1) from #tmp
SELECT * from #tmp where rowNumber <= #rowCount / 2
SELECT * from #tmp where rowNumber > #rowCount / 2
DROP TABLE #tmp

SELECT TOP 50 PERCENT WITH TIES ... ORDER BY SomeThing
then
SELECT TOP 50 PERCENT ... ORDER BY SomeThing DESC
However, unless you snapshot the data first, a row in the middle may slip through or be processed twice

I don't think you should do that in SQL, unless you will always have a possibility to have the same record 2 times.
I would do it in an "software" programming language, not SQL. Java, .NET, C++, etc...

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Azure Data Warehouse Generate serial faster query - sql

Related

Create equal sized, random buckets, with no repetition to the row

Repeat query if no results came up

SQL query with start and end dates - what is the best option?

delete old records and keep 10 latest in sql compact

Split query result by half in TSQL (obtain 2 resultsets/tables)

Categories

Resources