Different select criteria in odd and even events - sql

I have a table which looks like this ( 10 billion rows)
AID BID CID
1 2 1
1 6 9
0 1 4
1 3 2
1 100 2
0 4 2
0 0 1
The AID could only be 0 or 1. BID and CID could be anything.
Now I want to select events first with AID=1 and then AID=0, and again AID=1 and then AID=0.
The idea is to select equal numbers of AID=1 and AID=0 event.
How can I achieve that?
The expected result is
AID BID CID
1 2 1
0 1 4
1 6 9
0 4 2
1 3 2
0 0 1

;WITH cte AS (
select *
FROM (VALUES
(1, 2, 1),
(1, 6, 9),
(0, 1, 4),
(1, 3, 2),
(1, 100, 2),
(0, 4, 2),
(0, 0, 1)
) as t(AID, BID, CID)
),
withrow AS (
SELECT ROW_NUMBER() OVER (PARTITION BY AID ORDER BY AID) as RN, *
FROM cte)
SELECT AID,BID,CID
FROM withrow
ORDER BY RN asc , aid desc
Output:
AID BID CID
----------- ----------- -----------
1 100 2
0 4 2
1 3 2
0 1 4
1 6 9
0 0 1
1 2 1
(7 row(s) affected)

Related

SQL to ensure unique node names in adjacency list

So I have an adjacency list that forms a hierarchy simulating a versioned file structure. The problem is that the incoming file names are not currently unique and they need to be. To make things slightly more interesting the files may have different versions which should keep the name of the first version (note the versions all have the same NodeID).
Adjacency List
ParentID
NodeID
VersionNum
FileName
-1
1
1
FirstFolder
1
2
1
SecondFolder
1
3
1
ThirdFolder
1
4
1
FirstDocument
1
4
2
FirstDocument
1
5
1
FirstDocument
1
5
2
FirstDocument
2
6
1
FirstDocument
2
6
2
FirstDocument
2
7
1
SecondDocument
3
8
1
SecondDocument
3
9
1
ThirdDocument
3
9
2
ThirdDocument
3
10
1
ThirdDocument
3
11
1
ThirdDocument
Targeted Result
ParentID
NodeID
VersionNum
FileName
-1
1
1
FirstFolder
1
2
1
SecondFolder
1
3
1
ThirdFolder
1
4
1
FirstDocument
1
4
2
FirstDocument
1
5
1
FirstDocument_1
1
5
2
FirstDocument_1
2
6
1
FirstDocument
2
6
2
FirstDocument
2
7
1
SecondDocument
3
8
1
SecondDocument
3
9
1
ThirdDocument
3
9
2
ThirdDocument
3
10
1
ThirdDocument_1
3
11
1
ThirdDocument_2
*I should also note that the folder names are already guaranteed to be unique (they already exist, it is the documents that are incoming) and they only have 1 version.
CREATE TABLE #tmp_tree
(
ParentID INT,
NodeID INT,
VersionNum INT,
FileName VARCHAR(50),
);
INSERT INTO #tmp_tree (ParentID, NodeID, VersionNum, FileName)
VALUES (-1, 1, 1, 'FirstFolder' ),
(1, 2, 1, 'SecondFolder' ),
(1, 3, 1, 'ThirdFolder' ),
(1, 4, 1, 'FirstDocument' ),
(1, 4, 2, 'FirstDocument' ),
(1, 5, 1, 'FirstDocument' ),
(1, 5, 2, 'FirstDocument' ),
(2, 6, 1, 'FirstDocument' ),
(2, 6, 2, 'FirstDocument' ),
(2, 7, 1, 'SecondDocument' ),
(3, 8, 1, 'SecondDocument' ),
(3, 9, 1, 'ThirdDocument' ),
(3, 9, 2, 'ThirdDocument' ),
(3, 10, 1, 'ThirdDocument' )
(3, 11, 1, 'ThirdDocument' )
I really don't know how to approach this though resorting to a stored procedure. Adjacency list scream CTEs to me but that got me no where real fast. Group By loses the NodeID so while I can find the names of the documents that need to be renamed - I don't know how to use that to select the second occurrence of the name (ordered by NodeID).
-- I don't see how this helps... but this finds the names that need to change.
select ParentID, FileName,VersionNum, count(*) from #tmp_tree
GROUP BY ParentID, FileName, VersionNum
HAVING VersionNum = 1 and count(*) > 1
order by FileName
I know how to solve this procedural but not declaratively.
I don't know if this is closer or farther away from the solution:
select f2.*, Row_Number() over (order by f2.FileName) from
(select top 10 f.*, count(FileName) over (PARTITION by ParentID, FileName) as n from (select * from #tmp_tree where versionNum = 1) as f
order by f.ParentID, f.FileName) as f2
Where n > 1
I would assume the last line (3, 11) in the targeted result is a mistake.
You can find the repeated names with a window function in a subquery and then join it during the update. In short, you can do:
update #tmp_tree
set #tmp_tree.filename = concat(#tmp_tree.filename, '_', x.rn)
from #tmp_tree
join (
select *,
row_number() over(partition by parentid, filename order by nodeid) as rn
from #tmp_tree
where versionnum = 1
) x on x.rn > 1 and x.nodeid = #tmp_tree.nodeid;
Result:
ParentID NodeID VersionNum FileName
--------- ------- ----------- ---------------
-1 1 1 FirstFolder
1 2 1 SecondFolder
1 3 1 ThirdFolder
1 4 1 FirstDocument
1 4 2 FirstDocument
1 5 1 FirstDocument_2
1 5 2 FirstDocument_2
2 6 1 FirstDocument
2 6 2 FirstDocument
2 7 1 SecondDocument
3 8 1 SecondDocument
3 9 1 ThirdDocument
3 9 2 ThirdDocument
3 10 1 ThirdDocument_2
See running example at db<>fiddle.
You don't need to self-join the table, you can update the derived table directly, after calculating the row-number using DENSE_RANK
update x
set filename = concat(x.filename, '_', x.rn)
from (
select *,
dense_rank() over(partition by parentid, filename order by nodeid) as rn
from #tmp_tree
) x
where x.rn > 1;
db<>fiddle
DENSE_RANK will return the same number for tied results according to the ordering clause.

Adjusting table based on previous values in BigQuery

I have a table that looks like below:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|0| 0
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 0
2 |3/1/16|1| 0
3 |3/1/16|2| 0
I'm trying to make it so that flag is populated if X=2 in the PREVIOUS month. As such, it should look like this:
ID|Date |X| Flag |
1 |1/1/16|2| 0
2 |1/1/16|0| 0
3 |1/1/16|0| 0
1 |2/1/16|2| 1
2 |2/1/16|1| 0
3 |2/1/16|2| 0
1 |3/1/16|2| 1
2 |3/1/16|1| 0
3 |3/1/16|2| 1
I use this in SQL:
`select ID, date, X, flag into Work_Table from t
(
Select ID, date, X, flag,
Lag(X) Over (Partition By ID Order By date Asc) As Prev into Flag_table
From Work_Table
)
Update [dbo].[Flag_table]
Set flag = 1
where prev = '2'
UPDATE t
Set t.flag = [dbo].[Flag_table].flag FROM T
JOIN [dbo].[Flag_table]
ON t.ID= [dbo].[Flag_table].ID where T.date = [dbo].[Flag_table].date`
However I cannot do this in Bigquery. Any ideas?
Below is for BigQuery Standard SQL
#standardSQL
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
You can test / play with it using dummy data from your question as
#standardSQL
WITH `project.dataset.work_table` AS (
SELECT 1 id, '1/1/16' dt, 2 x, 0 flag UNION ALL
SELECT 2, '1/1/16', 0, 0 UNION ALL
SELECT 3, '1/1/16', 0, 0 UNION ALL
SELECT 1, '2/1/16', 0, 0 UNION ALL
SELECT 2, '2/1/16', 1, 0 UNION ALL
SELECT 3, '2/1/16', 2, 0 UNION ALL
SELECT 1, '3/1/16', 2, 0 UNION ALL
SELECT 2, '3/1/16', 1, 0 UNION ALL
SELECT 3, '3/1/16', 2, 0
)
SELECT id, dt, x,
IF(LAG(x = 2) OVER(PARTITION BY id ORDER BY dt), 1, 0) flag
FROM `project.dataset.work_table`
ORDER BY dt, id
with result as
Row id dt x flag
1 1 1/1/16 2 0
2 2 1/1/16 0 0
3 3 1/1/16 0 0
4 1 2/1/16 0 1
5 2 2/1/16 1 0
6 3 2/1/16 2 0
7 1 3/1/16 2 0
8 2 3/1/16 1 0
9 3 3/1/16 2 1

SQL Server: Increment row value depending on previous row

I have a table with the columns id and value. I'd like to create a column that groups the id. If a row's current value equals 0 then a new group in ideal_group will be created.
Table:
id | value | ideal_group
1 1 1
2 1 1
3 1 1
4 0 2
5 1 2
6 0 3
7 0 4
I'm thinking the solution should be something like:
SET #n = 1;
SELECT id,
CASE
WHEN value = 0 THEN #n = #n + 1
ELSE #n END AS ideal_group
But I'd prefer not to use an counter variable. Is there another way to go about this?
Try the below code, I assumed, that values in value column are only 1s and 0s:
select id,
value,
sum(1 - value) over (order by id rows between unbounded preceding and current row) + 1 [ideal_group]
from MY_TABLE
More general solution (without mentioned assumption):
select id,
value,
sum(case value when 0 then 1 else 0 end) over (order by id rows between unbounded preceding and current row) + 1 [ideal_group]
from MY_TABLE
create table tbl (id int, value int);
insert into tbl values
(1, 1),
(2, 1),
(3, 1),
(4, 0),
(5, 1),
(6, 0),
(7, 0);
GO
7 rows affected
select id,
value,
1 + sum(iif(value = 0, 1, 0)) over
(order by id rows between unbounded preceding and current row) as ideal_group
from tbl
GO
id | value | ideal_group
-: | ----: | ----------:
1 | 1 | 1
2 | 1 | 1
3 | 1 | 1
4 | 0 | 2
5 | 1 | 2
6 | 0 | 3
7 | 0 | 4
dbfiddle here
If you reversed the 1 and 0 and it was only 1 or 0 this would be easier.
declare #T table (id int primary key, val int);
insert into #T values
(1, 1)
, (2, 1)
, (3, 1)
, (4, 0)
, (5, 1)
, (6, 0)
, (7, 0);
select t.id, t.val
, case when t.val = 0 then 1 else 0 end as trig
, sum(case when t.val = 0 then 1 else 0 end) over (order by t.id) + 1 as grp
from #T t
order by t.id;
id val trig grp
----------- ----------- ----------- -----------
1 1 0 1
2 1 0 1
3 1 0 1
4 0 1 2
5 1 0 2
6 0 1 3
7 0 1 4

skip consecutive rows after specific value

Note: I have a working query, but am looking for optimisations to use it on large tables.
Suppose I have a table like this:
id session_id value
1 5 7
2 5 1
3 5 1
4 5 12
5 5 1
6 5 1
7 5 1
8 6 7
9 6 1
10 6 3
11 6 1
12 7 7
13 8 1
14 8 2
15 8 3
I want the id's of all rows with value 1 with one exception:
skip groups with value 1 that directly follow a value 7 within the same session_id.
Basically I would look for groups of value 1 that directly follow a value 7, limited by the session_id, and ignore those groups. I then show all the remaining value 1 rows.
The desired output showing the id's:
5
6
7
11
13
I took some inspiration from this post and ended up with this code:
declare #req_data table (
id int primary key identity,
session_id int,
value int
)
insert into #req_data(session_id, value) values (5, 7)
insert into #req_data(session_id, value) values (5, 1) -- preceded by value 7 in same session, should be ignored
insert into #req_data(session_id, value) values (5, 1) -- ignore this one too
insert into #req_data(session_id, value) values (5, 12)
insert into #req_data(session_id, value) values (5, 1) -- preceded by value != 7, show this
insert into #req_data(session_id, value) values (5, 1) -- show this too
insert into #req_data(session_id, value) values (5, 1) -- show this too
insert into #req_data(session_id, value) values (6, 7)
insert into #req_data(session_id, value) values (6, 1) -- preceded by value 7 in same session, should be ignored
insert into #req_data(session_id, value) values (6, 3)
insert into #req_data(session_id, value) values (6, 1) -- preceded by value != 7, show this
insert into #req_data(session_id, value) values (7, 7)
insert into #req_data(session_id, value) values (8, 1) -- new session_id, show this
insert into #req_data(session_id, value) values (8, 2)
insert into #req_data(session_id, value) values (8, 3)
select id
from (
select session_id, id, max(skip) over (partition by grp) as 'skip'
from (
select tWithGroups.*,
( row_number() over (partition by session_id order by id) - row_number() over (partition by value order by id) ) as grp
from (
select session_id, id, value,
case
when lag(value) over (partition by session_id order by session_id) = 7
then 1
else 0
end as 'skip'
from #req_data
) as tWithGroups
) as tWithSkipField
where tWithSkipField.value = 1
) as tYetAnotherOutput
where skip != 1
order by id
This gives the desired result, but with 4 select blocks I think it's way too inefficient to use on large tables.
Is there a cleaner, faster way to do this?
The following should work well for this.
WITH
cte_ControlValue AS (
SELECT
rd.id, rd.session_id, rd.value,
ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
FROM
#req_data rd
CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)
)
SELECT
cv.id, cv.session_id, cv.value
FROM
cte_ControlValue cv
WHERE
cv.value = 1
AND cv.ControlValue <> 7;
Results...
id session_id value
----------- ----------- -----------
5 5 1
6 5 1
7 5 1
11 6 1
13 8 1
Edit: How and why it works...
The basic premise is taken from Itzik Ben-Gan's "The Last non NULL Puzzle".
Essentially, we are relying 2 different behaviors that most people don't usually think about...
1) NULL + anything = NULL.
2) You can CAST or CONVERT an INT into a fixed length BINARY data type and it will continue to sort as an INT (as opposed to sorting like a text string).
This is easier to see when the intermittent steps are added to the query in the CTE...
SELECT
rd.id, rd.session_id, rd.value,
bv.BinVal,
SmearedBinVal = MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id),
SecondHalfAsINT = CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT),
ControlValue = ISNULL(CAST(SUBSTRING(MAX(bv.BinVal) OVER (PARTITION BY rd.session_id ORDER BY rd.id), 5, 4) AS INT), 999)
FROM
#req_data rd
CROSS APPLY ( VALUES (CAST(rd.id AS BINARY(4)) + CAST(NULLIF(rd.value, 1) AS BINARY(4))) ) bv (BinVal)
Results...
id session_id value BinVal SmearedBinVal SecondHalfAsINT ControlValue
----------- ----------- ----------- ------------------ ------------------ --------------- ------------
1 5 7 0x0000000100000007 0x0000000100000007 7 7
2 5 1 NULL 0x0000000100000007 7 7
3 5 1 NULL 0x0000000100000007 7 7
4 5 12 0x000000040000000C 0x000000040000000C 12 12
5 5 1 NULL 0x000000040000000C 12 12
6 5 1 NULL 0x000000040000000C 12 12
7 5 1 NULL 0x000000040000000C 12 12
8 6 7 0x0000000800000007 0x0000000800000007 7 7
9 6 1 NULL 0x0000000800000007 7 7
10 6 3 0x0000000A00000003 0x0000000A00000003 3 3
11 6 1 NULL 0x0000000A00000003 3 3
12 7 7 0x0000000C00000007 0x0000000C00000007 7 7
13 8 1 NULL NULL NULL 999
14 8 2 0x0000000E00000002 0x0000000E00000002 2 2
15 8 3 0x0000000F00000003 0x0000000F00000003 3 3
Looking at the BinVal column, we see an 8 byte hex value for all non-[value] = 1 rows and NULLS where [value] = 1... The 1st 4 bytes are the Id (used for ordering) and the 2nd 4 bytes are [value] (used to set the "previous non-1 value" or set the whole thing to NULL.
The 2nd step is to "smear" the non-NULL values into the NULLs using the window framed MAX function, partitioned by session_id and ordered by id.
The 3rd step is to parse out the last 4 bytes and convert them back to an INT data type (SecondHalfAsINT) and deal with any nulls that result from not having any non-1 preceding value (ControlValue).
Since we can't reference a windowed function in the WHERE clause, we have to throw the query into a CTE (a derived table would work just as well) so that we can use the new ControlValue in the where clause.
SELECT CRow.id
FROM #req_data AS CRow
CROSS APPLY (SELECT MAX(id) AS id FROM #req_data PRev WHERE PRev.Id < CRow.id AND PRev.session_id = CRow.session_id AND PRev.value <> 1 ) MaxPRow
LEFT JOIN #req_data AS PRow ON MaxPRow.id = PRow.id
WHERE CRow.value = 1 AND ISNULL(PRow.value,1) <> 7
You can use the following query:
select id, session_id, value,
coalesce(sum(case when value <> 1 then 1 end)
over (partition by session_id order by id), 0) as grp
from #req_data
to get:
id session_id value grp
----------------------------
1 5 7 1
2 5 1 1
3 5 1 1
4 5 12 2
5 5 1 2
6 5 1 2
7 5 1 2
8 6 7 1
9 6 1 1
10 6 3 2
11 6 1 2
12 7 7 1
13 8 1 0
14 8 2 1
15 8 3 2
So, this query detects islands of consecutive 1 records that belong to the same group, as specified by the first preceding row with value <> 1.
You can use a window function once more to detect all 7 islands. If you wrap this in a second cte, then you can finally get the desired result by filtering out all 7 islands:
;with session_islands as (
select id, session_id, value,
coalesce(sum(case when value <> 1 then 1 end)
over (partition by session_id order by id), 0) as grp
from #req_data
), islands_with_7 as (
select id, grp, value,
count(case when value = 7 then 1 end)
over (partition by session_id, grp) as cnt_7
from session_islands
)
select id
from islands_with_7
where cnt_7 = 0 and value = 1

Assign rownumber in SQL grouped on value and n rows per rownumber

I am trying to generate a report with 3 rows per page for each order number using the following SQL.
As you can see from the results the fields Actual & Expected do not match up.
Any help would be appreciated.
set nocount on
DECLARE #Orders TABLE (Expected int, OrderNumber INT, OrderDetailsNumber int)
Insert into #orders values (0,1,1)
Insert into #orders values (0,1,2)
Insert into #orders values (0,1,3)
Insert into #orders values (1,1,4)
Insert into #orders values (2,2,5)
Insert into #orders values (2,2,6)
Insert into #orders values (2,2,7)
Insert into #orders values (3,2,8)
Insert into #orders values (3,2,9)
select cast(((row_number() over( order by OrderNumber)) -1) /3 as int) as [Actual]
,*
from #orders
Actual Expected OrderNumber OrderDetailsNumber
----------- ----------- ----------- ------------------
0 0 1 1
0 0 1 2
0 0 1 3
1 1 1 4
1 2 2 5
1 2 2 6
2 2 2 7
2 3 2 8
2 3 2 9
Right, after a couple of edits I have the final answer:
SELECT DENSE_RANK() OVER (Order BY OrderNumber, floor(RowNumber/3)) - 1 AS Actual,
Expected,
OrderNumber,
OrderDetailsNumber
FROM
(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY OrderNumber
ORDER BY OrderDetailsNumber
) - 1 AS RowNumber
FROM #Orders
) RowNumberTable
Gives the result (with extra rows for testing):
Actual Expected OrderNumber OrderDetailsNumber
-------------------- ----------- ----------- ------------------
0 0 1 1
0 0 1 2
0 0 1 3
1 1 1 4
1 1 1 12
2 2 2 5
2 2 2 6
2 2 2 7
3 3 2 8
3 3 2 9
3 4 2 11
4 3 2 27
5 5 3 10
This only works where OrderDetailsNumber is unique such that the result is deterministic.
Edit
I've now got the complete code working, however the dependence on OrderDetailsNumber being in order is very iffy, hopefully you can test and edit as required.
Edit 2
I've put the 'golfed' version in the main answer.
WITH FirstCTE AS
(
SELECT
OrderNumber,
OrderDetailsNumber,
Expected,
ROW_NUMBER() OVER (
PARTITION BY OrderNumber
ORDER BY OrderDetailsNumber
) - 1 AS RowNumber
FROM #Orders
)
, SecondCTE AS
(
SELECT OrderDetailsNumber as odn,
floor(RowNumber/3) as page_for_order_number,
DENSE_RANK() OVER (Order BY OrderNumber, floor(RowNumber/3)) - 1 AS Actual
FROM FirstCTE
)
SELECT c2.page_for_order_number,
c1.RowNumber,
C2.Actual,
c1.Expected,
c1.OrderNumber,
c1.OrderDetailsNumber
FROM FirstCTE AS c1
INNER JOIN SecondCTE AS c2
on c2.odn = c1.OrderDetailsNumber
This strikes me as a bit of a hack, but it works...
Divide the row_number() by 3, and use CEILINGto get the smallest integer greater than or equal to the result of that division.
select row_number() over( order by OrderNumber) as [Actual],
cast (row_number() over(order by ordernumber) as decimal(5,1)) / 3,
CEILING(cast (row_number() over(order by ordernumber) as decimal(5,1)) / 3)as GRPR,
*
from #orders
EDIT: Dang it, can never get results to line up. The 3rd column in the result set is your "page number".
Which yields:
Actual (No column name) PG_NBR Expected OrderNumber OrderDetailsNumber
1 0.333333 1 0 1 1
2 0.666666 1 0 1 2
3 1.000000 1 0 1 3
4 1.333333 2 1 1 4
5 1.666666 2 2 2 5
6 2.000000 2 2 2 6
7 2.333333 3 2 2 7
8 2.666666 3 3 2 8
9 3.000000 3 3 2 9