GROUP by Largest String for all the substrings - sql

I have a table like this where some rows have the same grp but different names. I want to group them by name such that all the substrings after removing nonalphanumeric characters are aggregated together and grouped by the largest string. The null value is considered the substring of all the strings.
grp
name
value
1
ab&c
10
1
abc d e
56
1
ab
21
1
a
23
1
xy
34
1
[null]
1
2
fgh
87
Desired result
grp
name
value
1
abcde
111
1
xy
34
2
fgh
87
My query-
Select grp,
regexp_replace(name,'[^a-zA-Z0-9]+', '', 'g') name, sum(value) value
from table
group by grp,
regexp_replace(name,'[^a-zA-Z0-9]+', '', 'g');
Result
grp
name
value
1
abc
10
1
abcde
56
1
ab
21
1
a
23
1
xy
34
1
[null]
1
2
fgh
87
What changes should I make in my query?

To solve this problem, I did the following (all of the code below is available on the fiddle here).
CREATE TABLE test
(
grp SMALLINT NOT NULL,
name TEXT NULL,
value SMALLINT NOT NULL
);
and populate it using your data + extra for testing:
INSERT INTO test VALUES
(1, 'ab&c', 10),
(1, 'abc d e', 56),
(1, 'ab', 21),
(1, 'a', 23),
(1, NULL, 1000000),
(1, 'r*&%$s', 100), -- added for testing.
(1, 'rs__t', 101),
(1, 'rs__tu', 101),
(1, 'xy', 1111),
(1, NULL, 1000000),
(2, 'fgh', 87),
(2, 'fgh', 13), -- For Charlieface
(2, NULL, 1000000),
(2, 'x', 50),
(2, 'x', 150),
(2, 'x----y', 100);
Then, you can use this query:
WITH t1 AS
(
SELECT
grp, n_str,
LAG(n_str) OVER (PARTITION BY grp ORDER BY grp, n_str),
CASE
WHEN
LAG(n_str) OVER (PARTITION BY grp ORDER BY grp, n_str) IS NULL
OR
POSITION
(
LAG(n_str) OVER (PARTITION BY grp ORDER BY grp, n_str)
IN
n_str
) = 0
THEN 1
ELSE 0
END AS change,
value
FROM
test t1
CROSS JOIN LATERAL
(
VALUES
(
REGEXP_REPLACE(name,'[^a-zA-Z0-9]+', '', 'g')
)
) AS v(n_str)
WHERE n_str IS NOT NULL
), t2 AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY grp, s_change ORDER BY grp, n_str DESC) AS rn,
grp, n_str,
SUM(value) OVER (PARTITION BY grp, s_change) AS s_val,
MAX(LENGTH(n_str)) OVER (PARTITION BY grp) AS max_nom
FROM
(
SELECT
grp, n_str, change,
SUM(change) OVER (ORDER BY grp, n_str) AS s_change,
value
FROM
t1
ORDER BY grp, n_str DESC
) AS sub1
), t3 AS
(
SELECT
grp, SUM(value) AS null_sum
FROM
test
WHERE name IS NULL
GROUP BY grp
)
SELECT x.grp, x.n_str, x.s_val + y.null_sum
FROM t2 x
JOIN t3 y
ON x.max_nom = LENGTH(x.n_str) AND x.grp = y.grp
UNION
SELECT grp, n_str, s_val
FROM
t2 WHERE max_nom != LENGTH(n_str) AND rn = 1
ORDER BY grp, n_str;
Result:
grp n_str ?column?
1 abcde 2000110
1 rstu 302
1 xy 1111
2 fgh 1000100
2 xy 300
A few points to note:
Please always provide a fiddle when you ask questions such as this one with tables and data - it provides a single source of truth for the question and eliminates duplication of effort on the part of those trying to help you!
You haven't been very clear about what, exactly, should happen with NULLs - do the values count towards the SUM()? You can vary the CASE statement as required.
What happens when there's a tie in the number of characters in the string? I've included an example in the fiddle, where you get the draws - but you may wish to sort alphabetically (or some other method)?
There appears to be an error in your provided sums for the values (even taking account of counting or not values for NULL for the name field).
Finally, you don't want to GROUP BY the largest string - you want to GROUP BY the grp fields + the SUM() of the values in the the given grp records and then pick out the longest alphanumeric string in that grouping. It would be interesting to know why you want to do this?

Related

SQL Occurrence of Sequence Number

I want to find if any Name has straight 4 or more occurrences of SeqNo in consecutive sequence only.
If there is a break in seqNo but 4 or more rows are consecutive then also i need that Name.
Example:
SeqNo Name
10 | A
15 | A
16 | A
17 | A
18 | A
9 | B
10 | B
13 | B
14 | B
6 | C
7 | C
9 | C
10 | C
OUTPUT:
A
BELOW IS SCRIPT FOR ANYONE HELPING.
create table testseq (Id int, Name char)
INSERT into testseq values
(10, 'A'),
(15, 'A'),
(16, 'A'),
(17, 'A'),
(18, 'A'),
(9, 'B'),
(10, 'B'),
(13, 'B'),
(14, 'B'),
(6, 'C'),
(7, 'C'),
(9, 'C'),
(10, 'C')
SELECT * FROM testseq
You can use some gaps-and-islands techniques for this.
If you want names that have at least 4 consecutive records where seqno is increasing by 1, then you can use the difference between seqno androw_number()` to define the groups, and then aggregate:
select distinct name
from (
select t.*, row_number() over(partition by name order by seqno) rn
from testseq t
) t
group by name, rn - seqno
having count(*) >= 4
Note that for your sample data, this returns no rows. A has 3 consecutive records where seqno is incrementing by 1, B and C have two.
I don't really view this as a "gaps-and-islands" problem. You are just looking for a minimum number of adjacent rows. This is easily handled using lag() or lead():
select t.*
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3
from t
) t
where seqno_name_3 = seqno + 3;
This checks the third sequence number on the same name. The third one after means that four names are the same in a row.
If you just want the name and to handle duplicates:
select distinct name
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3
from t
) t
where seqno_name_3 = seqno + 3;
If the sequence numbers can have gaps (but are otherwise adjacent):
select distinct name
from (select t.*,
lead(seqno, 3) over (partition by name order by seqno) as seqno_name_3,
lead(seqno, 3) over (order by seqno) as seqno_3
from t
) t
where seqno_name_3 = seqno_3;
A solution in plain SQL, no LAG() or LEAD() or ROW_NUMBER():
SELECT t1.Name
FROM testseq t1
WHERE (
SELECT count(t2.Id)
FROM testseq t2
WHERE t2.Name=t1.Name
and t2.Id between t1.Id and t1.Id+3
GROUP BY t2.Name)>=4
GROUP BY t1.Name;

combining strings from muliple rows in sql server

How can I obtain the below output in sql server 2012.
Table
ID | Values|
1 a
1 b
1 c
2 d
2 e
The output should be such that the first row has a fixed number of values(2) seperated by comma and the next row has the remaining values seperated by comma
ID
ID | Values|
1 a,b
1 c
2 d,e
Each id should contain maximum two values in a single row.The remaining values should come in the next row.
Try to use my code:
use db_test;
create table dbo.test567
(
id int,
[values] varchar(max)
);
insert into dbo.test567
values
(1, 'a'),
(1, 'b'),
(1, 'c'),
(2, 'd'),
(2, 'e')
with cte as (
select
id,
[values],
row_number() over(partition by id order by [values] asc) % 2 as rn1,
(row_number() over(partition by id order by [values] asc) - 1) / 2 as rn2
from dbo.test567
), cte2 as (
select
id, max(case when rn1 = 1 then [values] end) as t1, max(case when rn1 = 0 then [values] end) as t2
from cte
group by id, rn2
)
select
id,
case
when t2 is not null then concat(t1, ',', t2)
else t1
end as [values]
from cte2
order by id, [values]

Fluently adding values in t-sql

I have a table like this:
Items Date Price
1 2016-01-01 10
1 2016-01-02 15
1 2016-01-03 null
1 2016-01-04 null
1 2016-01-05 8
1 2016-01-06 null
1 2016-01-07 null
1 2016-01-08 null
2 2016-01-01 14
2 2016-01-02 7
2 2016-01-03 null
2 2016-01-04 null
2 2016-01-05 16
2 2016-01-06 null
2 2016-01-07 null
2 2016-01-08 5
Now I want to update the null values. The difference between the price before and after null values must be evenly added.
Example:
1 2016-01-02 15 to
1 2016-01-05 8
15 to 8 = -7
-7 / 3 = -2,333333
1 2016-01-02 15
1 2016-01-03 12,6666
1 2016-01-04 10,3333
1 2016-01-05 8
Shouldn't be made with cursors. Helptables would be OK.
This is really where you want the ignore nulls option on lag() and lead(). Alas.
An alternative is to use outer apply:
select t.*,
coalesce(t.price,
tprev.price +
datediff(day, tprev.date, t.date) * (tnext.price - tprev.price) / datediff(day, tprev.date, tnext.date)
) as est_price
from t outer apply
(select top 1 t2.*
from t t2
where t2.item = t.item and
t2.date <= t.date and
t2.price is not null
order by t2.date desc
) tprev outer apply
(select top 1 t2.*
from t t2
where t2.item = t.item and
t2.date >= t.date and
t2.price is not null
order by t2.date asc
) tnext ;
The complex arithmetic is just calculating the difference, dividing by the number of days, and then allocating the days to the current day.
WITH T1 AS
(
SELECT *,
ROW_NUMBER() OVER (PARTITION BY Items ORDER BY Date) AS RN,
FORMAT(ROW_NUMBER() OVER (PARTITION BY Items ORDER BY Date),'D10') + FORMAT(Price,'0000000000.000000') AS RnPr
FROM YourTable
), T2 AS
(
SELECT *,
MAX(RnPr) OVER (PARTITION BY Items ORDER BY Date ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS prev,
MIN(RnPr) OVER (PARTITION BY Items ORDER BY Date ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING) AS next
FROM T1
), T3 AS
(
SELECT Items,
Date,
Price,
RnPr,
InterpolatedPrice = IIF(Price IS NOT NULL,prevPrice,prevPrice + (RN - prevRN) * (nextPrice - prevPrice)/NULLIF(nextRN - prevRN,0))
FROM T2
CROSS APPLY (VALUES(CAST(SUBSTRING(prev,11,17) AS decimal(16,6)),
CAST(LEFT(prev, 10) AS INT),
CAST(SUBSTRING(next,11,17) AS decimal(16,6)),
CAST(LEFT(next, 10) AS INT)
)) V(prevPrice,prevRN,nextPrice,nextRN)
)
--UPDATE T3 SET Price = InterpolatedPrice
SELECT *
FROM T3
ORDER BY Items,
Date
Which returns
the row_number and price are bundled together in a single column (RnPr above). The order of RnPr is the same as the order by row_number. MIN and MAX both ignore NULLS. So finding the MAX(RnPr) between UNBOUNDED PRECEDING AND CURRENT ROW will include the value of the previous NOT NULL price if the price in the current row is null. And similarly MIN(RnPr) will find the next with a frame between CURRENT ROW AND UNBOUNDED FOLLOWING.
This can then be cracked apart to get the price and row number as above.
If happy with the results the final SELECT can be removed and the UPDATE uncommented as in this demo.
Just replace "YourTable" (4 of them) with your actual table name.
If you are happy with the results, then comment out the select and un-comment the UPDATE and WHERE.
Select A.Items,A.Date,
--Update YourTable Set
Price = IsNull(A.Price,((DateDiff(DD,B.Date,A.Date)/(DateDiff(DD,B.Date,C.Date)+0.0))*(C.Price - B.Price)) + B.Price)
From YourTable A
Outer Apply (Select Top 1 Date,Price from YourTable Where Items=A.Items and Date<A.Date and Price is not Null and A.Price is null Order by Price Desc) B
Outer Apply (Select Top 1 Date,Price from YourTable Where Items=A.Items and Date>A.Date and Price is not Null and A.Price is null Order by Price) C
--Where Price is NULL
Returns
Now, you'll notice nulls between 01/06 and 01/08 for Items 1. This is because there is no cap to interpolate with.
You can do this by using a series of window functions in common table expressions.
T1 assigns a row number to each row (rn)
T2 finds the first non null Price row number before and after the current row (rnb / rna)
T3 calculates the adjusted Price by looking up the prices before and after
declare #T table (Items int, Date date, Price float)
insert into #T (Items, Date, Price) values
(1, '2016-01-01', 10), (1, '2016-01-02', 15), (1, '2016-01-03', null), (1, '2016-01-04', null), (1, '2016-01-05', 8), (1, '2016-01-06', null), (1, '2016-01-07', null), (1, '2016-01-08', null), (2, '2016-01-01', 14), (2, '2016-01-02', 7), (2, '2016-01-03', null), (2, '2016-01-04', null), (2, '2016-01-05', 16), (2, '2016-01-06', null), (2, '2016-01-07', null), (2, '2016-01-08', 5)
;with T1 as
(
select *,
row_number() over (order by Items, Date) as rn
from #T
),
T2 as
(
select *,
max(case when price is null then null else rn end) over (partition by Items order by Date) as rnb,
min(case when price is null then null else rn end) over (partition by Items order by Date desc) as rna
from T1
)
select Items, Date,
isnull(price,
lag(Price, rn-rnb, Price) over (order by rn) -
(
lag(Price, rn-rnb, Price) over (order by rn) -
lead(Price, rna-rn, Price) over (order by rn)
) / (rna-rnb) * (rn-rnb)
) as Price
from T2
order by Items, Date

SQL Server 2008: find number of contiguous rows with equal values

I have a table with multiple Ids. Each Id has values arranged by a sequential index.
create table myValues
(
id int,
ind int,
val int
)
insert into myValues
values
(21, 5, 300),
(21, 4, 310),
(21, 3, 300),
(21, 2, 300),
(21, 1, 345),
(21, 0, 300),
(22, 5, 300),
(22, 4, 300),
(22, 3, 300),
(22, 2, 300),
(22, 1, 395),
(22, 0, 300)
I am trying to find the number of consecutive values that are the same.
The value field represents some data that should be change on each entry (but need not be unique overall).
The problem is to find out when there are more than two consecutive rows with the same value (given the same id).
Thus I'm looking for an output like this:
id ind val count
21 5 300 1
21 4 310 1
21 3 300 2
21 2 300 2
21 1 345 1
21 0 300 1
22 5 300 4
22 4 300 4
22 3 300 4
22 2 300 4
22 1 395 1
22 0 300 1
I'm aware this is similar to the island and gaps problem discussed here.
However, those solutions all hinge on the ability to use a partition statement with values that are supposed to be consecutively increasing.
A solution that generates the ranges of "islands" as an intermediary would work as well, e.g.
id startind endind
21 3 2
22 5 2
Note that there can be many islands for each id.
I'm sure there is a simple adaptation of the island solution, but for the life of me I can't think of it.
find the continuous group and then do a count() partition by that
select id, ind, val, count(*) over (partition by id, val, grp)
from
(
select *, grp = dense_rank() over (partition by id, val order by ind) - ind
from myValues
) d
order by id, ind desc
The other solution is obviously more elegant. I'll have to study it a little closer myself.
with agg(id, min_ind, max_ind, cnt) as (
select id, min(ind), max(ind), count(*)
from
(
select id, ind, val, sum(brk) over (partition by id order by ind desc) as grp
from
(
select
id, ind, val,
coalesce(sign(lag(ind) over (partition by id, val order by ind desc) - ind - 1), 1) as brk
from myValues
) as d
) as d
group by id, grp
)
select v.id, v.ind, v.val, a.cnt
from myValues v inner join agg a on a.id = v.id and v.ind between min_ind and max_ind
order by v.id, v.ind desc;

select based on specific values

I have this table:
ID NO.
111 6
222 7
333 9
111 8
333 4
222 3
111 7
222 5
333 2
I want to select only 2 ID numbers from table where NO. column equal specific values.
For example i tried this query but i didn't get the expected result:
SELECT top 2 * FROM mytable where NO. in
(select NO. from mytable )
Expected result:
111 6
111 8
222 7
222 3
333 9
333 3
You seem to want to select two rows in the table for each id, based on a condition on the No column. For this, one method uses row_number():
select t.*
from (select t.*, row_number() over (partition by id order by id) as seqnum
from mytable t
where <condition goes here>
) t
where seqnum <= 2;
I'm guessing (333,3) is a mistake and you expect (333,2). If not I have no idea.
SELECT
ua.ID
, ua.[NO.]
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY t.[NO.] ASC) AS RowNum
, t.ID
, t.[NO.]
FROM dbo.t1 AS t
UNION ALL
SELECT
ROW_NUMBER() OVER (PARTITION BY ID ORDER BY t.[NO.] DESC)
, ID
, t.[NO.]
FROM dbo.t1 AS t
) ua
WHERE ua.RowNum = 1
ORDER BY ID, ua.[NO.] DESC
If you're just trying to get top 2 values for each group, you need something to define the order, ie. a third column. Then you don't need UNION ALL, just use WHERE ua.RowNum < 3.
/*Select 2 random rows per id where the number of rows per id can vary between 1 and infinity
A good article for this:-*/
--https://www.mssqltips.com/sqlservertip/3157/different-ways-to-get-random-data-for-sql-server-data-sampling/
DECLARE #TABLE TABLE(ID INT,NO INT)
INSERT INTO #TABLE
VALUES
(111, 6),
(222, 7),
(333 , 9),
(111 , 8),
(333 , 4),
(222 , 3),
(111 , 7),
(222 , 5),
(333 , 2)
select t.* from
(
Select s.* ,ROW_NUMBER() OVER(PARTITION BY ID ORDER BY randomnumber) ROWNUMBER
from
(
SELECT ID,NO,
(ABS(CHECKSUM(NEWID())) % 100001) + ((ABS(CHECKSUM(NEWID())) % 100001) * 0.00001) [randomnumber]
FROM #TABLE
) s
) t
where t.rownumber < 3