Get Distinct value from a list in SQL Server - sql

I have a DB column that has a comma delimited list:
VALUES ID
--------------------
1,11,32 A
11,12,28 B
1 C
32,12,1 D
When I run my SQL statement, in my WHERE clause I have tried IN, CONTAINS and LIKE with varying degrees of errors and success, but none offer an exact return of what I need.
What I need is a where clause that if I'm looking for all IDs with vale of '1' (NOT the number) in the list.
Example of problem:
WHERE values like (1)
This will return A,B,C,D because 1 is included in the value (11). I would expect IDs (A,C,D).
WHERE values like (2)
This will return A,B,D because 2 is included in the value (32,28,12). I would expect zeros records.
Thanks in advance for your help!

I will begin my answer by quoting the spot-on comment given by #Jarlh above:
Never, ever store data as comma separated items. It will only cause you lots of trouble.
That being said, if you're really stuck with this design, you could use:
SELECT *
FROM yourTable
WHERE ',' + [VALUES] + ',' LIKE '%,1,%';
The trick here is convert every VALUES into something looking like:
,11,12,28,
Then, we can search for a target number with comma delimiters on both sides. Since we placed commas at both ends, then every number in the CSV list is now guaranteed to have commas around it.

If you are stuck with such a poor data model, I would suggest:
select t.*
from t
where exists (select 1
from string_split(t.values, ',') s
where s.value = 1
);

Exactly i echo what jarlh and Tim says. relational model is not the right place to store comma delimited strings in table.
Here is an approach, that can likely use an index if there is one on column x
select distinct x
from t
cross apply string_split(t.x,',')
where value=1 /*out here you may parameterize, and also could make use of an index each if there is one in value*/
+---------+
| x |
+---------+
| 1 |
| 1,11,32 |
| 32,12,1 |
+---------+
working example
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=b9b3084f52b0f42ffd17d90427016999
--SQL Server older versions
with data
as (
SELECT t.c.value('.', 'VARCHAR(1000)') as val
,y
,x
FROM (
SELECT x1 = CAST('<t>' +
REPLACE(x , ',', '</t><t>') + '</t>' AS XML)
,y
,x
FROM t
) a
CROSS APPLY x1.nodes('/t') t(c)
)
select x,y
from data
+---------+
| x |
+---------+
| 1 |
| 1,11,32 |
| 32,12,1 |
+---------+
working example
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=011a096bbdd759ea5fe3aa74b08bc895

Related

How to retrieve records using comma separated values with IN clause?

I would like to retrieve certain records from a full list of table. Here I am using comma separated values with IN clause. The table rows looks like this:
Here is my SQL query, but the query completed with empty result set`
DECLARE #input VARCHAR(1000) = '2,3,17,10,16'
SELECT * FROM locations
WHERE
east_zone in (SELECT VALUE FROM string_split(#input,','))
OR
west_zone in (SELECT VALUE FROM string_split(#input,','))
Appreciate your help!
While this can be accomplished, i would request you to rethink your data model. Its a bad idea to store a comma separated list of ids/references in your databases. I strongly am with the comments of Tim Biegeleisen
Alternative would be store the list of zones-titles in a separate table.
Here is a way to accomplish this
with data
as (select 'model_check_holding' as col1,'1,2,3,4,5' as str union all
select 'model_cash_holding' as col1,'5,8,9' as str
)
,split_data
as (select *
from data
cross apply string_split(str,',')
)
,user_input
as(select '2,8,1' as input_val)
select *
from split_data
where value in (select x.value
from user_input
cross apply string_split(input_val,',') x
)
+---------------------+-----------+-------+
| col1 | str | value |
+---------------------+-----------+-------+
| model_check_holding | 1,2,3,4,5 | 1 |
| model_check_holding | 1,2,3,4,5 | 2 |
| model_cash_holding | 5,8,9 | 8 |
+---------------------+-----------+-------+
dbfiddle link
https://dbfiddle.uk/?rdbms=sqlserver_2019&fiddle=1cc9b224e443369744df19c1d7a7d789
Tim is 110% correct. Your data model is totally messed up -- not only storing multiple values in a delimited string. But also string numbers as strings. Wrong, wrong, wrong.
But if you are stuck with some else's really, really, really bad design choices, you do have an option:
DECLARE #input VARCHAR(1000) = '2,3,17,10,16';
SELECT l.*
FROM locations l
WHERE EXISTS (SELECT 1
FROM string_split(#input, ',') s1 JOIN
string_split(concat(l.east_zone, ',', l.west_zone), ',') l
ON s1.value = l.value
);
I do not recommend this approach. I merely suggest it as a stop-gap until you can fix the data model.

Extracting Multiple Numerical Values from Text

048(70F-Y),045(DDI-Y),454(CMDE-Y)
I have the above data in a column field, I need to extract each number before the, so in the above example I would want to see 048, 045, 454.
Note the data in the field will change in each record in the above you have 3 sets of numbers. Sometimes you may have just one set or 6 sets. I just need to capture all sets of numbers that are to the left of the (.
Ideally I would want the results to show in a new column like below. I have tried a few things and gotten no where any help would be greatly appreciate.
I would expect the result to look like the below:
+----------+-----------------------------------+---------------+
| EventId | PAEditTypes | Edits |
+----------+-----------------------------------+---------------+
| 6929107 | 082(SPA-Y),177(QL-Y) | 082, 177 |
| 26534980 | 048(70F-Y),045(DDI-Y),454(CMDE-Y) | 045, 048, 454 |
+----------+-----------------------------------+---------------+
You can get desired output with the following step:
use string_split with cross apply to isolate each item
use left to get only the first part of each item together with CHARINDEX to know where you have to stop
use STRING_AGG to build the final result, adding WITHIN GROUP clause to enforce ordering (if ordering is not important just remove WITHIN GROUP clause)
This is a TSQL sample that should work:
declare #tmp table ( EventId varchar(50), PAEditTypes varchar(200) )
insert into #tmp values
('6929107' ,'082(SPA-Y),177(QL-Y)' )
,('26534980','048(70F-Y),045(DDI-Y),454(CMDE-Y)')
select
EventId
, PAEditTypes
, STRING_AGG(left(value,CHARINDEX('(',value)-1),', ') WITHIN GROUP (ORDER BY value ASC) as Edits
from
#tmp
cross apply
string_split(PAEditTypes, ',')
group by
EventId
, PAEditTypes
order by
EventId desc
Output:

Dynamic pivot for thousands of columns

I'm using pgAdmin III / PostgreSQL 9.4 to store and work with my data. Sample of my current data:
x | y
--+--
0 | 1
1 | 1
2 | 1
5 | 2
5 | 2
2 | 2
4 | 3
6 | 3
2 | 3
How I'd like it to be formatted:
1, 2, 3 -- column names are unique y values
0, 5, 4 -- the first respective x values
1, 5, 6 -- the second respective x values
2, 2, 2 -- etc.
It would need to be dynamic because I have millions of rows and thousands of unique values for y.
Is using a dynamic pivot approach correct for this? I have not been able to successfully implement this:
DECLARE #columns VARCHAR(8000)
SELECT #columns = COALESCE(#columns + ',[' + cast(y as varchar) + ']',
'[' + cast(y as varchar)+ ']')
FROM tableName
GROUP BY y
DECLARE #query VARCHAR(8000)
SET #query = '
SELECT x
FROM tableName
PIVOT
(
MAX(x)
FOR [y]
IN (' + #columns + ')
)
AS p'
EXECUTE(#query)
It is stopping on the first line and giving the error:
syntax error at or near "#"
All dynamic pivot examples I've seen use this, so I'm not sure what I've done wrong. Any help is appreciated. Thank you for your time.
**Note: It is important for the x values to be stored in the correct order, as sequence matters. I can add another column to indicate sequential order if necessary.
The term "first row" assumes a natural order of rows, which does not exist in database tables. So, yes, you need to add another column to indicate sequential order like you suspected. I am assuming a column tbl_id for the purpose. Using the ctid would be a measure of last resort. See:
Deterministic sort order for window functions
The code you present looks like MS SQL Server code; invalid syntax for Postgres.
For millions of rows and thousands of unique values for Y it wouldn't even make sense to try and return individual columns. Postgres has generous limits, but not nearly generous enough for that. According to the source code or the manual, the absolute maximum number of columns is 1600.
So we don't even get to discuss the restrictive characteristics of SQL, which demands to know columns and data types at execution time, not dynamically adjusted during execution. You would need two separate calls, like we discussed in great detail under this related question.
Dynamic alternative to pivot with CASE and GROUP BY
Another answer by Clodoaldo under the same question returns arrays. That can actually be completely dynamic. And that's what I suggest here, too. The query is actually rather simple:
WITH cte AS (
SELECT *, row_number() OVER (PARTITION BY y ORDER BY tbl_id) AS rn
FROM tbl
ORDER BY y, tbl_id
)
SELECT text 'y' AS col, array_agg (y) AS values
FROM cte
WHERE rn = 1
UNION ALL
( -- parentheses required
SELECT text 'x' || rn, array_agg (x)
FROM cte
GROUP BY rn
ORDER BY rn
);
Result:
col | values
----+--------
y | {1,2,3}
x1 | {0,5,4}
x2 | {1,5,6}
x3 | {2,2,2}
db<>fiddle here
Old sqlfiddle
Explanation
The CTE computes a row_number rn for each row (each x) per group of y. We are going to use it twice, hence the CTE.
The 1st SELECT in the outer query generates the array of y values.
The 2nd SELECT in the outer query generates all arrays of x values in order. Arrays can have different length.
Why the parentheses for UNION ALL? See:
Sum results of a few queries and then find top 5 in SQL

SQL Server - Find records with identical substrings [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 8 years ago.
Improve this question
I inherited a table that has a column containing hand-entered award numbers. It has been used for many years by many people. The award numbers in general look like this:
R01AR012345-01
R01AR012345-02
R01AR012345-03
Award numbers get assigned each year. Because so many different people have had their hands in this in the past, there isn't a lot of consistency in how these are entered. For instance, an award sequence may appear like this:
R01AR012345-01
1 RO1AR012345-02
12345-03
12345-05A1
1234506
The rule I've been given to find is to return any record in which 5 consecutive integers from that column match with another record.
I know how to match a given string, but am at a loss when the 5 consecutive integers are unknown.
Here's a sample table to make what I'm looking for more clear:
+----------------------+
| table: AWARD |
+-----+----------------+
| ID | AWARD_NO |
+-----+----------------+
| 12 | R01AR015123-01 |
+-----+----------------+
| 13 | R01AR015124-01 |
+-----+----------------+
| 14 | 15123-02A1 |
+-----+----------------+
| 15 | 1 Ro1XY1512303 |
+-----+----------------+
| 16 | R01XX099232-01 |
+-----+----------------+
In the above table, the following IDs would be returned: 12,13,14,15
The five consecutive integers that match are:
12,13: 01512
12,14: 15123
12,15: 15123
In our specific case, ID 13 is a false positive... but we're willing to deal with those on a case-by-case basis.
Here's the desired return set for the above table:
+-----+-----+----------------+----------------+
| ID1 | ID2 | AWARD_NO_1 | AWARD_NO_2 |
+-----+-----+----------------+----------------+
| 12 | 13 | R01AR015123-01 | R01AR015124-01 |
+-----+-----+----------------+----------------+
| 12 | 14 | R01AR015123-01 | 15123-02A1 |
+-----+-----+----------------+----------------+
| 12 | 15 | R01AR015123-01 | 1 Ro1XY1512303 |
+-----+-----+----------------+----------------+
Now... I'm OK with false positives (like 12 matching 13) and duplicates (because if 12 matches 14, then 14 also matches 12). We're looking through something like 18,000 rows. Optimization isn't really necessary in this situation, because it's only needed to be run one time.
This should handle removing duplicates and most false-positives:
DECLARE #SPONSOR TABLE (ID INT NOT NULL PRIMARY KEY, AWARD_NO VARCHAR(50))
INSERT INTO #SPONSOR VALUES (12, 'R01AR015123-01')
INSERT INTO #SPONSOR VALUES (13, 'R01AR015124-01')
INSERT INTO #SPONSOR VALUES (14, '15123-02A1')
INSERT INTO #SPONSOR VALUES (15, '1 Ro1XY1512303')
INSERT INTO #SPONSOR VALUES (16, 'R01XX099232-01')
;WITH nums AS
(
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT 1)) AS [Num]
FROM sys.objects
),
cte AS
(
SELECT sp.ID,
sp.AWARD_NO,
SUBSTRING(sp.AWARD_NO, nums.Num, 5) AS [TestCode],
SUBSTRING(sp.AWARD_NO, nums.Num + 5, 1) AS [FalsePositiveTest]
FROM #SPONSOR sp
CROSS JOIN nums
WHERE nums.Num < LEN(sp.AWARD_NO)
AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[1-9][0-9][0-9][0-9][0-9]%'
-- AND SUBSTRING(sp.AWARD_NO, nums.Num, 5) LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
)
SELECT sp1.ID AS [ID1],
sp2.ID AS [ID2],
sp1.AWARD_NO AS [AWARD_NO1],
sp2.AWARD_NO AS [AWARD_NO2],
sp1.TestCode
FROM cte sp1
CROSS JOIN #SPONSOR sp2
WHERE sp2.AWARD_NO LIKE '%' + sp1.TestCode + '%'
AND sp1.ID < sp2.ID
--AND 1 = CASE
-- WHEN (
-- sp1.FalsePositiveTest LIKE '[0-9]'
-- AND sp2.AWARD_NO NOT LIKE
-- '%' + sp1.TestCode + sp1.FalsePositiveTest + '%'
-- ) THEN 0
-- ELSE 1
-- END
Output:
ID1 ID2 AWARD_NO1 AWARD_NO2 TestCode
12 14 R01AR015123-01 15123-02A1 15123
12 15 R01AR015123-01 1 Ro1XY1512303 15123
14 15 15123-02A1 1 Ro1XY1512303 15123
If IDs 14 and 15 should not match, we might be able to correct for that as well.
EDIT:
Based on the comment from #Serpiton I commented out the creation and usage of the [FalsePositiveTest] field since changing the initial character range in the LIKE clause on the SUBSTRING to be [1-9] accomplished the same goal and slightly more efficiently. However, this change assumes that no valid Award # will start with a 0 and I am not sure that this is a valid assumption. Hence, I left the original code in place but just commented out.
You want to use the LIKE command in your where clause and use a pattern to look for the 5 numbers. See this post here:
There are probably better ways of representing this but the below example looks for 5 digits from 0-9 next to each other in the data anywhere in your column value. This could perform quite slowly however...
Select *
from blah
Where column LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
Create a sql server function to extract the 5 numbers and then use the function in your query.
Perhaps something like:
select GetAwardNumber(AwardNumberField) as AwardNumber
from Awards
group by GetAwardNumber(AwardNumberField)
I will not post the code, but an idea on how to do it.
First of all you need to make a table valued function that will return all number sequences from a string bigger then 5 characters. (there are examples on SO)
So for each entry your function will return a list of numbers.
After that the query will simplify like:
;with res as (
select
id, -- hopefully there is an id on your table
pattern -- pattern is from the list of patterns the udtf returns
from myTable
cross apply udtf_custom(myString) -- myString is the string you need to split
)
select
pattern
from res
group by pattern
having count(distinct id)>1
I have to note that this is for example purposes, there should be some coding and testing involved, but this should be the story with it.
Good luck, hope it helps.
Here's what I ended up with:
SELECT a1.ID as AWARD_ID_1, a2.ID as AWARD_ID_2, a1.AWARD_NO as Sponsor_Award_1, a2.AWARD_NO as Sponsor_Award_2
FROM AWARD a1
LEFT OUTER JOIN AWARD a2
ON SUBSTRING(a1.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a1.AWARD_NO + '1'),5) = SUBSTRING(a2.AWARD_NO,PATINDEX('%[0-9][0-9][0-9][0-9][0-9]%',a2.AWARD_NO + '1'),5)
WHERE
a1.AWARD_NO <> '' AND a2.AWARD_NO <> ''
AND a1.ID <> a2.ID
AND a1.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%' AND a2.AWARD_NO LIKE '%[0-9][0-9][0-9][0-9][0-9]%'
There's a possibility that the first substring of five characters might not match (when they should generate a match), but it's close enough for us. :-)

sql logical compression of records

I have a table in SQL with more than 1 million records which I want to compress using following algorithm ,and now I'm looking for the best way to do that ,preferably without using a cursor .
if the table contains all 10 possible last digits(from 0 to 9) for a number (like 252637 in following example) we will find the most used Source (in our example 'A') and then remove all digits where Source = 'A' and insert the collapsed digit instead of that (here 252637) .
the example below would help for better understanding.
Original table :
Digit(bigint)| Source
|
2526370 | A
2526371 | A
2526372 | A
2526373 | B
2526374 | C
2526375 | A
2526376 | B
2526377 | A
2526378 | B
2526379 | B
Compressed result :
252637 |A
2526373 |B
2526374 |C
2526376 |B
2526378 |B
2526379 |B
This is just another version of Tom Morgan's accepted answer. It uses division instead of substring to trim the least significant digit off the BIGINT digit column:
SELECT
t.Digit/10
(
-- Foreach t, get the Source character that is most abundant (statistical mode).
SELECT TOP 1
Source
FROM
table i
WHERE
(i.Digit/10) = (t.Digit/10)
GROUP BY
i.Source
ORDER BY
COUNT(*) DESC
)
FROM
table t
GROUP BY
t.Digit/10
HAVING
COUNT(*) = 10
I think it'll be faster, but you should test it and see.
You could identify the rows which are candidates for compression without a cursor (I think) by GROUPing by a substring of the Digit (the length -1) HAVING count = 10. That would identify digits with 10 child rows. You could use this list to insert to a new table, then use it again to delete from the original table. What would be left would be rows that don't have all 10, which you'd also want to insert to the new table (or copy the new data back to the original).
Does that makes sense? I can write it out a bit better if it doesn't.
Possible SQL Solution:
SELECT
SUBSTRING(t.Digit,0,len(t.Digit)-1)
(SELECT TOP 1 Source
FROM innerTable i
WHERE SUBSTRING(i.Digit,0,len(i.Digit)-1)
= SUBSTRING(t.Digit,0,len(t.Digit)-1)
GROUP BY i.Source
ORDER BY COUNT(*) DESC
)
FROM table t
GROUP BY SUBSTRING(t.Digit,0,len(t.Digit)-1)
HAVING COUNT(*) = 10