tricky SQL with substrings

tricky SQL with substrings - sql

I have a table (postgres) with a varchar field that has content structured like:
".. John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9 .."
The uuid can occur in more than one record. But it must not occur for more than one combination of [givenname];[surname], according to a business rule.
That is, if the John Smith example above is present in the table, then if uuid 7c32e9e1.. occurs in any other record, the field in that record most also contain ".. John;Smith; .."
The problem is, this business rule has been violated due to some bug. And I would like to know how many rows in the table contains a uuid such that it occurs in more than one place with different combinations of [givenname];[surname].
I'd appreciate if someone could help me out with the SQL to accomplish this.

Use regular expressions to extract the UUID and the name from the string. Then aggregate per UUID and either count distinct names or compare minimum and maximum name:
select
substring(col, 'uuid=([[:alnum:]]+)') as uuid,
string_agg(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid'), ' | ') as names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1;
Demo: https://dbfiddle.uk/?rdbms=postgres_12&fiddle=907a283a754eb7427d4ffbf50c6f0028
If you only want to count:
select
count(*) as cnt_uuids,
sum(num_names) as cnt_names,
sum(num_rows) as cnt_rows
from
(
select
count(*) as num_rows,
count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) as num_names
from mytable
group by substring(col, 'uuid=([[:alnum:]]+)')
having count(distinct substring(col, '([[:alnum:]]+;[[:alnum:]]+);uuid')) > 1
) flaws;
But as has been mentioned already: This is not how a database should be used.

I assume you know all the reasons why this is a bad data format, but you are stuck with it. Here is my approach:
select v.user_id, array_agg(distinct names)
from (select v.id,
max(el) filter (where n = un) as user_id,
array_agg(el order by el) filter (where n in (un - 2, un - 1)) as names
from (select v.id, u.*,
max(u.n) filter (where el like 'uuid=%') over (partition by v.id) as un
from (values (1 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(2 , 'junkgoeshere;John;Smith;uuid=7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(3 , 'junkgoeshere;John;Smith;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..'),
(4 , 'junkgoeshere;John;Jay;uuid=new_7c32e9e1-e29e-4211-b11e-e20b2cb78da9; ..')
) v(id, str) cross join lateral
unnest(regexp_split_to_array(v.str, ';')) with ordinality u(el, n)
) v
where n between un - 2 and un
group by v.id
) v
group by user_id
having min(names) <> max(names);
Here is a db<>fiddle.
This assumes that the fields are separated by semicolons. Your data format is just awful, not just as a string but because the names are not identified. So, I am assuming they are the two fields before the user_id field.
So, this implements the following logic:
Breaks up the string by semicolons, with an identifying number.
Finds the number for the user_id.
Extracts the previous two fields together and the user_id column.
Then uses aggregation to find cases where there are multiple matches.

Related

How to unnest BigQuery nested records into multiple columns

I am trying to unnest the below table .
Using the below unnest query to flatten the table
SELECT
id,
name ,keyword
FROM `project_id.dataset_id.table_id`
,unnest (`groups` ) as `groups`
where id = 204358
Problem is , this duplicates the rows (except name) as is the case with flattening the table.
How can I modify the query to put the names in two different columns rather than rows.
Expected output below -

That's because the comma is a cross join - in combination with an unnested array it is a lateral cross join. You repeat the parent row for every row in the array.
One problem with pivoting arrays is that arrays can have a variable amount of rows, but a table must have a fixed amount of columns.
So you need a way to decide for a certain row that becomes a certain column.
E.g. with
SELECT
id,
name,
groups[ordinal(1)] as firstArrayEntry,
groups[ordinal(2)] as secondArrayEntry,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
If your array had a key-value pair you could decide using the key. E.g.
SELECT
id,
name,
(select value from unnest(groups) where key='key1') as key1,
keyword
FROM `project_id.dataset_id.table_id`
unnest(groups)
where id = 204358
But that doesn't seem to be the case with your table ...
A third option could be PIVOT in combination with your cross-join solution but this one has restrictions too: and I'm not sure how computation-heavy this is.

Consider below simple solution
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (1, 2))
if applied to sample data in your question - output is
Note , when you apply to your real case - you just need to know how many such name_NNN columns to expect and extend respectively list - for example for offset + 1 in (1, 2, 3, 4, 5)) if you expect 5 such columns
In case if for whatever reason you want improve this - use below where everything is built dynamically for you so you don't need to know in advance how many columns it will be in the output
execute immediate (select '''
select * from (
select id, name, keyword, offset
from `project_id.dataset_id.table_id`,
unnest(`groups`) with offset
) pivot (max(name) name for offset + 1 in (''' || string_agg('' || pos, ', ') || '''))
'''
from (select pos from (
select max(array_length(`groups`)) cnt
from `project_id.dataset_id.table_id`
), unnest(generate_array(1, cnt)) pos
))

Your question is a little unclear, because it does not specify what to do with other keywords or other columns. If you specifically want the first two values in the array for keyword "OVG", you can unnest the array and pull out the appropriate names:
SELECT id,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1
) as name_1,
(SELECT g.name
FROM UNNEST(t.groups) g WITH OFFSET n
WHERE key = 'OVG'
ORDER BY n
LIMIT 1 OFFSET 1
) as name_2,
'OVG' as keyword
FROM `project_id.dataset_id.table_id` t
WHERE id = 204358;

redshift regexp_substr all occurrences

I am trying to return all results that match a specific regex in Redshift database.
Sample query
WITH aa AS
(SELECT DISTINCT
id,
record,
regexp_substr(record, '(#{2})([A-Z]{2,3})',1,1),
REGEXP_COUNT(record, '(#{2})([A-Z]{2,3})')
FROM table_a)
SELECT
*
FROM aa
The above query returns the first occurrence and the total count of the occurrences.
Is there a way to return all occurrences? Maybe an iteration using a variable that is limited by the count?
regexp_substr(record, '(#{2})([A-Z]{2,3})',1,n)
There is a solution for Oracle with the CONNECT BY LEVEL, but that option seems to be unavailable for Redshift.

If I understand correctly, you can use split_part() instead:
split_part(record, '##', <n>)
Then you can attempt to generate numbers and get the results:
with n as (
select row_number() over () as n
from t
limit 10
)
select t.*, split_part(record, '##', n.n)
from t join
n
on split_part(record, '##', n.n) <> ''

Can I use string_split with enforcing combination of labels?

So I have the following table:
Id Name Label
---------------------------------------
1 FirstTicket bike|motorbike
2 SecondTicket bike
3 ThirdTicket e-bike|motorbike
4 FourthTicket car|truck
I want to use string_split function to identify rows that have both bike and motorbike labels.
So the desired output in my example will be just the first row:
Id Name Label
--------------------------------------
1 FirstTicket bike|motorbike
Currently, I am using the following query but it is returning row 1,2 and 3. I only want the first. Is it possible?
SELECT Id, Name, Label FROM tickets
WHERE EXISTS (
SELECT * FROM STRING_SPLIT(Label, '|')
WHERE value IN ('bike', 'motorbike')
)

You can use APPLY & do aggregation :
SELECT t.id, t.FirstTicket, t.Label
FROM tickets t CROSS APPLY
STRING_SPLIT(t.Label, '|') t1
WHERE t1.value IN ('bike', 'motorbike')
GROUP BY t.id, t.FirstTicket, t.Label
HAVING COUNT(DISTINCT t1.value) = 2;
However, this breaks the normalization rules you should have separate table tickets.

You could just use string functions for this:
select t.*
from mytable t
where
'|' + label + '|' like '%|bike|%'
and '|' + label + '|' like '%|motorbike|%'
I would expect this to be more efficient than other methods that split and aggregate.
Please note, however, that you should really consider fixing your data model. Instead of storing delimited lists, you should have a separated table to represent the relation between tickets and labels, with one row per ticket/label tuple. Storing delimited lists in database column is a well-know SQL antipattern, that should be avoided at all cost (hard to maintain, hard to query, hard to enforce data integrity, inefficicent, ...). You can have a look at this famous SO post for more on this topic.

Yogesh beat me to it; my solution is similar but with a HUGE performance improvement worth pointing out. We'll start with this sample data:
SET NOCOUNT ON;
IF OBJECT_ID('tempdb..#tickets','U') IS NOT NULL DROP TABLE #tickets;
CREATE TABLE #tickets (Id INT, [Name] VARCHAR(50), Label VARCHAR(1000));
INSERT #tickets (Id, [Name], Label)
VALUES
(1,'FirstTicket' , 'bike|motorbike'),
(2,'SecondTicket', 'bike'),
(3,'ThirdTicket' , 'e-bike|motorbike'),
(4,'FourthTicket', 'car|truck'),
(5,'FifthTicket', 'motorbike|bike');
Now the original and much improved version:
-- Original
SELECT t.id, t.[Name], t.Label
FROM #tickets AS t
CROSS APPLY STRING_SPLIT(t.Label, '|') t1
WHERE t1.[value] IN ('bike', 'motorbike')
GROUP BY t.id, t.[Name], t.Label
HAVING COUNT(DISTINCT t1.[value]) = 2;
-- Improved Version Leveraging APPLY to avoid a sort
SELECT t.Id, t.[Name], t.Label
FROM #tickets AS t
CROSS APPLY
(
SELECT 1
FROM STRING_SPLIT(t.Label,'|') AS split
WHERE split.[value] IN ('bike','motorbike')
HAVING COUNT(*) = 2
) AS isMatch(TF);
Now the execution plans:
If you compare the costs: the "sortless" version is query 4.36 times faster than the original. In reality it's more because, with the first version, we're not just sorting, we are sorting three columns - an int and two (n)varchars. Because sorting costs are N * LOG(N), the original query gets exponentially slower the more rows you throw at it.

SQL Oracle - query to return rows based on data matchng rules

I have the below data
NUMBER SEQUENCE_NUMBER
CA00000045 AAD508
CA00000045 AAD508
CA00000046 AAD509
CA00000047 AAD510
CA00000047 AAD510
CA00000047 AAD511
CA00000048 AAD511
and I would like to find out which rows do not match the following rule:
NUMBER will always be the same when the SEQUENCE_NUMBER is the same.
So in the above data 'AAD508' will mean the NUMBER value will be the same on each row where the same value appears in the SEQUENCE_NUMBER.
I want to right a query that will bring me back rows where this rule is
broken. So for example:
CA00000047 AAD511
CA00000048 AAD511
I don't know where to start with this one, so have no initial SQL i'm afraid.
Thanks

You want to self join on the data to compare each row to all others sharing the same sequence number, and then filter using a with statement to only get rows with non-matching numbers. You did not give a name for the table so I added it as "table_name" below
SELECT
a.NUMBER,
a.SEQUENCE_NUMBER
FROM table_name a
INNER JOIN table_name b
ON a.SEQUENCE_NUMBER = b.SEQUENCE_NUMBER
WHERE a.NUMBER <> b.NUMBER
GROUP BY 1,2
Threw in the group by to act as a distinct

I would simply use exists:
select t.*
from t
where exists (select 1
from t t2
where t2.sequence_number = t.sequence_number and
t2.number <> t.number
);
If sequence_numbers() only had up to two rows, you could get each rule-breaker on one row:
select sequence_number, min(number), max(number)
from t
group by sequence_number
having min(number) <> max(number);
Or, you could generalize this to get the list of numbers on a single row:
select sequence_number, listagg(number, ',') within group (order by number) as numbers
from t
group by sequence_number
having min(number) <> max(number);

How to split and display distinct letters from a word in SQL?

Yesterday in a job interview session I was asked this question and I had no clue about it. Suppose I have a word "Manhattan " I want to display only the letters 'M','A','N','H','T'
in SQL. How to do it?
Any help is appreciated.

Well, here is my solution (sqlfiddle) - it aims to use a "Relational SQL" operations, which may have been what the interviewer was going for conceptually.
Most of the work done is simply to turn the string into a set of (pos, letter) records as the relevant final applied DQL is a mere SELECT with a grouping and ordering applied.
select letter
from (
-- All of this just to get a set of (pos, letter)
select ns.n as pos, substring(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s) as ss
cross join (
-- Or use another form to create a "numbers table"
select n from (values (1),(2),(3),(4),(5),(6),(7),(8),(9)) as X(n)
) as ns
) as pairs
group by letter -- guarantees distinctness
order by min(pos) -- ensure output is ordered MANHT
The above query works in SQL Server 2008, but the "Numbers Table" may have to be altered for other vendors. Otherwise, there is nothing used that is vendor specific - no CTE, or cross application of a function, or procedural language code ..
That being said, the above is to show a conceptual approach - SQL is designed for use with sets and relations and multiplicity across records; the above example is, in some sense, merely a perversion of such.
Examining the intermediate relation,
select ns.n as pos, substring(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s) as ss
cross join (
select n from (values (1),(2),(3),(4),(5),(6),(7),(8),(9)) as X(n)
) as ns
uses a cross join to generate the Cartesian product of the string (1 row) with the numbers (9 rows); the substring function is then applied with the string and each number to obtain each character in accordance with its position. The resulting set contains the records-
POS LETTER
1 M
2 A
3 N
..
9 N
Then the outer select groups each record according to the letter and the resulting records are ordered by the minimum (first) occurrence position of the letter that establishing the grouping. (Without the order by the letters would have been distinct but the final order would not be guaranteed.)

One way (if using SQL Server) is with a recursive CTE (Commom Table Expression).
DECLARE #source nvarchar(100) = 'MANHATTAN'
;
WITH cte AS (
SELECT SUBSTRING(#source, 1, 1) AS c1, 1 as Pos
WHERE LEN(#source) > 0
UNION ALL
SELECT SUBSTRING(#source, Pos + 1, 1) AS c1, Pos + 1 as Pos
FROM cte
WHERE Pos < LEN(#source)
)
SELECT DISTINCT c1 from cte
SqlFiddle for this is here. I had to inline the #source for SqlFiddle, but the code above works fine in Sql Server.
The first SELECT generates the initial row(in this case 'M', 1). The second SELECT is the recursive part that generates the subsequent rows, with the Pos column getting incremented each time until the termination condition WHERE Pos < LEN(#source) is finally met. The final select removes the duplicates. Internally, SELECT DISTINCT sorts the rows in order to facilitate the removal of duplicates, which is why the final output happens to be in alphabetic order. Since you didn't specify order as a requirement, I left it as-is. But you could modify it to use a GROUP instead, that ordered on MIN(Pos) if you needed the output in the characters' original order.
This same technique can be used for things like generating all the Bigrams for a string, with just a small change to the general structure above.

declare #charr varchar(99)
declare #lp int
set #charr='Manhattan'
set #lp=1
DECLARE #T1 TABLE (
FLD VARCHAR(max)
)
while(#lp<=LEN(#charr))
begin
if(not exists(select * from #T1 where FLD=(select SUBSTRING(#charr,#lp,1))))
begin
insert into #T1
select SUBSTRING(#charr,#lp,1)
end
set #lp=#lp+1
end
select * from #T1
check this it may help u

Here's an Oracle version of #user2864740's answer. The only difference is how you construct the "numbers table" (plus slight differences in aliasing)
select letter
from (
select ns.n as pos, substr(ss.s, ns.n, 1) as letter
from (select 'MANHATTAN' as s from dual) ss
cross join (
SELECT LEVEL as n
FROM DUAL
CONNECT BY LEVEL <= 9
ORDER BY LEVEL) ns
) pairs
group by letter
order by min(pos)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

tricky SQL with substrings - sql

Related

How to unnest BigQuery nested records into multiple columns

redshift regexp_substr all occurrences

Can I use string_split with enforcing combination of labels?

SQL Oracle - query to return rows based on data matchng rules

How to split and display distinct letters from a word in SQL?

Categories

Resources