PostgreSQL Substring pattern with spaces - sql

I've been struggling with this query trying solutions found in this forum, but I can't go on. I need help.
I have a column that stores ship names througout the ship's life and I want to split them into three columns.
Mainly I have these three options,
a) Only one name
select t2.esp1,t2.espectro,t2.espectro1, t2.id from(
select substring(t.espectro, t.posfin)::varchar as esp1, t.espectro,t.espectro1,t.id from(
select "Id" as id, strpos(shipname, ', ') as posinic, strpos(shipname, ' y ') as posfin,shipname as espectro, shipname1 as espectro1 from ships) t)t2 (esp1, espectro, espectro1, id)
where t2.esp1 not like '% y %'`)
b) two names
select t2.esp1,t2.espectro,t2.espectro1, t2.id from(
select substring(t.espectro,1, t.posfin)::varchar as esp1, t.espectro,t.espectro1,t.id from(
select "Id" as id, strpos(shipname, ', ') as posinic, strpos(shipname ' y ') as posfin,shipname as espectro, shipname1 as espectro1 from ships) t)t2 (esp1, espectro, espectro1, id)
where t2.esp1 not like '%, %'`) and for the second name (`select t2.esp1,t2.espectro,t2.espectro1, t2.id from(
select substring(t.espectro, t.posfin)::varchar as esp2, t.espectro,t.espectro2,t.id from(
select "Id" as id, strpos(shipname, ', ') as posinic, strpos(shipname, ' y ') as posfin,shipname as espectro, shipname2 as espectro2 from ships) t)t2 (esp2, espectro, espectro2, id)
where t2.esp2 like '% y %' and t2.espectro not like '%, %';
and c) three names: I could get first
select substring(t.espectro,1,t.posicion) from(
select strpos(shipname, ',') as posicion,shipname as espectro from ships) t;` and third `select t2.esp3,t2.espectro,t2.espectro3, t2.id from(
select substring(t.espectro, t.posfin)::varchar as esp3, t.espectro,t.espectro3,t.id from(
select "Id" as id, strpos(shipname, ', ') as posinic, strpos(shipname, ' y ') as posfin,shipname as espectro, shipname3 as espectro3 from ships) t)t2 (esp3, espectro, espectro3, id)
where t2.esp3 like '% y %' and t2.espectro like '%, %';
but not second
The three named records look like this:
Nuestra Señora del Rosario, Santo Domingo y San José
I have tried this option:
select substring(t.shipsnames from '%#",_y#"%' for '#') as name2 from ships t
With several changes in the #"pattern#" to find the white spaces and get the second name.
Then I tried this option:
select t2.name2[6:7] from (regexp_split_to_array(t.shipnames, E'\\s+') as name2 from ships t) t2
But It doesn't work because not every record has the same length so some are solved like {"Santo","Domingo"} but other not like {"Rosario",","}.
I am not familiarized with regex sintax, I have found this example in the PostgreSQL documentation. Any hint?

When names should be split whenever they are separated by comma plus optional whitespace or with an y surrounded by mandatory whitespace the following regular expression will work:
\s*,\s*|\s+y\s+
\s: whitespace character, +: at least one, *: zero or more and | means alternation.
Example SQL utilizing this regular expression:
SELECT Id, ShipNamesArray[1] ShipName1, ShipNamesArray[2] ShipName2, ShipNamesArray[3] ShipName3
FROM (
SELECT Id, regexp_split_to_array(Shipnames, '\s*,\s*|\s+y\s+') ShipNamesArray
FROM (VALUES
(1, 'Nuestra Señora del Rosario, Santo Domingo y San José'),
(2, 'Nuestra Señora del Rosario y Santo Domingo'),
(3, 'Nuestra Señora del Rosario')
) AS ExampleShipNames (Id, ShipNames)
) AS SplitShipNames
The SQL will produce this output:
Id | ShipName1 | ShipName2 | ShipName3
-- | -------------------------- | ------------- | ---------
1 | Nuestra Señora del Rosario | Santo Domingo | San José
2 | Nuestra Señora del Rosario | Santo Domingo |
3 | Nuestra Señora del Rosario | |

Related

compare 2 text columns and show difference in the third cell using sql

I am trying to compare 2 columns and I have to get the only difference for example
select * from table1
Column_1 column_2
---------------- ------------------
Swetha working Swetha is working in Chennai
Raju 10th Raju is studying 10th std
ranjith Ranjith played yesterday
how to play how to play Cricket
My name is my name is john
Output:
If words come in between it should also remove like row 1 and 2
Column_1 column_2 column_3
---------------- ------------------ ------------------------
Swetha working Swetha is working in Chennai is in Chennai
Raju 10th Raju is studying 10th std is studying std
ranjith Ranjith played yesterday played yesterday
how to play how to play Cricket Cricket
My name is my name is john john
This is much more complicated than your previous question. You can break the first column into words and then substitute them individually in the second column. To do that, though, you need a recursive CTE:
with words as (
select t.*, s.*,
max(s.seqnum) over (partition by t.id) as max_seqnum
from t cross apply
(select s.value as word,
row_number() over (order by (select null)) as seqnum
from string_split(col1, ' ') s
) s
),
cte as (
select id, col1, col2,
replace(' ' + col2 + ' ', ' ' + word + ' ', ' ') as result,
word, seqnum, max_seqnum
from words
where seqnum = 1
union all
select cte.id, cte.col1, cte.col2,
replace(cte.result, ' ' + w.word + ' ', ' '),
w.word, w.seqnum, cte.max_seqnum
from cte join
words w
on w.id = cte.id and w.seqnum = cte.seqnum + 1
)
select id, col1, col2, ltrim(rtrim(result)) as result
from cte
where max_seqnum = seqnum
order by id;
Here is a db<>fiddle.
I added an id so each row is uniquely defined. If your version of SQL Server doesn't have the built-in string_split() function, you can easily find a version that does the same thing.
One trick that this uses is for handling the first and last words in the second column. The code adds spaces at the beginning and end. That way, all words in the string are surrounded by spaces, making it easier to replace only complete words.
SQL 2016 definitely has string split. This approach appends an extra space to either side of the split word from Column 2.
Data
drop table if exists #strings;
go
create table #strings(
Id int,
Column_1 varchar(200),
Column_2 varchar(200));
go
insert #strings(Id, Column_1, Column_2) values
(1, 'Swetha', 'Swetha is working in Chennai'),
(2, 'Raju', 'Raju is studying 10 std'),
(3, 'Swetha working', 'Swetha is working in Chennai'),
(4, 'Raju 10th', 'Raju is studying 10th std');
Query
declare
#add_delim char(1)=' ';
;with
c1_cte(split_str) as (
select ltrim(rtrim(s.[value]))
from
#strings st
cross apply
string_split(st.Column_1, ' ') s),
c2_cte(Id, ndx, split_str) as (
select Id, charindex(#add_delim + s.[value] + #add_delim, #add_delim + st.Column_2 + #add_delim), s.[value]
from
#strings st
cross apply
string_split(st.Column_2, ' ') s
where
st.Column_2 not like '% %')
select
Id, stuff((select ' ' + c.split_str
from c2_cte c
where c.Id = c2.Id and not exists(select 1
from c1_cte c1
where c.split_str=c1.split_str)
order by c.ndx FOR XML PATH('')), 1, 1, '') [new_str]
from c2_cte c2
group by Id;
Results
Id new_str
1 is in Chennai
2 is studying 10 std
3 is in Chennai
4 is studying std
Here is the solution using STRING_SPLIT and STRING_AGG
DBFIDDLE working link
;WITH split_words
AS (
SELECT *
FROM dbo.Strings
CROSS APPLY (
SELECT VALUE
FROM STRING_SPLIT(column_2, ' ')
WHERE VALUE NOT IN (
SELECT VALUE
FROM STRING_SPLIT(column_1, ' ')
)
) a
)
SELECT *
,(
SELECT sw.VALUE + ' ' [text()]
FROM split_words sw
WHERE sw.Column_1 = s.Column_1
AND sw.Column_2 = s.Column_2
FOR XML PATH('')
,TYPE
).value('.', 'NVARCHAR(MAX)') [difference]
FROM dbo.Strings s
For SQL version 2017+ where STRING_AGG is supported
SELECT b.Column_1
,b.Column_2
,STRING_AGG(b.VALUE, ' ')
FROM (
SELECT *
FROM dbo.Strings
CROSS APPLY (
SELECT VALUE
FROM STRING_SPLIT(column_2, ' ')
WHERE VALUE NOT IN (
SELECT VALUE
FROM STRING_SPLIT(column_1, ' ')
)
) a
) b
GROUP BY b.Column_1
,b.Column_2
Results:
WITH
-- your input
input(column_1,column_2,column_3) AS (
SELECT 'Swetha working','Swetha is working in Chennai','is in Chennai'
UNION ALL SELECT 'Raju 10th','Raju is studying 10th std','is studying std'
UNION ALL SELECT 'ranjith','Rantith played yesterday','played yesterday'
UNION ALL SELECT 'how to play','how to play Cricket','Cricket'
UNION ALL SELECT 'My name is','my name is john','john'
)
,
-- need a series of integers
-- you can also try to play with the STRING_SPLIT() function
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
)
,
-- you can also try to play with the STRING_SPLIT() function
unfound_tokens AS (
SELECT
i
, column_1
, column_2
, TOKEN(column_2,' ',i) AS token
FROM input CROSS JOIN i
WHERE TOKEN(column_2,' ',i) <> ''
AND CHARINDEX(
UPPER(TOKEN(column_2,' ',i))
, UPPER(column_1)
) = 0
)
SELECT
column_1
, column_2
, STRING_AGG(token ,' ') AS column_3
FROM unfound_tokens
GROUP BY
column_1
, column_2
-- out column_1 | column_2 | column_3
-- out ----------------+------------------------------+--------------------------
-- out My name is | my name is john | john
-- out Swetha working | Swetha is working in Chennai | is Chennai
-- out how to play | how to play Cricket | Cricket
-- out Raju 10th | Raju is studying 10th std | is studying std
-- out ranjith | Rantith played yesterday | Rantith played yesterday
I am not sure that the results, while using STRING_AGG or STRING_SPLIT, will preserve the ordering of the words...
Just look over this query that give a different ordering :
WITH
SS1 AS
(SELECT Id, SS.value AS COL1
FROM #strings
CROSS APPLY STRING_SPLIT(Column_1, ' ') AS SS
),
SS2 AS
(SELECT Id, SS.value AS COL2
FROM #strings
CROSS APPLY STRING_SPLIT(Column_2, ' ') AS SS
),
DIF AS
(
SELECT Id, COL2 AS COL
FROM SS2
EXCEPT
SELECT Id, COL1
FROM SS1
)
SELECT DIF.Id, Column_1, Column_2, STRING_AGG(COL, ' ')
FROM DIF
JOIN #strings AS S ON S.Id = DIF.Id
GROUP BY DIF.Id, Column_1, Column_2;
You must try with a very huge amount of data to see if the queries that have been given, will not have a side effect like the unconsistent ordering (I am pretty sure that no consistent order will appear due to parallelism....)
So the only way to preserve a consistent ordering is to create a recursive query that add an indiced value of the word in the sentence...

Count frequencies of words separated with multiple spaces

I would like to count the occurrences of all words in a column. The tricky part is that words in a row can appear in long stretches; meaning there are many spaces in-between.
This is a dummy example:
column_name
aaa bbb ccc ddd
[aaa]
bbb
bbb
So far I managed to use the following code
SELECT column_name,
SUM(LEN(column_name) - LEN(REPLACE(column_name, ' ', ''))+1) as counts
FROM
dbo.my_own
GROUP BY
column_name
The code gives me smth like this
column_name counts
aaa bbb ccc ddd 1
[aaa] 1
bbb 2
However, my desired output is:
column_name counts
aaa 1
[aaa] 1
bbb 3
ccc 1
ddd 1
In SQL Server, you would use string_split():
select s.value as word, count(*)
from dbo.my_own o cross apply
string_split(o.column_name, ' ') s
where s.value <> ''
group by s.value;
String manipulation is highly database-dependent. Most databases have some method for doing this, but they can be quite different.
First, take a look at this question to see how to split the words in your column into multiple rows. In that question the words are separated by comma, but, of course, it works the same with spaces.
For your case, assuming a table tablename with an id and your words in columnname, where you have at most 4 words in the column, it would look like this:
SELECT
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
FROM
(SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4) numbers
INNER JOIN tablename
ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
ORDER BY
id, n
Then, you can simply count the words:
SELECT columnname, count(*) FROM (
SELECT
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
FROM
(SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4) numbers
INNER JOIN tablename
ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
ORDER BY
id, n
) normalized
GROUP BY columnname
If you have more than 4 words in your column, you need to expand the select from numbers accordingly.
Edit: Oh, I am late, and I assumed MySQL.

How to search whole words in a string that has delimiter ";"?

I have a column that has values like this 'Blood work;MRI;ICC', which can be a string with some words separated by ';'.
I wonder with a like clause, how can I make a query that returns results that when you search by 'Blood work', 'mri', 'icc' but not by 'blood' or 'mr' or 'ic'?
To search for a field in a CSV list, one method is:
where ';' + mycol + ';' like '%;mri;%'
Demo on DB Fiddle:
with
csv as (select 'Blood work;MRI;ICC' v),
match as (select 'mri' m union all select 'Blood work' union all select 'Blood')
select csv.v, match.m,
case when ';' + csv.v + ';' like '%;' + match.m + ';%'
then 'match'
else 'no match'
end matched
from csv
cross join match
v | m | matched
:----------------- | :--------- | :-------
Blood work;MRI;ICC | mri | match
Blood work;MRI;ICC | Blood work | match
Blood work;MRI;ICC | Blood | no match
I would personally use a string splitter:
SELECT {Columns}
FROM dbo.YourTable YT
CROSS APPLY STRING_SPLIT (YT.YourColumn,';') SS
WHERE SS.[Value] = 'mri';
If you're not using SQL Sevrer 2016+, then you can use a custom splitter, like DelimitedSplit8K_LEAD.

Remove additional comma without knowing the length of the string

My tables
MyTable
+----+-------+---------------+
| Id | Title | DependencyIds |
+----+-------+---------------+
DependentIds contains values like 14;77;120.
MyDependentTable
+--------------+------+
| DependencyId | Name |
+--------------+------+
Background
I have to select data from MyTable with every dependency from MyDependentTable separated with a comma.
Expected output:
+---------+-------------------------------------+
| Title | Dependencies |
+---------+-------------------------------------+
| Test | ABC, One-two-three, Some Dependency |
+---------+-------------------------------------+
| Example | ABC |
+---------+-------------------------------------+
My query
SELECT t.Title,
(SELECT ISNULL((
SELECT DISTINCT
(
SELECT dt.Name + '',
CASE WHEN DependencyIds LIKE '%;%' THEN ', ' ELSE '' END AS [text()]
FROM MyDependentTable dt
WHERE dt.DependencyId IN (SELECT Value FROM dbo.fSplitIds(t.DependencyIds, ';'))
ORDER BY dt.DependencyId
FOR XML PATH('')
)), '')) Dependencies
FROM dbo.MyTable t
Problem description
The query works, but adds an additional comma when there are multiple dependencies:
+---------+---------------------------------------+
| Title | Dependencies |
+---------+---------------------------------------+
| Test | ABC, One-two-three, Some Dependency, |
+---------+---------------------------------------+
| Example | ABC |
+---------+---------------------------------------+
I can't use SUBSTRING(ISNULL(... because I can't access the length of the string and therefore I'm not able to set the length of the SUBSTRING.
Is there any possibility to get rid of that unnecessary additional comma?
Normally for group concatenation in Sql Server, people will add leading comma and remove it using STUFF function but even that looks ugly.
Outer Apply method looks neat to do this instead of correlated sub-query. In this method we don't have to wrap the SELECT query with ISNULL or STUFF
SELECT DISTINCT t.title,
Isnull(LEFT(dependencies, Len(dependencies) - 1), '')
Dependencies
FROM dbo.mytable t
OUTER apply (SELECT dt.NAME + ','
FROM mydependenttable dt
WHERE dt.dependencyid IN (SELECT value
FROM
dbo.Fsplitids(t.dependencyids,';'))
ORDER BY dt.dependencyid
FOR xml path('')) ou (dependencies)
Here is the method using STUFF.
SELECT t.Title
,STUFF((SELECT ', ' + CAST(dt.Name AS VARCHAR(10)) [text()]
FROM MyDependentTable dt
WHERE dt.DependencyId IN (SELECT Value FROM dbo.fSplitIds(t.DependencyIds, ';'))
ORDER BY dt.DependencyId
FOR XML PATH(''), TYPE).value('.','NVARCHAR(MAX)'),1,2,' ') Dependencies
FROM dbo.MyTable t

SQL Coldfusion - eliminate the word THE when searching for duplicate names

I have a process that creates a list of possible duplicate companies. The problem is that "The ABC Company, Inc." and "ABC Company, Inc." both in Dallas, TX are probably duplicates but I won't find them with my criteria. I've eliminated the first 4 characters if they are "the " but I also need to check for the right 5 characters if they are " Inc.".
I have a view that creates a column thename. The prefix "the " has been stripped;
SELECT CASE WHEN LEFT(name, 4) = 'The ' THEN RIGHT(name, (len(name) - 4)) ELSE name END AS thename, CASE WHEN CHARINDEX(' ', ltrim(rtrim(Name)))
= 0 THEN ltrim(Name) WHEN CHARINDEX(' ', ltrim(Name)) = 1 THEN ltrim('b') ELSE SUBSTRING(ltrim(Name) + ' x', 1, CHARINDEX(' ', ltrim(Name))) END AS subname,
CHARINDEX(' ', LTRIM(Name)) AS wordcheck, Name, Address_Line_1, City AS Company_City, State AS Company_State, Zip, Area_Code, Phone, Status_Flag, ID,
Not_Dupe_Flag, DUNS, Temp_Check_Dupes_Flag, Parent_Company_Number, Special_Display,
CASE WHEN c.parent_company_number = 0 THEN c.id ELSE c.parent_company_number END AS parent
FROM dbo.Companies AS c
Then I use that view in my query to look for duplicates;
<cfquery name="qResults" datasource="#request.dsnlive#" timeout="200">
SELECT b.ID,
Thename,
substring(TheName,1,(CHARINDEX(' ',TheNAME,1))) as subName,
name,
b.address_line_1,
b.zip,
b.company_state,
b.company_city,
b.area_code,
b.phone,
b.Special_Display,
isnull(not_dupe_flag,'False') as not_dupe_flag,
isnull(Temp_Check_Dupes_Flag,'False') as Temp_Check_Dupes_Flag,
b.id as bID,
b.duns
FROM dbo.vw_Comp_Details_withFirstWord as b
WHERE isnull(b.status_flag,'') != 'D'
and b.ID <> #arguments.CompNum#
and isnull(b.Temp_Check_Dupes_Flag,'False') = 'False'
<cfif arguments.IncludeDunsOnly eq 0>
<cfif arguments.FirstWord>
AND b.subName = '#arguments.CompanySubName#'
<cfelse>
AND (substring(dbo.KeepAlphaNumCharacters(Thename),1,#val(arguments.WordLength)#) = substring('#arguments.CompanyName#',1,#val(arguments.WordLength)#)
or differnce(soundex(Thename),soundex('#arguments.CompanyName#')) > 2)
</cfif>
AND (
( company_city = '#arguments.City#'
AND Isnull(company_city, '') > '' )
AND ( b.parent != #val(arguments.Parent)#
AND Isnull(b.parent, '0') > 0 )
)
<cfif arguments.IncludeDuns>
AND (
( REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = '#val(arguments.Duns)#'
AND REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') > ' '
AND #val(arguments.Duns)# > 0 )
or REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = ' '
)
</cfif>
<cfelse>
and (REPLACE(LTRIM(REPLACE(b.duns, '0', ' ')), ' ', '0') = '#val(arguments.Duns)#')
</cfif>
</cfquery>
Now I need to add code to strip the suffix " Inc." but I can't seem to come up with the logic to end up with a column that contains the name without the prefix "The " and the suffix " Inc."
I will like to share my question from some days ago. This was made in Postgres but Im sure you can find an equivalent for split string into rows for your rdbms.
What you do is split the string and remove the offending string like The or Inc
SQL Fiddle Demo
| ID | token |
|----|---------|
| 1 | The |
| 1 | ABC |
| 1 | Company |
| 1 | Inc. |
| 2 | ABC |
| 2 | Company |
| 2 | Inc. |
| 3 | ABC |
| 3 | Company |
Then you go the other way and join the remaining strings together postgres use string_agg() MSsql use XML PATH, etc
Many possible ways to do this. Consider if you want to have a fulltext index won the field which can then search for similar names and eliminate noise words like the. Or you can use an SSIS package to do fuzzy matching (this would also help with abbreviations vice spelling the whole word out). Or you can use Data Quality Services which is probably your best bet.
https://msdn.microsoft.com/en-us/library/ff877917.aspx