Count frequencies of words separated with multiple spaces - sql

I would like to count the occurrences of all words in a column. The tricky part is that words in a row can appear in long stretches; meaning there are many spaces in-between.
This is a dummy example:
column_name
aaa bbb ccc ddd
[aaa]
bbb
bbb
So far I managed to use the following code
SELECT column_name,
SUM(LEN(column_name) - LEN(REPLACE(column_name, ' ', ''))+1) as counts
FROM
dbo.my_own
GROUP BY
column_name
The code gives me smth like this
column_name counts
aaa bbb ccc ddd 1
[aaa] 1
bbb 2
However, my desired output is:
column_name counts
aaa 1
[aaa] 1
bbb 3
ccc 1
ddd 1

In SQL Server, you would use string_split():
select s.value as word, count(*)
from dbo.my_own o cross apply
string_split(o.column_name, ' ') s
where s.value <> ''
group by s.value;
String manipulation is highly database-dependent. Most databases have some method for doing this, but they can be quite different.

First, take a look at this question to see how to split the words in your column into multiple rows. In that question the words are separated by comma, but, of course, it works the same with spaces.
For your case, assuming a table tablename with an id and your words in columnname, where you have at most 4 words in the column, it would look like this:
SELECT
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
FROM
(SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4) numbers
INNER JOIN tablename
ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
ORDER BY
id, n
Then, you can simply count the words:
SELECT columnname, count(*) FROM (
SELECT
tablename.id,
SUBSTRING_INDEX(SUBSTRING_INDEX(tablename.columnname, ' ', numbers.n), ' ', -1) columnname
FROM
(SELECT 1 AS n UNION ALL
SELECT 2 UNION ALL
SELECT 3 UNION ALL
SELECT 4) numbers
INNER JOIN tablename
ON LENGTH(tablename.columnname) - LENGTH(REPLACE(tablename.columnname, ' ', '')) >= numbers.n - 1
ORDER BY
id, n
) normalized
GROUP BY columnname
If you have more than 4 words in your column, you need to expand the select from numbers accordingly.
Edit: Oh, I am late, and I assumed MySQL.

Related

Condition check to get output

I have 2 tables name and match. The name and match table have columns type.
The columns and data in the name table
ID| type |
--| ---- |
1| 1ABC |
2| 2DEF |
3| 3DEF |
4| 4IJK |
The columns and data in match table is
type
DATA
NOT %ABC% AND NOT %DEF%
NOT ABC AND NOT DEF
%DEF%
DEF ONLY
NOT %DEF% AND NOT %IJK%
NOT DEF AND NOT IJK
I have tried using case statement. The first 3 characters will be NOT if there is a NOT in the type in match table.
The below query is giving me a missing keyword error. I am not sure what I am missing here
SELECT s.id, s.type, m.data
where case when substr(m.type1,3)='NOT' then s.type not in (REPLACE(REPLACE(m.type,'NOT',''),'AND',','))
ELSE s.type in (m.type) end
from source s, match m;
I need the output to match the type in source column and display the data in match column.
The output should be
ID|type|DATA
1 |1ABC|NOT DEF AND NOT IJK
2 |2DEF|DEF ONLY
3 |3DEF|DEF ONLY
4 |4IJK|NOT ABC AND NOT DEF
The biggest problem with your attempted query seems to be that SQL requires the WHERE clause to come after the FROM clause.
But your query is flawed in other ways as well. Although it can have complicated logic within it, including subqueries, a CASE statement must ultimately return a constant. Conditions within it are not applied as if they are in a WHERE clause of the main query (like you appear to be trying to do).
My recommendation would be to not store the match table as you currently are. It seems much preferable to have something that contains each condition you want to evaluate. Assuming that's not possible, I suggest a CTE (or even a view) that breaks it down that way first.
This query (based on Nefreo's answer for breaking strings into multiple rows)...
SELECT
data,
regexp_count(m.type, ' AND ') + 1 num,
CASE WHEN REGEXP_SUBSTR(m.type,'(.*?)( AND |$)',1,levels.column_value) like 'NOT %' THEN 1 ELSE 0 END negate,
replace(replace(REGEXP_SUBSTR(m.type,'(.*?)( AND |$)',1,levels.column_value), 'NOT '), ' AND ') match
FROM match m INNER JOIN
table(cast(multiset(select level from dual connect by level <= regexp_count(m.type, ' AND ') + 1) as sys.OdciNumberList)) levels
ON 1=1
... breaks your match table into something more like:
DATA
NUM
NEGATE
MATCH
NOT ABC AND NOT DEF
2
1
%ABC%
NOT ABC AND NOT DEF
2
1
%DEF%
DEF ONLY
1
0
%DEF%
NOT DEF AND NOT IJK
2
1
%DEF%
NOT DEF AND NOT IJK
2
1
%IJK%
So we now know each specific like condition, whether it should be negated, and the number of conditions that need to be matched for each MATCH row. (For simplicity, I am using match.data as essentially a key for this since it is unique for each row in match and is what we want to return anyway, but if you were actually storing the data this way you'd probably use a sequence of some sort and not repeat the human-readable text.)
That way, your final query can be quite simple:
SELECT name.id, name.type, criteria.data
FROM name INNER JOIN criteria
ON
(criteria.negate = 0 AND name.type LIKE criteria.match)
OR
(criteria.negate = 1 AND name.type NOT LIKE criteria.match)
GROUP BY name.id, name.type, criteria.data
HAVING COUNT(*) = MAX(criteria.num)
ORDER BY name.id
The conditions in the ON do the appropriate LIKE or NOT LIKE (matches one condition from the CRITERIA view/CTE), and the condition in the HAVING makes sure we had the correct number of total matches to return the row (makes sure we matched all the conditions in one row of the MATCH table).
You can see the entire thing...
WITH criteria AS
(
SELECT
data,
regexp_count(m.type, ' AND ') + 1 num,
CASE WHEN REGEXP_SUBSTR(m.type,'(.*?)( AND |$)',1,levels.column_value) like 'NOT %' THEN 1 ELSE 0 END negate,
replace(replace(REGEXP_SUBSTR(m.type,'(.*?)( AND |$)',1,levels.column_value), 'NOT '), ' AND ') match
FROM match m INNER JOIN
table(cast(multiset(select level from dual connect by level <= regexp_count(m.type, ' AND ') + 1) as sys.OdciNumberList)) levels
ON 1=1
)
SELECT name.id, name.type, criteria.data
FROM name INNER JOIN criteria
ON
(criteria.negate = 0 AND name.type LIKE criteria.match)
OR
(criteria.negate = 1 AND name.type NOT LIKE criteria.match)
GROUP BY name.id, name.type, criteria.data
HAVING COUNT(*) = MAX(criteria.num)
ORDER BY name.id
... working in this fiddle.
As a one-off, I don't think this is significantly different than the other answer already provided, but I wanted to do this since I think this is probably more maintainable if the complexity of your conditions changes.
It already handles arbitrary numbers of conditions, mixes of NOT and not-NOT within the same row of MATCH, and allows for the % signs (for the like) to be placed arbitrarily (e.g. startswith%, %endswith, %contains%, start%somewhere%end, exactmatch should all work as expected). If in the future you want to add different types of conditions or handle ORs, I think the general ideas here will apply.
Not knowing the possible other rules for selecting rows, just with your data from the question, maybe you could use this:
WITH
tbl_name AS
(
Select 1 "ID", '1ABC' "A_TYPE" From Dual Union All
Select 2 "ID", '2DEF' "A_TYPE" From Dual Union All
Select 3 "ID", '3DEF' "A_TYPE" From Dual Union All
Select 4 "ID", '4IJK' "A_TYPE" From Dual
),
tbl_match AS
(
Select 'NOT %ABC% AND NOT %DEF%' "A_TYPE", 'NOT ABC AND NOT DEF' "DATA" From Dual Union All
Select '%DEF%' "A_TYPE", 'DEF ONLY' "DATA" From Dual Union All
Select 'NOT %DEF% AND NOT %IJK%' "A_TYPE", 'NOT DEF AND NOT IJK' "DATA" From Dual
)
Select
n.ID "ID",
n.A_TYPE,
m.DATA
From
tbl_match m
Inner Join
tbl_name n ON (1=1)
Where
(
INSTR(m.A_TYPE, 'NOT %' || SubStr(n.A_TYPE, 2) || '%', 1, 1) = 0
AND
INSTR(m.A_TYPE, 'NOT %' || SubStr(n.A_TYPE, 2) || '%', 1, 2) = 0
AND
Length(m.A_TYPE) > Length(SubStr(n.A_TYPE, 2)) + 2
)
OR
(
Length(m.A_TYPE) = Length(SubStr(n.A_TYPE, 2)) + 2
AND
'%' || SubStr(n.A_TYPE, 2) || '%' = m.A_TYPE
)
Order By n.ID
Result:
ID
A_TYPE
DATA
1
1ABC
NOT DEF AND NOT IJK
2
2DEF
DEF ONLY
3
3DEF
DEF ONLY
4
4IJK
NOT ABC AND NOT DEF
Any other format of condition should be evaluated separately ...
Regards...
WITH match_cte AS (
SELECT m.data
,m.type
,decode(instr(m.type,'NOT')
,1 -- found at position 1
,0
,1) should_find_str_1
,substr(m.type
,instr(m.type,'%',1,1) + 1
,instr(m.type,'%',1,2) - instr(m.type,'%',1,1) - 1) str_1
,decode(instr(m.type,'NOT',instr(m.type,'%',1,2))
,0 -- no second NOT
,1
,0) should_find_str_2
,substr(m.type
,instr(m.type,'%',1,3) + 1
,instr(m.type,'%',1,4) - instr(m.type,'%',1,3) - 1) str_2
FROM match m
)
SELECT s.id
,s.type
,m.data
FROM source s
CROSS JOIN match_cte m
WHERE m.should_find_str_1 = sign(instr(s.type,m.str_1))
AND (m.str_2 IS NULL
OR m.should_find_str_2 = sign(instr(s.type, m.str_2))
)
ORDER BY s.id, m.data
MATCH_CTE
|DATA|TYPE|SHOULD_FIND_STR_1|STR_1|SHOULD_FIND_STR_2|STR_2|
|-|-|-|-|-|-|
|NOT ABC AND NOT DEF|NOT %ABC% AND NOT %DEF%|0|ABC|0|DEF|
|DEF|%DEF%|1|DEF|1|NULL|
|NOT DEF AND NOT IJK|NOT %DEF% AND NOT %IJK%|0|DEF|0|IJK|

compare 2 text columns and show difference in the third cell using sql

I am trying to compare 2 columns and I have to get the only difference for example
select * from table1
Column_1 column_2
---------------- ------------------
Swetha working Swetha is working in Chennai
Raju 10th Raju is studying 10th std
ranjith Ranjith played yesterday
how to play how to play Cricket
My name is my name is john
Output:
If words come in between it should also remove like row 1 and 2
Column_1 column_2 column_3
---------------- ------------------ ------------------------
Swetha working Swetha is working in Chennai is in Chennai
Raju 10th Raju is studying 10th std is studying std
ranjith Ranjith played yesterday played yesterday
how to play how to play Cricket Cricket
My name is my name is john john
This is much more complicated than your previous question. You can break the first column into words and then substitute them individually in the second column. To do that, though, you need a recursive CTE:
with words as (
select t.*, s.*,
max(s.seqnum) over (partition by t.id) as max_seqnum
from t cross apply
(select s.value as word,
row_number() over (order by (select null)) as seqnum
from string_split(col1, ' ') s
) s
),
cte as (
select id, col1, col2,
replace(' ' + col2 + ' ', ' ' + word + ' ', ' ') as result,
word, seqnum, max_seqnum
from words
where seqnum = 1
union all
select cte.id, cte.col1, cte.col2,
replace(cte.result, ' ' + w.word + ' ', ' '),
w.word, w.seqnum, cte.max_seqnum
from cte join
words w
on w.id = cte.id and w.seqnum = cte.seqnum + 1
)
select id, col1, col2, ltrim(rtrim(result)) as result
from cte
where max_seqnum = seqnum
order by id;
Here is a db<>fiddle.
I added an id so each row is uniquely defined. If your version of SQL Server doesn't have the built-in string_split() function, you can easily find a version that does the same thing.
One trick that this uses is for handling the first and last words in the second column. The code adds spaces at the beginning and end. That way, all words in the string are surrounded by spaces, making it easier to replace only complete words.
SQL 2016 definitely has string split. This approach appends an extra space to either side of the split word from Column 2.
Data
drop table if exists #strings;
go
create table #strings(
Id int,
Column_1 varchar(200),
Column_2 varchar(200));
go
insert #strings(Id, Column_1, Column_2) values
(1, 'Swetha', 'Swetha is working in Chennai'),
(2, 'Raju', 'Raju is studying 10 std'),
(3, 'Swetha working', 'Swetha is working in Chennai'),
(4, 'Raju 10th', 'Raju is studying 10th std');
Query
declare
#add_delim char(1)=' ';
;with
c1_cte(split_str) as (
select ltrim(rtrim(s.[value]))
from
#strings st
cross apply
string_split(st.Column_1, ' ') s),
c2_cte(Id, ndx, split_str) as (
select Id, charindex(#add_delim + s.[value] + #add_delim, #add_delim + st.Column_2 + #add_delim), s.[value]
from
#strings st
cross apply
string_split(st.Column_2, ' ') s
where
st.Column_2 not like '% %')
select
Id, stuff((select ' ' + c.split_str
from c2_cte c
where c.Id = c2.Id and not exists(select 1
from c1_cte c1
where c.split_str=c1.split_str)
order by c.ndx FOR XML PATH('')), 1, 1, '') [new_str]
from c2_cte c2
group by Id;
Results
Id new_str
1 is in Chennai
2 is studying 10 std
3 is in Chennai
4 is studying std
Here is the solution using STRING_SPLIT and STRING_AGG
DBFIDDLE working link
;WITH split_words
AS (
SELECT *
FROM dbo.Strings
CROSS APPLY (
SELECT VALUE
FROM STRING_SPLIT(column_2, ' ')
WHERE VALUE NOT IN (
SELECT VALUE
FROM STRING_SPLIT(column_1, ' ')
)
) a
)
SELECT *
,(
SELECT sw.VALUE + ' ' [text()]
FROM split_words sw
WHERE sw.Column_1 = s.Column_1
AND sw.Column_2 = s.Column_2
FOR XML PATH('')
,TYPE
).value('.', 'NVARCHAR(MAX)') [difference]
FROM dbo.Strings s
For SQL version 2017+ where STRING_AGG is supported
SELECT b.Column_1
,b.Column_2
,STRING_AGG(b.VALUE, ' ')
FROM (
SELECT *
FROM dbo.Strings
CROSS APPLY (
SELECT VALUE
FROM STRING_SPLIT(column_2, ' ')
WHERE VALUE NOT IN (
SELECT VALUE
FROM STRING_SPLIT(column_1, ' ')
)
) a
) b
GROUP BY b.Column_1
,b.Column_2
Results:
WITH
-- your input
input(column_1,column_2,column_3) AS (
SELECT 'Swetha working','Swetha is working in Chennai','is in Chennai'
UNION ALL SELECT 'Raju 10th','Raju is studying 10th std','is studying std'
UNION ALL SELECT 'ranjith','Rantith played yesterday','played yesterday'
UNION ALL SELECT 'how to play','how to play Cricket','Cricket'
UNION ALL SELECT 'My name is','my name is john','john'
)
,
-- need a series of integers
-- you can also try to play with the STRING_SPLIT() function
i(i) AS (
SELECT 1
UNION ALL SELECT 2
UNION ALL SELECT 3
UNION ALL SELECT 4
UNION ALL SELECT 5
)
,
-- you can also try to play with the STRING_SPLIT() function
unfound_tokens AS (
SELECT
i
, column_1
, column_2
, TOKEN(column_2,' ',i) AS token
FROM input CROSS JOIN i
WHERE TOKEN(column_2,' ',i) <> ''
AND CHARINDEX(
UPPER(TOKEN(column_2,' ',i))
, UPPER(column_1)
) = 0
)
SELECT
column_1
, column_2
, STRING_AGG(token ,' ') AS column_3
FROM unfound_tokens
GROUP BY
column_1
, column_2
-- out column_1 | column_2 | column_3
-- out ----------------+------------------------------+--------------------------
-- out My name is | my name is john | john
-- out Swetha working | Swetha is working in Chennai | is Chennai
-- out how to play | how to play Cricket | Cricket
-- out Raju 10th | Raju is studying 10th std | is studying std
-- out ranjith | Rantith played yesterday | Rantith played yesterday
I am not sure that the results, while using STRING_AGG or STRING_SPLIT, will preserve the ordering of the words...
Just look over this query that give a different ordering :
WITH
SS1 AS
(SELECT Id, SS.value AS COL1
FROM #strings
CROSS APPLY STRING_SPLIT(Column_1, ' ') AS SS
),
SS2 AS
(SELECT Id, SS.value AS COL2
FROM #strings
CROSS APPLY STRING_SPLIT(Column_2, ' ') AS SS
),
DIF AS
(
SELECT Id, COL2 AS COL
FROM SS2
EXCEPT
SELECT Id, COL1
FROM SS1
)
SELECT DIF.Id, Column_1, Column_2, STRING_AGG(COL, ' ')
FROM DIF
JOIN #strings AS S ON S.Id = DIF.Id
GROUP BY DIF.Id, Column_1, Column_2;
You must try with a very huge amount of data to see if the queries that have been given, will not have a side effect like the unconsistent ordering (I am pretty sure that no consistent order will appear due to parallelism....)
So the only way to preserve a consistent ordering is to create a recursive query that add an indiced value of the word in the sentence...

Remove Alphabet and Insert Leading 0 up to Certain Character Length

I have a table name Table and column name Item. Below is the data.
ABC123
ABC1234
ABC12345
HA11
K112
L1164
I need to remove the alphabets, replace them with leading 0, and the total character length must be 9. Below is the results.
000000123
000001234
000012345
000000011
000000112
000001164
I know how to change for ABC (certain alphabet set) only however I don't know to build the CASE statement. Below is what I have been successful.
select REPLICATE('0',9-LEN(A.B)) + A.B
from
(select replace(Item, 'ABC','') as B from Table) as A
I tried to combine CASE with SELECT and it doesn't seem like it.
Case when Item like '%ABC%' then
select REPLICATE('0',9-LEN(A.B)) + A.B
from
(select replace(Item, 'ABC','') as B from Table) as A
when Item like '%HA%' then
select REPLICATE('0',9-LEN(A.B)) + A.B
from
(select replace(Item, 'HA','') as B from Table) as A
when Item like '%K%' then
select REPLICATE('0',9-LEN(A.B)) + A.B
from
(select replace(Item, 'K','') as B from Table) as A
when Item like '%L%' then
select REPLICATE('0',9-LEN(A.B)) + A.B
from
(select replace(Item, 'L','') as B from Table) as A
else Item
End
Does anyone know how to achieve the result? I'm using SQL Server 2012.
Thank you.
I assumed, that you have letters only at the beginning of your data.
declare #s varchar(20) = 'ABS123'
-- we find index of first occurence of digit and then we cut out the letters
set #s = right(#s, len(#s) - patindex('%[0-9]%', #s) + 1)
-- here we just produce string with amount of zeros we need
select left('000000000', 9 - len(#s)) + #s
In terms of applying it to your table:
select left('000000000', 9 - len([Digits])) + [Digits] from (
select right([Item], len([Item]) - patindex('%[0-9]%', [Item]) + 1) as [Digits] from [Table]
)

SQL: Pivoting on more than one column

I have a table
Name | Period | Value1 | Value2
-----+---------+---------+-------
A 1 2 3
A 2 5 4
A 3 6 7
B 1 2 3
B 2 5 4
B 3 6 7
I need results like
Name | Value1 | Value2
-----+--------+------
A | 2,5,6 | 3,4,7
B | 2,5,6 | 3,4,7
Number of periods is dynamic but I know how to handle it so, for simplicity, let's say there are 3 periods
The query below gives me results for Value1. How can I get results for both?
I can always do them separately and then do a join but the table is really big and I need "combine" four values, not two. Can I do it in one statement?
SELECT Name,
[1]+','+ [2] + ','+ [3] ValueString
FROM (
select Name, period, cpr from #MyTable
) as s
PIVOT(SUM(Value1)
FOR period IN ([1],[2],[3])
Use conditional aggregation. Combining the values into strings is a bit tricky, requiring XML logic in SQL Server:
select n.name,
stuff((select ',' + cast(value1 as varchar(max))
from t
where n.name = t.name
order by t.period
for xml path ('')
), 1, 1, ''
) as values1,
stuff((select ',' + cast(value2 as varchar(max))
from t
where n.name = t.name
order by t.period
for xml path ('')
), 1, 1, ''
) as values2
from (select distinct name
from t
) n;
Your values look like numbers, hence the explicit cast and the lack of concern for XML special characters.
You may ask why this does the distinct in a subquery rather than in the outer query. If done in the outer query, then the SQL engine will probably do the aggregation for every row before doing the distinct. I'm not sure if the optimizer is good enough run the subqueries only once per name.
Using Group By with stuff function and get expected result
SELECT Name , STUFF((SELECT ',' + CAST(Value1 AS VARCHAR) FROM #MyTable T2 WHERE T1.Name = T2.Name FOR XML PATH('')),1,1,'') Value1
, STUFF((SELECT ',' + CAST(Value2 AS VARCHAR) FROM #MyTable T3 WHERE T1.Name = T3.Name FOR XML PATH('')),1,1,'') Value2 FROM #MyTable T1 GROUP BY Name

sql - getting the count of employee in each grade in a specific format

I am new to sql ,I have a table like this
Emp_id | Emp_NAME | EMP_GRADE
1 Test1 A1
2 Test2 A2
3 Test3 A3
4 Test4 A4
6 Test5 A1
7 Test6 A2
8 Test7 A3
I need to get the count of the employee in each grade , in which the final ouput will be
"2 - 2 - 2 - 1 " in a single column where output refers (Count of Employee in each Grade ie A1(2) - A2(2)- A3(2) -A4(1)) . can anyone give sql query for this. I hope we dont need cursor for this .
SELECT COUNT(Emp_id) FROM myTableName GROUP BY EMP_GRADE
Use:
DECLARE #Grades varchar(1000)
SELECT #Grades=coalesce(#Grades + ' ','') +Cast(COUNT(EMP_GRADE) as Varchar(2))+' -' From TableName
Group By EMP_GRADE
Select #Grades=SUBSTRING(#Grades,0,LEN(#Grades))
Select #Grades
Update:
SELECT #Grades=coalesce(#Grades + ' ','') +Cast(COUNT(t1.EMP_GRADE) as Varchar(2))+' -' From #tab1 t
Left Join #tab1 t1 On t1.EMP_GRADE= t.EMP_GRADE And t1.Emp_id= t.Emp_id
And t1.EMP_GRADE<>'A3' -- Replace conditions here
Group By t1.EMP_GRADE,t.EMP_GRADE
This should work:
SELECT EMP_GRADE, COUNT(EMP_Id) AS EMPS_COUNT
FROM TableName
GROUP BY EMP_GRADE
Hope that helps. Keep learning SQL.
SELECT STUFF((
SELECT ' - ' + CAST(COUNT(1) AS VARCHAR(max))
FROM myTable
GROUP BY EMP_GRADE
ORDER BY EMP_GRADE
FOR XML PATH('')
), 1, 3, '')
SQL Fiddle example
If you are filtering but still want to return results for every grade, you will need a self-join to get the full list of grades. Here's one way:
;WITH g AS (SELECT DISTINCT EMP_GRADE FROM myTable)
SELECT STUFF((
SELECT ' - ' + CAST(COUNT(t.Emp_id) AS VARCHAR(max))
FROM g
LEFT OUTER JOIN myTable t ON g.EMP_GRADE = t.EMP_GRADE
AND t.Emp_id % 2 = 1 --put your filter conditions here as part of the join
GROUP BY g.EMP_GRADE
ORDER BY g.EMP_GRADE
FOR XML PATH('')
), 1, 3, '')
SQL Fiddle example