How to get tally of unique words using only SQL? - sql

How do you get a list of all unique words and their frequencies ("tally") in one text column of one table of an SQL database? An answer for any SQL dialect in which this is possible would be appreciated.
In case it helps, here's a one-liner that does it in Ruby using Sequel:
Hash[DB[:table].all.map{|r| r[:text]}.join("\n").gsub(/[\(\),]+/,'').downcase.strip.split(/\s+/).tally.sort_by{|_k,v| v}.reverse]
To give an example, for table table with 4 rows with each holding one line of Dr. Dre's Still D.R.E.'s refrain in a text field, the output would be:
{"the"=>4,
"still"=>3,
"them"=>3,
"for"=>2,
"d-r-e"=>1,
"it's"=>1,
"streets"=>1,
"love"=>1,
"got"=>1,
"i"=>1,
"and"=>1,
"beat"=>1,
"perfect"=>1,
"to"=>1,
"time"=>1,
"my"=>1,
"taking"=>1,
"girl"=>1,
"low-lows"=>1,
"in"=>1,
"corners"=>1,
"hitting"=>1,
"world"=>1,
"across"=>1,
"all"=>1,
"gangstas"=>1,
"representing"=>1,
"i'm"=>1}
This works, of course, but is naturally not as fast or elegant as it would be to do it in pure SQL - which I have no clue if that's even in the realm of possibilites...

I guess it would depend on how the SQL database would look like. You would have to first turn your 4 row "database" into a data of single column, each row representing one word. To do that you could use something like String_split, where every space would be a delimiter.
STRING_SPLIT('I'm representing for them gangstas all across the world', ' ')
https://www.sqlservertutorial.net/sql-server-string-functions/sql-server-string_split-function/
This would turn it into a table where every word is a row.
Once you've set up your data table, then it's easy.
Your_table:
[Word]
I'm
representing
for
them
...
world
Then you can just write:
SELECT Word, count(*)
FROM your_table
GROUP BY Word;
Your output would be:
Word | Count
I'm 1
representing 1

I had a play using XML in sql server. Just an idea :)
with cteXML as
(
select *
,cast('<wd>' + replace(songline,' ' ,'</wd><wd>') + '</wd>' as xml) as XMLsongline
from #tSongs
),cteBase as
(
select p.value('.','nvarchar(max)') as singleword
from cteXML as x
cross apply x.XMLsongline.nodes('/wd') t(p)
)
select b.singleword,count(b.singleword)
from cteBase as b
group by b.singleword

Related

Count all elements in an array

I have a table that I save some data include list of numbers.
like this:
numbers
(null)
،42593
،42593،42594،36725،42592،36725،42592
،42593،42594،36725،42592
،31046،36725،42592
I would like to count the number elements in every row in SQL Server
count
0
1
6
4
3
You could use a replacement trick here:
SELECT numbers,
COALESCE(LEN(numbers) - LEN(REPLACE(numbers, ',', '')), 0) AS num_elements
FROM yourTable;
The above trick works by counting the number of commas (assuming your data really has commas as separators). For example, your last sample data point was:
,31046,36725,42592 => length is 18
310463672542592 => length is 15
Hence the difference in lengths correctly yields the right number of elements.
Another idea is to useSTRING_SPLIT:
SELECT y.numbers,
(SELECT COUNT(Value) - 1
FROM string_split(COALESCE(y.numbers,''),',')) AS num_elements
FROM yourtable AS y;
I know this looks a bit unhandy on first glance due to this strange -1 in the second line and the COALESCE in the third line. So why do I talk about this option?
Well, the strange thing in your case which causes these difficulties in my query is that your rows always start with a comma.
This is quite weird and it would be much easier without this first comma in every row.
Let's assume you remove this comma in future. Then this will become really easy and good readable:
SELECT y.numbers,
(SELECT COUNT(Value)
FROM string_split(y.numbers,',')) AS num_elements
FROM yourtable AS y;
Try out: db<>fiddle
your data
CREATE TABLE yourtable(
numbers VARCHAR(max)
);
INSERT INTO yourtable
(numbers) VALUES
(null),
('،42593'),
('،42593،42594،36725،42592،36725،42592'),
('،42593،42594،36725،42592'),
('،31046،36725،42592');
you need ISNULL and len
select
ISNULL(len(numbers) - len(replace(numbers,'،','')) ,0) count
from yourtable
the other way is by using IIF and string_split as follows
SELECT IIF(count < 0, 0, count) count
FROM   (SELECT (SELECT Count(*) - 1
                FROM   STRING_SPLIT (Replace(Replace(numbers, 'R', ''), '،',
                                     'R'), 'R'
                       )) AS
               'count'
        FROM   yourtable) A
dbfiddle

SQL group by middle part of string

I have string column that looks usually approximately like this:
https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554
https://mapy.cz/turisticka?x=15.9380354&y=50.1990211&z=11&source=base&id=2197
https://mapy.cz/turisticka?x=12.8611357&y=49.8051338&z=16&source=base&id=1703157
I would like to group data by source which is part of the string - four letters behind "source=" (in the case above: firm) and then simply count them. Is there a way to achieve this directly in SQL code? I am using hadoop.
Data is a set of strings that look like above. My expected result is summary table with two columns: 1) Each type of the source (there is about 20 possible and their length is different so I cannot use sipmle substring). Ideally I am looking for solution that says: For the grouping use four letters that come after "source=" 2) Count of their occurences in all the strings.
There is just one source type in each string.
You can use regexp_extract():
select substr(regexp_extract(url, 'source[^&]+'), 8)
You can use charindex in MSSQL to get position of string and extract record
;with cte as (
SELECT SUBSTRING('https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554',
charindex('&source=','https://mapy.cz/zakladni?x=16.3360208&y=49.6718038&z=8&source=firm&id=13123554')
+8,4) AS ExtractString )
select ExtractString,count(ExtractString) as count from cte group by ExtractString;
There is equivalent function LOCATE in hiveql for charindex.

String split/chunks into chunked columns from one varchar column

Hopefully everyone is having a productive lockdown all over the world. This is my second issue I wanted some assistance with today.
What I have is a chat from a telecom company signing up new customers.
I have successfully collapsed them into 2x rows per unique_id - a unique chat interaction captured between customer and company agent.
I would like to now take each column (text) in each row and separate
it out to 5 equal varchar columns.
The objective is to splice/chunk a
conversation into 5 different stages within this table.
I do not
have access to delimiters as customers and company staff use
delimiting characters themselves so it makes this tricky.
Below I have 2 images with what the data looks like now and what I am looking for.
BEFORE
AFTER
I have looked at the following articles to try to crack it, but am stuck:
Split A Single Field Value Into Multiple Fixed-Length Column Values in T-SQL
How to Split String by Character into Separate Columns in SQL Server
How to split a comma-separated value to columns
How to split a single column values to multiple column values?
Split string in SQL Server to a maximum length, returning each as a row
Here is the SQL Fiddle page, but I am running this code in MS SQL Server: http://sqlfiddle.com/#!9/ddd08c
Here is the table creation code:
CREATE TABLE Table1
(`unique_id` double, `user` varchar(8), `text` varchar(144))
;
INSERT INTO Table1
(`unique_id`, `user`, `text`)
VALUES
(50314585222, 'customer', 'This is part 1 of long text. This is part 2 of long text. This is part 3 of long text. This is part 4 of long text. This is part 5 of long text.'),
(50314585222, 'company', 'This is part 1 of long text This is part 2 of long text This is part 3 of long text This is part 4 of long text This is part 5 of long text'),
(50319875222, 'customer', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5'),
(50319875222, 'company', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5')
;
I have requested an almost similar algorithm in R, in my history. I have been trying to do this in SQL.
I have manage to solve this with the T-SQL statement below:
WITH DataSource AS
(
SELECT *
,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
FROM TAble1
), PreparedData AS
(
SELECT unique_id
,[user]
,'text' + CAST(RM.matchID + 1 AS VARCHAR(12)) as [column]
,RM.CaptureValue AS [value]
FROM DataSource T
CROSS APPLY [dbo].[fn_Utils_RegexMatches] ([text], [pattern]) RM
)
SELECT *
FROM PreparedData DS
PIVOT
(
max([value]) for [column] IN ([text1], [text2], [text3], [text4], [text5])
) PVT;
In order to use this code, you need to implement SQL CLR function(s) for working with regular expression in the context of T-SQL (you need to invest some time understanding how SQL CLR works) - otherwise, you will not be able to use this solution.
So, having RegexMatches function, the first part is to build a regular expression pattern for splitting the data:
SELECT *
,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
FROM TAble1;
The pattern is \b.number\b and will match part of the strings with length number but not cutting the words (check if boundary works for you, because in some cases it won't).
Then, using our regex matches function we getting a result like this (the second common table expression):
And the data above is ready for pivoting which is pretty easy.
So, the notes are:
you need to implement Microsoft String Utility
you need to ensure the regex pattern works for you
you can split the T-SQL I used, check the other columns of the regex function and even make dynamic pivoting - the code is an example and need to modify/check it before using in production

How to substring records with variable length

I have a table which has a column with doc locations, such as AA/BB/CC/EE
I am trying to get only one of these parts, lets say just the CC part (which has variable length). Until now I've tried as follows:
SELECT RIGHT(doclocation,CHARINDEX('/',REVERSE(doclocation),0)-1)
FROM Table
WHERE doclocation LIKE '%CC %'
But I'm not getting the expected result
Use PARSENAME function like this,
DECLARE #s VARCHAR(100) = 'AA/BB/CC/EE'
SELECT PARSENAME(replace(#s, '/', '.'), 2)
This is painful to do in SQL Server. One method is a series of string operations. I find this simplest using outer apply (unless I need subqueries for a different reason):
select *
from t outer apply
(select stuff(t.doclocation, 1, patindex('%/%/%', t.doclocation), '') as doclocation2) t2 outer apply
(select left(tt.doclocation2), charindex('/', tt.doclocation2) as cc
) t3;
The PARSENAME function is used to get the specified part of an object name, and should not used for this purpose, as it will only parse strings with max 4 objects (see SQL Server PARSENAME documentation at MSDN)
SQL Server 2016 has a new function STRING_SPLIT, but if you don't use SQL Server 2016 you have to fallback on the solutions described here: How do I split a string so I can access item x?
The question is not clear I guess. Can you please specify which value you need? If you need the values after CC, then you can do the CHARINDEX on "CC". Also the query does not seem correct as the string you provided is "AA/BB/CC/EE" which does not have a space between it, but in the query you are searching for space WHERE doclocation LIKE '%CC %'
SELECT SUBSTRING(doclocation,CHARINDEX('CC',doclocation)+2,LEN(doclocation))
FROM Table
WHERE doclocation LIKE '%CC %'

Searching a column containing CSV data in a MySQL table for existence of input values

I have a table say, ITEM, in MySQL that stores data as follows:
ID FEATURES
--------------------
1 AB,CD,EF,XY
2 PQ,AC,A3,B3
3 AB,CDE
4 AB1,BC3
--------------------
As an input, I will get a CSV string, something like "AB,PQ". I want to get the records that contain AB or PQ. I realized that we've to write a MySQL function to achieve this. So, if we have this magical function MATCH_ANY defined in MySQL that does this, I would then simply execute an SQL as follows:
select * from ITEM where MATCH_ANY(FEAURES, "AB,PQ") = 0
The above query would return the records 1, 2 and 3.
But I'm running into all sorts of problems while implementing this function as I realized that MySQL doesn't support arrays and there's no simple way to split strings based on a delimiter.
Remodeling the table is the last option for me as it involves lot of issues.
I might also want to execute queries containing multiple MATCH_ANY functions such as:
select * from ITEM where MATCH_ANY(FEATURES, "AB,PQ") = 0 and MATCH_ANY(FEATURES, "CDE")
In the above case, we would get an intersection of records (1, 2, 3) and (3) which would be just 3.
Any help is deeply appreciated.
Thanks
First of all, the database should of course not contain comma separated values, but you are hopefully aware of this already. If the table was normalised, you could easily get the items using a query like:
select distinct i.Itemid
from Item i
inner join ItemFeature f on f.ItemId = i.ItemId
where f.Feature in ('AB', 'PQ')
You can match the strings in the comma separated values, but it's not very efficient:
select Id
from Item
where
instr(concat(',', Features, ','), ',AB,') <> 0 or
instr(concat(',', Features, ','), ',PQ,') <> 0
For all you REGEXP lovers out there, I thought I would add this as a solution:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]';
and for case sensitivity:
SELECT * FROM ITEM WHERE FEATURES REGEXP BINARY '[[:<:]]AB|PQ[[:>:]]';
For the second query:
SELECT * FROM ITEM WHERE FEATURES REGEXP '[[:<:]]AB|PQ[[:>:]]' AND FEATURES REGEXP '[[:<:]]CDE[[:>:]];
Cheers!
select *
from ITEM where
where CONCAT(',',FEAURES,',') LIKE '%,AB,%'
or CONCAT(',',FEAURES,',') LIKE '%,PQ,%'
or create a custom function to do your MATCH_ANY
Alternatively, consider using RLIKE()
select *
from ITEM
where ','+FEATURES+',' RLIKE ',AB,|,PQ,';
Just a thought:
Does it have to be done in SQL? This is the kind of thing you might normally expect to write in PHP or Python or whatever language you're using to interface with the database.
This approach means you can build your query string using whatever complex logic you need and then just submit a vanilla SQL query, rather than trying to build a procedure in SQL.
Ben