SQL Server - Find frequency of occurance (by row, not word) of most common words in a column - sql

This question has been asked more than a few times, but I can't find the specific answer I need. I have a query that finds the most commonly appearing words in a column in SQL Server and lists them with the count of their appearances. The problem is that if a word appears multiple times in a row, it counts once for each appearance. I would like to only count each word once per row.
So a row with a value of "To be or not to be" would count 'to' and 'be' once each, not twice each for purposes of overall frequency.
Here is the current query, which also strips out common words such as pronouns and replaces all of the commonly occurring separators with spaces. It's a bit old so I suspect it could be a lot neater.
SELECT sep.Col Phrase, count(*) as Qty
FROM (
Select * FROM (
Select value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' '))))
FROM Table
) easyValues
Where value <> ''
) actualValues
Cross Apply dbo.SeparateValues(value, ' ') sep
WHERE sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
and LEN(sep.Col) > 2
GROUP By sep.Col
HAVING count(*) > 1
Appreciate any thoughts on a better way to do this while fixing the issue of repeat words.

You just need to GROUP BY twice.
First by sep.Col and Table.ID to remove duplicates in a row. Your table has some ID column, right?
Second, just by sep.Col to get the final count.
I have also rewritten your query using CTEs to make it readable. At least, for me it is more readable in this way.
WITH
easyValues
AS
(
Select
ID
,value = Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Replace(Title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' '), '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' '))))
FROM Table
)
,actualValues
AS
(
SELECT
ID
,Value
FROM easyValues
Where value <> ''
)
,SeparateValues
AS
(
SELECT
ID
,sep.Col
FROM
actualValues
Cross Apply dbo.SeparateValues(value, ' ') AS sep
WHERE
sep.Col not in ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
and LEN(sep.Col) > 2
)
,UniqueValues
AS
(
SELECT
ID, Col
FROM
SeparateValues
GROUP BY
ID, Col
)
SELECT
Col AS Phrase
,count(*) as Qty
FROM UniqueValues
GROUP By Col
HAVING count(*) > 1
;

As far as I can tell, the STRING_SPLIT function along with CROSS APPLY can give you what you want. You can split the string based on the space delimiter, select each word distinctly, then count in an outer query. I ommitted the part where you don't select specific words for brevity.
Fiddle<>:
CREATE TABLE phrases(phrase NVARCHAR(MAX));
INSERT INTO phrases(phrase)VALUES(N'To be or not to be'),(N'this is not a phrase'),(N'And why is this not another one');
SELECT
w.value,
COUNT(*)
FROM
phrases AS p
CROSS APPLY (
SELECT DISTINCT
value
FROM
STRING_SPLIT(p.phrase,N' ')
) AS w
GROUP BY
w.value;

To achieve your requirement, you can use a FUNCTION to split a string into list of words by delimiter ' ' space. With the help of this function, you can then use some Dynamic SQL like cursor to get the final count.
First create the FUNCTION as-
Source Of code: stackoverflow
CREATE FUNCTION dbo.splitstring ( #stringToSplit VARCHAR(MAX) )
RETURNS #returnList TABLE ([Word] [nvarchar] (500))
AS
BEGIN
DECLARE #name NVARCHAR(255)
DECLARE #pos INT
WHILE CHARINDEX(' ', #stringToSplit) > 0
BEGIN
SELECT #pos = CHARINDEX(' ', #stringToSplit)
SELECT #name = SUBSTRING(#stringToSplit, 1, #pos-1)
INSERT INTO #returnList
SELECT #name
SELECT #stringToSplit = SUBSTRING(#stringToSplit, #pos+1, LEN(#stringToSplit)-#pos)
END
INSERT INTO #returnList
SELECT #stringToSplit
RETURN
END
Then use this CURSOR script to get your final output-
DECLARE #Value VARCHAR(MAX)
DECLARE #WordList TABLE
(
Word VARCHAR(200)
)
DECLARE db_cursor CURSOR
FOR
SELECT Upper(RTrim(LTrim(Replace(Replace(Replace(Replace(Replace
(Replace(Replace(Replace(Replace(Replace(Replace(Replace
(Replace(Replace(title, ',', ' '), '.', ' '), '!', ' '), '+', ' '), ':', ' '), '-', ' '), ';', ' ')
, '(', ' '), ')', ' '), '/', ' '), '&', ''), '?', ' '), ' ', ' '), ' ', ' ')))) [Value]
FROM table
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #Value
WHILE ##FETCH_STATUS = 0
BEGIN
INSERT INTO #WordList
SELECT DISTINCT Word FROM [dbo].[splitstring](#Value)
WHERE Word NOT IN ('', 'THE', 'A', 'AN', 'WHO', 'BOOK', 'AND', 'FOR', 'ON', 'HAVE', 'YOUR', 'HOW', 'WE', 'IN', 'I', 'IT', 'BY', 'SO', 'THEIR', 'IS', 'OR', 'HE', 'OF', 'WHAT'
, 'HIM', 'HIS', 'SHE', 'HER', 'MY', 'FROM', 'US', 'OUR', 'AT', 'ALL', 'BE', 'OF', 'TO', 'YOU', 'WITH', 'THAT', 'THIS', 'WAS', 'ARE', 'THERE', 'BUT', 'HAS'
, '1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'WILL', 'MORE', 'DIV', 'THAN', 'EACH', 'GET', 'ANY')
AND LEN(Word) > 2
FETCH NEXT FROM db_cursor INTO #Value
END
CLOSE db_cursor
DEALLOCATE db_cursor
SELECT Word,COUNT(*)
FROM #WordList
GROUP BY Word

Related

Extract words from a column and count frequency

Does anyone know if there's an efficient way to extract all the words from a single column and count the frequency of each word in SQL Server? I only have read-only access to my database so I can't create a self-defined function to do this.
Here's a reproducible example:
CREATE TABLE words
(
id INT PRIMARY KEY,
text_column VARCHAR(1000)
);
INSERT INTO words (id, text_column)
VALUES
(1, 'SQL Server is a popular database management system'),
(2, 'It is widely used for data storage and retrieval'),
(3, 'SQL Server is a powerful tool for data analysis');
I have found this code but it's not working correctly, and I think it's too complicated to understand:
WITH E1(N) AS
(
SELECT 1
FROM (VALUES
(1),(1),(1),(1),(1),(1),(1),(1),(1),(1)
) t(N)
),
E2(N) AS (SELECT 1 FROM E1 a CROSS JOIN E1 b),
E4(N) AS (SELECT 1 FROM E2 a CROSS JOIN E2 b)
SELECT
LOWER(x.Item) AS [Word],
COUNT(*) AS [Counts]
FROM
(SELECT * FROM words) a
CROSS APPLY
(SELECT
ItemNumber = ROW_NUMBER() OVER(ORDER BY l.N1),
Item = LTRIM(RTRIM(SUBSTRING(a.text_column, l.N1, l.L1)))
FROM
(SELECT
s.N1,
L1 = ISNULL(NULLIF(CHARINDEX(' ',a.text_column,s.N1),0)-s.N1,4000)
FROM
(SELECT 1
UNION ALL
SELECT t.N+1
FROM
(SELECT TOP (ISNULL(DATALENGTH(a.text_column)/2,0))
ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM E4) t(N)
WHERE SUBSTRING(a.text_column ,t.N,1) = ' '
) s(N1)
) l(N1, L1)
) x
WHERE
x.item <> ''
AND x.Item NOT IN ('0o', '0s', '3a', '3b', '3d', '6b', '6o', 'a', 'a1', 'a2', 'a3', 'a4', 'ab', 'able', 'about', 'above', 'abst', 'ac', 'accordance', 'according', 'accordingly', 'across', 'act', 'actually', 'ad', 'added', 'adj', 'ae', 'af', 'affected', 'affecting', 'affects', 'after', 'afterwards', 'ag', 'again', 'against', 'ah', 'ain', 'ain''t', 'aj', 'al', 'all', 'allow', 'allows', 'almost', 'alone', 'along', 'already', 'also', 'although', 'always', 'am', 'among', 'amongst', 'amoungst', 'amount', 'an', 'and', 'announce', 'another', 'any', 'anybody', 'anyhow', 'anymore', 'anyone', 'anything', 'anyway', 'anyways', 'anywhere', 'ao', 'ap', 'apart', 'apparently', 'appear', 'appreciate', 'appropriate', 'approximately', 'ar', 'are', 'aren', 'arent', 'aren''t', 'arise', 'around', 'as', 'a''s', 'aside', 'ask', 'asking', 'associated', 'at', 'au', 'auth', 'av', 'available', 'aw', 'away', 'awfully', 'ax', 'ay', 'az', 'b', 'b1', 'b2', 'b3', 'ba', 'back', 'bc', 'bd', 'be', 'became', 'because', 'become', 'becomes', 'becoming', 'been', 'before', 'beforehand', 'begin', 'beginning', 'beginnings', 'begins', 'behind', 'being', 'believe', 'below', 'beside', 'besides', 'best', 'better', 'between', 'beyond', 'bi', 'bill', 'biol', 'bj', 'bk', 'bl', 'bn', 'both', 'bottom', 'bp', 'br', 'brief', 'briefly', 'bs', 'bt', 'bu', 'but', 'bx', 'by', 'c', 'c1', 'c2', 'c3', 'ca', 'call', 'came', 'can', 'cannot', 'cant', 'can''t', 'cause', 'causes', 'cc', 'cd', 'ce', 'certain', 'certainly', 'cf', 'cg', 'ch', 'changes', 'ci', 'cit', 'cj', 'cl', 'clearly', 'cm', 'c''mon', 'cn', 'co', 'com', 'come', 'comes', 'con', 'concerning', 'consequently', 'consider', 'considering', 'contain', 'containing', 'contains', 'corresponding', 'could', 'couldn', 'couldnt', 'couldn''t', 'course', 'cp', 'cq', 'cr', 'cry', 'cs', 'c''s', 'ct', 'cu', 'currently', 'cv', 'cx', 'cy', 'cz', 'd', 'd2', 'da', 'date', 'dc', 'dd', 'de', 'definitely', 'describe', 'described', 'despite', 'detail', 'df', 'di', 'did', 'didn', 'didn''t', 'different', 'dj', 'dk', 'dl', 'do', 'does', 'doesn', 'doesn''t', 'doing', 'don', 'done', 'don''t', 'down', 'downwards', 'dp', 'dr', 'ds', 'dt', 'du', 'due', 'during', 'dx', 'dy', 'e', 'e2', 'e3', 'ea', 'each', 'ec', 'ed', 'edu', 'ee', 'ef', 'effect', 'eg', 'ei', 'eight', 'eighty', 'either', 'ej', 'el', 'eleven', 'else', 'elsewhere', 'em', 'empty', 'en', 'end', 'ending', 'enough', 'entirely', 'eo', 'ep', 'eq', 'er', 'es', 'especially', 'est', 'et', 'et-al', 'etc', 'eu', 'ev', 'even', 'ever', 'every', 'everybody', 'everyone', 'everything', 'everywhere', 'ex', 'exactly', 'example', 'except', 'ey', 'f', 'f2', 'fa', 'far', 'fc', 'few', 'ff', 'fi', 'fifteen', 'fifth', 'fify', 'fill', 'find', 'fire', 'first', 'five', 'fix', 'fj', 'fl', 'fn', 'fo', 'followed', 'following', 'follows', 'for', 'former', 'formerly', 'forth', 'forty', 'found', 'four', 'fr', 'from', 'front', 'fs', 'ft', 'fu', 'full', 'further', 'furthermore', 'fy', 'g', 'ga', 'gave', 'ge', 'get', 'gets', 'getting', 'gi', 'give', 'given', 'gives', 'giving', 'gj', 'gl', 'go', 'goes', 'going', 'gone', 'got', 'gotten', 'gr', 'greetings', 'gs', 'gy', 'h', 'h2', 'h3', 'had', 'hadn', 'hadn''t', 'happens', 'hardly', 'has', 'hasn', 'hasnt', 'hasn''t', 'have', 'haven', 'haven''t', 'having', 'he', 'hed', 'he''d', 'he''ll', 'hello', 'help', 'hence', 'her', 'here', 'hereafter', 'hereby', 'herein', 'heres', 'here''s', 'hereupon', 'hers', 'herself', 'hes', 'he''s', 'hh', 'hi', 'hid', 'him', 'himself', 'his', 'hither', 'hj', 'ho', 'home', 'hopefully', 'how', 'howbeit', 'however', 'how''s', 'hr', 'hs', 'http', 'hu', 'hundred', 'hy', 'i', 'i2', 'i3', 'i4', 'i6', 'i7', 'i8', 'ia', 'ib', 'ibid', 'ic', 'id', 'i''d', 'ie', 'if', 'ig', 'ignored', 'ih', 'ii', 'ij', 'il', 'i''ll', 'im', 'i''m', 'immediate', 'immediately', 'importance', 'important', 'in', 'inasmuch', 'inc', 'indeed', 'index', 'indicate', 'indicated', 'indicates', 'information', 'inner', 'insofar', 'instead', 'interest', 'into', 'invention', 'inward', 'io', 'ip', 'iq', 'ir', 'is', 'isn', 'isn''t', 'it', 'itd', 'it''d', 'it''ll', 'its', 'it''s', 'itself', 'iv', 'i''ve', 'ix', 'iy', 'iz', 'j', 'jj', 'jr', 'js', 'jt', 'ju', 'just', 'k', 'ke', 'keep', 'keeps', 'kept', 'kg', 'kj', 'km', 'know', 'known', 'knows', 'ko', 'l', 'l2', 'la', 'largely', 'last', 'lately', 'later', 'latter', 'latterly', 'lb', 'lc', 'le', 'least', 'les', 'less', 'lest', 'let', 'lets', 'let''s', 'lf', 'like', 'liked', 'likely', 'line', 'little', 'lj', 'll', 'll', 'ln', 'lo', 'look', 'looking', 'looks', 'los', 'lr', 'ls', 'lt', 'ltd', 'm', 'm2', 'ma', 'made', 'mainly', 'make', 'makes', 'many', 'may', 'maybe', 'me', 'mean', 'means', 'meantime', 'meanwhile', 'merely', 'mg', 'might', 'mightn', 'mightn''t', 'mill', 'million', 'mine', 'miss', 'ml', 'mn', 'mo', 'more', 'moreover', 'most', 'mostly', 'move', 'mr', 'mrs', 'ms', 'mt', 'mu', 'much', 'mug', 'must', 'mustn', 'mustn''t', 'my', 'myself', 'n', 'n2', 'na', 'name', 'namely', 'nay', 'nc', 'nd', 'ne', 'near', 'nearly', 'necessarily', 'necessary', 'need', 'needn', 'needn''t', 'needs', 'neither', 'never', 'nevertheless', 'new', 'next', 'ng', 'ni', 'nine', 'ninety', 'nj', 'nl', 'nn', 'no', 'nobody', 'non', 'none', 'nonetheless', 'noone', 'nor', 'normally', 'nos', 'not', 'noted', 'nothing', 'novel', 'now', 'nowhere', 'nr', 'ns', 'nt', 'ny', 'o', 'oa', 'ob', 'obtain', 'obtained', 'obviously', 'oc', 'od', 'of', 'off', 'often', 'og', 'oh', 'oi', 'oj', 'ok', 'okay', 'ol', 'old', 'om', 'omitted', 'on', 'once', 'one', 'ones', 'only', 'onto', 'oo', 'op', 'oq', 'or', 'ord', 'os', 'ot', 'other', 'others', 'otherwise', 'ou', 'ought', 'our', 'ours', 'ourselves', 'out', 'outside', 'over', 'overall', 'ow', 'owing', 'own', 'ox', 'oz', 'p', 'p1', 'p2', 'p3', 'page', 'pagecount', 'pages', 'par', 'part', 'particular', 'particularly', 'pas', 'past', 'pc', 'pd', 'pe', 'per', 'perhaps', 'pf', 'ph', 'pi', 'pj', 'pk', 'pl', 'placed', 'please', 'plus', 'pm', 'pn', 'po', 'poorly', 'possible', 'possibly', 'potentially', 'pp', 'pq', 'pr', 'predominantly', 'present', 'presumably', 'previously', 'primarily', 'probably', 'promptly', 'proud', 'provides', 'ps', 'pt', 'pu', 'put', 'py', 'q', 'qj', 'qu', 'que', 'quickly', 'quite', 'qv', 'r', 'r2', 'ra', 'ran', 'rather', 'rc', 'rd', 're', 'readily', 'really', 'reasonably', 'recent', 'recently', 'ref', 'refs', 'regarding', 'regardless', 'regards', 'related', 'relatively', 'research', 'research-articl', 'respectively', 'resulted', 'resulting', 'results', 'rf', 'rh', 'ri', 'right', 'rj', 'rl', 'rm', 'rn', 'ro', 'rq', 'rr', 'rs', 'rt', 'ru', 'run', 'rv', 'ry', 's', 's2', 'sa', 'said', 'same', 'saw', 'say', 'saying', 'says', 'sc', 'sd', 'se', 'sec', 'second', 'secondly', 'section', 'see', 'seeing', 'seem', 'seemed', 'seeming', 'seems', 'seen', 'self', 'selves', 'sensible', 'sent', 'serious', 'seriously', 'seven', 'several', 'sf', 'shall', 'shan', 'shan''t', 'she', 'shed', 'she''d', 'she''ll', 'shes', 'she''s', 'should', 'shouldn', 'shouldn''t', 'should''ve', 'show', 'showed', 'shown', 'showns', 'shows', 'si', 'side', 'significant', 'significantly', 'similar', 'similarly', 'since', 'sincere', 'six', 'sixty', 'sj', 'sl', 'slightly', 'sm', 'sn', 'so', 'some', 'somebody', 'somehow', 'someone', 'somethan', 'something', 'sometime', 'sometimes', 'somewhat', 'somewhere', 'soon', 'sorry', 'sp', 'specifically', 'specified', 'specify', 'specifying', 'sq', 'sr', 'ss', 'st', 'still', 'stop', 'strongly', 'sub', 'substantially', 'successfully', 'such', 'sufficiently', 'suggest', 'sup', 'sure', 'sy', 'system', 'sz', 't', 't1', 't2', 't3', 'take', 'taken', 'taking', 'tb', 'tc', 'td', 'te', 'tell', 'ten', 'tends', 'tf', 'th', 'than', 'thank', 'thanks', 'thanx', 'that', 'that''ll', 'thats', 'that''s', 'that''ve', 'the', 'their', 'theirs', 'them', 'themselves', 'then', 'thence', 'there', 'thereafter', 'thereby', 'thered', 'therefore', 'therein', 'there''ll', 'thereof', 'therere', 'theres', 'there''s', 'thereto', 'thereupon', 'there''ve', 'these', 'they', 'theyd', 'they''d', 'they''ll', 'theyre', 'they''re', 'they''ve', 'thickv', 'thin', 'think', 'third', 'this', 'thorough', 'thoroughly', 'those', 'thou', 'though', 'thoughh', 'thousand', 'three', 'throug', 'through', 'throughout', 'thru', 'thus', 'ti', 'til', 'tip', 'tj', 'tl', 'tm', 'tn', 'to', 'together', 'too', 'took', 'top', 'toward', 'towards', 'tp', 'tq', 'tr', 'tried', 'tries', 'truly', 'try', 'trying', 'ts', 't''s', 'tt', 'tv', 'twelve', 'twenty', 'twice', 'two', 'tx', 'u', 'u201d', 'ue', 'ui', 'uj', 'uk', 'um', 'un', 'under', 'unfortunately', 'unless', 'unlike', 'unlikely', 'until', 'unto', 'uo', 'up', 'upon', 'ups', 'ur', 'us', 'use', 'used', 'useful', 'usefully', 'usefulness', 'uses', 'using', 'usually', 'ut', 'v', 'va', 'value', 'various', 'vd', 've', 've', 'very', 'via', 'viz', 'vj', 'vo', 'vol', 'vols', 'volumtype', 'vq', 'vs', 'vt', 'vu', 'w', 'wa', 'want', 'wants', 'was', 'wasn', 'wasnt', 'wasn''t', 'way', 'we', 'wed', 'we''d', 'welcome', 'well', 'we''ll', 'well-b', 'went', 'were', 'we''re', 'weren', 'werent', 'weren''t', 'we''ve', 'what', 'whatever', 'what''ll', 'whats', 'what''s', 'when', 'whence', 'whenever', 'when''s', 'where', 'whereafter', 'whereas', 'whereby', 'wherein', 'wheres', 'where''s', 'whereupon', 'wherever', 'whether', 'which', 'while', 'whim', 'whither', 'who', 'whod', 'whoever', 'whole', 'who''ll', 'whom', 'whomever', 'whos', 'who''s', 'whose', 'why', 'why''s', 'wi', 'widely', 'will', 'willing', 'wish', 'with', 'within', 'without', 'wo', 'won', 'wonder', 'wont', 'won''t', 'words', 'world', 'would', 'wouldn', 'wouldnt', 'wouldn''t', 'www', 'x', 'x1', 'x2', 'x3', 'xf', 'xi', 'xj', 'xk', 'xl', 'xn', 'xo', 'xs', 'xt', 'xv', 'xx', 'y', 'y2', 'yes', 'yet', 'yj', 'yl', 'you', 'youd', 'you''d', 'you''ll', 'your', 'youre', 'you''re', 'yours', 'yourself', 'yourselves', 'you''ve', 'yr', 'ys', 'yt', 'z', 'zero', 'zi', 'zz')
GROUP BY x.Item
ORDER BY COUNT(*) DESC
Here's the result of the above code, as you can see it's not counting correctly:
Word Counts
server 2
sql 2
data 1
database 1
popular 1
powerful 1
Can anyone help on this? Would be really appreciated!
You can make use of String_split here, such as
select value Word, Count(*) Counts
from words
cross apply String_Split(text_column, ' ')
where value not in(exclude list)
group by value
order by counts desc;
You should should the string_split function -- like this
SELECT id, value as aword
FROM words
CROSS APPLY STRING_SPLIT(text_column, ',');
This will create a table with all the words by id -- to get the count do this:
SELECT aword, count(*) as counts
FROM (
SELECT id, value as aword
FROM words
CROSS APPLY STRING_SPLIT(text_column, ',');
) x
GROUP BY aword
You may need to lower case the LOWER(text_column) if you want it to not matter
If you don't have access to STRING_SPLIT function, you can use weird xml trick to convert space to a word node and then shred it with nodes function:
select word, COUNT(*)
from (
select n.value('.', 'nvarchar(50)') AS word
from (
VALUES
(1, 'SQL Server is a popular database management system'),
(2, 'It is widely used for data storage and retrieval'),
(3, 'SQL Server is a powerful tool for data analysis')
) AS t (id, txt)
CROSS APPLY (
SELECT CAST('<x>' + REPLACE(txt, ' ', '</x><x>') + '</x>' AS XML) x
) x
CROSS APPLY x.nodes('x') z(n)
) w
GROUP BY word
Of course, this will fail on "bad" words and invalid xml-characters but it can be worked on. Text processing has never been SQL Server's strong-point though, so probably better to use some NLP library to do this kind of stuff

How to remove punctuation within a string?

I am doing text cleaning for my pandas dataframe
This is a string from my description column before punctuation is removed:
['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology',
'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society',
'.', 'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“',
'efficient', 'life', 'â€', 'tied', 'to', 'the', 'products', 'and',
'services', 'they', 'provide', '.']
This is how the string look like after i applied the code below:
['dedicated', 'to', 'support', 'the', 'fast-paced', 'technology',
'lifestyle', 'needs', 'of', 'today', '’', 's', 'modern', 'society',
'gadget', 'mix', 'have', 'the', 'benefit', 'of', '“', 'efficient',
'life', 'â€', 'tied', 'to', 'the', 'products', 'and', 'services',
'they', 'provide']
This is my code:
#removing punctuation
import string
punc=string.punctuation
updated_mall['Cleansed_description']=update_mall['Cleansed_description'].apply(lambdax: [word for word in x if word not in punc])
update_mall.head(105)
This code did remove punctuation except:
words like "Fast-paced","...","restaurant/catering".
Other than that,after punctuation removal and changing to lower casing words like Asia's became 'asia' and 's.
I was told that this only check an entire string if is a punctuation instead of checking every single word in a string for punctuation.
Can you try the below code using regex
import re
updated_mall['Cleansed_description']=update_mall['Cleansed_description'].apply(lambda x: [re.sub(r'[^\w\d\s]', ' ', word.lower()) for word in x])
update_mall.head(105)

How to SQL conver to dataframe

I want to convert to SQL to dataframe.\
SELECT day,
MAX(id),
MAX(if(device = 'Mobile devices with full browsers', 'mobile', 'pc')),
AVG(replace(replace(search_imprshare, '< 10%', '10'), '%', '') / 100),
REPLACE(SUBSTRING(SUBSTRING_INDEX(add_trackingcode, '_', 1), CHAR_LENGTH(SUBSTRING_INDEX(add_trackingcode, '_', 1 - 1)) + 2), add_trackingcode, '')
FROM MY_TEST_TABLE
GROUP BY day
But I can only do below that.
I don't know how to work on '???'.
df_data= df_data.groupby(['day').agg(
{
'id': np.max,
'device ' : ???,
'percent' : ???,
'tracking' : ???
}
)
How should I do it?

how do i update column (selmanufacturers_id) from the query

SELECT
'S. J. & G. FAZUL ELLAHIE (PVT) LTD.' AS CompanyName,
'E - 46, S.I.T.E., KARACHI - 75700' AS CompanyAddress,
'admin' login_user, 'crm.sjg.local' selhost,
'admin' sellogin, 'sr_salereport' seltoprint,
'SALES REPORT' selreportTitle, '' selflagType,
'SI' selreportType, '' selfrmDate, '' seltoDate,
'' selfromProductId, '' selac_fromProductId_ac,
'' selbrandId, '' selcustomerId, '' selac_customerId_ac,
'' selgodownId, '' selsmId, '' selsalesmanId, '' selcity,
'' selarea, '' selorders_status, 'S' seldst,
'Invoice' selgroup1, '1' selcurrencyid, 'id' seltype,
'D' seldefUnit, '' selmanufacturers_id, 'Report' selyt0,
CAST(docTypeId as varchar(20)) + CAST(documentId as varchar(20)) groupId1,
[dbo].[ITL_DATE_TO_STRING_FOR_SORTING](refDate) groupTitle1,
'' groupId2, '' selgroup2,'' groupTitle2, *
FROM
[dbo].SR_Sale(null, null, 0, 0, 0, 0, N'', N'', 0, N'SI', 0, 0, 1, N'id', N'D') AS TR
WHERE
docType IN ('SI', 'GYM RECEIPT')
This is a general update statement
FROM dbo.SR_Sale
SET selmanufacturers_id ='xxx'
WHERE (TODO: add where-clause here to specify which row needs to be updated)
Assuming selmanufacturers_id does not exist in [dbo].SR_Sale function but contrary to products_id.
IF (OBJECT_ID('tempdb..#tmp_SR_SALE') IS NOT NULL )
BEGIN
DROP TABLE #tmp_SR_SALE
END
SELECT 'S. J. & G. FAZUL ELLAHIE (PVT) LTD.' AS CompanyName ,
'E - 46, S.I.T.E., KARACHI - 75700' AS CompanyAddress ,
'admin' login_user ,
'crm.sjg.local' selhost ,
'admin' sellogin ,
'sr_salereport' seltoprint ,
'SALES REPORT' selreportTitle ,
'' selflagType ,
'SI' selreportType ,
'' selfrmDate ,
'' seltoDate ,
'' selfromProductId ,
'' selac_fromProductId_ac ,
'' selbrandId ,
'' selcustomerId ,
'' selac_customerId_ac ,
'' selgodownId ,
'' selsmId ,
'' selsalesmanId ,
'' selcity ,
'' selarea ,
'' selorders_status ,
'S' seldst ,
'Invoice' selgroup1 ,
'1' selcurrencyid ,
'id' seltype ,
'D' seldefUnit ,
'' selmanufacturers_id ,
'Report' selyt0 ,
CAST(docTypeId AS VARCHAR(20)) + CAST(documentId AS VARCHAR(20)) groupId1 ,
[dbo].[ITL_DATE_TO_STRING_FOR_SORTING](refDate) groupTitle1 ,
'' groupId2 ,
'' selgroup2 ,
'' groupTitle2 ,
*
INTO #tmp_SR_SALE FROM [dbo].SR_Sale(NULL, NULL, 0, 0, 0, 0, N'', N'', 0, N'SI', 0, 0, 1,
N'id', N'D') AS TR
WHERE docType IN ( 'SI', 'GYM RECEIPT' )
UPDATE #tmp_SR_SALE
SET selmanufacturers_id = 'value here'
WHERE products_id = 9031
SELECT * FROM #tmp_SR_SALE

how to avoid sub-query to gain performance

i have a reporting query which have 2 long sub-query
SELECT r1.code_centre, r1.libelle_centre, r1.id_equipe, r1.equipe, r1.id_file_attente,
r1.libelle_file_attente,r1.id_date, r1.tranche, r1.id_granularite_de_periode,r1.granularite,
r1.ContactsTraites, r1.ContactsenParcage, r1.ContactsenComm, r1.DureeTraitementContacts,
r1.DureeComm, r1.DureeParcage, r2.AgentsConnectes, r2.DureeConnexion, r2.DureeTraitementAgents,
r2.DureePostTraitement
FROM
( SELECT cc.id_centre_contact, cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_file_attente, f.libelle_file_attente, a.id_date, g.tranche, g.id_granularite_de_periode,
g.granularite, sum(Nb_Contacts_Traites) as ContactsTraites,
sum(Nb_Contacts_en_Parcage) as ContactsenParcage,
sum(Nb_Contacts_en_Communication) as ContactsenComm,
sum(Duree_Traitement/1000) as DureeTraitementContacts,
sum(Duree_Communication / 1000 + Duree_Conference / 1000 + Duree_Com_Interagent / 1000) as DureeComm,
sum(Duree_Parcage/1000) as DureeParcage
FROM agr_synthese_activite_media_fa_agent a, centre_contact cc,
direction_contact dc, granularite_de_periode g, media m, file_attente f
WHERE m.id_media = a.id_media
AND cc.id_centre_contact = a.id_centre_contact
AND a.id_direction_contact = dc.id_direction_contact
AND dc.direction_contact ='INCOMING'
AND a.id_file_attente = f.id_file_attente
AND m.media = 'PHONE'
AND ( ( g.valeur_min = date_format(a.id_date,'%d/%m') and g.granularite = 'Jour')
or ( g.granularite = 'Heure' and a.id_th_heure = g.id_granularite_de_periode) )
GROUP by cc.id_centre_contact, a.id_equipe, a.id_file_attente, a.id_date, g.tranche,
g.id_granularite_de_periode) r1,
(
(SELECT cc.id_centre_contact,cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_date, g.tranche, g.id_granularite_de_periode,g.granularite,
count(distinct a.id_agent) as AgentsConnectes,
sum(Duree_Connexion / 1000) as DureeConnexion,
sum(Duree_en_Traitement / 1000) as DureeTraitementAgents,
sum(Duree_en_PostTraitement / 1000) as DureePostTraitement
FROM activite_agent a, centre_contact cc, granularite_de_periode g
WHERE ( g.valeur_min = date_format(a.id_date,'%d/%m') and g.granularite = 'Jour')
AND cc.id_centre_contact = a.id_centre_contact
GROUP BY cc.id_centre_contact, a.id_equipe, a.id_date, g.tranche, g.id_granularite_de_periode )
UNION
(SELECT cc.id_centre_contact,cc.code_centre, cc.libelle_centre, a.id_equipe, a.equipe,
a.id_date, g.tranche, g.id_granularite_de_periode,g.granularite,
count(distinct a.id_agent) as AgentsConnectes,
sum(Duree_Connexion / 1000) as DureeConnexion,
sum(Duree_en_Traitement / 1000) as DureeTraitementAgents,
sum(Duree_en_PostTraitement / 1000) as DureePostTraitement
FROM activite_agent a, centre_contact cc, granularite_de_periode g
WHERE ( g.granularite = 'Heure'
AND a.id_th_heure = g.id_granularite_de_periode)
AND cc.id_centre_contact = a.id_centre_contact
GROUP BY cc.id_centre_contact,a.id_equipe, a.id_date, g.tranche, g.id_granularite_de_periode)
) r2
WHERE r1.id_centre_contact = r2.id_centre_contact
AND r1.id_equipe = r2.id_equipe AND r1.id_date = r2.id_date
AND r1.tranche = r2.tranche AND r1.id_granularite_de_periode = r2.id_granularite_de_periode
GROUP BY r1.id_centre_contact , r1.id_equipe, r1.id_file_attente,
r1.id_date, r1.tranche, r1.id_granularite_de_periode
ORDER BY r1.code_centre, r1.libelle_centre, r1.equipe,
r1.libelle_file_attente, r1.id_date, r1.id_granularite_de_periode,r1.tranche
the EXPLAIN shows
| id | select_type | table | type| possible_keys | key | key_len | ref| rows | Extra |
'1', 'PRIMARY', '<derived3>', 'ALL', NULL, NULL, NULL, NULL, '2520', 'Using temporary; Using filesort'
'1', 'PRIMARY', '<derived2>', 'ALL', NULL, NULL, NULL, NULL, '4378', 'Using where; Using join buffer'
'3', 'DERIVED', 'a', 'ALL', 'fk_Activite_Agent_centre_contact', NULL, NULL, NULL, '83433', 'Using temporary; Using filesort'
'3', 'DERIVED', 'g', 'ref', 'Index_granularite,Index_Valeur_min', 'Index_Valeur_min', '23', 'func', '1', 'Using where'
'3', 'DERIVED', 'cc', 'ALL', 'PRIMARY', NULL, NULL, NULL, '6', 'Using where; Using join buffer'
'4', 'UNION', 'g', 'ref', 'PRIMARY,Index_granularite', 'Index_granularite', '23', '', '24', 'Using where; Using temporary; Using filesort'
'4', 'UNION', 'a', 'ref', 'fk_Activite_Agent_centre_contact,fk_activite_agent_TH_heure', 'fk_activite_agent_TH_heure', '5', 'reporting_acd.g.Id_Granularite_de_periode', '2979', 'Using where'
'4', 'UNION', 'cc', 'ALL', 'PRIMARY', NULL, NULL, NULL, '6', 'Using where; Using join buffer'
NULL, 'UNION RESULT', '<union3,4>', 'ALL', NULL, NULL, NULL, NULL, NULL, ''
'2', 'DERIVED', 'g', 'range', 'PRIMARY,Index_granularite,Index_Valeur_min', 'Index_granularite', '23', NULL, '389', 'Using where; Using temporary; Using filesort'
'2', 'DERIVED', 'a', 'ALL', 'fk_agr_synthese_activite_media_fa_agent_centre_contact,fk_agr_synthese_activite_media_fa_agent_direction_contact,fk_agr_synthese_activite_media_fa_agent_file_attente,fk_agr_synthese_activite_media_fa_agent_media,fk_agr_synthese_activite_media_fa_agent_th_heure', NULL, NULL, NULL, '20903', 'Using where; Using join buffer'
'2', 'DERIVED', 'cc', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Centre_Contact', '1', ''
'2', 'DERIVED', 'f', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_File_Attente', '1', ''
'2', 'DERIVED', 'dc', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Direction_Contact', '1', 'Using where'
'2', 'DERIVED', 'm', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'reporting_acd.a.Id_Media', '1', 'Using where'
don't know it very clear, but i think is the problem of seems it take full scaning
than i change all the sub-query to views(create view as select sub-query), and the result is the same
thanks for any advice
Subquery and view will most of the time give you same result speed-wise.
If your subquery is not variabe, consider creating table that has same structure as your view, and occasionally do:
truncate table my_table;
insert into my_table select * from my_view;
...to cache your subquery data. If properly indexed, it will marginalize time lost on storing results of subquery in table, if data is not that frequently changed, or at least if you don't need up-to-date information on second-to-second basis.