Hopefully everyone is having a productive lockdown all over the world. This is my second issue I wanted some assistance with today.
What I have is a chat from a telecom company signing up new customers.
I have successfully collapsed them into 2x rows per unique_id - a unique chat interaction captured between customer and company agent.
I would like to now take each column (text) in each row and separate
it out to 5 equal varchar columns.
The objective is to splice/chunk a
conversation into 5 different stages within this table.
I do not
have access to delimiters as customers and company staff use
delimiting characters themselves so it makes this tricky.
Below I have 2 images with what the data looks like now and what I am looking for.
BEFORE
AFTER
I have looked at the following articles to try to crack it, but am stuck:
Split A Single Field Value Into Multiple Fixed-Length Column Values in T-SQL
How to Split String by Character into Separate Columns in SQL Server
How to split a comma-separated value to columns
How to split a single column values to multiple column values?
Split string in SQL Server to a maximum length, returning each as a row
Here is the SQL Fiddle page, but I am running this code in MS SQL Server: http://sqlfiddle.com/#!9/ddd08c
Here is the table creation code:
CREATE TABLE Table1
(`unique_id` double, `user` varchar(8), `text` varchar(144))
;
INSERT INTO Table1
(`unique_id`, `user`, `text`)
VALUES
(50314585222, 'customer', 'This is part 1 of long text. This is part 2 of long text. This is part 3 of long text. This is part 4 of long text. This is part 5 of long text.'),
(50314585222, 'company', 'This is part 1 of long text This is part 2 of long text This is part 3 of long text This is part 4 of long text This is part 5 of long text'),
(50319875222, 'customer', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5'),
(50319875222, 'company', 'This is part 1 This is part 2 This is part 3 This is part 4 This is part 5')
;
I have requested an almost similar algorithm in R, in my history. I have been trying to do this in SQL.
I have manage to solve this with the T-SQL statement below:
WITH DataSource AS
(
SELECT *
,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
FROM TAble1
), PreparedData AS
(
SELECT unique_id
,[user]
,'text' + CAST(RM.matchID + 1 AS VARCHAR(12)) as [column]
,RM.CaptureValue AS [value]
FROM DataSource T
CROSS APPLY [dbo].[fn_Utils_RegexMatches] ([text], [pattern]) RM
)
SELECT *
FROM PreparedData DS
PIVOT
(
max([value]) for [column] IN ([text1], [text2], [text3], [text4], [text5])
) PVT;
In order to use this code, you need to implement SQL CLR function(s) for working with regular expression in the context of T-SQL (you need to invest some time understanding how SQL CLR works) - otherwise, you will not be able to use this solution.
So, having RegexMatches function, the first part is to build a regular expression pattern for splitting the data:
SELECT *
,'\b.{1,'+CAST(CEILING(LEN([text]) * 1.0 /5) AS VARCHAR(12)) +'}\b' AS [pattern]
FROM TAble1;
The pattern is \b.number\b and will match part of the strings with length number but not cutting the words (check if boundary works for you, because in some cases it won't).
Then, using our regex matches function we getting a result like this (the second common table expression):
And the data above is ready for pivoting which is pretty easy.
So, the notes are:
you need to implement Microsoft String Utility
you need to ensure the regex pattern works for you
you can split the T-SQL I used, check the other columns of the regex function and even make dynamic pivoting - the code is an example and need to modify/check it before using in production
Related
How do you get a list of all unique words and their frequencies ("tally") in one text column of one table of an SQL database? An answer for any SQL dialect in which this is possible would be appreciated.
In case it helps, here's a one-liner that does it in Ruby using Sequel:
Hash[DB[:table].all.map{|r| r[:text]}.join("\n").gsub(/[\(\),]+/,'').downcase.strip.split(/\s+/).tally.sort_by{|_k,v| v}.reverse]
To give an example, for table table with 4 rows with each holding one line of Dr. Dre's Still D.R.E.'s refrain in a text field, the output would be:
{"the"=>4,
"still"=>3,
"them"=>3,
"for"=>2,
"d-r-e"=>1,
"it's"=>1,
"streets"=>1,
"love"=>1,
"got"=>1,
"i"=>1,
"and"=>1,
"beat"=>1,
"perfect"=>1,
"to"=>1,
"time"=>1,
"my"=>1,
"taking"=>1,
"girl"=>1,
"low-lows"=>1,
"in"=>1,
"corners"=>1,
"hitting"=>1,
"world"=>1,
"across"=>1,
"all"=>1,
"gangstas"=>1,
"representing"=>1,
"i'm"=>1}
This works, of course, but is naturally not as fast or elegant as it would be to do it in pure SQL - which I have no clue if that's even in the realm of possibilites...
I guess it would depend on how the SQL database would look like. You would have to first turn your 4 row "database" into a data of single column, each row representing one word. To do that you could use something like String_split, where every space would be a delimiter.
STRING_SPLIT('I'm representing for them gangstas all across the world', ' ')
https://www.sqlservertutorial.net/sql-server-string-functions/sql-server-string_split-function/
This would turn it into a table where every word is a row.
Once you've set up your data table, then it's easy.
Your_table:
[Word]
I'm
representing
for
them
...
world
Then you can just write:
SELECT Word, count(*)
FROM your_table
GROUP BY Word;
Your output would be:
Word | Count
I'm 1
representing 1
I had a play using XML in sql server. Just an idea :)
with cteXML as
(
select *
,cast('<wd>' + replace(songline,' ' ,'</wd><wd>') + '</wd>' as xml) as XMLsongline
from #tSongs
),cteBase as
(
select p.value('.','nvarchar(max)') as singleword
from cteXML as x
cross apply x.XMLsongline.nodes('/wd') t(p)
)
select b.singleword,count(b.singleword)
from cteBase as b
group by b.singleword
I have customer data with mobile phone numbers where '1' has been entered 10 times or more in a cell to bypass the customer onboarding system validation. For example '1111111111'
I used below condition in my where clause but that didn't really help.
AND p.mobile_no LIKE '%[1111111111]%'
It is possible that users might enter 1 multiple number of times in the new customer form to bypass validation. To find only 0 values in the cell I used %[^0]% in the WHERE clause and I was hoping to use something similar to find 1s where regardless of how many times it has been entered in the field, as long as it only has 1 in it it will skim out the data for me.
How can I find these instances in my data using a SQL query?
The goal is to find these anomalies and remove them.
Using: Microsoft SQL Server 2016 (SP2).
I think you are looking for the following, which tests if at least 1 '1' exists, and that no other characters exist.
select Number
from (values ('111'),('121'),('1-2'),('22')) x (Number)
-- Test that at least 1 '1' exists
where Number like '%1%'
-- And that no other allowable characters exist - expand to cover all options
and Number not like '%[0,2-9,-]%'
Using a table to define invalid phone numbers:
Declare #invalidPhoneNumbers Table (PhoneNumber char(10));
Insert Into #testData (PhoneNumber)
Values ('0000000000'), ('1111111111'), ('2222222222'), ('3333333333'), ('4444444444')
, ('5555555555'), ('6666666666'), ('7777777777'), ('8888888888'), ('9999999999');
Select ...
From ...
Where ...
And p.mobile_no Not In (Select i.PhoneNumber From #invalidPhoneNumbers i)
Or - using NOT EXISTS which may perform better:
Declare #invalidPhoneNumbers Table (PhoneNumber char(10));
Insert Into #testData (PhoneNumber)
Values ('0000000000'), ('1111111111'), ('2222222222'), ('3333333333'), ('4444444444')
, ('5555555555'), ('6666666666'), ('7777777777'), ('8888888888'), ('9999999999');
Select ...
From ...
Where ...
And Not Exists (Select * From #invalidPhoneNumbers i Where i.PhoneNumber = p.mobile_no)
When declaring the table - make sure the data type defined matches exactly the defined data type of p.mobile_no. This will make sure there are no implicit conversions that can cause issues.
I am trying to write a SQL query that only returns rows where a specific column (let's say 'amount' column) contains numbers comprising of only one digit, e.g. only '1's (1111111...) or only '2's (2222222...), etc.
In addition, 'amount' column contains numbers with decimal points as well and these kind of values should also be returned, e.g. 1111.11, 2222.22, etc
If you want to make the query generic that you don't have to specify each possible digit you could change the where to the following:
WHERE LEN(REPLACE(REPLACE(amount,LEFT(amount,1),''),'.','') = 0
This will always use the first digit as comparison for the rest of the string
If you are using SQL Server, then you can try this script:
SELECT *
FROM (
SELECT CAST(amount AS VARCHAR(30)) AS amount
FROM TableName
)t
WHERE LEN(REPLACE(REPLACE(amount,'1',''),'.','') = 0 OR
LEN(REPLACE(REPLACE(amount,'2',''),'.','') = 0
I tried like this in place of 1111111 replace with column name:
Select replace(Str(1111111, 12, 2),0,left(11111,1))
I have a pretty hard request to do in SQL Server 2008, but I'm not able to do the whole...
I have two kind of records :
16HENFC******** (8 numbers after more 'FC')
16HEN******* (7 numbers after more 'EN')
I have to select the * (which are in fact numbers), and add a 0 at the beginning of the second form of record to just have 8 long selected values.
Then I have to insert the result in a empty table.
I think I did the first part which is :
SUBSTRING(SELECT mycolumn1 FROM mytable1 WHERE mycolumn1 LIKE '16HENFC%', 5, 8) ;
In summary,
I have those records in my column :
'16HENFC071052'
'16HEN5130026'
I want to select them and transform them to insert those ones in an other column :
'05130026'
'FC071052'
[EDIT]=>
CREATE TABLE nom_de_la_table
(
colonne1 VARCHAR(250),
colonne2 VARCHAR(250)
)
INSERT INTO nom_de_la_table (colonne1)
VALUES
('16HEN5138745'),
('16HENFC071052v2'),
('16HENFC78942878'),
('16HEN4830026'),
('16HEN7815934'),
('16HENFC74859422'),
('16HEN9687326'),
('16HENFC74889639'),
('16HEN9798556');
[etc...]
So two different types of records, and I want to insert the result of what you did first with just two records in an other column but for the 956 records of my table. And this is the result with the two examples :
'05130026'
'FC071052'
Left-Filling a string is a relatively easy request. Here's an example:
select right(replicate('0',8) + right(test,len(test)-len('16HEN')),8)
from (
select '16HENFC071052' as test
union all
select '16HEN5130026' as test
) z
Use replicate to left-fill your string with the amount of digits you wish to end up with. Append your desired string, in this case, slice your prefix off by taking the right X characters where X = len(target) - len(prefix). Finally, take the right characters of the whole string equal to your desired length.
I currently am working with a large data set that was pre-populated in BigQuery. I have a column of orderID's which have the following set-up: o377412876, o380940924, etc. This is stored in a string. I need to do the following and am running into problems:
1) Strip off the first character using the BigQuery query language
2) Convert the remaining (or treat the remaining values), as an integer.
I will then run a join against the values. Now, I would be abundantly happier down this operation in either Python, R, or another language. That said, the challenge I have been given based on client needs is to write all the scripts in BigQuery's querying language.
SELECT 10 * INTEGER(REGEXP_REPLACE(x, '^.', ''))
FROM
(SELECT 'o1234' AS x)
12340
You can use SUBSTR function and SAFE_CAST (in case there are NULL values in your column). INTEGER does not work on BQ.
SELECT SAFE_CAST(SUBSTR(x, 2) AS INT64)
FROM (SELECT 'o1234' AS x)
Output: 1234