Parse SQL file to separate columns - sql

I have a sql file which has a lot of insert statements (over 3000+).
E.g.
insert into `pubs_for_client` (`ID`, `num`, `pub_name`, `pub_address`, `publ_tele`, `publ_fax`, `pub_email`, `publ_website`, `pub_vat`, `publ_last_year`, `titles_on_backlist`, `Personnel`) values('7','5','4TH xxxx xxxx','xxxx xxxx, 16 xxxxx xxxxx, xxxxxxx, We','111111111','1111111111','support#example.net','www.example.net','15 675 4238 14',NULL,NULL,'Jane Bloggs(Sales Contact:)jane.bloggs#example.net,Joe Bloggs(Other Contact:)joe.bloggs#example.net');
I have exported this into an excel document (I did this through running the query in phpmyadmin, and exporting for an excel document). There's just one problem, as you can see in this case, there are two names & email addresses being inserted into 'Personnel'.
How easy/difficult would it be to seperate these out to display as Name, email, Name2, email2?

What about when there are three e-mails/names?
With shown data it should be easy to do
select replace(substring(substring_index(`Personnel`, ',', 1),length(substring_index(`Personnel`, ',', 1 - 1)) + 1), ',', '') personnel1,
replace(substring(substring_index(`Personnel`, ',', 2),length(substring_index(`Personnel`, ',', 2 - 1)) + 1), ',', '') personnel2,
from `pubs_for_client`
The above will split the Personnel column on delimiter ,.
You can then split these fields on delimiter ( and ) to split personnel into name, position and e-mail
The SQL will be ugly (because mysql does not have split function), but it will get the job done.
The split expression was taken from comments on mysql documentation (search for split).
You can also
CREATE FUNCTION strSplit(x varchar(255), delim varchar(12), pos int) returns varchar(255)
return replace(substring(substring_index(x, delim, pos), length(substring_index(x, delim, pos - 1)) + 1), delim, '');
After which you can user
select strSplit(`Personnel`, ',', 1), strSplit(`Personnel`, ',', 2)
from `pubs_for_client`
You could also create your own function that will extract directly names and e-mails.

As bogeymin has already said - either get the data to CSV (or convert it easily from Excel) to manipulate it. If you're on Windows, then have a look at using Notepad++ to break apart the last column.
Or... (and I'd probably do this), insert it into the database as it is (even if you insert into a dummy field, not the one you eventually want to use), then use the string manipulation functions in your varient of SQL to make either update statements, or more insert statements (whatever you need). Cerainly, MS-SQL Server can do this using things like SUBSTRING, PATINDEX etc etc...

Related

SQL Substring function usage

This the column Description
Entity=10||WorkdayReferenceID=9000100332||HCMCostCenterMgr=nicoleb#broadinstitute.org||FRP=||
I want to retrieve the emailid only in the above scenario the desired output would be
nicoleb#broadinstitute.org
I tried using this:
select RTRIM(NVL(SUBSTR(TL.DESCRIPTION,(INSTR(TL.DESCRIPTION, '=',1,3)+1),
(LENGTH(TL.DESCRIPTION)-1)),'TL.DESCRIPTION'), '|') AS CCM
But after that new value of FRP was added so it got wrong .
Assuming (and it is a big assumption) that there is only ever one # in the data, the approach would be find the #, and then you want the text that is after the preceding = and before the next |. I do not know if the email address is always after HCMCostCenterMgr= so I won't assume that (but if that is the case the solution is easier).
It looks like you might be using Oracle (I see a NVL function), but I did this as SQL Server (it is more familiar off the top of my head). Here is a small script that will return just the email address you want - you can easily change these functions to Oracle if you need a version for Oracle.
DECLARE #data VARCHAR(200);
SELECT #data = 'Entity=10||WorkdayReferenceID=9000100332||HCMCostCenterMgr=nicoleb#broadinstitute.org||FRP=||';
SELECT REVERSE(SUBSTRING(REVERSE(SUBSTRING(#data, 1, CHARINDEX('|', #data, CHARINDEX('#', #data)) - 1)), 1, CHARINDEX('=', REVERSE(SUBSTRING(#data, 1, CHARINDEX('|', #data, CHARINDEX('#', #data)) - 1)))-1));
It is ugly, but SQL often gets a bit convoluted. I had to reverse the string to find things before the '#' and then reverse it at the end to get the value in the correct direction.

create new columns from xml value in hive

I have a column desc_txt in my table and its contents are quite similar to that of xml like shown below-
desc_txt
-----------
<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>
Requirement is to have a new table/view created from this table having additional columns like Criticality, Country, City along with the column values like High, India, Indore, respectively.
How can this be achieved in Hive/Impala?
This can be done in two steps. I assumed you have only four columns to pull.
Load the data as is in a table. Put everything in a column.
Then use this below SQL to split the data multiple columns. I assumed 4 columns, you can increase as per your requirement.
with t as (
SELECT rtrim(ltrim(
regexp_replace( replace( trim(
regexp_replace(
regexp_replace("<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>","</?[^>]*>",",")
,',,',',') ), ' ,', ',' ), '(,){2,}', ','),','),',')
str)
select split_part(str, ',', 1) as first_col,
split_part(str, ',', 2) as second_col,
split_part(str, ',', 3) as third_col,
split_part(str, ',', 4) as fourth_col
from t
The query is tricky - first it replaces all tags with comma in them, then it replaces multiple commas with single comma, then it removes comma from start and end of the string. split function then splits whole string based on comma and create individual columns.
HTH...

Isolating an alphanumerical ID that can be anywhere in a text based cell

I'm trying to isolate a certain character string from a text cell.
For example, I would like to extract "AB-T120-15" from the string "His server ID was AB-T120-15 and his problem was that he needed a reboot"
AB-T120-15 is an example, but they would all be codes of a max length of 13 characters starting by something like AB-T, CL-R, etc.
The codes can appear anywhere in a text field of the column.
string_split() cannot be used since the DB we are under is older.
I have tried many combinations of Substring and LEFT, but I cannot seem to have it worked.
Any thoughts?
String operations are not the strength of SQL Server -- which I assume you are using.
You can do this with rather painful string manipulation:
select left(stuff(str, 1, patindex('%[A-Z][A-Z]-[A-Z]%', str) - 1, ''),
charindex(' ', stuff(str, 1, patindex('%[A-Z][A-Z]-[A-Z]%', str), '') + ' ')
)
from (values ('His server ID was AB-T120-15 and his problem was that he needed a reboot')) v(str);

Using Upper to Capitalize the first letter of City name

I am doing some data clean-up and need to Capitalize the first letter of City names. How do I capitalize the second word in a City Like Terra Bella.
SELECT UPPER(LEFT([MAIL CITY],1))+
LOWER(SUBSTRING([MAIL CITY],2,LEN([MAILCITY])))
FROM masterfeelisting
My results is this 'Terra bella' and I need 'Terra Bella'. Thanks in advance.
Ok, I know I answered this before, but it bugged me that we couldn't write something efficient to handle an unknown amount of 'text segments'.
So re-thinking it and researching, I discovered a way to change the [MAILCITY] field into XML nodes where each 'text segment' is assigned it's own Node within the xml field. Then those xml fields can be processed node by node, concatenated together, and then changed back to a SQL varchar. It's convoluted, but it works. :)
Here's the code:
CREATE TABLE
#masterfeelisting (
[MAILCITY] varchar(max) not null
);
INSERT INTO #masterfeelisting VALUES
('terra bellA')
,(' terrA novA ')
,('chicagO ')
,('bostoN')
,('porT dE sanTo')
,(' porT dE sanTo pallo ');
SELECT
RTRIM
(
(SELECT
UPPER([xmlField].[xmlNode].value('.', 'char(1)')) +
LOWER(STUFF([xmlField].[xmlNode].value('.', 'varchar(max)'), 1, 1, '')) + ' '
FROM [xmlNodeRecordSet].[nodeField].nodes('/N') as [xmlField]([xmlNode]) FOR
xml path(''), type
).value('.', 'varchar(max)')
) as [MAILCITY]
FROM
(SELECT
CAST('<N>' + REPLACE([MAILCITY],' ','</N><N>')+'</N>' as xml) as [nodeField]
FROM #masterfeelisting
) as [xmlNodeRecordSet];
Drop table #masterfeelisting;
First I create a table and fill it with dummy values.
Now here is the beauty of the code:
For each record in #masterfeelisting, we are going to create an xml field with a node for each 'text segment'.
ie. '<N></N><N>terrA</N><N>novA</N><N></N>'
(This is built from the varchar ' terrA novA ')
1) The way this is done is by using the REPLACE function.
The string starts with a '<N>' to designate the beginning of the node. Then:
REPLACE([MAILCITY],' ','</N><N>')
This effectively goes through the whole [MAILCITY] string and replaces each
' ' with '</N><N>'
and then the string ends with a '</N>'. Where '</N>' designates the end of each node.
So now we have a beautiful XML string with a couple of empty nodes and the 'text segments' nicely nestled in their own node. All the 'spaces' have been removed.
2) Then we have to CAST the string into xml. And we will name that field [nodeField]. Now we can use xml functions on our newly created record set. (Conveniently named [xmlNodeRecordSet].)
3) Now we can read the [xmlNodeRecordSet] into the main sub-Select by stating:
FROM [xmlNodeRecordSet].[nodeField].nodes('/N')
This tells us we are reading the [nodeField] as nodes with a '/N' delimiter.
This table of node fields is then parsed by stating:
as [xmlField]([xmlNode]) FOR xml path(''), type
This means each [xmlField] will be parsed for each [xmlNode] in the xml string.
4) So in the main sub-select:
Each blank node '<N></N>' is discarded. (Or not processed.)
Each node with a 'text segment' in it will be parsed. ie <N>terrA</N>
UPPER([xmlField].[xmlNode].value('.', 'char(1)')) +
This code will grab each node out of the field and take its contents '.' and only grab the first character 'char(1)'. Then it will Upper case that character. (the plus sign at the end means it will concatenate this letter with the next bit of code:
LOWER(STUFF([xmlField].[xmlNode].value('.', 'varchar(max)'), 1, 1, ''))
Now here is the beauty... STUFF is a function that will take a string, from a position, for a length, and substitute another string.
STUFF(string, start position, length, replacement string)
So our string is:
[xmlField].[xmlNode].value('.', 'varchar(max)')
Which grabs the whole string inside the current node since it is 'varchar(max)'.
The start position is 1. The length is 1. And the replacement string is ''. This effectively strips off the first character by replacing it with nothing. So the remaining string is all the other characters that we want to have lower case. So that's what we do... we use LOWER to make them all lower case. And this result is concatenated to our first letter that we already upper cased.
But wait... we are not done yet... we still have to append a + ' '. Which adds a blank space after our nicely capitalized 'text segment'. Just in case there is another 'text segment' after this node is done.
This main sub-Select will now parse each node in our [xmlField] and concatenate them all nicely together.
5) But now that we have one big happy concatenation, we still have to change it back from an xml field to a SQL varchar field. So after the main sub-select we need:
.value('.', 'varchar(max)')
This changes our [MAILCITY] back to a SQL varchar.
6) But hold on... we still are not done. Remember we put an extra space at the end of each 'text segment'??? Well the last 'text segment still has that extra space after it. So we need to Right Trim that space off by using RTRIM.
7) And dont forget to rename the final field back to as [MAILCITY]
8) And that's it. This code will take an unknown amount of 'text segments' and format each one of them. All using the fun of XML and it's node parsers.
Hope that helps :)
Here's one way to handle this using APPLY. Note that this solution supports up to 3 substrings (e.g. "Phoenix", "New York", "New York City") but can easily be updated to handle more.
DECLARE #string varchar(100) = 'nEW yoRk ciTY';
WITH DELIMCOUNT(String, DC) AS
(
SELECT #string, LEN(RTRIM(LTRIM(#string)))-LEN(REPLACE(RTRIM(LTRIM(#string)),' ',''))
),
CIPOS AS
(
SELECT *
FROM DELIMCOUNT
CROSS APPLY (SELECT CHARINDEX(char(32), string, 1)) CI1(CI1)
CROSS APPLY (SELECT CHARINDEX(char(32), string, CI1.CI1+1)) CI2(CI2)
)
SELECT
OldString = #string,
NewString =
CASE DC
WHEN 0 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,8000))
WHEN 1 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,CI1-1)) +
UPPER(SUBSTRING(string,CI1+1,1))+LOWER(SUBSTRING(string,CI1+2,100))
WHEN 2 THEN UPPER(SUBSTRING(string,1,1))+LOWER(SUBSTRING(string,2,CI1-1)) +
UPPER(SUBSTRING(string,CI1+1,1))+LOWER(SUBSTRING(string,CI1+2,CI2-(CI1+1))) +
UPPER(SUBSTRING(string,CI2+1,1))+LOWER(SUBSTRING(string,CI2+2,100))
END
FROM CIPOS;
Results:
OldString NewString
--------------- --------------
nEW yoRk ciTY New York City
This will only capitalize the first letter of the second word. A shorter but less flexible approach. Replace #str with [Mail City].
DECLARE #str AS VARCHAR(50) = 'Los angelas'
SELECT STUFF(#str, CHARINDEX(' ', #str) + 1, 1, UPPER(SUBSTRING(#str, CHARINDEX(' ', #str) + 1, 1)));
This is a way to use imbedded Selects for three City name parts.
It uses CHARINDEX to find the location of your separator character. (ie a space)
I put an 'if' structure around the Select to test if you have any records with more than 3 parts to the city name. If you ever get the warning message, you could add another sub-Select to handle another city part.
Although... just to be clear... SQL is not the best language to do complicated formatting. It was written as a data retrieval engine with the idea that another program will take that data and massage it into a friendlier look and feel. It may be easier to handle the formatting in another program. But if you insist on using SQL and you need to account for city names with 5 or more parts... you may want to consider using Cursors so you can loop through the variable possibilities. (But Cursors are not a good habit to get into. So don't do that unless you've exhausted all other options.)
Anyway, the following code creates and populates a table so you can test the code and see how it works. Enjoy!
CREATE TABLE
#masterfeelisting (
[MAILCITY] varchar(30) not null
);
Insert into #masterfeelisting select 'terra bella';
Insert into #masterfeelisting select ' terrA novA ';
Insert into #masterfeelisting select 'chicagO ';
Insert into #masterfeelisting select 'bostoN';
Insert into #masterfeelisting select 'porT dE sanTo';
--Insert into #masterfeelisting select ' porT dE sanTo pallo ';
Declare #intSpaceCount as integer;
SELECT #intSpaceCount = max (len(RTRIM(LTRIM([MAILCITY]))) - len(replace([MAILCITY],' ',''))) FROM #masterfeelisting;
if #intSpaceCount > 2
SELECT 'You need to account for more than 3 city name parts ' as Warning, #intSpaceCount as SpacesFound;
else
SELECT
cThird.[MAILCITY1] + cThird.[MAILCITY2] + cThird.[MAILCITY3] as [MAILCITY]
FROM
(SELECT
bSecond.[MAILCITY1] as [MAILCITY1]
,SUBSTRING(bSecond.[MAILCITY2],1,bSecond.[intCol2]) as [MAILCITY2]
,UPPER(SUBSTRING(bSecond.[MAILCITY2],bSecond.[intCol2] + 1, 1)) +
SUBSTRING(bSecond.[MAILCITY2],bSecond.[intCol2] + 2,LEN(bSecond.[MAILCITY2]) - bSecond.[intCol2]) as [MAILCITY3]
FROM
(SELECT
SUBSTRING(aFirst.[MAILCITY],1,aFirst.[intCol1]) as [MAILCITY1]
,UPPER(SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 1, 1)) +
SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 2,LEN(aFirst.[MAILCITY]) - aFirst.[intCol1]) as [MAILCITY2]
,CHARINDEX ( ' ', SUBSTRING(aFirst.[MAILCITY],aFirst.[intCol1] + 1, LEN(aFirst.[MAILCITY]) - aFirst.[intCol1]) ) as intCol2
FROM
(SELECT
UPPER (LEFT(RTRIM(LTRIM(mstr.[MAILCITY])),1)) +
LOWER(SUBSTRING(RTRIM(LTRIM(mstr.[MAILCITY])),2,LEN(RTRIM(LTRIM(mstr.[MAILCITY])))-1)) as [MAILCITY]
,CHARINDEX ( ' ', RTRIM(LTRIM(mstr.[MAILCITY]))) as intCol1
FROM
#masterfeelisting as mstr -- Initial Master Table
) as aFirst -- First Select Shell
) as bSecond -- Second Select Shell
) as cThird; -- Third Select Shell
Drop table #masterfeelisting;

Sql Server Split and Concatenation

I have data in the following format in a sql server database table
[CPOID] [ContractPO] [ContractPOTitle]
1 10-SUP-CN-CNP-0001 Drytech
2 10-SUP-CN-CNP-0002 EC&M
I need to write a stored procedure to generate the following result
[CPOID] [ContractPO] [ContractPOTitle] [ConcatField]
1 10-SUP-CN-CNP-0001 Drytech CNP-0001-Drytech
2 10-SUP-CN-CNP-0002 EC&M CNP-0002-EC&M
where [ConcatField] generate the result using split the last two values of the [ContractPOTitle] column and combine with the [ContractPOTitle]
If the ContractPO field is always the same length, you could just do:
SELECT
CPOID,
ContractPO,
ContractPOTitle,
RIGHT(ContractPO, 8) + '-' + ContractPOTitle as [ConcatField]
FROM MyTable
Assuming that the length of the ContractPO field is not fixed AND we have to rely on stripping out the text after the next to last '-', the following SQL will work. It's a bit ugly, but these types of operations are necessary because there doesn't appear to be a LASTINDEX function available out of the box in SQL Server.
SELECT
CPOID,
ContractPO,
ContractPOTitle,
RIGHT(ContractPO, CHARINDEX('-', REVERSE(ContractPO), CHARINDEX('-', REVERSE(ContractPO)) + 1) - 1) + '-' + ContractPOTitle as [ConcatField]
FROM #myTable