SSIS Transform -- Split one column into multiple columns

SSIS Transform -- Split one column into multiple columns - sql

I'm trying to find out how to split a column I have in a table and split it into three columns after the result is exported to a CSV file.
For example, I have a field called fullpatientname. It is listed in the following text format:
Smith, John C
The expectation is to have it in three separate columns:
Smith
John
C
I'm quite sure I have to split this in a derived column, but I'm not sure how to proceed with that

You are going to need to use a derived column for this process.
The SUBSTRING and FINDSTRING functions will be key to pull this off.
To get the first segment you would use something like this:
(DT_STR,25,1252) SUBSTRING([fullpatientname], 1, FINDSTRING(",",[fullpatientname],1)-1)
The above should display a substring starting with the beginning of the [fullpatientname] to the position prior to the comma (,).
The next segment would be from the position after the comma to the final space separator, and the final would be everything from the position following the final space separator to the end.

It sounds like your business rule is
The "last name" is all of the characters up to the first comma
The "first name" will be all of the characters after the first comma and a space
The "middle name" will be what (and is it always present)?
the last character in the string (you will only ever have an initial letter)
All of the characters after the second space
This logic will fail in lots of fun ways so be prepared for it. And also remember that once you combine information together, you cannot, with 100 accuracy, restore it to the component parts. Capture first, middle, last/surname and store them separately.
Approach A
A derived column component. Actually, a few of them added to your data flow will cover this. The first Derived Column will be tasked with finding the positions of the name breaks. This could be done all in a single Component but debugging becomes a challenge and then you will need to reference the same expression multiple times in a single row * 3 it quickly becomes a maintenance nightmare.
The second Derived Column will then use the positions defined in the first to call the LEFT and SUBSTRING functions to access points in the column
Approach B
I never reach for a script component first and the same should hold true for you. However, this is a mighty fine case for a script. The base .NET string library has a Split function that will break a string into pieces based on whatever delimiter you supply. The default is whitespace. The first call to split will use the ',' as the argument. The zeroeth ordinal string will be the last name. The first ordinal string will contain the first and middle name pieces. Call the string.Split method again, this time using the default value and the last element is the middle name and the remaining elements are called the first name. Or vice versa, the zeroeth element is the first name and everything else is last.
I've had to deal with cleaning names before and so I've seen different rules based on how they want to standardize the name.

Try something like this, if your names are always in the same format (LastName-comma-space-FirstName-space-MI):
declare #FullName varchar(25) = 'Smith, John C'
select
substring(#FullName, 1, charindex(',', #FullName)-1 ) as LastName,
substring(#FullName, charindex(',',#FullName) + 2, charindex(' ',#FullName,charindex(',',#FullName)+2) - (charindex(',',#FullName) + 2) ) as FirstName,
substring(#FullName, len(#FullName), 1) as MiddleInitial

I am using SQL SERVER 2016 with SSIS in Visual Studio 2015. If you are using findstring you need to make sure the order is correct. I tried this first -
FINDSTRING(",",[fullpatientname],1), but it wouldn't work. I had to look up the documentation and found the order to be incorrect. FINDSTRING([fullpatientname],",",1) fixed the problem for me. I am not sure if this is due to differences in versions.

Related

Extract all elements from a string separated by underscores using regex

I am trying to capture all elements and store in separate column from the below string,seprated via underscores(campaign name for an advertisement) and then I wish to compare it with a master table having the true values to determine how accurate the data is being recorded.
eg: Input :
Expected output is :
My first element extraction was : REGEXP_EXTRACT(campaign_name, r"[^_+]{3}")) as parsed_campaign_agency
I only extracted first 3 letters because according to the naming convention(truth table), the agency name is made of only 3 letters.
Caveat: Some elements can have variable lengths too. eg. The third element "CrossBMC" could be 3 letters in length or more.
I am new to regex and the data lies in a SQL table(in BigQuery) so I thought it could be achieved via SQL's regex_extract but what I am having trouble is to extract all elements at once.
Any help is appreciated :)

If number of underscores constant and knows you can use SUBSTRING_INDEX like:
SELECT
SUBSTRING_INDEX(campaign_name,'_',1) first,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',2),'_',-1) second,
SUBSTRING_INDEX(SUBSTRING_INDEX(campaign_name,'_',3),'_',-1) third
FROM your_table;
Here you can try an example SQLize.online

Replace single quote to regular abbreviation

I have a table with a state column. Inside the state column I have a value like TX` I want to replace that ending character to make the State read TX. How would I do that please give examples

You already have answers for replacing the quote, but I wanted to provide methods for avoiding this problem in the first place.
As noted in #SeanLange's answer, you can use define your State field as a CHAR(2) , so you know that you'll never have a dummy character following a valid state code. You could also handle this in your client code, sanitizing the input before even sending to the database.
One could argue that it would even be a good idea to define a lookup table with a foreign key constraint, so users could only input valid values. You could also use this lookup table client-side (e.g. to provide a list of states).
Of course, you also have to consider internationalization: What about when/if you need to store locations outside of the United States, that may have > 2 characters?

You can escape a single quote by doubling it and including it in quotes. So:
select replace(state, '''', ''')
Of course, if the problem is just a bad third character, then LEFT(state, 2) might do the trick as well.

As a Sean Lang's comment said, you can do this in many different ways. For simplicity, you can just use LEFT(string, #) function for the whole typecasting as long as your raw values are all in the TX` format (**two letter abbrev and one ` , so three characters total for every value in that field.
If that is the case then just do:
SELECT CAST(LEFT(t.state_column, 2) As Varchar(2)) As state,
t.column_2,
t.column_3
/* and so on for all the columns you want */
FROM table t;
--
Further Reference:
- https://w3resource.com/PostgreSQL/left-function.php

Difference between _%_% and __% in sql server

I am learning basics of SQL through W3School and during understanding basics of wildcards I went through the following query:
--Finds any values that start with "a" and are at least 3 characters in length
WHERE CustomerName LIKE 'a_%_%'
as per the example following query will search the table where CustomerName column start with 'a' and have at least 3 characters in length.
However, I try the following query also:
WHERE CustomerName LIKE 'a__%'
The above query also gives me the exact same result.
I want to know whether there is any difference in both queries? Does the second query produce a different output in some specific scenario? If yes what will be that scenario?

Both start with A, and end with %. In the middle part, the first says "one char, then between zero and many chars, then one char", while the second one says "one char, then one char".
Considering that the part that comes after them (the final part) is %, which means "between zero and many chars", I can only see both clauses as identical, as they both essentially just want a string starting with A then at least two following characters. Perhaps if there were at least some limitations on what characters were allowed by the _, then maybe they could have been different.
If I had to choose, I'd go with the second one for being more intuitive. After all, many other masks (e.g. a%%%%%%_%%_%%%%%) will yield the same effect, but why the weird complexity?

For Like operator a single underscore "_" means, any single character, so if you put One underscore like
ColumnName LIKE 'a_%'
you basically saying you need a string where first letter is 'a' then followed by another single character and then followed by anything or nothing.
ColumnName LIKE 'a__%' OR ColumnName LIKE 'a_%_%'
Both expressions mean first letter 'a' then followed by two characters and then followed by anything or nothing. Or in simple English any string with 3 or more character starting with a.

Can we select the datas that have spaces between the lines in DB without the spaces?

I have a textbox to make a search in my table.My table name is ADDRESSBOOK and this table holds the personel records like name,surname,phone numbers and etc.The phone numbers holding like "0 123 456789".If I write "0 123 456789" in my textbox in the background this code is working
SELECT * FROM ADDRESSBOOK WHERE phonenumber LIKE "0 123 456789"
My problem is how can I select the same row with writing "0123456789" in the textbox.Sorry for my english

You can use replace():
WHERE REPLACE(phonenumber, ' ', '') LIKE REPLACE('0 123 456789', ' ', '')
If performance is an issue, you can do the following in SQL Server:
alter table t add column phonenumber_nospace as (replace(phonenumber, ' ', '');
create index idx_t_phonenumber_nospace on t(phonenumber_nospace);
Then, remove the spaces in the parameter value before constructing the query, and use:
WHERE phonenumber_nospace = #phonenumber_nospace
This assumes an equality comparison, as in your example.

If there is a specific format in which the Phone number is stored than you can insert space at the specific locations and than pass that to the database query.
For Example as you have mentioned in the question for number 0 123 456789.
If there is a space after first number and space after fourth number then you could take the text from the textbox and insert space at second position and sixth position(as after adding space at second position + next three positions are number so sixth position) and pass that text to the database query.

An important part of Db design is ensuring data consistency. The more consistently it's stored, the easier it is to query. That's why you should make a point of ensuring your columns use the correct data types:
Dates/time columns should use an appropriate date/time type.
Number columns should use a numeric type of the appropriate size. (None of this numeric varchar rubbish.)
String columns should be of the appropriate length (whether char or varchar).
Columns with referential relationships should never store invalid references to the referenced table.
And similarly, you need to determine the exact format you wish to use when storing telephone numbers; and ensure that any time you store a number it's done so consistently.
Some queries will be complex enough as is. As soon as you're unable to rely on a consistent format, your queries to find data need to cater for all the possible variations. They'll be less likely to leverage indexes effectively.
I have seen argument in favour of storing telephone numbers as numeric data. (It is after all a "number".) Though I'm not really convinced because this approach would be unable to represent leading zeroes (which might be desirable).
Conclusion
Whenever you insert/update a telephone number, ensure it's stored in a consistent format. (NOTE: You can be flexible about how the number appears to your users. It's only the stored value that needs to be consistent.)
Whenever you search for a telephone number, convert the search value into the compatible format before searching.
It's up to you exactly where/how you do these conversions. But you might wish to consider CHECK constraints to ensure that if you failed to convert a number appropriately at some point, that it isn't accidentally stored in the incorrect format. E.g.
CONSTRAINT CK_NoSpacesInTelno CHECK (Telephone NOT LIKE '% %')

SQL Server 2005 Update/Delete Substring of a Lengthy Column

I'm not sure if it is possible to do what I'm trying to do, but I thought i would give it a shot anyway. Also, I am fairly new to the SQL Server world and this is my fist post, so I apologize if my wording is poor or if I leave information out. Also, I am working with SQL Server 2005.
Let's say I have a table named "table" and a column named "column" The contents of column is a jumbled mess of characters (ntext data type). These characters were all drawn in from multiple entry fields in a front end application. Now one of those entry fields was for sensitive information that we no longer need and would like to get rid of but I can't just get rid of the whole column because it also contains other valuable information. Most of the solutions I have found so far only deal with columns that have short entries so they are just able to update the whole string, but for mine I think I need to identify the the beginning and the end of the specific substring that I need and replace it or delete it somehow. This is the closest I have gotten to at least selecting the data that I need... AAA and /AAA mark the beginning and the end of the substring that I need.
select
substring (column, charindex ('AAA', column), charindex ('/AAA',column))
from table
where column like '%/AAA%'
The problems I am having with this one are that the substring doesn't stop at /AAA, it just keeps going, and some of the results are just blank so it looks something like:
AAA 12345 /AAA abcdefghijklmnop
AAA 12346 /AAA qrstuvwxyzabcdef
AAA 12347 /AAA abcdefghijklmnop
With the characters in bold being the information I need to get rid of. Also even though row 3 is blank, it still does contain the info that I need but I'm guessing that it isn't returning it because it has a different amount of characters before it (for example, rows 1, 2, and 4 might have 50 characters before them but row 3 might have 100 characters before it), at least that's the only reason that I could think of.
So I suppose the first step would probably be to actually select the right substring, then to either delete it or replace it with a different, meaningless substring like "111111" or something.
If there is more information that you need to be provided with or if I was unclear about anything please let me know and thank you for taking the time to read through (and hopefully answer) my question!
Edit: Another one that gets close to the right results goes something like this
select substring(column,charindex('AAA',column),20) from table
where column like '%/AAA%'
I'm not sure if this approach would work better since the substring i'm looking for is always going to have the same amount of characters. The problem with this one though, is that instead of having blank rows, they are replaced with irrelevant substrings from that column, but all of the other rows do return exactly what I want.

First of all, check your usage of SUBSTRING(). The third argument is for length, not end character, so you would need to alter your query to something like:
select substring (column, charindex ('AAA',column)
, charindex ('/AAA',column)-charindex ('AAA',column))
from table where column like '%/AAA%'
Yes your approach of finding it and then either deleting or replacing it is sound.
If some of the results are blank, it's possible that you are finding and replacing the entire string. If it had not found the correct regular expression in there, you would have not returned the row at all, which is different from returning a black value in that column.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas