Sorting (or usage of ORDER BY clause) in T-SQL / SQL SERVER without considering some words - sql

i'm wondering whether it is possible to use ORDER BY clause (or any other clause(s)) to do sorting without considering some words.
For ex, article 'the':
Bank of Switzerland
Bank of America
The Bank of England
should be sorted into:
Bank of America
The Bank of England
Bank of Switzerland
and NOT
Bank of America
Bank of Switzerland
The Bank of England

select * from #test
order by
case when test like 'The %' then substring(test, 5, 8000) else test end

If you have a limited number of words that you wish to eliminate, then you might be able to remove them by judicious use of REPLACE, e.g.
ORDER BY REPLACE(REPLACE(' ' + Column + ' ',' the ',' '),' and ',' ')
However, as the number of words add up, you'll have more and more nested REPLACE calls. In addition, this ORDER BY will be unable to benefit from any indexes, and doesn't cope with punctuation marks.
If this sort is frequent and the queries would otherwise be able to benefit from an index, you might consider making the above a computed column, and creating an index over it (You would then order by the computed column).

You need to encode a method of turning one string into another and then ordering by that.
For example, if the method is just to strip away starting occurances of 'The '...
ORDER BY
CASE WHEN LEFT(yourField, 4) = 'The ' THEN RIGHT(yourField, LEN(yourField)-4) ELSE yourField END
Or, if you want to ignore all occurrences of 'the', where ever it occurs, just use REPLACE...
ORDER BY
REPLACE(yourField, 'The', '')
You may end up with a fairly complex transposition, in which case you can do things like this...
SELECT
*
FROM
(
SELECT
<complex transposition> AS new_name,
*
FROM
whatever
)
AS data
ORDER BY
new_name

No, not really because the is arbitrary in this case. The closest you can do is modify the field value, such as below:
SELECT field1
FROM table
ORDER BY REPLACE(field1, 'The ', '')
The problem is that to replace two words, you have to next REPLACE statements, which becomes a huge issue if you have more than about five words:
SELECT field1
FROM table
ORDER BY REPLACE(REPLACE(field1, 'of ', ''), 'The ', '')
Update: You don't really need to check if the or of appears at the beginning of the field because you are only wanting to sort by important words anyway. For example, Bank of America should appear before Bank England (the of shouldn't make it selected after).

My Solution a little bit shorter
DECLARE #Temp TABLE ( Name varchar(100) );
INSERT INTO #Temp (Name)
SELECT 'Bank of Switzerland'
UNION ALL
SELECT 'Bank of America'
UNION ALL
SELECT 'The Bank of England'
SELECT * FROM #Temp
ORDER BY LTRIM(REPLACE(Name, 'The ', ''))

Related

Extract unmatched content or values

I want to extract the un-matched values in data like in (table1)
name id subject
maria 01 Math computer english
faro 02 Computer stat english
hina 03 Chemistry physics bio
The below query
Select *
from table1
where subject like ‘%english%’ or
subject like ‘%stat%’
returns first two rows that are matched with the criteria.
But I just need to extract the un-matched values from column (subject) like below output
unmatched
math computer
computer
chemistry physics bio
(Because in the first row only math computer values are not matching, in the second row two matches and in third row there are no matches).
can i get that output??
With REPLACE you eliminate all occurrences of the values 'english' and/or 'stat':
SELECT
trim(
replace(replace(replace(subject, 'english', ''), 'stat', ''), ' ', '')
) unmatched
FROM tablename;
The final trim and replace will remove double spaces from the result and spaces from the start and the end.
You have a poor table design. You should be storing lists as separate rows in another table -- a so-called "junction" or "association" table. SQL has a great data type for storing lists. It is called a "table" not a "string".
That said, sometimes we are stuck with other peoples really, really bad choices of data model.
If so, you can use replace() and trim() to get the list you want. I would do:
SELECT trim(replace(replace(' ' || subject || ' ', ' english ', ' '
), ' stat ', ''
), ' ', ' '
) as unmatched
FROM tablename;
This easily generalizes to more values, without worrying about introducing adjacent spaces.

What's the equivalent of Excel's `left(find(), -1)` in BigQuery?

I have names in my dataset and they include parentheses. But, I am trying to clean up the names to exclude those parentheses.
Example: ABC Company (Somewhere, WY)
What I want to turn it into is: ABC Company
I'm using standard SQL with google big query.
I've done some research and I know big query has left(), but I do not know the equivalent of find(). My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
My plan was to do something that finds the ( and then gives me everything to the left of -1 characters from the (.
Good plan! In BigQuery Standard SQL - equivalent of LEFT is SUBSTR(value, position[, length]) and equivalent of FIND is STRPOS(value1, value2)
With this in mind your query can look like (which is exactly as you planned)
#standardSQL
WITH names AS (
SELECT 'ABC Company (Somewhere, WY)' AS name
)
SELECT SUBSTR(name, 1, STRPOS(name, '(') - 1) AS clean_name
FROM names
Usually, string functions are less expensive than regular expression functions, so if you have pattern as in your example - you should go with above version
But in more generic cases, when pattern to clean is more dynamic like in Graham's answer - you should go with solution in Graham's answer
Just use REGEXP_REPLACE + TRIM. This will work with all variants (just not nested parentheses):
#standardSQL
WITH
names AS (
SELECT
'ABC Company (Somewhere, WY)' AS name
UNION ALL
SELECT
'(Somewhere, WY) ABC Company' AS name
UNION ALL
SELECT
'ABC (Somewhere, WY) Company' AS name)
SELECT
TRIM(REGEXP_REPLACE(name,r'\(.*?\)',''), ' ') AS cleaned
FROM
names
Use REGEXP_EXTRACT:
SELECT
RTRIM(REGEXP_EXTRACT(names, r'([^(]*)')) AS new_name
FROM yourTable
The regex used here will greedily consume and match everything up until hitting an opening parenthesis. I used RTRIM to remove any unwanted whitespace picked up by the regex.
Note that this approach is robust with respect to the edge case of an address record not having any term with parentheses. In this case, the above query would just return the entire original value.
I can't test this solution at the moment, but you can combine SUBSTR and INSTR. Like this:
SELECT CASE WHEN INSTR(name, '(') > 0 THEN SUBSTR( name, 1, INSTR(name, '(') ) ELSE name END as name FROM table;

Retrieve Second to Last Word in PostgreSQL

I am using PostgreSQL 9.5.1
I have an address field where I am trying to extract the street type (AVE, RD, ST, etc). Some of them are formatted like this: 5th AVE N or PEE DEE RD N
I have seen a few methods in PostgreSQL to count segments from the left based on spaces i.e. split_part(name, ' ', 3), but I can't seem to find any built-in functions or regular expression examples where I can count the characters from the right.
My idea for moving forward is something along these lines:
select case when regexp_replace(name, '^.* ', '') = 'N'
then *grab the second to last group of string values*
end as type;
Leaving aside the issue of robustness of this approach when applied to address data, you can extract the penultimate space-delimited substring in a string like this:
with a as (
select string_to_array('5th AVE N', ' ') as addr
)
select
addr[array_length(addr, 1)-1] as street
from
a;

Unexpected execution in an update query in SQL

I am getting an 'Unexpected' result with an update query in SQL Server 2012.
This is what I am trying to do.
From a column (IDENTIFIER) composed by an ID ','name (e.g. 258967,Sarah Jones), I have to fill other two columns: ID and SELLER_NAME.
The original column has some values with a blank at the end and the rest with out it:
'258967,Sarah Jones'
'98745,Richard James '
This is the update query that I am executing:
UPDATE SELLER
SET
IDENTIFIER = LTRIM(RTRIM(IDENTIFIER)),
ID = Left(IDENTIFIER , charindex(',', IDENTIFIER )-1),
SELLER_NAME = UPPER(RIGHT((IDENTIFIER ),LEN(IDENTIFIER )-CHARINDEX(',',IDENTIFIER )));
But I am having a wrong result at the end
258967,Sarah Jones 258967 SARAH JONES
98745,Richard James 98745 ICHARD JAMES
The same happens with all the names that has the blank at the end. At this point I wonder, if I have specified that I want to eliminate all the blanks at the begining and at the end of the value of IDENTIFIER as a first action, why the system updates the ID and SELLER_NAMES and then does this action?.
Just to specify: The IDENTIFIER column is part of the seller table which is updating from another person that imports the data from an Excel file. I receive this values and I have to normalize the information. I only can read the SELLER table, take this into account before answer
Try this, because you have space in right side of name, so it will just truncate one char from name. So just need to RTRIM(IDENTIFIER) and thats it.
SELLER_NAME = UPPER(RIGHT((RTRIM(IDENTIFIER)),LEN(IDENTIFIER )-CHARINDEX(',',IDENTIFIER)));
The design of your tables violates 1NF and is nothing but painful. Instead of doing all this crazy string manipulation you could leverage PARSENAME here quite easily.
with Something(SomeValue) as
(
select '258967,Sarah Jones' union all
select '98745,Richard James '
)
select *
, ltrim(rtrim(PARSENAME(replace(SomeValue, ',', '.'), 2)))
, ltrim(rtrim(PARSENAME(replace(SomeValue, ',', '.'), 1)))
from Something
Instead of using Right(), use SubString().
Here's an example. I've tried to show each step individually to illustrate
; WITH x (identifier) AS (
SELECT '258967,Sarah Jones'
UNION ALL
SELECT '98745,Richar James '
)
SELECT identifier
, CharIndex(',', identifier) As comma
, SubString(identifier, CharIndex(',', identifier) + 1, 1000) As name_only
, LTrim(RTrim(SubString(identifier, CharIndex(',', identifier) + 1, 1000))) As trimmed_name_only
FROM x
Note that the 1000 used should be the maximum length of the column definition or higher e.g. if your IDENTIFIER column is a varchar(2000) then use 2,000 instead.
try trim the IDENTIFIER first like this
SALLER_NAME = UPPER(RIGHT((RTRIM(IDENTIFIER),LEN(IDENTIFIER )-CHARINDEX(',',IDENTIFIER )));

Get index of two consecutive upper case characters

I am trying to separate a city/state/zip field into the city, state, and zip. Normally I would do this with charindex of ',' to get the city and state, and isnumeric and right() for the zip.
This will work fine for the zip, but most of the rows in the data I am working with now have no commas City ST Zip. Is there a way to identify the index of two upper case characters?
If not, does anybody have a better idea than just a case statement checking for each state individually?
EDIT: I found the PATINDEX/COLLATE option to work fairly intermittently. See my answer below.
PATINDEX should work for you:
PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as)
So your full extract would be something like:
WITH CTE AS
( SELECT i = PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as) + 1,
A
FROM (VALUES
('City ST Zip'),
('Another City ST Zip'),
('City, with comma ST Zip')
) t (A)
)
SELECT City = LEFT(A, i - 2),
State = SUBSTRING(A, i, 2),
Zip = SUBSTRING(A, i + 3, LEN(A))
FROM CTE;
Example on SQL Fiddle
The reason why PATINDEX appears to work intermittently is that you cannot use a character range (i.e. A-Z) to accomplish a case-sensitive search, even if using a case-sensitive collation. The issue is that character ranges work like sorting, and case-sensitive sorting groups the upper-case letters with their lower-case equivalents, just like it would be ordered in a dictionary. Range sorting is really: a,A,b,B,c,C,d,D,etc. Or, depending on the collation, it might be: A,a,B,b,C,c,D,d,etc (there are 31 Collations that sort upper-case first). When doing this in a case-sensitive collation, that merely groups all A entries together, separate from the a entries, whereas in a case-insensitive sort they would be intermixed.
But if you specify each of the letters individually (hence not using a range), then it will work as expected:
PATINDEX(N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%',
[CityStZip] COLLATE Latin1_General_100_CS_AS)
The reason that PATINDEX and LIKE (both of which allow for a single character class of [A-Z]) work this way is that the [start-end] syntax is not a Regular Expression. Many people claim that PATINDEX and LIKE support "limited" RegEx due to supporting this syntax, but that is not true. It is merely a very similar (and a confusingly similar) syntax to RegEx where [A-Z] would normally not include any lower-case matches.
Of course, if you are guaranteed to only be searching on the US-English letters of A-Z, then a binary collation (i.e. one ending in _BIN2; don't use ones ending in _BIN as they have been deprecated since SQL Server 2005 was introduced, I believe) should work.
PATINDEX(N'%[A-Z][A-Z]%', [CityStZip] COLLATE Latin1_General_100_BIN2)
For more details about case-sensitive matching, especially in regards to including Unicode / NVARCHAR data, please see my related answer on DBA.StackExchange:
How to find values with multiple consecutive upper case characters
If you have zip code and state at the end of the string, then this might work:
select right(address, 5) as zip,
left(right(address, 8), 2) as state,
left(address, len(address) - 9) as city
You can start by removing the commas and double spaces from the address.
If you have a table of states(which you should) with a column of the abbreviations you can do things like this:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
a.CityStateZip Like '% ' + s.UpperCaseAbbreviation + ' %' --space on either side of abbreviation
You can make it work for both commas and spaces:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
Replace(a.CityStateZip, ',' , ' ') Like '% ' + s.UpperCaseAbbreviation + ' %'
I found the PATINDEX/COLLATE option to work fairly intermittently. Here is what I ended up doing:
--get rid of the sparsely used commas
--get rid of the duplicate spaces
update MyTable set
CityStZip=
replace(
replace(
replace(CityStZip,' ',' '),
' ',' '),
',','')
select
--check if state and zip are there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then left(CityStZip,len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+1)
--no zip. check for state
when left(right(CityStZip,3),1) = ' '
then left(CityStZip,len(CityStZip)-charIndex(' ',reverse(CityStZip)))
else CityStZip
end as City,
--check if zip is there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+2,
2)
--no zip. check if 3rd to last char is a space and grab the last two chars
when left(right(CityStZip,3),1) = ' '
then right(CityStZip,2)
end as [State],
--grab everything after the last space if the last character is numeric
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip))+1,
charindex(' ',reverse(CityStZip)))
end as Zip
from MyTable