Custom ORDER BY to ignore 'the' - sql

I'm trying to sort a list of titles, but currently there's a giant block of titles which start with 'The '. I'd like the 'The ' to be ignored, and the sort to work off the second word. Is that possible in SQL, or do I have to do custom work on the front end?
For example, current sorting:
Airplane
Children of Men
Full Metal Jacket
Pulp Fiction
The Fountain
The Great Escape
The Queen
Zardoz
Would be better sorted:
Airplane
Children of Men
The Fountain
Full Metal Jacket
The Great Escape
Pulp Fiction
The Queen
Zardoz
Almost as if the records were stored as 'Fountain, The', and the like. But I don't want to store them that way if I can, which is of course the crux of the problem.

Best is to have a computed column to do this, so that you can index the computed column and order by that. Otherwise, the sort will be a lot of work.
So then you can have your computed column as:
CASE WHEN title LIKE 'The %' THEN stuff(title,1,4,'') + ', The' ELSE title END
Edit: If STUFF isn't available in MySQL, then use RIGHT or SUBSTRING to remove the leading 4 characters. But still try to use a computed column if possible, so that indexing can be better. The same logic should be applicable to rip out "A " and "An ".
Rob

Something like:
ORDER BY IF(LEFT(title,2) = "A ",
SUBSTRING(title FROM 3),
IF(LEFT(title,3) = "An ",
SUBSTRING(title FROM 4),
IF(LEFT(title,4) = "The ",
SUBSTRING(title FROM 5),
title)))
But given the overhead of doing this more than a few times, you're really better off storing the title sort value in another column...

I think you could do something like
ORDER BY REPLACE(TITLE, 'The ', '')
although this would replace any occurrence of 'The ' with '', not just the first 'The ', although I don't think this would affect very much.

The best way to handle this would be to have a column that contains the value you want to use specifically for ordering output. Then you'd just have to use:
SELECT t.title
FROM MOVIES t
ORDER BY t.order_title
There are going to be various rules about what should and should not be used to order titles.
Based on your example, an alternative would be to use something like:
SELECT t.title
FROM MOVIES t
ORDER BY SUBSTR(t.title, INSTR(t.title, 'The '))
You could use a CASE statement to contain the various rules.

You can certainly arrange dynamically strip off 'The', though you'll soon find that you have to deal with 'A' and 'An' (except for the special case of titles like "A is for Alibi"). When "foreign" films enter the mix, you'll need to cope with "El" and "La" (except for that pesky edge case, "LA Story"). Then mix in some German films, and you'll need to cope with 'Der' and 'Die' (except for that pesky set of 'Die Hard' edge cases). See the pattern? You're headed down a path that keeps getting longer and more pitted with special cases.
The way forward on this that avoids an ever-growing set of special cases is to store the title as you want it display and store the title as you want it sorted.

For SQLite
ORDER BY CASE WHEN LOWER(SUBSTR(title,1,4)) = 'the ' THEN SUBSTR(title,5) ELSE title END ASC

Ways that will only remove the first The:
=SUBSTITUTE(A1,"The ","",1) OR more reliably:
=IF(IF(LEFT(A1,4)="The ",TRUE)=TRUE,RIGHT(A1,(LEN(A1)-4)),A1)
Second one is basically saying if the first left digit equals The, then check how many digits are in the cell, and show only the the right hand digits excluding The.

Related

How to use Regex to lowercase catalogue values without any logic codes

For a loan domain we pass some catalogue values eg. if a customer is primary or secondary customer like that. So i need to check the values irrespective of uppercase, lowercase, camelcase. Software which i am using will accept only regex codes not any Java, js codes (it is different scripting). I am trying to convert only with regexp but still getting error.
If catalogue_value ~"(/A-Z/)" then
Catalogue_value ~"/l"
Endif
As i am learning regex as of now still figuring for correct expressions to use.
Kindly please tell me correct format to use regex to change into lowercase / uppercase
If i understood your problem you want to search without worrying about the case, for example the data is Paul, and you want to find this record searching by PAUL, paul, PaUl, etc?
One common to technique to do that is to put both sides all in upper or lower case, without regex, for example, in javascript:
"Paul".toLowerCase() === "paUL".toLowerCase()
In SQL:
select case when LOWER('Paul') = LOWER('paUL') then 1 else 0 end

How do I find the amount of time a certain word appears in a title in SQL?

You have a database with a lot of movies and their specific titles, the question is as follows.
How many movies are there that have the word ‘love’ anywhere in the title? (Hint: The L in the word love can be upper or lower case and can be included in words such as ‘lovers’.)
This is my code thus far but I am not sure how to include the search for 'L' and 'Lovers'.
SELECT title
FROM Movies
WHERE title LIKE '%love%'
AND title LIKE 'love%'
OR title LIKE '%love'
Can anyone assist?
Many databases support case-insensitive strings by default, so this would find all of them:
WHERE title LIKE '%love%'
Some don't. A convenient function is to put the title in lower case:
WHERE LOWER(title) LIKE '%love%'
%love% will also match foxglove, rollover, sloven, pullover etc. You should also review your AND/OR use in the WHERE clause to get the expected results. Having said that, your '%love%' AND 'love%' is the same as just %love% since % matches nothing as well as anything.
You may get better results matching '% love%' OR 'love%' which will give (titles where love% is not the first word) AND (titles where love% is the first word). Use LOWER or UPPER as suggested by Gordon to make the search case insensitive:
WHERE UPPER(title) LIKE '% LOVE%' OR UPPER(title) LIKE 'LOVE%'

Mixing Like and Not Like in SQL

I am trying to search a free text column that contain crime reports. I want to identify shot from a gun, but not blood shot eyes. What I wish is to exclude the term “shot” if it is saying blood shot, but still selected the row if shot is used elsewhere in the report. I believe the code below will exclude the row if “blood shot” is located, even if “shot” is mentioned multiple times.
(Narrative LIKE '%[^a-z]Shot[^a-z]%' and Narrative Not Like '%[^a-z]Blood?Shot[^a-z]%')
Is there a way exclude from the search terms if the term “shot” is near the term “Blood”. But not exclude the cell if the term “shot” shows up in another place in the report within the cell?
This is really not something you should be doing in base SQL -- databases are not very good are such string manipulation. You probably want to look into the full text index capabilities on your database.
But I think the simplest method is:
where replace(lower(narrative), 'blood shot', '') like '%shot%'
That is, remove the "blood shot" from the string and then check.
You may still want to have delimiters around "shot". Perhaps:
where concat(' ', replace(lower(narrative), 'blood shot', ''), ' ') like '%[^a-z]shot[^a-z]%'

Search postgresql database for strings contianing specific words

I'm looking to query a postgresql database full of strings, specifically for strings with the word 'LOVE' in - this means only this specific version of the word and nothing where love is the stem or has that sequence of characters inside another word. I've so far been using the SELECT * FROM songs WHERE title LIKE '%LOVE%';, which mostly returns the desired results.
However, it also returns results like CRIMSON AND CLOVER, LOVESTONED/I THINK SHE KNOWS (INTERLUDE), LOVER YOU SHOULD'VE COME OVER and TO BE LOVED, which I want to exclude as they are specifically the word 'LOVE'.
I know you can use SELECT * FROM songs WHERE title = 'LOVE';, but this will obviously miss any string that isn't exactly 'LOVE'. Is there an operation in postgresql that can return the results I need?
You can use a regular expression that looks for love either with a space before or after, or if the word is at the start or end of the string:
with songs (title) as (
values
('Crimson And Clover'),
('Love hurts'),
('Only love can tear us apart'),
('To be loved'),
('Tainted love')
)
select *
from songs
where title ~* '\mlove\M';
The ~* is the regex operator and uses case insensitive comparison. The \m and \M restrict the match to the beginning and end of a word.
returns:
title
---------------------------
Love hurts
Only love can tear us apart
Tainted love
Online example: http://rextester.com/EUTHKM33922

What's the best way to parse an Address field using t-sql or SSIS?

I have a data set that I import into a SQL table every night. One field is 'Address_3' and contains the City, State, Zip and Country fields. However, this data isn't standardized. How can I best parse the data that is currently going into 1 field into individual fields. Here are some examples of the data I might receive:
'INDIANAPOLIS, IN 46268 US'
'INDIANAPOLIS, IN 46268-1234 US'
'INDIANAPOLIS, IN 46268-1234'
'INDIANAPOLIS, IN 46268'
Thanks in advance!
David
I've done something similar (not in T-SQL) and I find it works best to start at the end of the string and work backwards.
Grab the rightmost element up to the first space or comma.
Is it a known country code? It's a country
If not, is it all numeric (including a hyphen)? It's a zip code.
Else discard it
Grab the second rightmost element up to the next space or comma
Is it a two alpha-character field? It's the state
Grab everything else preceding the last comma and call it the city.
You'll need to make some adjustments based on what your input data looks like but the basic idea is to start from the right, grab the elements you can easily classify and call everything else the city.
You can implement something like this by using the REVERSE function to make searching easier (in which case you'll be parsing the string from left to right instead of right to left like I said above), the PATINDEX or CHARINDEX functions to find spaces and commas, and the SUBSTRING function to pull the address apart based on the positions found by PATINDEX and CHARINDEX. You could use the ASCII function to determine if a character is numeric or not.
You tagged your question with the SSIS tag as well - it might be easier to implement the parsing in some VB script in SSIS rather than try to do it with T-SQL.
By far the best way is to not reinvent the wheel and get an address parsing and standardization engine. Ideally, you would use a CASS certified engine which is what is approved by the Postal Service. However, there are free address parsers on the net these days and any of those would be more accurate and less frustrating than trying to parse the address yourself.
That said, I will say that address parsers and the Post Office work from bottom up (So, country, then zip code, then city, then state then address line 2 etc.).
In SSIS you can have 4 derived columns (city,state,zip,country).
substring(column,1,FINDSTRING(",",column,1)-1) --city
substring(column,FINDSTRING(" ",column,1)+1,FINDSTRING("",column,2)-1) --state
substring(column,FINDSTRING(" ",column,2)+1,FINDSTRING(" ",column,3)-1) -- zip
You can see the pattern above and continue accordingly. This might get a bit complicated. You can use a Script Component to better pull out the lines of text.
something like this should help:
select substring(CityStateZip, 1,
case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip)
else len(CityStateZip) - charindex(',',reverse(CityStateZip)) end) as City,
LEFT(LTRIM(
SUBSTRING(CityStateZip, case when charindex(',',reverse(CityStateZip)) = 0 then len(CityStateZip) else
len(CityStateZip) - charindex(',',reverse(CityStateZip))+2 end, LEN(CityStateZip)))
,2) as State,
SUBSTRING(CityStateZip, case when charindex(' ',reverse(CityStateZip)) = 0 then len(CityStateZip) else
len(CityStateZip) - charindex(' ',reverse(CityStateZip))+2 end, LEN(CityStateZip)) as Zip
from YourAddressTable