Extract unmatched content or values - sql

I want to extract the un-matched values in data like in (table1)
name id subject
maria 01 Math computer english
faro 02 Computer stat english
hina 03 Chemistry physics bio
The below query
Select *
from table1
where subject like ‘%english%’ or
subject like ‘%stat%’
returns first two rows that are matched with the criteria.
But I just need to extract the un-matched values from column (subject) like below output
unmatched
math computer
computer
chemistry physics bio
(Because in the first row only math computer values are not matching, in the second row two matches and in third row there are no matches).
can i get that output??

With REPLACE you eliminate all occurrences of the values 'english' and/or 'stat':
SELECT
trim(
replace(replace(replace(subject, 'english', ''), 'stat', ''), ' ', '')
) unmatched
FROM tablename;
The final trim and replace will remove double spaces from the result and spaces from the start and the end.

You have a poor table design. You should be storing lists as separate rows in another table -- a so-called "junction" or "association" table. SQL has a great data type for storing lists. It is called a "table" not a "string".
That said, sometimes we are stuck with other peoples really, really bad choices of data model.
If so, you can use replace() and trim() to get the list you want. I would do:
SELECT trim(replace(replace(' ' || subject || ' ', ' english ', ' '
), ' stat ', ''
), ' ', ' '
) as unmatched
FROM tablename;
This easily generalizes to more values, without worrying about introducing adjacent spaces.

Related

Concatenate rows in function PostgreSQL

Assume there's a table projects containing project name, location, team id, start and end years. How can I concatenate rows so that the same names would combine the other information into one string?
name location team_id start end
Library Atlanta 2389 2015 2017
Library Georgetown 9920 2003 2007
Museum Auckland 3092 2005 2007
Expected output would look like this:
name Records
Library Atlanta, 2389, 2015-2017
Georgetown, 9920, 2003-2007
Museum Auckland, 3092, 2005-2007
Each line should contain end-of-line / new line character.
I have a function for this, but I don't think it would work with just using CONCAT. What are other ways this can be done? What I tried:
CREATE OR REPLACE TYPE projects (name TEXT, records TEXT);
CREATE OR REPLACE FUNCTION records (INT)
RETURNS SETOF projects AS
$$
RETURN QUERY
SELECT p.name
CONCAT(p.location, ', ', p.team_id, ', ', p.start, '-', p.end, CHAR(10))
FROM projects($1) p;
$$
LANGUAGE PLpgSQL;
I tried using CHAR(10) for new line, but its giving a syntax error (not sure why?).
The above sample concatenate the string but expectedly leaving out duplicated names.
You do not need PL/pgSQL for that.
First eliminate duplicate names using DISTINCT and then in a subquery you can concat the columns into a single string. After that use array_agg to create an array out of it. It will then "merge" multiple arrays, in case the subquery returns more than one row. Finally, get rid of the commas and curly braces using array_to_string. Instead of using the char value of a newline, you can simply use E'\n' (E stands for escape):
WITH j (name,location,team_id,start,end_) AS (
VALUES ('Library','Atlanta',2389,2015,2017),
('Library','Georgetown',9920,2003,2007),
('Museum','Auckland',3092,2005,2007)
)
SELECT
DISTINCT q1.name,
array_to_string(
(SELECT array_agg(concat(location,', ',team_id,', ',start,'-', end_, E'\n'))
FROM j WHERE name = q1.name),'') AS records
FROM j q1;
name | records
---------+----------------------------
Library | Atlanta, 2389, 2015-2017
| Georgetown, 9920, 2003-2007
|
Museum | Auckland, 3092, 2005-2007
Note: try to not use reserved strings (e.g. end,name,start, etc.) to name your columns. Although PostgreSQL allows you to use them, it is considered a bad practice.
Demo: db<>fiddle
A bit simple query:
select
name,
string_agg( concat(location, ', ', team_id, ', ', start, '-', "end"), E'\n') AS records
FROM t
group by name;
PostgreSQL fiddle

create new columns from xml value in hive

I have a column desc_txt in my table and its contents are quite similar to that of xml like shown below-
desc_txt
-----------
<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>
Requirement is to have a new table/view created from this table having additional columns like Criticality, Country, City along with the column values like High, India, Indore, respectively.
How can this be achieved in Hive/Impala?
This can be done in two steps. I assumed you have only four columns to pull.
Load the data as is in a table. Put everything in a column.
Then use this below SQL to split the data multiple columns. I assumed 4 columns, you can increase as per your requirement.
with t as (
SELECT rtrim(ltrim(
regexp_replace( replace( trim(
regexp_replace(
regexp_replace("<td><strong>Criticality</strong></td><td>High</td></tr><td><strong>Country</strong></td><td>India</td></tr><tr><td><strong>City</strong></td><td>Indore</td>","</?[^>]*>",",")
,',,',',') ), ' ,', ',' ), '(,){2,}', ','),','),',')
str)
select split_part(str, ',', 1) as first_col,
split_part(str, ',', 2) as second_col,
split_part(str, ',', 3) as third_col,
split_part(str, ',', 4) as fourth_col
from t
The query is tricky - first it replaces all tags with comma in them, then it replaces multiple commas with single comma, then it removes comma from start and end of the string. split function then splits whole string based on comma and create individual columns.
HTH...

Retrieve Second to Last Word in PostgreSQL

I am using PostgreSQL 9.5.1
I have an address field where I am trying to extract the street type (AVE, RD, ST, etc). Some of them are formatted like this: 5th AVE N or PEE DEE RD N
I have seen a few methods in PostgreSQL to count segments from the left based on spaces i.e. split_part(name, ' ', 3), but I can't seem to find any built-in functions or regular expression examples where I can count the characters from the right.
My idea for moving forward is something along these lines:
select case when regexp_replace(name, '^.* ', '') = 'N'
then *grab the second to last group of string values*
end as type;
Leaving aside the issue of robustness of this approach when applied to address data, you can extract the penultimate space-delimited substring in a string like this:
with a as (
select string_to_array('5th AVE N', ' ') as addr
)
select
addr[array_length(addr, 1)-1] as street
from
a;

Get index of two consecutive upper case characters

I am trying to separate a city/state/zip field into the city, state, and zip. Normally I would do this with charindex of ',' to get the city and state, and isnumeric and right() for the zip.
This will work fine for the zip, but most of the rows in the data I am working with now have no commas City ST Zip. Is there a way to identify the index of two upper case characters?
If not, does anybody have a better idea than just a case statement checking for each state individually?
EDIT: I found the PATINDEX/COLLATE option to work fairly intermittently. See my answer below.
PATINDEX should work for you:
PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as)
So your full extract would be something like:
WITH CTE AS
( SELECT i = PATINDEX('% [A-Z][A-Z] %', A COLLATE Latin1_general_cs_as) + 1,
A
FROM (VALUES
('City ST Zip'),
('Another City ST Zip'),
('City, with comma ST Zip')
) t (A)
)
SELECT City = LEFT(A, i - 2),
State = SUBSTRING(A, i, 2),
Zip = SUBSTRING(A, i + 3, LEN(A))
FROM CTE;
Example on SQL Fiddle
The reason why PATINDEX appears to work intermittently is that you cannot use a character range (i.e. A-Z) to accomplish a case-sensitive search, even if using a case-sensitive collation. The issue is that character ranges work like sorting, and case-sensitive sorting groups the upper-case letters with their lower-case equivalents, just like it would be ordered in a dictionary. Range sorting is really: a,A,b,B,c,C,d,D,etc. Or, depending on the collation, it might be: A,a,B,b,C,c,D,d,etc (there are 31 Collations that sort upper-case first). When doing this in a case-sensitive collation, that merely groups all A entries together, separate from the a entries, whereas in a case-insensitive sort they would be intermixed.
But if you specify each of the letters individually (hence not using a range), then it will work as expected:
PATINDEX(N'%[ABCDEFGHIJKLMNOPQRSTUVWXYZ][ABCDEFGHIJKLMNOPQRSTUVWXYZ]%',
[CityStZip] COLLATE Latin1_General_100_CS_AS)
The reason that PATINDEX and LIKE (both of which allow for a single character class of [A-Z]) work this way is that the [start-end] syntax is not a Regular Expression. Many people claim that PATINDEX and LIKE support "limited" RegEx due to supporting this syntax, but that is not true. It is merely a very similar (and a confusingly similar) syntax to RegEx where [A-Z] would normally not include any lower-case matches.
Of course, if you are guaranteed to only be searching on the US-English letters of A-Z, then a binary collation (i.e. one ending in _BIN2; don't use ones ending in _BIN as they have been deprecated since SQL Server 2005 was introduced, I believe) should work.
PATINDEX(N'%[A-Z][A-Z]%', [CityStZip] COLLATE Latin1_General_100_BIN2)
For more details about case-sensitive matching, especially in regards to including Unicode / NVARCHAR data, please see my related answer on DBA.StackExchange:
How to find values with multiple consecutive upper case characters
If you have zip code and state at the end of the string, then this might work:
select right(address, 5) as zip,
left(right(address, 8), 2) as state,
left(address, len(address) - 9) as city
You can start by removing the commas and double spaces from the address.
If you have a table of states(which you should) with a column of the abbreviations you can do things like this:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
a.CityStateZip Like '% ' + s.UpperCaseAbbreviation + ' %' --space on either side of abbreviation
You can make it work for both commas and spaces:
SELECT a.* FROM Addresses a
INNER JOIN States s ON
Replace(a.CityStateZip, ',' , ' ') Like '% ' + s.UpperCaseAbbreviation + ' %'
I found the PATINDEX/COLLATE option to work fairly intermittently. Here is what I ended up doing:
--get rid of the sparsely used commas
--get rid of the duplicate spaces
update MyTable set
CityStZip=
replace(
replace(
replace(CityStZip,' ',' '),
' ',' '),
',','')
select
--check if state and zip are there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then left(CityStZip,len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+1)
--no zip. check for state
when left(right(CityStZip,3),1) = ' '
then left(CityStZip,len(CityStZip)-charIndex(' ',reverse(CityStZip)))
else CityStZip
end as City,
--check if zip is there and then grab the city
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip),
charindex(' ',reverse(CityStZip))+1)+2,
2)
--no zip. check if 3rd to last char is a space and grab the last two chars
when left(right(CityStZip,3),1) = ' '
then right(CityStZip,2)
end as [State],
--grab everything after the last space if the last character is numeric
case when isNumeric(right(CityStZip,1))=1
then substring(CityStZip,
len(CityStZip)-charindex(' ',reverse(CityStZip))+1,
charindex(' ',reverse(CityStZip)))
end as Zip
from MyTable

Sorting (or usage of ORDER BY clause) in T-SQL / SQL SERVER without considering some words

i'm wondering whether it is possible to use ORDER BY clause (or any other clause(s)) to do sorting without considering some words.
For ex, article 'the':
Bank of Switzerland
Bank of America
The Bank of England
should be sorted into:
Bank of America
The Bank of England
Bank of Switzerland
and NOT
Bank of America
Bank of Switzerland
The Bank of England
select * from #test
order by
case when test like 'The %' then substring(test, 5, 8000) else test end
If you have a limited number of words that you wish to eliminate, then you might be able to remove them by judicious use of REPLACE, e.g.
ORDER BY REPLACE(REPLACE(' ' + Column + ' ',' the ',' '),' and ',' ')
However, as the number of words add up, you'll have more and more nested REPLACE calls. In addition, this ORDER BY will be unable to benefit from any indexes, and doesn't cope with punctuation marks.
If this sort is frequent and the queries would otherwise be able to benefit from an index, you might consider making the above a computed column, and creating an index over it (You would then order by the computed column).
You need to encode a method of turning one string into another and then ordering by that.
For example, if the method is just to strip away starting occurances of 'The '...
ORDER BY
CASE WHEN LEFT(yourField, 4) = 'The ' THEN RIGHT(yourField, LEN(yourField)-4) ELSE yourField END
Or, if you want to ignore all occurrences of 'the', where ever it occurs, just use REPLACE...
ORDER BY
REPLACE(yourField, 'The', '')
You may end up with a fairly complex transposition, in which case you can do things like this...
SELECT
*
FROM
(
SELECT
<complex transposition> AS new_name,
*
FROM
whatever
)
AS data
ORDER BY
new_name
No, not really because the is arbitrary in this case. The closest you can do is modify the field value, such as below:
SELECT field1
FROM table
ORDER BY REPLACE(field1, 'The ', '')
The problem is that to replace two words, you have to next REPLACE statements, which becomes a huge issue if you have more than about five words:
SELECT field1
FROM table
ORDER BY REPLACE(REPLACE(field1, 'of ', ''), 'The ', '')
Update: You don't really need to check if the or of appears at the beginning of the field because you are only wanting to sort by important words anyway. For example, Bank of America should appear before Bank England (the of shouldn't make it selected after).
My Solution a little bit shorter
DECLARE #Temp TABLE ( Name varchar(100) );
INSERT INTO #Temp (Name)
SELECT 'Bank of Switzerland'
UNION ALL
SELECT 'Bank of America'
UNION ALL
SELECT 'The Bank of England'
SELECT * FROM #Temp
ORDER BY LTRIM(REPLACE(Name, 'The ', ''))