Extracting file extensions from file excluding query parameters SQL - sql

Is there a way to obtain only the file extension excluding query parameters using split_part and reverse from an SQL query?
ie.
www.example.com?hffhqowhf
or
test.jpg?34rfqeyfhhf
Returns:
com
jpg
Not tied down to com or jpg but in general?
Many Thanks

There are a number of ways of achieving this (for example using a combination of INSTR and SUBSTR functions) but the cleanest way is probably to use a regular expression, something like this:
(
assumption: the string always has a query parameter starting with a '?' and that is the only occurrence of this character in the string
caveat: I don't currently have access to Impala so you may need to
adjust the regex expression to get it to work precisely as you
require
)
Reverse the string (REVERSE function) - so that the substring you want is between the '?' and the next '.'. If you don't reverse the string it is harder to identify which '.' in the string you are dealing with
Extract the substring between '?' and '.' but excluding these 2 characters e.g.
select regexp_extract(reverse('www.example.com?hffhqowhf'),'?([^.]+)',1);
Reverse the output again to get the required result

Related

Substring from reverse in PostgreSql

How do we substring from reverse in Postgres? In oracle we can provide the no.of occurrences of the pattern and extract the expected records. In Postgres we do not have such option.
I tried using substring(), left() and right() functions, still it is not working. Any suggestion would be helpful.
Value in the column,
col1
100~500~~~~Bangalore~~~~KA~null~Train
Expected result,
Train
To get the characters after the last ~ you can use substring() with a regex:
substring(col1 from '~([^~]+)$')
Or get the position of the last ~ by using the reverse function:
right(col1, strpos(reverse(col1), '~') - 1)
A more general approach is to convert the string to an array, then pick the last array element:
(string_to_array(col1, '~'))[cardinality(string_to_array(col1, '~'))]
The best solution to this sort of problems is to not store multiple values delimited by some character in a single column. If you really need to de-normalize using arrays or JSON would at least be a bit more flexible (and robust)

Determine if substring corresponds to specific code (character types) in SQL

I have a collection of strings and want to filter out those where the last four characters are: (alpha)(alpha)(number)(number).
I know I can make a substring of each of these and separately, but what is the method to determine the types of the characters in the sequence?
This is for SQL in Hive.
You can use regular expressions. Something like:
where col regexp '[a-zA-Z]{2}[0-9]{2}$'

Redshift SQL - Extract numbers from string

In Amazon Redshift tables, I have a string column from which I need to extract numbers only out. For this currently I use
translate(stringfield, '0123456789'||stringfield, '0123456789')
I was trying out REPLACE function, but its not gonna be elegant.
Any thoughts with converting the string into ASCII first and then doing some operation to extract only number? Or any other alternatives.
It is hard here as Redshift do not support functions and is missing lot of traditional functions.
Edit:
Trying out the below, but it only returns 051-a92 where as I need 05192 as output. I am thinking of substring etc, but I only have regexp_substr available right now. How do I get rid of any characters in between
select REGEXP_SUBSTR('somestring-051-a92', '[0-9]+..[0-9]+', 1)
might be late but I was solving the same problem and finally came up with this
select REGEXP_replace('somestring-051-a92', '[a-z/-]', '')
alternatively, you can create a Python UDF now
Typically your inputs will conform to some sort of pattern that can be used to do the parsing using SUBSTRING() with CHARINDEX() { aka STRPOS(), POSITION() }.
E.g. find the first hyphen and the second hyphen and take the data between them.
If not (and assuming your character range is limited to ASCII) then your best bet would be to nest 26+ REPLACE() functions to remove all of the standard alpha characters (and any punctuation as well).
If you have multibyte characters in your data though then this is a non-starter.
Better method is to remove all the non-numeric values:
select REGEXP_replace('somestring-051-a92', '[^0-9]', '')
You can specify "any non digit" that includes non-printable, symbols, alpha, etc.
e.g., regexp_replace('brws--A*1','[\D]')
returns
"1"

Return sql rows where field contains ONLY non-alphanumeric characters

I need to find out how many rows in a particular field in my sql server table, contain ONLY non-alphanumeric characters.
I'm thinking it's a regular expression that I need along the lines of [^a-zA-Z0-9] but Im not sure of the exact syntax I need to return the rows if there are no valid alphanumeric chars in there.
SQL Server doesn't have regular expressions. It uses the LIKE pattern matching syntax which isn't the same.
As it happens, you are close. Just need leading+trailing wildcards and move the NOT
WHERE whatever NOT LIKE '%[a-z0-9]%'
If you have short strings you should be able to create a few LIKE patterns ('[^a-zA-Z0-9]', '[^a-zA-Z0-9][^a-zA-Z0-9]', ...) to match strings of different length. Otherwise you should use CLR user defined function and a proper regular expression - Regular Expressions Make Pattern Matching And Data Extraction Easier.
This will not work correctly, e.g. abcÑxyz will pass thru this as it has a,b,c... you need to work with Collate or check each byte.

SQLite not using index when using concatenation

I am using the following SQL statement for SQLite:
select * from words where \"word\" like ? || '%' || ? ;
In order to bind parameters to the first and last letters. I have tested this both with and without an index on the column word, and the results are the same. However, when running the queries as
select * from words where \"word\" like 'a%a';
etc. (that is, hardcoding each value instead of using ||, the query is about x10 faster when indexed.
Can someone show me how to use the index and the parameters both?
I found an answer thanks to the sqlite mailing list. It says here (http://sqlite.org/optoverview.html), section 4: "The right-hand side of the LIKE or GLOB must be either a string literal or a parameter bound to a string literal that does not begin with a wildcard character."