SQL Like condition fails to run - sql

I've been tasked to develop a query that behaves essentially like the following one:
SELECT * FROM tblTestData WHERE *.TestConditions LIKE '*textToSearch*'
The textToSearch is a string which contains information about the condition in which a given device is tested (Voltage, Current, Frequency, etc) in the following format as an example:
[V:127][PF:1][F:50][I:65]
The objective is to recover a list of any and all tests performed at a voltage of 127 Volts, so the SQL developed would look like the folllowing:
SELECT * FROM tblTestData WHERE *.TestConditions LIKE '*V:127*'
This works as intended but there is a problem due to an inproper introduction of data, there are cases in which the _textToSearch string looks like the following examples:
[V.127][PF:1][F:50][I:65]
[V.230][PF:1][F:50][I:65]
As you can see, my previous SQL transaction does not work as it does not meet the conditions.
If I try to do the following transaction with the objective of ignoring improper data format:
SELECT * FROM tblTestData WHERE *.TestConditions LIKE '*V*127*'
The transaction is not succesful and returns an error.
What am I doing wrong for this transaction not to work? I am approaching this problem wrong?
I see a pair of problems although with this transaction, if there were a group of test conditions like the following:
[V.127][PF:1][F:50][I:127]
[V.230][PF:1][F:50][I:127]
Would it return the values of both points given that both meet the condition of the transaction stated above?
In conclusion, my questions are:
What is wrong with the LIKE '*V*127*' condition for it not to work?
What implications has working with this condition? Can it return more information than desired if I am not careful?
I hope it is clear what I am asking for, if it isn't, please point out what is not clear and I will try to clarify it

One choice is to look for any character between the "V" and the "127":
WHERE TestConditions LIKE '%V_127%'
Note that % is the wildcard for a string of any length and _ is the wildcard for a single character.
You can also use regular expressions:
WHERE regexp_like(TestConditions, 'V[.:]127')
Note that regular expressions match anywhere in the string, so wildcards at the beginning and end are not needed.

You could check for both cases (although this will decrease performance)
SELECT *
FROM tblTestData
WHERE (TestConditions LIKE '%V:127%' OR TestConditions LIKE '%V.127%')
It is better to clean the data in your database if only old records have this problem.

Using regular expressions is recommended by Oracle for this kind of conditions. You could build a regular expression for your case:
WITH your_table AS (
SELECT '[V.127][PF:1][F:50][I:65]' text_to_search FROM dual
UNION
SELECT '[V.230][PF:1][F:50][I:65]' text_to_search FROM dual
UNION
SELECT '[V:127][PF:1][F:50][I:65]' text_to_search FROM dual
)
SELECT *
FROM your_table
WHERE REGEXP_LIKE(text_to_search,'\[V(.|:)127\]','i')
Or you could use the good old LIKE operator. In this case, you need to know that:
% matches zero or more characters
_ matches only one character
So you should use an underscore to match the : or the .
WITH your_table AS (
SELECT '[V.127][PF:1][F:50][I:65]' text_to_search FROM dual
UNION
SELECT '[V.230][PF:1][F:50][I:65]' text_to_search FROM dual
UNION
SELECT '[V:127][PF:1][F:50][I:65]' text_to_search FROM dual
)
SELECT *
FROM your_table
WHERE text_to_search LIKE '%V_127%';

Related

Using period "." in Standard SQL in BigQuery

BigQuery Standard SQL does not seems to allow period "." in the select statement. Even a simple query (see below) seems to fail. This is a big problem for datasets with field names that contain "." Is there an easy way to avoid this issue?
select id, time_ts as time.ts
from `bigquery-public-data.hacker_news.comments`
LIMIT 10
Returns error...
Error: Syntax error: Unexpected "." at [1:27]
This also fails...
select * except(detected_circle.center_x )
from [bigquery-public-data:eclipse_megamovie.photos_v_0_2]
LIMIT 10
It depends on what you are trying to accomplish. One interpretation is that you want to return a STRUCT named time with a single field named ts inside of it. If that's the case, you can use the STRUCT operator to build the result:
SELECT
id,
STRUCT(time_ts AS ts) AS time
FROM `bigquery-public-data.hacker_news.comments`
LIMIT 10;
In the BigQuery UI, it will display the result as id and time.ts, where the latter indicates that ts is inside a STRUCT named time.
BigQuery disallows columns in the result whose names include periods, so you'll get an error if you run the following query:
SELECT
id,
time_ts AS `time.ts`
FROM `bigquery-public-data.hacker_news.comments`
LIMIT 10;
Invalid field name "time.ts". Fields must contain only letters, numbers, and underscores, start with a letter or underscore, and be at most 128 characters long.
Elliot's answer great and addresses first part of your question, so let me address second part of it (as it is quite different)
First, wanted to mention that select modifiers like SELECT * EXCEPT are supported for BigQuery Standard SQL so, instead of
SELECT * EXCEPT(detected_circle.center_x )
FROM [bigquery-public-data:eclipse_megamovie.photos_v_0_2]
LIMIT 10
you should rather tried
#standardSQL
SELECT * EXCEPT(detected_circle.center_x )
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10
and of course now we are back to issue with `using period in standard sql
So, above code can only be interpreted as you try to eliminate center_x field from detected_circle STRUCT (nullable record). Technically speaking, this makes sense and can be done using below code
SELECT *
REPLACE(STRUCT(detected_circle.radius, detected_circle.center_y ) AS detected_circle)
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10
... still not clear to me how to use your recommendation to remove the entire detected_circle.*
SELECT * EXCEPT(detected_circle)
FROM `bigquery-public-data.eclipse_megamovie.photos_v_0_2`
LIMIT 10

Oracle SQL: Filtering rows with non-numeric characters

My question is very similar to this one: removing all the rows from a table with columns A and B, where some records include non-numeric characters (looking like '1234#5' or '1bbbb'). However, the solutions I read around don't seem to work for me. For example,
SELECT count(*) FROM tbl
--962060;
SELECT count(*)
FROM tbl
WHERE (REGEXP_like(A,'[^0-9]') OR REGEXP_like(B,'[^0-9]') ) ;
--17
SELECT count(*)
FROM tbl
WHERE (REGEXP_like(A,'[0-9]') and REGEXP_like(B,'[0-9]') )
;
--962060
From the 3rd query, I'd expect to see (962060-17)=962043. Why is it still 962060? An alternative query like this also gives the same answer:
SELECT count(*)
FROM tbl
WHERE (REGEXP_like(A,'[[:digit:]]')and REGEXP_like(B,'[[:digit:]]') )
;
--962060
Of course, I could bypass the problem by doing query1 minus query2, but I'd like to learn how to do that using regular expressions.
If you use regexp you should take in account that any part of string may be matched as regexp. According your example you should specify that whole string should cntain only numbers ^ - is the beginig of string $ - is the end. And you may use \d- is digits
SELECT count(*)
FROM tbl
WHERE (REGEXP_like(A,'^[0-9]+$') and REGEXP_like(B,'^[0-9]+$') )
or
SELECT count(*)
FROM tbl
WHERE (REGEXP_like(A,'^\d+$') and REGEXP_like(B,'^\d+$') )
I know you specifically asked for a regex solution, but translate can solve these kind of questions as well (and usually faster because regexes use more processing power):
select count(1)
from tbl
where translate(a, 'x0123456789', 'x') is null
and translate(b, 'x0123456789', 'x') is null;
What this does: translate the characters 0123456789 to null, and if the result is null, then the input must have been all digits. The 'x' is just there because the third argument to translate can not be null.
Thought I should add this here, might be helpful to other readers.

SQL pattern matching

I have a question related to SQL.
I want to match two fields for similarities and return a percentage on how similar it is.
For example if I have a field called doc, which contains the following
This is my first assignment in SQL
and in another field I have something like
My first assignment in SQL
I want to know how I can check the similarities between the two and return by how much percent.
I did some research and wanted a second opinion plus I never asked for source code. Ive looked at Soundex(), Difference(), Fuzzy string matching using Levenshtein distance algorithm.
You didn't say what version of Oracle you are using. This example is based on 11g version.
You can use edit_distance function of utl_match package to determine how many characters you need to change in order to turn one string to another. greatest function returns the greatest value in the list of passed in parameters. Here is an example:
-- sample of data
with t1(col1, col2) as(
select 'This is my first assignment in SQL', 'My first assignment in SQL ' from dual
)
-- the query
select trunc(((greatest(length(col1), length(col2)) -
(utl_match.edit_distance(col2, col1))) * 100) /
greatest(length(col1), length(col2)), 2) as "%"
from t1
result:
%
----------
70.58
Addendum
As #jonearles correctly pointed out, it is much simpler to use edit_distance_similarity function of utl_match package.
with t1(col1, col2) as(
select 'This is my first assignment in SQL', 'My first assignment in SQL ' from dual
)
select utl_match.edit_distance_similarity(col1, col2) as "%"
from t1
;
Result:
%
----------
71

How can I SELECT DISTINCT on the last, non-numerical part of a mixed alphanumeric field?

I have a data set that looks something like this:
A6177PE
A85506
A51SAIO
A7918F
A810004
A11483ON
A5579B
A89903
A104F
A9982
A8574
A8700F
And I need to find all the ENDings where they are non-numeric. In this example, that means PE, AIO, F, ON, B and F.
In pseudocode, I'm imagining I need something like
SELECT DISTINCT X FROM
(SELECT SUBSTR(COL,[SOME_CLEVER_LOGIC]) AS X FROM TABLE);
Any ideas? Can I solve this without learning regexp?
EDIT: To clarify, my data set is a lot larger than this example. Also, I'm only interested in the part of the string AFTER the numeric part. If the string is "A6177PE" I want "PE".
Disclaimer: I don't know Oracle SQL. But, I think something like this should work:
SELECT DISTINCT X FROM
(SELECT SUBSTR(COL,REGEXP_INSTR(COL, "[[:ALPHA:]]+$")) AS X FROM TABLE);
REGEXP_INSTR(COL, "[[:ALPHA:]]+$") should return the position of the first of the characters at the end of the field.
For readability, I'd recommend using the REGEXP_SUBSTR function (If there are no performance issues of course, as this is definitely slower than the accepted solution).
...also similar to REGEXP_INSTR, but instead of returning the position of the substring, it returns the substring itself
SELECT DISTINCT SUBSTR(MY_COLUMN,REGEXP_SUBSTR("[a-zA-Z]+$")) FROM MY_TABLE;
(:alpha: is supported also, as #Audun wrote )
Also useful: Oracle Regexp Support (beginning page)
For example
SELECT SUBSTR(col,INSTR(TRANSLATE(col,'A0123456789','A..........'),'.',-1)+1)
FROM table;

Sorting '£' (pound symbol) in sql

I am trying to sort £ along with other special characters, but its not sorting properly.
I want that string to be sorted along with other strings starting with special characters. For example I have four strings:
&!##
££$$
abcd
&#$%.
Now its sorting in the order: &!##, &#$%, abcd, ££$$.
I want it in the order: &!##, &#$%, ££$$, abcd.
I have used the function order by replace(column,'£','*') so that it sorts along with strings starting with *. Although this seems to work while querying the DB, when used in code and deployed the £ gets replaced by �, i.e. (replace(column,'�','*') in the query, and doesn't sort as expected.
How to resolve this issue? Is there any other solution to sort the pound symbol/£? Any help would be greatly appreciated.
You seem to have two problems; performing the actual sort, and (possibly) how the £ symbol appears in the results in your code. Without knowing anything about your code or client or environment it's rather hard to guess what you might need to change, but I'd start by looking at your NLS_LANG and other NLS settings at the client end. #amccausl's link might be useful, but it depends what you're doing. I suspect you'll find different values in nls_session_parameters when queried from SQL*Plus and from your code, which may give you some pointers.
The sorting itself is slightly clearer now. Have a look at the docs for Linguistic Sorting and String Searching and NLSSORT.
You can do something like this (with a CTE to generate your data):
with tmp_tab as (
select '&!##' as value from dual
union all select '££$$' from dual
union all select 'abcd' from dual
union all select '&#$%' from dual
)
select * from tmp_tab
order by nlssort(value, 'NLS_SORT = WEST_EUROPEAN')
VALUE
------
&!##
&#$%
££$$
abcd
4 rows selected.
You can get sort values supported by your configuration with select value from v$nls_valid_values where parameter = 'SORT', but WESTERN_EUROPEAN seems to do what you want, for this sample data anyway.
You can see the default sorting in your current session with select value from nls_session_parameters where parameter = 'NLS_SORT'. (You can change that with an ALTER SESSION, but it's only letting me do that with some values, so that may not be helpful here).
You need to make sure your application code is all proper UTF-8 (see http://htmlpurifier.org/docs/enduser-utf8.html for more details)
Seems like your issue is with db characterset, or difference in charactersets between the app and db. For Oracle side, you can check by doing:
select value from sys.nls_database_parameters where parameter='NLS_CHARACTERSET';
If this comes up ascii (like US7ASCII), then you may have issues storing the data properly. Even if this is the charset, you should be able to insert and retrieve sorted (binary sort) by using nvarchar2 and unistr (assuming they conform to your NLS_NCHAR_CHARACTERSET, see above query but change parameter), like:
create table test1(val nvarchar2(100));
insert into test1(val) values (unistr('\00a3')); -- pound currency
insert into test1(val) values (unistr('\00a5')); -- yen currency
insert into test1(val) values ('$'); -- dollar currency
commit;
select * from test1
order by val asc;
-- will give symbols in order: dollar('\0024'), pound ('\00a3'), yen ('\00a5')
I will say that I would not resort to using the national characterset, I would probably change the db characterset to fit the needs of my data, as supporting 2 diff character sets isn't ideal, but its available anyway
If you have no issues storing/retrieving on the data side, then your app/client characterset is probably different than your db.
Use nchar(168). It will work.
select nchar(168)