Regular expression metacharacter in SQL yields different results in Oracle vs. Postgres - sql

I'm trying to convert some queries from an Oracle environment to Postgres. This is a simplified version of one of the queries:
SELECT * FROM TABLE
WHERE REGEXP_LIKE(TO_CHAR(LINK_ID),'\D')
I believe the equivalent postgreSQL should be this:
SELECT * FROM TABLE
WHERE CAST(LINK_ID AS TEXT) ~ '\D'
But when I run these queries in their respective environments on the exact same dataset, the first query outputs no records (which is correct) and the second query outputs all records in the table. I didn't write the original code, but as I understand it, it's looking for any values in the numeric field LINK_ID that are non-digit characters. Is the \D metacharacter supposed to behave differently in Oracle vs. postgres? I'm not seeing anything in documentation to say they should.

The documentation for Oracle's TO_CHAR(number) states
If you omit fmt, then n is converted to a VARCHAR2 value exactly long enough to hold its significant digits.
https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions181.htm
This means that the only non-numeric character which might be produced is a negative sign or a decimal point. If the number is positive and has no fractional part, it will not match the regular expression \D.
On the other hand, on PostgreSQL CAST(numeric(38,8)as TEXT) returns a value with the number of decimal places specified by the type specification, in this case 8.
E.g.:
cast( cast(12341234 as numeric(38,8)) as TEXT)
Generates 12341234.00000000 The result of such a cast will always contain a decimal point and therefore will always match the regular expression \D.
You may find that replacing it with this solves your problem:
(LINK_ID % 1) <> 0.0
Alternatively, If you need to use the regex (e.g. to simplify migration work), consider changing it to '\.0*[1-9]' i.e. to find a decimal point with any nonzero digit after it.

Related

How to extract digits from field using regex

I am using Firebird 2.5 and I have a field (called identifier) with mixed letters, numbers and special characters. I would like to use regex to extract only the numbers in a new column. I have tried something like below, but it is not working.
Any idea how I can achieve this using regex without using stored procedures or execute block
SELECT ORDER_ID,
ORDER_DATE,
SUBSTRING(IDENTIFIER FROM 1 TO 10) SIMILAR TO '^[0-9]{10}$' --- DESIRED EXTRACTION COLUMN
FROM ORDERS
Example of data
IDENTIFIER DESIRED OUTPUT
ANDRE 02869567995 02869567995
02869567995 MARIA 02869567995
028.695.67.995 02869567995
028695679-95 02869567995
You cannot do this in Firebird 2.5, at least not without help from a UDF, or a (selectable) stored procedure. I'm not aware of third-party UDFs providing regular expressions, so you might have to write this yourself.
In Firebird 3.0, you could also use a UDR or stored function to achieve this. Unfortunately, using the regular expression functionality available in Firebird alone will not be enough to solve this.
NOTE: The rest of the answer is based on the assumption to extract digits if the first 10 characters of string are digits. With the updated question, this assumption is no longer valid.
That said, if your need is exactly as shown in your question, that is only extract the first 10 characters from a string if they are all digits, then you could use:
case
when IDENTIFIER similar to '[[:DIGIT:]]{10}%'
then substring(IDENTIFIER from 1 for 10)
end
(as an aside, the positional SUBSTRING syntax is from <start> for <length>, not from <start> to <end>)
In Firebird 3.0 and higher, you can use SUBSTRING(... SIMILAR ...) with a SQL regular expression pattern. Assuming you want to extract 10 digits from the start of a string, you can do:
substring(IDENTIFIER similar '#"[[:DIGIT:]]{10}#"%' escape '#')
The #" delimits the pattern to extract (where # is a custom escape character as specified in the ESCAPE clause). The remainder of the pattern must match the rest of the string, hence the use of % here (in other cases, you may need to specify a pattern before the first #" as well.
See this dbfiddle for an example.
It is not possible in any version of Firebird.

BigQuery - ROUND, TRUNC function Issue

I have a issue with round, trunc function from BigQuery standard query .
I expected at 3953.7, 3053.67, 3053.667. f1_, f2_ result is different. It is a bug??
I expected at 3.195, 3.195, 3.1955, 3.1965, 3.1945.
f1_, f3_ result is different. Is it my fault?
The ROUND() is used to round a numeric field to the nearest number of decimals specified.
There is a limitation of floating point values.
They can only represent binary values, but cannot precisely represent decimal digits after the decimal point (see here).
In case of SELECT ROUND(3053.665,2) you receive: 3053.66, you can overcome it by using: ROUND(value + 0.005, 2), which allows you to receive 3053.67.
Anyway, if you want to take care about precise decimal results, you should use the NUMERIC type. The following query gives results that you expect:
SELECT ROUND(3953.65,1), ROUND(numeric '3053.665',2), ROUND(numeric '3053.6665',3)
TRUNC(), the following query gives results that you expect:
SELECT TRUNC(3.1955,3), TRUNC(numeric'3.195',3), TRUNC(3.1955,4), TRUNC(numeric '3.1965',4), TRUNC(3.1945,4)
BigQuery parses fractional numbers as floating point by default for better performance, while other databases parses fractional numbers as NUMERIC by default. This means the other databases would interpret TRUNC(3.03,2) in the same way BigQuery interprets TRUNC(numeric '3.03',2).
I hope it will helps you.
This is due to the fact that, in BigQuery, digits are stored as floating point values by default.
You can fin more information about how these work in Wikipedia, but the main idea is that some numbers are not stored as they are but as the closest approximation its representation allows. If instead of 3.03 it is internally represented as 3.0299999999..., when you trunc it the result will be 3.02.
Same thing happens with round function, if 3053.665 is internally stored as 3053.6649999999..., the result of rounding it will be 3053.66.
If you specify it to be stored as NUMERIC, it then works as "expected":
select trunc(numeric '3.195', 3)
gives as result 3.195
You can find more information about Numeric Types in the official BigQuery Documentation.

Difference between _%_% and __% in sql server

I am learning basics of SQL through W3School and during understanding basics of wildcards I went through the following query:
--Finds any values that start with "a" and are at least 3 characters in length
WHERE CustomerName LIKE 'a_%_%'
as per the example following query will search the table where CustomerName column start with 'a' and have at least 3 characters in length.
However, I try the following query also:
WHERE CustomerName LIKE 'a__%'
The above query also gives me the exact same result.
I want to know whether there is any difference in both queries? Does the second query produce a different output in some specific scenario? If yes what will be that scenario?
Both start with A, and end with %. In the middle part, the first says "one char, then between zero and many chars, then one char", while the second one says "one char, then one char".
Considering that the part that comes after them (the final part) is %, which means "between zero and many chars", I can only see both clauses as identical, as they both essentially just want a string starting with A then at least two following characters. Perhaps if there were at least some limitations on what characters were allowed by the _, then maybe they could have been different.
If I had to choose, I'd go with the second one for being more intuitive. After all, many other masks (e.g. a%%%%%%_%%_%%%%%) will yield the same effect, but why the weird complexity?
For Like operator a single underscore "_" means, any single character, so if you put One underscore like
ColumnName LIKE 'a_%'
you basically saying you need a string where first letter is 'a' then followed by another single character and then followed by anything or nothing.
ColumnName LIKE 'a__%' OR ColumnName LIKE 'a_%_%'
Both expressions mean first letter 'a' then followed by two characters and then followed by anything or nothing. Or in simple English any string with 3 or more character starting with a.

Table or column name cannot start with numeric?

I tried to create table named 15909434_user with syntax like below:
CREATE TABLE 15909434_user ( ... )
It would produced error of course. Then, after I tried to have a bit research with google, I found a good article here that describe:
When you create an object in PostgreSQL, you give that object a name. Every table has a name, every column has a name, and so on. PostgreSQL uses a single data type to define all object names: the name type.
A value of type name is a string of 63 or fewer characters. A name must start with a letter or an underscore; the rest of the string can contain letters, digits, and underscores.
...
If you find that you need to create an object that does not meet these rules, you can enclose the name in double quotes. Wrapping a name in quotes creates a quoted identifier. For example, you could create a table whose name is "3.14159"—the double quotes are required, but are not actually a part of the name (that is, they are not stored and do not count against the 63-character limit). ...
Okay, now I know how to solve this by use this syntax (putting double quote on table name):
CREATE TABLE "15909434_user" ( ... )
You can create table or column name such as "15909434_user" and also user_15909434, but cannot create table or column name begin with numeric without use of double quotes.
So then, I am curious about the reason behind that (except it is a convention). Why this convention applied? Is it to avoid something like syntax limitation or other reason?
Thanks in advance for your attention!
It comes from the original sql standards, which through several layers of indirection eventually get to an identifier start block, which is one of several things, but primarily it is "a simple latin letter". There are other things too that can be used, but if you want to see all the details, go to http://en.wikipedia.org/wiki/SQL-92 and follow the links to the actual standard ( page 85 )
Having non numeric identifier introducers makes writing a parser to decode sql for execution easier and quicker, but a quoted form is fine too.
Edit: Why is it easier for the parser?
The problem for a parser is more in the SELECT-list clause than the FROM clause. The select-list is the list of expressions that are selected from the tables, and this is very flexible, allowing simple column names and numeric expressions. Consider the following:
SELECT 2e2 + 3.4 FROM ...
If table names, and column names could start with numerics, is 2e2 a column name or a valid number (e format is typically permitted in numeric literals) and is 3.4 the table "3" and column "4" or is it the numeric value 3.4 ?
Having the rule that identifiers start with simple latin letters (and some other specific things) means that a parser that sees 2e2 can quickly discern this will be a numeric expression, same deal with 3.4
While it would be possible to devise a scheme to allow numeric leading characters, this might lead to even more obscure rules (opinion), so this rule is a nice solution. If you allowed digits first, then it would always need quoting, which is arguably not as 'clean'.
Disclaimer, I've simplified the above slightly, ignoring corelation names to keep it short. I'm not totally familiar with postgres, but have double checked the above answer against Oracle RDB documentation and sql spec
I'd imagine it's to do with the grammar.
SELECT 24*DAY_NUMBER as X from MY_TABLE
is fine, but ambiguous if 24 was allowed as a column name.
Adding quotes means you're explicitly referring to an identifier not a constant. So in order to use it, you'd always have to escape it anyway.

Regular expression filter

I have this regular expression in my sql query
DECLARE #RETURN_VALUE VARCHAR(MAX)
IF #value LIKE '%[0-9]%[^A-Z]%[0-9]%'
BEGIN
SET #RETURN_VALUE = NULL
END
I am not sure, but whenever I have this in my row 12 TEST then it gives me the value of 12, but if I have three digit number then it filters out the three digit numbers.How can I modify the regular expression to return me the three digits numbers too.
any help will be appreciated.
SQL doesn't have regular expressions: it has SQL wildcard expressions. They are much simpler than regular expressions and long predate regular expressions. For instance, there is no way to specify alternation (a|b) or repetition ( a*, a+, a?, a{m,n} ) such as you might find in a regular expression.
The 'like expression' that you have
LIKE '%[0-9]%[^A-Z]%[0-9]%'
will match any string containing the following pattern anywhere in the string
zero or more of any character, followed by...
a single decimal digit, followed by...
zero or more of any character, followed by...
a single character other than A–Z (whether it's case sensitive or not depends on the collating sequence in use), followed by...
zero or of any character, followed by...
a single decimal digit, followed by...
zero or more of any character
One should note that the % is likely to match perhaps more than you might like.
Have you tried ([0-9]*). I believe that this will capture every digit for you. However, I am not as strong at regex. When I ran this through rubular, it worked, though :) BTW, rubular is a great way to test out regular expressions
You can easily create a SQL CLR function and use this in your queries. Visual Studio has a project template for this and makes deploying the functions a snap.
Here is more information from Microsoft about how to create the function and how to use it (for boolean matches and for data extraction).
First of all, note that this is not really a "regular expression", it's a SQL-specific form of wildcard matching. You are very limited in what you can accomplish with SQL wildcards. As one example, you cannot "optionally" match a specific character or character set.
Your expression, as you've written it, will match any value that contains two digits with at least one non-letter character in between them, meaning it will match:
111
1^1
1?7
1AAAAAAAAAAA?AAAAAAAAA1
-----------------------5-----------------3-------
And infinitely more items of a similar structure.
Oddly, one string that would not match this pattern is "12 TEST" because there is no character between the 1 and 2. The pattern also won't "give you" the value of 12 back because it's not a parsing expression, just a matching expression: it returns 1 (true) or 0 (false).
There is clearly something else going on in your application, possibly even an actual regular expression, but it has nothing to do with the SQL you've included here.