Humanized or natural number sorting of mixed word-and-number strings - sql

Following up on this question by Sivaram Chintalapudi, I'm interested in whether it's practical in PostgreSQL to do natural - or "humanized" - sorting " of strings that contain a mixture of multi-digit numbers and words/letters. There is no fixed pattern of words and numbers in the strings, and there may be more than one multi-digit number in a string.
The only place I've seen this done routinely is in the Mac OS's Finder, which sorts filenames containing mixed numbers and words naturally, placing "20" after "3", not before it.
The collation order desired would be produced by an algorithm that split each string into blocks at letter-number boundaries, then ordered each part, treating letter-blocks with normal collation and number-blocks as integers for collation purposes. So:
'AAA2fred' would become ('AAA',2,'fred') and 'AAA10bob' would become ('AAA',10,'bob'). These can then be sorted as desired:
regress=# WITH dat AS ( VALUES ('AAA',2,'fred'), ('AAA',10,'bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
dat
--------------
(AAA,2,fred)
(AAA,10,bob)
(2 rows)
as compared to the usual string collation ordering:
regress=# WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
regress-# SELECT dat FROM dat ORDER BY dat;
dat
------------
(AAA10bob)
(AAA2fred)
(2 rows)
However, the record comparison approach doesn't generalize because Pg won't compare ROW(..) constructs or records of unequal numbers of entries.
Given the sample data in this SQLFiddle the default en_AU.UTF-8 collation produces the ordering:
1A, 10A, 2A, AAA10B, AAA11B, AAA1BB, AAA20B, AAA21B, X10C10, X10C2, X1C1, X1C10, X1C3, X1C30, X1C4, X2C1
but I want:
1A, 2A, 10A, AAA1BB, AAA10B, AAA11B, AAA20B, AAA21B, X1C1, X1C3, X1C4, X1C10, X1C30, X2C1, X10C10, X10C2
I'm working with PostgreSQL 9.1 at the moment, but 9.2-only suggestions would be fine. I'm interested in advice on how to achieve an efficient string-splitting method, and how to then compare the resulting split data in the alternating string-then-number collation described. Or, of course, on entirely different and better approaches that don't require splitting strings.
PostgreSQL doesn't seem to support comparator functions, otherwise this could be done fairly easily with a recursive comparator and something like ORDER USING comparator_fn and a comparator(text,text) function. Alas, that syntax is imaginary.
Update: Blog post on the topic.

Building on your test data, but this works with arbitrary data. This works with any number of elements in the string.
Register a composite type made up of one text and one integer value once per database. I call it ai:
CREATE TYPE ai AS (a text, i int);
The trick is to form an array of ai from each value in the column.
regexp_matches() with the pattern (\D*)(\d*) and the g option returns one row for every combination of letters and numbers. Plus one irrelevant dangling row with two empty strings '{"",""}' Filtering or suppressing it would just add cost. Aggregate this into an array, after replacing empty strings ('') with 0 in the integer component (as '' cannot be cast to integer).
NULL values sort first - or you have to special case them - or use the whole shebang in a STRICT function like #Craig proposes.
Postgres 9.4 or later
SELECT data
FROM alnum
ORDER BY ARRAY(SELECT ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai
FROM regexp_matches(data, '(\D*)(\d*)', 'g') x)
, data;
db<>fiddle here
Postgres 9.1 (original answer)
Tested with PostgreSQL 9.1.5, where regexp_replace() had a slightly different behavior.
SELECT data
FROM (
SELECT ctid, data, regexp_matches(data, '(\D*)(\d*)', 'g') AS x
FROM alnum
) x
GROUP BY ctid, data -- ctid as stand-in for a missing pk
ORDER BY regexp_replace (left(data, 1), '[0-9]', '0')
, array_agg(ROW(x[1], CASE x[2] WHEN '' THEN '0' ELSE x[2] END)::ai)
, data -- for special case of trailing 0
Add regexp_replace (left(data, 1), '[1-9]', '0') as first ORDER BY item to take care of leading digits and empty strings.
If special characters like {}()"', can occur, you'd have to escape those accordingly.
#Craig's suggestion to use a ROW expression takes care of that.
BTW, this won't execute in sqlfiddle, but it does in my db cluster. JDBC is not up to it. sqlfiddle complains:
Method org.postgresql.jdbc3.Jdbc3Array.getArrayImpl(long,int,Map) is
not yet implemented.
This has since been fixed: http://sqlfiddle.com/#!17/fad6e/1

I faced this same problem, and I wanted to wrap the solution in a function so I could re-use it easily. I created the following function to achieve a 'human style' sort order in Postgres.
CREATE OR REPLACE FUNCTION human_sort(text)
RETURNS text[] AS
$BODY$
/* Split the input text into contiguous chunks where no numbers appear,
and contiguous chunks of only numbers. For the numbers, add leading
zeros to 20 digits, so we can use one text array, but sort the
numbers as if they were big integers.
For example, human_sort('Run 12 Miles') gives
{'Run ', '00000000000000000012', ' Miles'}
*/
select array_agg(
case
when a.match_array[1]::text is not null
then a.match_array[1]::text
else lpad(a.match_array[2]::text, 20::int, '0'::text)::text
end::text)
from (
select regexp_matches(
case when $1 = '' then null else $1 end, E'(\\D+)|(\\d+)', 'g'
) AS match_array
) AS a
$BODY$
LANGUAGE sql IMMUTABLE;
tested to work on Postgres 8.3.18 and 9.3.5
No recursion, should be faster than recursive solutions
Can be used in just the order by clause, don't have to deal with primary key or ctid
Works for any select (don't even need a PK or ctid)
Simpler than some other solutions, should be easier to extend and maintain
Suitable for use in a functional index to improve performance
Works on Postgres v8.3 or higher
Allows an unlimited number of text/number alternations in the input
Uses just one regex, should be faster than versions with multiple regexes
Numbers longer than 20 digits are ordered by their first 20 digits
Here's an example usage:
select * from (values
('Books 1', 9),
('Book 20 Chapter 1', 8),
('Book 3 Suffix 1', 7),
('Book 3 Chapter 20', 6),
('Book 3 Chapter 2', 5),
('Book 3 Chapter 1', 4),
('Book 1 Chapter 20', 3),
('Book 1 Chapter 3', 2),
('Book 1 Chapter 1', 1),
('', 0),
(null::text, 0)
) as a(name, sort)
order by human_sort(a.name)
-----------------------------
|name | sort |
-----------------------------
| | 0 |
| | 0 |
|Book 1 Chapter 1 | 1 |
|Book 1 Chapter 3 | 2 |
|Book 1 Chapter 20 | 3 |
|Book 3 Chapter 1 | 4 |
|Book 3 Chapter 2 | 5 |
|Book 3 Chapter 20 | 6 |
|Book 3 Suffix 1 | 7 |
|Book 20 Chapter 1 | 8 |
|Books 1 | 9 |
-----------------------------

Adding this answer late because it looked like everyone else was unwrapping into arrays or some such. Seemed excessive.
CREATE FUNCTION rr(text,int) RETURNS text AS $$
SELECT regexp_replace(
regexp_replace($1, '[0-9]+', repeat('0',$2) || '\&', 'g'),
'[0-9]*([0-9]{' || $2 || '})',
'\1',
'g'
)
$$ LANGUAGE sql;
SELECT t,rr(t,9) FROM mixed ORDER BY t;
t | rr
--------------+-----------------------------
AAA02free | AAA000000002free
AAA10bob | AAA000000010bob
AAA2bbb03boo | AAA000000002bbb000000003boo
AAA2bbb3baa | AAA000000002bbb000000003baa
AAA2fred | AAA000000002fred
(5 rows)
(reverse-i-search)`OD': SELECT crypt('richpass','$2$08$aJ9ko0uKa^C1krIbdValZ.dUH8D0R0dj8mqte0Xw2FjImP5B86ugC');
richardh=>
richardh=> SELECT t,rr(t,9) FROM mixed ORDER BY rr(t,9);
t | rr
--------------+-----------------------------
AAA2bbb3baa | AAA000000002bbb000000003baa
AAA2bbb03boo | AAA000000002bbb000000003boo
AAA2fred | AAA000000002fred
AAA02free | AAA000000002free
AAA10bob | AAA000000010bob
(5 rows)
I'm not claiming two regexps are the most efficient way to do this, but rr() is immutable (for fixed length) so you can index it. Oh - this is 9.1
Of course, with plperl you could just evaluate the replacement to pad/trim it in one go. But then with perl you've always got just-one-more-option (TM) than any other approach :-)

The following function splits a string into an array of (word,number) pairs of arbitrary length. If the string begins with a number then the first entry will have a NULL word.
CREATE TYPE alnumpair AS (wordpart text,numpart integer);
CREATE OR REPLACE FUNCTION regexp_split_numstring_depth_pairs(instr text)
RETURNS alnumpair[] AS $$
WITH x(match) AS (SELECT regexp_matches($1, '(\D*)(\d+)(.*)'))
SELECT
ARRAY[(CASE WHEN match[1] = '' THEN '0' ELSE match[1] END, match[2])::alnumpair] || (CASE
WHEN match[3] = '' THEN
ARRAY[]::alnumpair[]
ELSE
regexp_split_numstring_depth_pairs(match[3])
END)
FROM x;$$ LANGUAGE 'sql' IMMUTABLE;
allowing PostgreSQL's composite type sorting to come into play:
SELECT data FROM alnum ORDER BY regexp_split_numstring_depth_pairs(data);
and producing the expected result, as per this SQLFiddle. I've adopted Erwin's substitution of 0 for the empty string in all strings beginning with a number so that numbers sort first; it's cleaner than using ORDER BY left(data,1), regexp_split_numstring_depth_pairs(data).
While the function is probably horrifically slow it can at least be used in an expression index.
That was fun!

create table dat(val text)
insert into dat ( VALUES ('BBB0adam'), ('AAA10fred'), ('AAA2fred'), ('AAA2bob') );
select
array_agg( case when z.x[1] ~ E'\\d' then lpad(z.x[1],10,'0') else z.x[1] end ) alnum_key
from (
SELECT ctid, regexp_matches(dat.val, E'(\\D+|\\d+)','g') as x
from dat
) z
group by z.ctid
order by alnum_key;
alnum_key
-----------------------
{AAA,0000000002,bob}
{AAA,0000000002,fred}
{AAA,0000000010,fred}
{BBB,0000000000,adam}
Worked on this for almost an hour and posted without looking -- I see Erwin arrived at a similar place. Ran into the same "could not find array type for data type text[]" trouble as #Clodoaldo. Had a lot of trouble getting the cleanup exercise to not agg all the rows until I thought of grouping by the ctid (which feels like cheating really -- and doesn't work on a psuedo table as in the OP example WITH dat AS ( VALUES ('AAA2fred'), ('AAA10bob') )
...). It would be nicer if array_agg could accept a set-producing subselect.

I'm not a RegEx guru, but I can work it to some extent. Enough to produce this answer.
It will handle up to 2 numeric values within the content. I don't think OSX goes further than that, if it even handles 2.
WITH parted AS (
select data,
substring(data from '([A-Za-z]+).*') part1,
substring('a'||data from '[A-Za-z]+([0-9]+).*') part2,
substring('a'||data from '[A-Za-z]+[0-9]+([A-Za-z]+).*') part3,
substring('a'||data from '[A-Za-z]+[0-9]+[A-Za-z]+([0-9]+).*') part4
from alnum
)
select data
from parted
order by part1,
cast(part2 as int),
part3,
cast(part4 as int),
data;
SQLFiddle

The following solution is a combination of various ideas presented in other answers, as well as some ideas from the classic solution:
create function natsort(s text) returns text immutable language sql as $$
select string_agg(r[1] || E'\x01' || lpad(r[2], 20, '0'), '')
from regexp_matches(s, '(\D*)(\d*)', 'g') r;
$$;
The design goals of this function were simplicity and pure string operations (no custom types and no arrays), so it can easily be used as a drop-in solution, and is trivial to be indexed over.
Note: If you expect numbers with more than 20 digits, you'll have to replace the hard-coded maximum length 20 in the function with a suitable larger length. Note that this will directly affect the length of the resulting strings, so don't make that value larger than needed.

Related

Can (should) I include elements of TO_CHAR, CONCAT and FM in the same select statement?

I'm following up on a previous post that successfully generated an Oracle SELECT statement. From within that prior script I now need to
concatenate two different fields (numeric values for 3-digit area codes and 7-digit phone numbers), then
format the resulting column as XXX-XXX-XXXX
but my attempts at using TO_CHAR, CONCAT (or || I have tried doing the concatenation both ways), and FM in the same line result in invalid number or invalid operator errors (depending on how I've rearranged the elements in the line) painfully reminding me that my barely-elementary scripting shows a significant lack understanding of proper use and syntax.
The combination of TO_CHAR and CONCAT (||) successfully produces a 9-digit string, but I'm trying to attain as result formatted as XXX-XXX-XXXX from the following (I've edited out the lines from the original script for data elements not relevant to this particular question; nothing in the original query is nested, it just selects several fields and has a series of left joins linking on a common UID field in different tables)
select distinct
cn.dflt_id StudentIdNumber,
to_char (p.area_code || p.phone_no) Phone,
from
co_name cn
left join co_v_name_phone1 p on cn.name_id = p.name_id
order by cn.dflt_id
Would anyone offer helpful advice on attaining the desired XXX-XXX-XXXX formatting in the resulting Phone column? My attempts with variants of 'fm999g999g9999' have thus far not been successful.
Thanks,
Scott
Here are a few options that crossed my mind; have a look, pick the one you find the most appropriate. If you still have problems, post your own test case.
RES2 is a simple concatenation of substrings that have a - in between
RES3 uses format mask with an adjusted NLS_NUMERIC_CHARACTERS for thousands
RES4 concatenates area code (which is OK by itself) with regular expression that splits a string into two parts; the first has {3} characters, and the second one has {4} of them
By the way, are area codes really numbers? No leading zeros?
SQL> with test (area_code, phone_number) as
2 (select 123, 9884556 from dual union
3 select 324, 1254789 from dual
4 )
5 select
6 to_char(area_code) || to_char(phone_number) l_concat,
7 --
8 substr(to_char(area_code) || to_char(phone_number), 1, 3) ||'-'||
9 substr(to_char(area_code) || to_char(phone_number), 4, 3) ||'-'||
10 substr(to_char(area_code) || to_char(phone_number), 7)
11 res2,
12 --
13 to_char(to_char(area_code) || to_char(phone_number),
14 '000g000g0000', 'nls_numeric_characters=.-') res3,
15 --
16 to_char(area_code) ||'-'||
17 regexp_replace(to_char(phone_number), '(\d{3})(\d{4})', '\1-\2') res4
18 from test;
L_CONCAT RES2 RES3 RES4
------------- ------------- ------------- -------------
1239884556 123-988-4556 123-988-4556 123-988-4556
3241254789 324-125-4789 324-125-4789 324-125-4789
SQL>

Deleting records with number repeating more than 5

I have data in a table of length 9 where data is like
999999969
000000089
666666689
I want to delete only those data in which any number from 1-9 is repeating more than 5 times.
OK, so the logic here can be summed up as:
Find the longest series of the same consecutive digit in any given number; and
Return true if that longest value is > 5 digits
Right?
So, lets split it into series of consecutive digits:
regress=> SELECT regexp_matches('666666689', '(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)', 'g');
regexp_matches
----------------
{6666666}
{8}
{9}
(3 rows)
then filter for the longest:
regress=>
SELECT x[1]
FROM regexp_matches('6666666898', '(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)', 'g') x
ORDER BY length(x[1]) DESC
LIMIT 1;
x
---------
6666666
(1 row)
... but really, we don't actually care about that, just if any entry is longer than 5 digits, so:
SELECT x[1]
FROM regexp_matches('6666666898', '(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)', 'g') x
WHERE length(x[1]) > 5;
can be used as an EXISTS test, e.g.
WITH blah(n) AS (VALUES('999999969'),('000000089'),('666666689'),('15552555'))
SELECT n
FROM blah
WHERE EXISTS (
SELECT x[1]
FROM regexp_matches(n, '(0+|1+|2+|3+|4+|5+|6+|7+|8+|9+)', 'g') x
WHERE length(x[1]) > 5
)
which is actually pretty efficient and return the correct result (always nice). But it can be simplified a little more with:
WITH blah(n) AS (VALUES('999999969'),('000000089'),('666666689'),('15552555'))
SELECT n
FROM blah
WHERE EXISTS (
SELECT x[1]
FROM regexp_matches(n, '(0{6}|1{6}|2{6}|3{6}|4{6}|5{6}|6{6}|7{6}|8{6}|9{6})', 'g') x;
)
You can use the same WHERE clause in a DELETE.
This can be much simpler with a regular expression using a back reference.
DELETE FROM tbl
WHERE col ~ '([1-9])\1{5}';
That's all.
Explain
([1-9]) ... a character class with digits from 1 to 9, parenthesized for the following back reference.
\1 ... back reference to first (and only in this case) parenthesized subexpression.
{5} .. exactly (another) 5 times, making it "more than 5".
Per documentation:
A back reference (\n) matches the same string matched by the previous
parenthesized subexpression specified by the number n [...] For example, ([bc])\1 matches bb or cc but not bc or cb.
SQL Fiddle demo.
Horrible and terrible in terms of performance, but it should work:
DELETE FROM YOURTABLE
WHERE YOURDATA LIKE '%111111%'
OR YOURDATA LIKE '%222222%'
OR YOURDATA LIKE '%333333%'
OR YOURDATA LIKE '%444444%'
OR YOURDATA LIKE '%555555%'
OR YOURDATA LIKE '%666666%'
OR YOURDATA LIKE '%777777%'
OR YOURDATA LIKE '%888888%'
OR YOURDATA LIKE '%999999%'

Sort alphanumeric column

I have a column in database:
Serial Number
-------------
S1
S10
...
S2
S11
..
S13
I want to sort and return the result as follows for serial number <= 10 :
S1
S2
S10
One way I tried was:
select Serial_number form table where Serial_Number IN ('S1', 'S2',... 'S10');
This solves the purpose but looking for a better way
Here is an easy way for this format:
order by length(Serial_Number),
Serial_Number
This works because the prefix ('S') is the same length on all the values.
For Postgres you can use something like this:
select serial_number
from the_table
order by regexp_replace(serial_number, '[^0-9]', '', 'g')::integer;
The regexp_replace will remove all non-numeric characters and the result is treated as a number which is suited for a "proper" sorting.
Edit 1:
You can use the new "number" to limit the result of the query:
select serial_number
from (
select serial_number,
regexp_replace(serial_number, '[^0-9]', '', 'g')::integer as snum
from the_table
) t
where snum <= 10
order by snum;
Edit 2
If you receive the error ERROR: invalid input syntax for integer: "" then apparently you have values in the serial_number column which do no follow the format you posted in your question. It means that regexp_replace() remove all characters from the string, so a string like S would cause that.
To prevent that, you need to either exclude those rows from the result using:
where length(regexp_replace(serial_number, '[^0-9]', '', 'g')) > 0
in the inner select. Or, if you need those rows for some reason, deal with that in the select list:
select serial_number
from (
select serial_number,
case
when length(regexp_replace(serial_number, '[^0-9]', '', 'g')) > 0 then regexp_replace(serial_number, '[^0-9]', '', 'g')::integer as snum
else null -- or 0 whatever you need
end as snum
from the_table
) t
where snum <= 10
order by snum;
This is a really nice example on why you should never mix two different things in a single column. If all your serial numbers have a prefix S you shouldn't store it and put the real number in a real integer (or bigint) column.
Using something like NOT_SET to indicate a missing value is also a bad choice. The NULL value was precisely invented for that reason: to indicate the absence of data.
Since only the first character spoils your numeric fun, just trim it with right() and sort by the numeric value:
SELECT *
FROM tbl
WHERE right(serial_number, -1)::int < 11
ORDER BY right(serial_number, -1)::int;
Requires Postgres 9.1 or later. In older versions substitute with substring (x, 10000).

SQL change date formats inside a string

I would like to convert a string containing dates in SQL select from Oracle 11g database.
Original string (CLOB) example:
"1.12.2011 - event 1
2.2.2012 - event 2
13.3.2012 - event 44"
Desired output:
"20111201 - event 1
20120202 - event 2
20120313 - event 44"
Is there a better (faster) way than using 4 separate replacements?
regexp_replace(regexp_replace(regexp_replace(regexp_replace(my_string,
'(\d\d)\.(\d\d)\.(20\d\d)', '\3\2\1'),
'(\d\d)\.(\d)\.(20\d\d)', '\30\2\1'),
'(\d)\.(\d\d)\.(20\d\d)', '\3\20\1'),
'(\d)\.(\d)\.(20\d\d)', '\30\20\1')
Especially if you're using clobs you have to be careful unless you're certain of the data in there.
However, if your clob only looks like that then you need threeregexp_replace in order for this to work; it'll also be much more dynamic. Just explicitly specify digits using [[:digit:]] then specify a minimum and maximum number of times these digits could be there using {1,2}.
Then the following would work:
select regexp_replace(
regexp_replace(
regexp_replace( my_string
, '([[:digit:]]{1,2})\.([[:digit:]]{1,2})\.(20[[:digit:]]{2})'
, '\3-\2-\1')
, '-([[:digit:]]{1}(-|$))'
, '0\1' )
, ('-')
, '')
from dual
This means:
match ( group 1 ) 1 or 2 digits
match a full stop.
match ( group 2 ) 1 or 2 digits
match a full stop
match ( group 3 ) 20 + 2 digits.
Then take out only groups 1, 2 and 3, i.e. ignoring the full stops and return then in the order 3, 2, 1 padded with a hyphen
Then replace any [digit] that is followed by either a hyphen or the end of the string, i.e. the number of digits is only 1 with -0[digit].
Lastly replace all the hyphens.
Separately from that I agree with tbone. It would make a lot more sense to store this data in a separate table (event_id number, event_date date). Any string transformations are easy with no chance of getting it wrong, unlike in this situation, and the data is easy to query and compare.
there are no better options (both correct and readable) with better performance - or if there are, no one cares..
i prefer a 2-level regexp_replace for date part:
select regexp_replace(
regexp_replace( my_string,
'([[:digit:]]{1,2})\.([[:digit:]]{1,2})\.(20[[:digit:]]{2})',
'\3-0\2-0\1' ),
'(20[[:digit:]]{2})-0?([[:digit:]]{2})-0?([[:digit:]]{2})',
'\3\2\1' )
from dual;
Demo
Maybe try doing:
select to_char(to_date('13.3.2011', 'DD.MM.YYYY'),'YYYYMMDD') from dual;

Finding rows that don't contain numeric data in Oracle

I am trying to locate some problematic records in a very large Oracle table. The column should contain all numeric data even though it is a varchar2 column. I need to find the records which don't contain numeric data (The to_number(col_name) function throws an error when I try to call it on this column).
I was thinking you could use a regexp_like condition and use the regular expression to find any non-numerics. I hope this might help?!
SELECT * FROM table_with_column_to_search WHERE REGEXP_LIKE(varchar_col_with_non_numerics, '[^0-9]+');
To get an indicator:
DECODE( TRANSLATE(your_number,' 0123456789',' ')
e.g.
SQL> select DECODE( TRANSLATE('12345zzz_not_numberee',' 0123456789',' '), NULL, 'number','contains char')
2 from dual
3 /
"contains char"
and
SQL> select DECODE( TRANSLATE('12345',' 0123456789',' '), NULL, 'number','contains char')
2 from dual
3 /
"number"
and
SQL> select DECODE( TRANSLATE('123405',' 0123456789',' '), NULL, 'number','contains char')
2 from dual
3 /
"number"
Oracle 11g has regular expressions so you could use this to get the actual number:
SQL> SELECT colA
2 FROM t1
3 WHERE REGEXP_LIKE(colA, '[[:digit:]]');
COL1
----------
47845
48543
12
...
If there is a non-numeric value like '23g' it will just be ignored.
In contrast to SGB's answer, I prefer doing the regexp defining the actual format of my data and negating that. This allows me to define values like $DDD,DDD,DDD.DD
In the OPs simple scenario, it would look like
SELECT *
FROM table_with_column_to_search
WHERE NOT REGEXP_LIKE(varchar_col_with_non_numerics, '^[0-9]+$');
which finds all non-positive integers. If you wau accept negatiuve integers also, it's an easy change, just add an optional leading minus.
SELECT *
FROM table_with_column_to_search
WHERE NOT REGEXP_LIKE(varchar_col_with_non_numerics, '^-?[0-9]+$');
accepting floating points...
SELECT *
FROM table_with_column_to_search
WHERE NOT REGEXP_LIKE(varchar_col_with_non_numerics, '^-?[0-9]+(\.[0-9]+)?$');
Same goes further with any format. Basically, you will generally already have the formats to validate input data, so when you will desire to find data that does not match that format ... it's simpler to negate that format than come up with another one; which in case of SGB's approach would be a bit tricky to do if you want more than just positive integers.
Use this
SELECT *
FROM TableToSearch
WHERE NOT REGEXP_LIKE(ColumnToSearch, '^-?[0-9]+(\.[0-9]+)?$');
After doing some testing, i came up with this solution, let me know in case it helps.
Add this below 2 conditions in your query and it will find the records which don't contain numeric data
and REGEXP_LIKE(<column_name>, '\D') -- this selects non numeric data
and not REGEXP_LIKE(column_name,'^[-]{1}\d{1}') -- this filters out negative(-) values
Starting with Oracle 12.2 the function to_number has an option ON CONVERSION ERROR clause, that can catch the exception and provide default value.
This can be used for the test of number values. Simple set NULL when the conversion fails and filer all not NULL values.
Example
with num as (
select '123' vc_col from dual union all
select '1,23' from dual union all
select 'RV12P2000' from dual union all
select null from dual)
select
vc_col
from num
where /* filter numbers */
vc_col is not null and
to_number(vc_col DEFAULT NULL ON CONVERSION ERROR) is not null
;
VC_COL
---------
123
1,23
From http://www.dba-oracle.com/t_isnumeric.htm
LENGTH(TRIM(TRANSLATE(, ' +-.0123456789', ' '))) is null
If there is anything left in the string after the TRIM it must be non-numeric characters.
I've found this useful:
select translate('your string','_0123456789','_') from dual
If the result is NULL, it's numeric (ignoring floating point numbers.)
However, I'm a bit baffled why the underscore is needed. Without it the following also returns null:
select translate('s123','0123456789', '') from dual
There is also one of my favorite tricks - not perfect if the string contains stuff like "*" or "#":
SELECT 'is a number' FROM dual WHERE UPPER('123') = LOWER('123')
After doing some testing, building upon the suggestions in the previous answers, there seem to be two usable solutions.
Method 1 is fastest, but less powerful in terms of matching more complex patterns.
Method 2 is more flexible, but slower.
Method 1 - fastest
I've tested this method on a table with 1 million rows.
It seems to be 3.8 times faster than the regex solutions.
The 0-replacement solves the issue that 0 is mapped to a space, and does not seem to slow down the query.
SELECT *
FROM <table>
WHERE TRANSLATE(replace(<char_column>,'0',''),'0123456789',' ') IS NOT NULL;
Method 2 - slower, but more flexible
I've compared the speed of putting the negation inside or outside the regex statement. Both are equally slower than the translate-solution. As a result, #ciuly's approach seems most sensible when using regex.
SELECT *
FROM <table>
WHERE NOT REGEXP_LIKE(<char_column>, '^[0-9]+$');
You can use this one check:
create or replace function to_n(c varchar2) return number is
begin return to_number(c);
exception when others then return -123456;
end;
select id, n from t where to_n(n) = -123456;
I tray order by with problematic column and i find rows with column.
SELECT
D.UNIT_CODE,
D.CUATM,
D.CAPITOL,
D.RIND,
D.COL1 AS COL1
FROM
VW_DATA_ALL_GC D
WHERE
(D.PERIOADA IN (:pPERIOADA)) AND
(D.FORM = 62)
AND D.COL1 IS NOT NULL
-- AND REGEXP_LIKE (D.COL1, '\[\[:alpha:\]\]')
-- AND REGEXP_LIKE(D.COL1, '\[\[:digit:\]\]')
--AND REGEXP_LIKE(TO_CHAR(D.COL1), '\[^0-9\]+')
GROUP BY
D.UNIT_CODE,
D.CUATM,
D.CAPITOL,
D.RIND ,
D.COL1
ORDER BY
D.COL1