BigQuery: Convert accented characters to their plain ascii equivalents - google-bigquery
I have the following string:
brasília
And I need to convert to:
brasilia
Withou the ´ accent!
How can I do on BigQuery?
Thank you!
Try below as quick and simple option for you:
#standardSQL
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
),
yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
(SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent) AS word_without_accent
FROM yourTableWithWords
Output is
word_with_accent word_without_accent
blessèd blessed
El Niño El Nino
belle époque belle epoque
boîte boite
Boötes Bootes
blasé blase
ångström angstrom
bobèche bobeche
barège barege
bric-à-brac bric-a-brac
bête noire bete noire
Bichon Frisé Bichon Frise
Brontë Beyoncé Bronte Beyonce
bêtise betise
beau idéal beau ideal
bombé bombe
brasília brasilia
boutonnière boutonniere
aperçu apercu
béguin beguin
Bön Bon
UPDATE
Below is how to pack this logic into SQL UDF - so accent2latin(word) can be called to make a "magic"
#standardSQL
CREATE TEMP FUNCTION accent2latin(word STRING) AS
((
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,Ø,Å,Á,À,Â,Ä,È,É,Ê,Ë,Í,Î,Ï,Ì,Ò,Ó,Ô,Ö,Ú,Ù,Û,Ü,Ÿ,Ç,Æ,Œ,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,O,A,A,A,A,A,E,E,E,E,I,I,I,I,O,O,O,O,U,U,U,U,Y,C,AE,OE,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
)
SELECT STRING_AGG(IFNULL(latin, char), '')
FROM UNNEST(SPLIT(word, '')) char
LEFT JOIN pairs
ON char = accent
));
WITH yourTableWithWords AS (
SELECT word FROM UNNEST(
SPLIT('brasília,ångström,aperçu,barège, beau idéal, belle époque, béguin, bête noire, bêtise, Bichon Frisé, blasé, blessèd, bobèche, boîte, bombé, Bön, Boötes, boutonnière, bric-à-brac, Brontë Beyoncé,El Niño')
) AS word
)
SELECT
word AS word_with_accent,
accent2latin(word) AS word_without_accent
FROM yourTableWithWords
It's worth mentioning that what you're asking for is a simplified case of unicode text normalization. Many languages have a function for this in their standard libraries (e.g., Java). One good approach would be to insert your text BigQuery already normalized. If that won't work -- for example, because you need to retain the original text and you're concerned about hitting BigQuery's row size limit -- then you'll need to do normalization on the fly in your queries.
Some databases have implementations of Unicode normalization of various completeness (e.g., PostgreSQL's unaccent method, PrestoDB's normalize method) for use in queries. Unfortunately, BigQuery is not one of them. There is no text normalization function in BigQuery as of this writing. The implementations on this answer are kind of a "roll your own unaccent." When BigQuery releases an official function, everyone should use that instead!
Assuming you need to do the normalization in your query (and Google still hasn't come out with a function for this yet), these are some reasonable options.
Approach 1: Use NORMALIZE
Google now has come out with a NORMALIZE function. (Thanks to #WillianFuks in the comments for flagging!) This is now the obvious choice for text normalization. For example:
SELECT REGEXP_REPLACE(NORMALIZE(text, NFD), r"\pM", '') FROM yourtable;
There is a brief explanation of how this works and why the call to REGEXP_REPLACE is needed in the comments.
I have left the additional approaches for reference.
Approach 2: Use REGEXP_REPLACE and REPLACE on Content
I implemented the lowercase-only case of text normalization in legacy SQL using REGEXP_REPLACE. (The analog in Standard SQL is fairly self-evident.) I ran some tests on a text field with average length around 1K in a large table of 28M rows using the query below:
SELECT id, text FROM
(SELECT
id,
CASE
WHEN REGEXP_CONTAINS(LOWER(text), r"[àáâäåæçèéêëìíîïòóôöøùúûüÿœ]") THEN
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REGEXP_REPLACE(
REPLACE(REPLACE(REPLACE(REPLACE(LOWER(text), 'œ', 'ce'), 'ÿ', 'y'), 'ç', 'c'), 'æ', 'ae'),
r"[ùúûü]", 'u'),
r"[òóôöø]", 'o'),
r"[ìíîï]", 'i'),
r"[èéêë]", 'e'),
r"[àáâäå]", 'a')
ELSE
LOWER(text)
END AS text
FROM
yourtable ORDER BY id LIMIT 10);
versus:
WITH lookups AS (
SELECT
'ç,æ,œ,á,é,í,ó,ú,à,è,ì,ò,ù,ä,ë,ï,ö,ü,ÿ,â,ê,î,ô,û,å,ø,ñ' AS accents,
'c,ae,oe,a,e,i,o,u,a,e,i,o,u,a,e,i,o,u,y,a,e,i,o,u,a,o,n' AS latins
),
pairs AS (
SELECT accent, latin FROM lookups,
UNNEST(SPLIT(accents)) AS accent WITH OFFSET AS p1,
UNNEST(SPLIT(latins)) AS latin WITH OFFSET AS p2
WHERE p1 = p2
)
SELECT foo FROM (
SELECT
id,
(SELECT STRING_AGG(IFNULL(latin, char), '') AS foo FROM UNNEST(SPLIT(LOWER(text), '')) char LEFT JOIN pairs ON char=accent) AS foo
FROM
yourtable ORDER BY id LIMIT 10);
On average, the REGEXP_REPLACE implementation ran in about 2.9s; the array-based implementation ran in about 12.5s.
Approach 3: Use REGEXP_REPLACE on Search Pattern
What brought me to this question my was a search use case. For this use case, I can either normalize my corpus text so that it looks more like my query, or I can "denormalize" my query so that it looks more like my text. The above describes an implementation of the first approach. This describes an implementation of the second.
When searching for a single word, one can use the REGEXP_MATCH match function and merely update the query using the following patterns:
a -> [aàáaâäãåā]
e -> [eèéêëēėę]
i -> [iîïíīįì]
o -> [oôöòóøōõ]
u -> [uûüùúū]
y -> [yÿ]
s -> [sßśš]
l -> [lł]
z -> [zžźż]
c -> [cçćč]
n -> [nñń]
æ -> (?:æ|ae)
œ -> (?:œ|ce)
So the query "hello" would look like this, as a regexp:
r"h[eèéêëēėę][lł][lł][oôöòóøōõ]"
Transforming the word into this regular expression should be fairly straightforward in any language. This isn't a solution to the posted question -- "How do I remove accents in BigQuery?" -- but is rather a solution to a related use case, which might have brought people (like me!) to this page.
I like this answer explanation. You can use:
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '')
As a simple example:
WITH data AS(
SELECT 'brasília / paçoca' AS text
)
SELECT
REGEXP_REPLACE(NORMALIZE(text, NFD), r'\pM', '') RemovedDiacritics
FROM data
brasilia / pacoca
UPDATE
With the new string function Translate, it's much simpler to do it:
WITH data AS(
SELECT 'brasília / paçoca' AS text
)
SELECT
translate(text, "ŠŽšžŸÀÁÂÃÄÅÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖÙÚÛÜÝàáâãäåçèéêëìíîïðñòóôõöùúûüýÿ", "SZszYAAAAAACEEEEIIIIDNOOOOOUUUUYaaaaaaceeeeiiiidnooooouuuuyy") as RemovedDiacritics
FROM data
brasilia / pacoca
You can call REPLACE() or REGEXP_REPLACE(). You can find some regular expressions at Remove accents/diacritics in a string in JavaScript.
Alternatively, you can use javascript UDF, but I expect it to be slower.
Just call --> select bigfunctions.eu.remove_accents('Voilà !') as cleaned_string
(BigFunctions are open-source BigQuery functions callable by anyone from their BigQuery Project)
https://unytics.io/bigfunctions/reference/#remove_accents
Related
How to get the nth match from regexp_matches() as plain text
I have this code: with demo as ( select 'WWW.HELLO.COM' web union all select 'hi.co.uk' web) select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)') from demo And the table I get is: regexp_matches {hello} {hi} What I would like to do is: with demo as ( select 'WWW.HELLO.COM' web union all select 'hi.co.uk' web) select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)')[1] from demo Or even the big query version: with demo as ( select 'WWW.HELLO.COM' web union all select 'hi.co.uk' web) select regexp_matches(replace(lower(web),'www.',''),'([^\.]*)')[offset(1)] from demo But neither works. Is this possible? If it isn't clear, the result I would like is: match hello hi
Use split_part() instead. Simpler, faster. To get the first word, before the first separator .: WITH demo(web) AS ( VALUES ('WWW.HELLO.COM') , ('hi.co.uk') ) SELECT split_part(replace(lower(web), 'www.', ''), '.', 1) FROM demo; db<>fiddle here See: Split comma separated column data into additional columns regexp_matches() returns setof text[], i.e. 0-n rows of text arrays. (Because each regular expression can result in a set of multiple matching strings.) In Postgres 10 or later, there is also the simpler variant regexp_match() that only returns the first match, i.e. text[]. Either way, the surrounding curly braces in your result are the text representation of the array literal. You can take the first row and unnest the first element of the array, but since you neither want the set nor the array to begin with, use split_part() instead. Simpler, faster, and less versatile. But good enough for the purpose. And it returns exactly what you want to begin with: text.
I'm a little confused. Doesn't this do what you want? with demo as ( select 'WWW.HELLO.COM' web union all select 'hi.co.uk' web ) select (regexp_matches(replace(lower(web), 'www.',''), '([^\.]*)'))[1] from demo This is basically your query with extra parentheses so it does not generate a syntax error. Here is a db<>fiddle illustrating that it returns what you want.
How to check string have custom template in SQL Server
I have a column like this : Codes -------------------------------------------------- 3/1151---------366-500-2570533-1 9/6809---------------------368-510-1872009-1 1-260752-305-154----------------154-200-260752-1--------154-800-13557-1 2397/35425---------------------------------377-500-3224575-1 17059----------------377-500-3263429-1 126/42906---------------------377-500-3264375-1 2269/2340-------------------------377-500-3065828-1 2267/767---------377-500-1452908-4 2395/118593---------377-500-3284699-1 2395/136547---------377-500-3303413-1 92/10260---------------------------377-500-1636038-1 2345-2064---------377-500-3318493-1 365-2290--------377-500-3278261-12 365-7212--------377-500-2587120-1 How can I extract codes with this format: 3digit-3digit-5to7digit-1to2digit xxx-xxx-xxxxxx-xx The result must be : Codes -------------------------------------------------- 366-500-2570533-1 368-510-1872009-1 154-200-260752-1 , 154-800-13557-1 -- have 2 code template 377-500-3224575-1 377-500-3263429-1 377-500-3264375-1 377-500-3065828-1 377-500-1452908-4 377-500-3284699-1 377-500-3303413-1 377-500-1636038-1 377-500-3318493-1 377-500-3278261-12 377-500-2587120-1 ------------------------------------ This problem is completely tired of me. Thanks for reading about my problem
This is really ugly, really really ugly. I don't for one second suggest doing this in your RDBMS, and really I suggest you fix your data. You should not be storing "delimited" (I use that word loosely to describe your data) data in your tables you should be storing in in separate columns and rows. In this case, the first "code" should be in one column, with a one to many relationship with another table with the codes you're trying to extract. As you haven't tagged or mentioned your Version of SQL Server I've used the latest SQL Server syntax. STRING_SPLIT is available in SQL Server 2016+ and STRING_AGG in 2017+. If you aren't using those versions you will need to replace those functions with a suitable alternative (I suggest delimitedsplit8k(_lead) and FOR XML PATH respectively). Anyway, what this does. Firstly we need to fix that data to something more useable, so I change the double hyphens (--) to a Pipe (|), as that doesn't seem to appear in your data. Then then use that pipe to split your data into parts (individual codes). Because your delimiter is inconsistent (it isn't a consistent width) this leaves some codes with a leading hyphen, so I have to then get rid of that. Then I use my answer from your other question to split the code further into it's components, and reverse the WHERE; previously the answer was looking for "bad" rows, where as now we want "good" rows. Then after all of that, it's as "simple" as using STRING_AGG to delimit the "good" rows: SELECT STRING_AGG(ca.Code,',') AS Codes FROM (VALUES('3/1151---------366-500-2570533-1'), ('9/6809---------------------368-510-1872009-1'), ('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'), ('2397/35425---------------------------------377-500-3224575-1'), ('17059----------------377-500-3263429-1'), ('126/42906---------------------377-500-3264375-1'), ('2269/2340-------------------------377-500-3065828-1'), ('2267/767---------377-500-1452908-4'), ('2395/118593---------377-500-3284699-1'), ('2395/136547---------377-500-3303413-1'), ('92/10260---------------------------377-500-1636038-1'), ('2345-2064---------377-500-3318493-1'), ('365-2290--------377-500-3278261-12'), ('365-7212--------377-500-2587120-1')) V(Codes) CROSS APPLY (VALUES(REPLACE(V.Codes,'--','|'))) D(DelimitedCodes) CROSS APPLY STRING_SPLIT(D.DelimitedCodes,'|') SS CROSS APPLY (VALUES(CASE LEFT(SS.[value],1) WHEN '-' THEN STUFF(SS.[value],1,1,'') ELSE SS.[value] END)) ca(Code) CROSS APPLY (VALUES(PARSENAME(REPLACE(ca.Code,'-','.'),4), PARSENAME(REPLACE(ca.Code,'-','.'),3), PARSENAME(REPLACE(ca.Code,'-','.'),2), PARSENAME(REPLACE(ca.Code,'-','.'),1))) PN(P1, P2, P3, P4) WHERE LEN(PN.P1) = 3 AND LEN(PN.P2) = 3 AND LEN(PN.P3) BETWEEN 5 AND 7 AND LEN(PN.P4) BETWEEN 1 AND 2 AND ca.Code NOT LIKE '%[^0-9\-]%' ESCAPE '\' GROUP BY V.Codes; db<>fiddle
You have several problems here: Splitting your longer strings into the codes you want. Dealing with the fact that your separator for the longer strings is the same as your separator for the shorter ones. Finding the patterns that you want. The last is perhaps the simplest, because you can use brute force to solve that. Here is a solution that extracts the values you want: with t as ( select v.* from (values ('3/1151---------366-500-2570533-1'), ('9/6809---------------------368-510-1872009-1'), ('1-260752-305-154----------------154-200-260752-1--------154-800-13557-1'), ('2397/35425---------------------------------377-500-3224575-1') ) v(str) ) select t.*, ss.value from t cross apply (values (replace(replace(replace(replace(replace(t.str, '--', '><'), '<>', ''), '><', '|'), '|-', '|'), '-|', '|')) ) v(str_sep) cross apply string_split(v.str_sep, '|') ss where ss.value like '%-%-%-%' and ss.value not like '%-%-%-%-%' and (ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9]' or ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' or ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9]' or ss.value like '[0-9][0-9][0-9]-[0-9][0-9][0-9]-[0-9][0-9][0-9][0-9][0-9][0-9][0-9]-[0-9][0-9]' ); Here is a db<>fiddle. I would strongly encourage you to find some way of doing this string parsing anywhere other than SQL. The key to this working is getting the long string of hyphens down to a single delimiter. SQL Server does not offer regular expressions for the hyphens (as some other databases do and as is available in other programming languages). In Python, for instance, this would be much simpler. The strange values statement with a zillion replaces is handling the repeated delimiters, replacing them with a single pipe delimiter. Note: This uses string_split() as a convenience. It was introduced in SQL Server 2017. For earlier versions, there are plenty of examples of string splitting functions on the web.
SQL - Split string with multiple delimiters into multiple rows and columns
I am trying to split a string in SQL with the following format: 'John, Mark, Peter|23, 32, 45'. The idea is to have all the names in the first columns and the ages in the second column. The query should be "dynamic", the string can have several records depending on user entries. Does anyone know how to this, and if possible without SQL functions? I have tried the cross apply approach but I wasn't able to make it work. Any ideas?
This solution uses Jeff Moden's DelimitedSplit8k. Why? Because his solution provides the ordinal position of the item. Ordinal Position something that many others functions, including Microsoft's own STRING_SPLIT, does not provide. It's going to be vitally important for getting this to work correctly. Once you have that, the solution becomes fairly simple: DECLARE #NameAges varchar(8000) = 'John, Mark, Peter|23, 32, 45'; WITH Splits AS ( SELECT S1.ItemNumber AS PipeNumber, S2.ItemNumber AS CommaNumber, S2.Item FROM dbo.DelimitedSplit8K (REPLACE(#NameAges,' ',''), '|') S1 --As you have spaces between the delimiters I've removed these. Be CAREFUL with that CROSS APPLY DelimitedSplit8K (S1.item, ',') S2) SELECT S1.Item AS [Name], S2.Item AS Age FROM Splits S1 JOIN Splits S2 ON S1.CommaNumber = S2.CommaNumber AND S2.PipeNumber = 2 WHERE S1.PipeNumber = 1;
Text to List in SQL
Is there any way on how to convert a comma separated text value to a list so that I can use it with 'IN' in SQL? I used PostgreSQL for this one. Ex.: select location from tbl where location in (replace(replace(replace('[Location].[SG],[Location].[PH]', ',[Location].[', ','''), '[Location].[', ''''), ']','''')) This query: select (replace(replace(replace('[Location].[SG],[Location].[PH]', ',[Location].[', ','''), '[Location].[', ''''), ']','''')) produces 'SG','PH' I wanted to produce this query: select location from tbl where location in ('SG','PH') Nothing returned when I executed the first query. The table has been filled with location values 'SG' and 'PH'. Can anyone help me on how to make this work without using PL/pgSQL?
So you're faced with a friendly and easy to use tool that won't let you get any work done, I feel your pain. A slight modification of what you have combined with string_to_array should be able to get the job done. First we'll replace your nested replace calls with slightly nicer replace calls: => select replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', ''); replace --------- SG,PH So we strip out the [Location]. noise and then strip out the leftover brackets to get a comma delimited list of the two-character location codes you're after. There are other ways to get the SG,PH using PostgreSQL's other string and regex functions but replace(replace(replace(... will do fine for strings with your specific structure. Then we can split that CSV into an array using string_to_array: => select string_to_array(replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', ''), ','); string_to_array ----------------- {SG,PH} to give us an array of location codes. Now that we have an array, we can use = ANY instead of IN to look inside an array: => select 'SG' = any (string_to_array(replace(replace(replace('[Location].[SG],[Location].[PH]', '[Location].', ''), '[', ''), ']', ''), ',')); ?column? ---------- t That t is a boolean TRUE BTW; if you said 'XX' = any (...) you'd get an f (i.e. FALSE) instead. Putting all that together gives you a final query structured like this: select location from tbl where location = any (string_to_array(...)) You can fill in the ... with the nested replace nastiness on your own.
Assuming we are dealing with a comma-separated list of elements in the form [Location].[XX], I would expect this construct to perform best: SELECT location FROM tbl JOIN ( SELECT substring(unnest(string_to_array('[Location].[SG],[Location].[PH]'::text, ',')), 13, 2) AS location ) t USING (location); Step-by-step Transform the comma-separated list into an array and split it to a table with unnest(string_to_array()). You could do the same with regexp_split_to_table(). Slightly shorter but more expensive. Extract the XX part with substring(). Very simple and fast. JOIN to tbl instead of the IN expression. That's faster - and equivalent while there are no duplicates on either side. I assign the same column alias location to enable an equijoin with USING.
Directly using location in ('something') works I have create a fiddle that uses IN clause on a VARCHAR column http://sqlfiddle.com/#!12/cdf915/1
How to extract numerical data from SQL result
Suppose there is a table "A" with 2 columns - ID (INT), DATA (VARCHAR(100)). Executing "SELECT DATA FROM A" results in a table looks like: DATA --------------------- Nowshak 7,485 m Maja e Korabit (Golem Korab) 2,764 m Tahat 3,003 m Morro de Moco 2,620 m Cerro Aconcagua 6,960 m (located in the northwestern corner of the province of Mendoza) Mount Kosciuszko 2,229 m Grossglockner 3,798 m // the DATA continues... --------------------- How can I extract only the numerical data using some kind of string processing function in the SELECT SQL query so that the result from a modified SELECT would look like this: DATA (in INTEGER - not varchar) --------------------- 7485 2764 3003 2620 6960 2229 3798 // the DATA in INTEGER continues... --------------------- By the way, it would be best if this could be done in a single SQL statement. (I am using IBM DB2 version 9.5) Thanks :)
I know this thread is old, and the OP doesn't need the answer, but I had to figure this out with a few hints from this and other threads. They all seem to be missing the exact answer. The easy way to do this is to TRANSLATE all unneeded characters to a single character, then REPLACE that single character with an empty string. DATA = 'Nowshak 7,485 m' # removes all characters, leaving only numbers REPLACE(TRANSLATE(TRIM(DATA), '_____________________________________________________________________________________________', ' abcdefghijklmnopqrstuvwzyaABCDEFGHIJKLMNOPQRSTUVWXYZ`~!##$%^&*()-_=+\|[]{};:",.<>/?'), '_', '') => '7485' To break down the TRANSLATE command: TRANSLATE( FIELD or String, <to characters>, <from characters> ) e.g. DATA = 'Sample by John' TRANSLATE(DATA, 'XYZ', 'abc') => a becomes X, b becomes Y, c becomes Z => 'SXmple Yy John' ** Note: I can't speak to performance or version compatibility. I'm on a 9.x version of DB2, and new to the technology. Hope this helps someone.
In Oracle: SELECT TO_NUMBER(REGEXP_REPLACE(data, '[^0-9]', '')) FROM a In PostgreSQL: SELECT CAST(REGEXP_REPLACE(data, '[^0-9]', '', 'g') AS INTEGER) FROM a In MS SQL Server and DB2, you'll need to create UDF's for regular expressions and query like this. See links for more details.
Doing a quick search on line for DB2 the best inbuilt function I can find is Translate It lets you specify a list of characters you want to change to other characters. It's not ideal, but you can specify every character that you want to strip out, that is, every non numeric character available... (Yes, that's a long list, a very long list, which is why I say it's not ideal) TRANSLATE('data', 'abc...XYZ,./\<>?|[and so on]', ' ') Alternatively you need to create a user defined function to search for the number. There are a few alternatives for that. Check each character one by one and keep it only if it's a numeric. If you know what precedes the number and what follows the number, you can search for those and keep what is in between...
To elaborate on Dems's suggeston, the approach I've used is a scalar user-defined function (UDF) that accepts an alphanumeric string and recursively iterates through the string (one byte per iteration) and suppresses the non-numeric characters from the output. The recursive expression will generate a row per iteration, but only the final row is kept and returned to the calling application.