I would like to replace all pipes and line breaks with space in a free text field in my data base.
My current approach looks like the following:
SELECT
ID,
REGEXP_REPLACE(REGEXP_REPLACE(FREETEXT,'|',‘ ‘),‘\n',' ')
FROM TABLE
My idea is to replace the pipes | with a space and then the results get checked again and all linebreaks are replaced. Problem now is that there are still pipes in there which messes up the CSV since my delimter for that is |.
Hope anyone can help me out here.
PS: I am not able to change the delimter to something else.
The pipe symbol is a special character in a Regular Expression, splitting it into multiple alternatives, thus you must escape it.
If you want to replace all pipe and line break characters you don't have to nest:
RegExp_Replace(FREETEXT,'[\|\n\r]',' ')
\| pipe 0x7C
\n line feed 0x0A
\r carriage return 0x0D
But as those are single characters you can simply use
OTranslate(FREETEXT, '7C0A0D'xc,' ')
Only if you want to replace consecutive occurences of those characters with a single space you need a RegEx:
RegExp_Replace(FREETEXT,'[\|\n\r]+',' ')
Related
I'm trying to replace newline etc kind of values using regexp_replace. But when I open the result in query result window, I can still see the new lines in the text. Even when I copy the result, I can see new line characters. See output for example, I just copied from the result.
Below is my query
select regexp_replace('abc123
/n
CHAR(10)
头疼,'||CHR(10)||'allo','[^[:alpha:][:digit:][ \t]]','') from dual;
/ I just kept for testing characters.
Output:
abc123
/n
CHAR(10)
头疼,
allo
How to remove the new lines from the text?
Expected output:
abc123 /nCHAR(10)头疼,allo
There are two mistakes in your code. One of them causes the issue you noticed.
First, in a bracket expression, in Oracle regular expressions (which follow the POSIX standard), there are no escape sequences. You probably meant \t as escape sequence for tab - within the bracket expression. (Note also that in Oracle regular expressions, there are no escape sequences like \t and \n anyway. If you must preserve tabs, it can be done, but not like that.)
Second, regardless of this, you include two character classes, [:alpha:] and [:digit:], and also [ \t] in the (negated) bracket expression. The last one is not a character class, so the [ as well as the space, the backslash and the letter t are interpreted as literal characters - they stand in for themselves. The closing bracket, on the other hand, has special meaning. The first of your two closing brackets is interpreted as the end of the bracket expression; and the second closing bracket is interpreted as being an additional, literal character that must be matched! Since there is no such literal closing bracket anywhere in the string, nothing is replaced.
To fix both mistakes, replace [ \t] with the [:blank:] character class, which consists exactly of space and tab. (And, note that [:alpha:][:digit:] can be written more compactly as [:alnum:].)
From within an Oracle 11g database, using SQL, I need to remove the following sequence of special characters from a string, i.e.
~!##$%^&*()_+=\{}[]:”;’<,>./?
If any of these characters exist within a string, except for these two characters, which I DO NOT want removed, i.e.: "|" and "-" then I would like them completely removed.
For example:
From: 'ABC(D E+FGH?/IJK LMN~OP' To: 'ABCD EFGHIJK LMNOP' after removal of special characters.
I have tried this small test which works for this sample, i.e:
select regexp_replace('abc+de)fg','\+|\)') from dual
but is there a better means of using my sequence of special characters above without doing this string pattern of '\+|\)' for every special character using Oracle SQL?
You can replace anything other than letters and space with empty string
[^a-zA-Z ]
here is online demo
As per below comments
I still need to keep the following two special characters within my string, i.e. "|" and "-".
Just exclude more
[^a-zA-Z|-]
Note: hyphen - should be in the starting or ending or escaped like \- because it has special meaning in the Character class to define a range.
For more info read about Character Classes or Character Sets
Consider using this regex replacement instead:
REGEXP_REPLACE('abc+de)fg', '[~!##$%^&*()_+=\\{}[\]:”;’<,>.\/?]', '')
The replacement will match any character from your list.
Here is a regex demo!
The regex to match your sequence of special characters is:
[]~!##$%^&*()_+=\{}[:”;’<,>./?]+
I feel you still missed to escape all regex-special characters.
To achieve that, go iteratively:
build a test-tring and start to build up your regex-string character by character to see if it removes what you expect to be removed.
If the latest character does not work you have to escape it.
That should do the trick.
SELECT TRANSLATE('~!##$%sdv^&*()_+=\dsv{}[]:”;’<,>dsvsdd./?', '~!##$%^&*()_+=\{}[]:”;’<,>./?',' ')
FROM dual;
result:
TRANSLATE
-------------
sdvdsvdsvsdd
SQL> select translate('abc+de#fg-hq!m', 'a+-#!', etc.) from dual;
TRANSLATE(
----------
abcdefghqm
I have a column eventDate which contains trailing spaces. I am trying to remove them with the PostgreSQL function TRIM(). More specifically, I am running:
SELECT TRIM(both ' ' from eventDate)
FROM EventDates;
However, the trailing spaces don't go away. Furthermore, when I try and trim another character from the date (such as a number), it doesn't trim either. If I'm reading the manual correctly this should work. Any thoughts?
There are many different invisible characters. Many of them have the property WSpace=Y ("whitespace") in Unicode. But some special characters are not considered "whitespace" and still have no visible representation. The excellent Wikipedia articles about space (punctuation) and whitespace characters should give you an idea.
<rant>Unicode sucks in this regard: introducing lots of exotic characters that mainly serve to confuse people.</rant>
The standard SQL trim() function by default only trims the basic Latin space character (Unicode: U+0020 / ASCII 32). Same with the rtrim() and ltrim() variants. Your call also only targets that particular character.
Use regular expressions with regexp_replace() instead.
Trailing
To remove all trailing white space (but not white space inside the string):
SELECT regexp_replace(eventdate, '\s+$', '') FROM eventdates;
The regular expression explained:
\s ... regular expression class shorthand for [[:space:]]
- which is the set of white-space characters - see limitations below
+ ... 1 or more consecutive matches
$ ... end of string
Demo:
SELECT regexp_replace('inner white ', '\s+$', '') || '|'
Returns:
inner white|
Yes, that's a single backslash (\). Details in this related answer:
SQL select where column begins with \
Leading
To remove all leading white space (but not white space inside the string):
regexp_replace(eventdate, '^\s+', '')
^ .. start of string
Both
To remove both, you can chain above function calls:
regexp_replace(regexp_replace(eventdate, '^\s+', ''), '\s+$', '')
Or you can combine both in a single call with two branches.
Add 'g' as 4th parameter to replace all matches, not just the first:
regexp_replace(eventdate, '^\s+|\s+$', '', 'g')
But that should typically be faster with substring():
substring(eventdate, '\S(?:.*\S)*')
\S ... everything but white space
(?:re) ... non-capturing set of parentheses
.* ... any string of 0-n characters
Or one of these:
substring(eventdate, '^\s*(.*\S)')
substring(eventdate, '(\S.*\S)') -- only works for 2+ printing characters
(re) ... Capturing set of parentheses
Effectively takes the first non-whitespace character and everything up to the last non-whitespace character if available.
Whitespace?
There are a few more related characters which are not classified as "whitespace" in Unicode - so not contained in the character class [[:space:]].
These print as invisible glyphs in pgAdmin for me: "mongolian vowel", "zero width space", "zero width non-joiner", "zero width joiner":
SELECT E'\u180e', E'\u200B', E'\u200C', E'\u200D';
'' | '' | '' | ''
Two more, printing as visible glyphs in pgAdmin, but invisible in my browser: "word joiner", "zero width non-breaking space":
SELECT E'\u2060', E'\uFEFF';
'' | ''
Ultimately, whether characters are rendered invisible or not also depends on the font used for display.
To remove all of these as well, replace '\s' with '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]' or '[\s]' (note trailing invisible characters!).
Example, instead of:
regexp_replace(eventdate, '\s+$', '')
use:
regexp_replace(eventdate, '[\s\u180e\u200B\u200C\u200D\u2060\uFEFF]+$', '')
or:
regexp_replace(eventdate, '[\s]+$', '') -- note invisible characters
Limitations
There is also the Posix character class [[:graph:]] supposed to represent "visible characters". Example:
substring(eventdate, '([[:graph:]].*[[:graph:]])')
It works reliably for ASCII characters in every setup (where it boils down to [\x21-\x7E]), but beyond that you currently (incl. pg 10) depend on information provided by the underlying OS (to define ctype) and possibly locale settings.
Strictly speaking, that's the case for every reference to a character class, but there seems to be more disagreement with the less commonly used ones like graph. But you may have to add more characters to the character class [[:space:]] (shorthand \s) to catch all whitespace characters. Like: \u2007, \u202f and \u00a0 seem to also be missing for #XiCoN JFS.
The manual:
Within a bracket expression, the name of a character class enclosed in
[: and :] stands for the list of all characters belonging to that
class. Standard character class names are: alnum, alpha, blank, cntrl,
digit, graph, lower, print, punct, space, upper, xdigit.
These stand for the character classes defined in ctype.
A locale can provide others.
Bold emphasis mine.
Also note this limitation that was fixed with Postgres 10:
Fix regular expressions' character class handling for large character
codes, particularly Unicode characters above U+7FF (Tom Lane)
Previously, such characters were never recognized as belonging to
locale-dependent character classes such as [[:alpha:]].
It should work the way you're handling it, but it's hard to say without knowing the specific string.
If you're only trimming leading spaces, you might want to use the more concise form:
SELECT RTRIM(eventDate)
FROM EventDates;
This is a little test to show you that it works.
Tell us if it works out!
If your whitespace is more than just the space meta value than you will need to use regexp_replace:
SELECT '(' || REGEXP_REPLACE(eventDate, E'[[:space:]]', '', 'g') || ')'
FROM EventDates;
In the above example I am bounding the return value in ( and ) just so you can easily see that the regex replace is working in a psql prompt. So you'll want to remove those in your code.
SELECT replace((' devo system ') ,' ','');
It gives: devosystem
A tested one that works like a charm:
UPDATE company SET name = TRIM (BOTH FROM name) where id > 0
I have a textfile of rows of values, each field delimited by a single space. The end of each row is signalled by the Windows-style {carriage return, newline}.
I would like to replace each spaces with a comma using Notepad++ but I am unfamiliar with whitespace regex on Notepad++.
Any help would be appreciated.
\s works for me at least as a whitespace token, just as it does in normal regular expressions. A single regular space character in the replace dialog works just as well.
I need to find and clean up line breaks, carriage returns, tabs and "SUB"-characters in a set of 400k+ string records, but this DB2 environment is taking a toll on me.
Thought I could do some search and replacing with the REPLACE() and CHR() functions, but it seems CHR() is not available on this system (Error: CHR in *LIBL type *N not found). Working with \t, \r, \n etc doesn't seem to be working either. The chars can be in the middle of strings or at the end of them.
DBMS = DB2
System = iSeries
Language = SQL
Encoding = Not sure, possibly EBCDIC
Any hints on what I can do with this?
I used this SQL to find x'25' and x'0D':
SELECT
<field>
, LOCATE(x'0D', <field>) AS "0D"
, LOCATE(x'25', <field>) AS "25"
, length(trim(<field>)) AS "Length"
FROM <file>
WHERE LOCATE(x'25', <field>) > 0
OR LOCATE(x'0D', <field>) > 0
And I used this SQL to replace them:
UPDATE <file>
SET <field> = REPLACE(REPLACE(<field>, x'0D', ' '), x'25', ' ')
WHERE LOCATE(x'25', <field>) > 0
OR LOCATE(x'0D', <field>) > 0
If you want to clear up specific characters like carriage return (EBCDIC x'0d') and line feed (EBCDIC x'25') you should find the translated character in EBCDIC then use the TRANSLATE() function to replace them with space.
If you just want to remove undisplayable characters then look for anything under x'40'.
Here is an sample script that replaces X'41' by X'40'. Something that was creating issues at our shop:
UPDATE [yourfile] SET [yourfield] = TRANSLATE([yourfield], X'40',
X'41') WHERE [yourfield] like '%' concat X'41' concat '%'
If you need to replace more than one character, extend the "to" and "from" hexadecimal strings to the values you need in the TRANSLATE function.
Try TRANSLATE or REPLACE.
The brute force method involves using POSITION to find the errant character, then SUBSTR before and after it. CONCAT the two substrings (less the undesirable character) to re-form the column.
The character encoding is almost certainly one of the EBCDIC character sets. Depending on how the table got loaded in the first place, the CR may be x'0d' and the LF x'15', x'25'. An easy way to find out is to get to a green screen and do a DSPPFM against the table. Press F10 then F11 to view the table is raw, hexadecimal (over/under) format.
For details on the available functions see the
DB2 for i5/OS SQL Reference.
Perhaps the TRANSLATE() function will serve your needs.
TRANSLATE( data, tochars, fromchars )
...where fromchars is the set of characters you don't want, and tochars is the corresponding characters you want them replaced with. You may have to write this out in hex format, as x'nnnnnn...' and you will need to know what character set you are working with.
Using the DSPFFD command on your table should show the CCSID of your fields.
we struggled a lot to replace the new line char and carriage return from flat file.
Finally we used below sql to sort the issue.
REPLACE(REPLACE(COLUMN_NAME, CHR(13), ''), CHR(10), '')
Try it out
CR = CHR(13)
LF = CHR(10)