Is it ok to insert ascii decimal characters in postgres select query? - sql

I need o/p of select statement in this way.
You are assigned to following projects:
A
B
C
There are some spaces, special characters & new lines. To do that I used ascii charaters chr(10), chr(32), chr(8226).
It's working fine but query doesn't look good & I am not sure if it's a good approach to do it like this.
The query looks like this
SELECT
'You are assigned to following projects:' || chr(10) || chr(32) || chr(8226) || chr(32) ||
string_agg(e.projects, chr(10) || chr(32) || chr(8226) || chr(32))
Also will this work in all OS & in every environment?

You have a few options:
Insert characters literally. Usually the best option. Want a "•"? Use a string like 'this is a •'. Nothing more is required if your client_encoding is correct and the encoding you're using includes the character you want (like •). This is SQL-standard. Newlines may be included as literals:
SELECT '
' AS "this_is_a_newline";
This approach may not work for some non-printable characters, depending on the database implementation. For PostgreSQL it's fine for everything except \x00, the zero byte, which PostgreSQL doesn't support in text / varchar etc at all, only in bytea.
Watch out to make sure your text editor / SQL editor's text encoding matches what your connection tells PostgreSQL the client_encoding is, otherwise you'll get mangled strings or weird errors. Users of unix-like terminals also need to make sure the terminal encoding matches client_encoding to avoid weird output errors. These days Windows is the only platform where this is generally an issue.
Insert characters by hex or unicode literal in an E'' escape-string, e.g. E'this is a \u2022' . Note that \u escapes are hexadecimal - 0x2022 is decimal 8226. The E'' syntax is a PostgreSQL extension.
For characters that have shorthand escapes defined, use the shorthand escapes in an escape-string, e.g. E'\n'. This is a PostgreSQL extension.
use chr(8226), as you described, but note that chr interprets the code according to your server_encoding (the database's text encoding). So I do not encourage it. For multi-byte chars you'll just get an error like ERROR: requested character too large for encoding: 8226:
regress=> CREATE DATABASE latin ENCODING 'latin-1' LC_CTYPE 'C' LC_COLLATE 'C' TEMPLATE template0;
CREATE DATABASE
regress=> \c latin
You are now connected to database "latin" as user "craig".
latin=> SHOW server_encoding;
server_encoding
-----------------
LATIN1
(1 row)
latin=> SHOW client_encoding;
client_encoding
-----------------
UTF8
(1 row)
latin=> select chr(8226);
ERROR: requested character too large for encoding: 8226
but for chars whose ordinal is in the 1-byte range, you can get an unexpected character instead. Take ü, which in both utf-8 and latin-1 (iso-8859-1) is 0xfc (decimal 252), but in iso-8859-5 is ќ. So:
regress=> SHOW server_encoding;
server_encoding
-----------------
UTF8
regress=> SELECT chr(252);
chr
-----
ü
regress=> CREATE DATABASE iso5 ENCODING 'iso-8859-5' LC_CTYPE 'C' LC_COLLATE 'C' TEMPLATE template0;
regress=> \c iso5
iso5=> SELECT chr(252);
chr
-----
ќ
So my advice: Always use literals where possible. Where you must use escapes, use E'' strings with unicode escapes to prevent ambiguity about the meaning of a codepoint based on the current server encoding. Avoid \x escapes and chr.
For the specific example you wrote, you should use:
SELECT 'You are assigned to following projects:
• A
• B
• C';
Note for readers on very old PostgreSQL versions: Extremely old PostgreSQL release didn't support E'' strings and treated all strings as if they were escape strings. So '\n' meant "newline" whereras modern PostgreSQL follows the SQL standard in which '\n' is just the string "\n". Only marginally prehistoric versions still did this, but raised a warning about it and let you request the standard behaviour by setting standard_conforming_strings = on. This has been the default for quite a while.

Instead of
chr(10) || chr(32) || chr(8226) || chr(32)
just use
\n •
No reason to use chr() in this case.

Related

Remove accents from string in Oracle

When trying to remove all accents from a string in Oracle using the techniques described in this stackoverflow answer: how replace accented letter in a varchar2 column in oracle I’m getting mixed results.
select CONVERT('JUAN ROMÄN', 'US7ASCII') from dual;
Returns the original string but replaces characters with for example ñ by a question mark (probably because of the chosen charset - tests with different charsets led to different results).
Using the following technique:
select utl_raw.cast_to_varchar2(nlssort(NAME_USER, 'nls_sort=binary_ai')) from YOUR_TABLE;
Returns the complete string but also places a NUL value at the end of the string.
Is there a characterset that I can use with Spanish accents to get a correct result (the original string with the different accents removed); is there a way to avoid the NUL value in the utl_raw.cast_to_varchar2 technique?
Based on the comments the the replace char(0) seems to remove the NUL value. For example
select
upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))) as test,
replace(upper(utl_raw.cast_to_varchar2((nlssort('this is áà ñew test','nls_sort=binary_ai')))),chr(0),'') as test2
from dual;
If possible I would however to have a more 'straightforward/simpler' solution.
You can use TRANSLATE(your_string, from_chars, to_chars) https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions196.htm
Just put all chars with accents in from_chars string and their corresponding replacement chars in to_chars.

What regular expression characters have to be escaped in SQL?

To prevent SQL injection attack, the book "Building Scalable Web Sites" has a function to replace regular expression characters with escaped version:
function db_escape_str_rlike($string) {
preg_replace("/([().\[\]*^\$])/", '\\\$1', $string);
}
Does this function escape ( ) . [ ] * ^ $? Why are only those characters escaped in SQL?
I found an excerpt from the book you mention, and found that the function is not for escaping to protect against SQL injection vulnerabilities. I assumed it was, and temporarily answered your question with that in mind. I think other commenters are making the same assumption.
The function is actually about escaping characters that you want to use in regular expressions. There are several characters that have special meaning in regular expressions, so if you want to search for those literal characters, you need to escape them (precede with a backslash).
This has little to do with SQL. You would need to escape the same characters if you wanted to search for them literally using grep, sed, perl, vim, or any other program that uses regular expression searches.
Unfortunately, active characters in sql databases is an open issue. Each database vendor uses their own (mainly oracle's mysql, that uses \ escape sequences)
The official SQL way to escape a ', which is the string delimiter used for values is to double the ', as in ''.
That should be the only way to ensure transparency in SQL statements, and the only way to introduce a proper ' into a string. As soon as any vendor admits \' as a synonim of a quote, you are open to support all the extra escape sequences to delimit strings. Suppose you have:
'Mac O''Connor' (should go into "Mac O'Connor" string)
and assume the only way to escape a ' is that... then you have to check the next char when you see a ' for a '' sequence and:
you get '' that you change into '.
you get another, and you terminate the string literal and process the char as the first of the next token.
But if you admit \ as escape also, then you have to check for \' and for \\', and \\\' (this last one should be converted to \' on input) etc. You can run into trouble if you don't detect special cases as
\'' (should the '' be processed as SQL mandates, or the first \' is escaping the first ' and the second is the string end quote?)
\\'' (should the \\ be converted into a single \ then the ' should be the string terminator, or do we have to switch to SQL way of encoding and consider '' as a single quote?)
etc.
You have to check your database documentation to see if \ as escape characters affect only the encoding of special characters (like control characters or the like) and also affects the interpretation of the quote character or simply doesn't, and you have to escape ' the other way.
That is the reason for the vendors to include functions to do the escape/unescape of character literals into values to be embedded in a SQL statement. The idea of the attackers is to include (if you don't properly do) escape sequences into the data they post to you to see if that allows them to modify the text of the sql command to simply add a semicolon ; and write a complete sql statement that allows them to access freely your database.

Removing replacement character � from column

Based on my research so far this character indicates bad encoding between the database and front end. Unfortunately, I don't have any control over either of those. I'm using Teradata Studio.
How can I filter this character out? I'm trying to perform a REGEX_SUBSTR function on a column that occasionally contains �, which throws the error "The string contains an untranslatable character".
Here is my SQL. AIRCFT_POSITN_ID is the column that contains the replacement character.
SELECT DISTINCT AIRCFT_POSITN_ID,
REGEXP_SUBSTR(AIRCFT_POSITN_ID, '[0-9]+') AS AUTOROW
FROM PROD_MAE_MNTNC_VW.FMR_DISCRPNCY_DFRL
WHERE DFRL_CREATE_TMS > CURRENT_DATE -25
Your diagnostic is correct, so first of all, you might want to check the Session Character Set (it is part of the connection definition).
If it is ASCII change it to UTF8 and you will be able to see the original characters instead of the substitute character.
And in case the character is indeed part of the data and not just an indication for encoding translations issues:
The substitute character AKA SUB (DEC: 26 HEX: 1A) is quite unique in Teradata.
you cannot use it directly -
select '�';
-- [6706] The string contains an untranslatable character.
select '1A'XC;
-- [6706] The string contains an untranslatable character.
If you are using version 14.0 or above you can generate it with the CHR function:
select chr(26);
If you're below version 14.0 you can generate it like this:
select translate (_unicode '05D0'XC using unicode_to_latin with error);
Once you have generated the character you can now use it with REPLACE or OTRANSLATE
create multiset table t (i int,txt varchar(100) character set latin) unique primary index (i);
insert into t (i,txt) values (1,translate ('Hello שלום world עולם' using unicode_to_latin with error));
select * from t;
-- Hello ���� world ����
select otranslate (txt,chr(26),'') from t;
-- Hello world
select otranslate (txt,translate (_unicode '05D0'XC using unicode_to_latin with error),'') from t;
-- Hello world
BTW, there are 2 versions for OTRANSLATE and OREPLACE:
The functions under syslib works with LATIN.
the functions under TD_SYSFNLIB works with UNICODE.
In addition to Dudu's excellent answer above, I wanted to add the following now that I've encountered the issue again and had more time to experiment. The following SELECT command produced an untranslatable character:
SELECT IDENTIFY FROM PROD_MAE_MNTNC_VW.SCHD_MNTNC;
IDENTIFY
24FEB1747659193DC330A163DCL�ORD
Trying to perform a REGEXP_REPLACE or OREPLACE directly on this character produces an error:
Failed [6706 : HY000] The string contains an untranslatable character.
I changed the CHARSET property in my Teradata connection from UTF8 to ASCII and I could now see the offending character, looks like a tab
IDENTIFY
Using the TRANSLATE_CHK command using this specific conversion succeeds and identifies the position of the offending character (Note that this does not work using the UTF8 charset):
TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) AS BADCHAR
BADCHAR
28
Now this character can be dealt with using some CASE statements to remove the bad character and retain the remainder of the string:
CASE WHEN TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE) = 0 THEN IDENTIFY
ELSE SUBSTR(IDENTIFY, 1, TRANSLATE_CHK(IDENTIFY USING KANJI1_SBC_TO_UNICODE)-1)
END AS IDENTIFY
Hopes this helps someone out.

Converting symbols from outside of alphabet caused by copying text in different encoding

In my database I should only have data written using Polish alphabet but sometimes there are symbols not included in Polish alphabet (words copied from source with different encoding) that correspond to Polish letters in another encoding. Is it possible to somehow convert symbols outside of Polish alphabet to corresponding letters?
The only solution I figured is to manually find and replace those characters but maybe you have better solution to my problem.
Question concerns Oracle SQL Language.
I don't have database in front of me but as I remember correctly the example could look like this - two rows from my db:
ŚWIAT
ÚWIAT
and what I need is to convert Ú that doesn't belong to Polish alphabet to Ś.
You can try this. Experiment with it first to see if it works.
If I want to change every occurrence of the letter z with a j in a string, I would use the translate function: translate(text_string, 'z', 'j'). I don't have to use the letters z and j; instead, I can write translate(text_string, chr(122), chr(106) - to find out the character code, I use select ascii('z') from dual;. For example:
SQL> select translate('banzo', chr(122), chr(106)) from dual;
TRANS
-----
banjo
This changes every occurrence of z to j in text_string.
Now, you will have to find the code for the characters you want to change (both the "from" and the "to" characters) in your character set - it should be your session character set, not the database character set. (At least I think this is correct; experiment with it or read the documentation for CHR and perhaps for TRANSLATE - CHR returns the character code in the DATABASE character set unless you indicate otherwise, while I believe TRANSLATE uses the session character set.)
The function ascii may or may not work for non-ASCII characters, but if you google the name of your character set, you should find a character set table that will show you the codes for all the letters available in that character set.
Then, if this works, you can do the translation in one shot - translate(text_string, 'abcd', 'qrst') will change every 'a' to a 'q', every 'b' to an 'r' etc. And with chr(...), instead of 'abcd' you can write chr(97) || chr(98) || chr(99) || chr(100).

Remove Special Characters from an Oracle String

From within an Oracle 11g database, using SQL, I need to remove the following sequence of special characters from a string, i.e.
~!##$%^&*()_+=\{}[]:”;’<,>./?
If any of these characters exist within a string, except for these two characters, which I DO NOT want removed, i.e.: "|" and "-" then I would like them completely removed.
For example:
From: 'ABC(D E+FGH?/IJK LMN~OP' To: 'ABCD EFGHIJK LMNOP' after removal of special characters.
I have tried this small test which works for this sample, i.e:
select regexp_replace('abc+de)fg','\+|\)') from dual
but is there a better means of using my sequence of special characters above without doing this string pattern of '\+|\)' for every special character using Oracle SQL?
You can replace anything other than letters and space with empty string
[^a-zA-Z ]
here is online demo
As per below comments
I still need to keep the following two special characters within my string, i.e. "|" and "-".
Just exclude more
[^a-zA-Z|-]
Note: hyphen - should be in the starting or ending or escaped like \- because it has special meaning in the Character class to define a range.
For more info read about Character Classes or Character Sets
Consider using this regex replacement instead:
REGEXP_REPLACE('abc+de)fg', '[~!##$%^&*()_+=\\{}[\]:”;’<,>.\/?]', '')
The replacement will match any character from your list.
Here is a regex demo!
The regex to match your sequence of special characters is:
[]~!##$%^&*()_+=\{}[:”;’<,>./?]+
I feel you still missed to escape all regex-special characters.
To achieve that, go iteratively:
build a test-tring and start to build up your regex-string character by character to see if it removes what you expect to be removed.
If the latest character does not work you have to escape it.
That should do the trick.
SELECT TRANSLATE('~!##$%sdv^&*()_+=\dsv{}[]:”;’<,>dsvsdd./?', '~!##$%^&*()_+=\{}[]:”;’<,>./?',' ')
FROM dual;
result:
TRANSLATE
-------------
sdvdsvdsvsdd
SQL> select translate('abc+de#fg-hq!m', 'a+-#!', etc.) from dual;
TRANSLATE(
----------
abcdefghqm