psycopg2 equivalent of mysqldb.escape_string? - sql

I'm passing some values into a postgres character field using psycopg2 in Python. Some of the string values contain periods, slashes, quotes etc.
With MySQL I'd just escape the string with
MySQLdb.escape_string(my_string)
Is there an equivalent for psycopg2?

Escaping is automatic, you just have to call:
cursor.execute("query with params %s %s", ("param1", "pa'ram2"))
(notice that the python % operator is not used) and the values will be correctly escaped.
You can escape manually a variable using extensions.adapt(var), but this would be error prone and not keep into account the connection encoding: it is not supposed to be used in regular client code.

Like piro said, escaping is automatic. But there's a method to also return the full sql escaped by psycopg2 using cursor.mogrify(sql, [params])

In the unlikely event that query parameters aren't sufficient and you need to escape strings yourself, you can use Postgres escaped string constants along with Python's repr (because Python's rules for escaping non-ascii and unicode characters are the same as Postgres's):
def postgres_escape_string(s):
if not isinstance(s, basestring):
raise TypeError("%r must be a str or unicode" %(s, ))
escaped = repr(s)
if isinstance(s, unicode):
assert escaped[:1] == 'u'
escaped = escaped[1:]
if escaped[:1] == '"':
escaped = escaped.replace("'", "\\'")
elif escaped[:1] != "'":
raise AssertionError("unexpected repr: %s", escaped)
return "E'%s'" %(escaped[1:-1], )

Psycopg2 doesn't have such a method. It has an extension for adapting Python values to ISQLQuote objects, and these objects have a getquoted() method to return PostgreSQL-compatible values.
See this blog for an example of how to use it:
Quoting bound values in SQL statements using psycopg2
Update 2019-03-03: changed the link to archive.org, because after nine years, the original is no longer available.

psycopg2 added a method in version 2.7 it seems:
http://initd.org/psycopg/docs/extensions.html#psycopg2.extensions.quote_ident
from psycopg2.extensions import quote_ident
with psycopg2.connect(<db config>) as conn:
with conn.cursor() as curs:
ident = quote_ident('foo', curs)
If you get an error like:
TypeError: argument 2 must be a connection or a cursor, try either:
ident = quote_ident('foo', curs.cursor)
# or
ident = quote_ident('food', curs.__wrapper__)

Related

SparkSQL Regular Expression: Cannot remove backslash from text

I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()
I expected the result to be "ABC\XYZ\", but what I got instead was:
+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+
I'm not sure what the other characters are after the C and after Z.
However I need to remove the backslash as well. To get rid of the backslash, I then tried this:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()
However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.
I appreciate any suggestions!
EDIT: I tried Daniel's suggestion but received the following errors:
>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)
== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^
Thoughts?
If you check the string outside of the spark.sql call, you will notice the string has special characters due to how Python handles the backslash in a literal string:
>>> "select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\\][\\d][\\d][\\d]','') as new_value"
If we add the r prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:
>>> r"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\d][\\d][\\d]','') as new_value"
That looks correct, but if we just supply the raw Python string to spark.sql, the exception SparkException: The Spark SQL phase optimization failed with an internal error. is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.
It's a bit odd looking, but the end result is:
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
And the output is, as expected:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
It's also worth mentioning if you call regexp_replace as a function directly (from pyspark.sql.functions import regexp_replace), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r prefix - the backslash in the first character class just needs to be escaped:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
df.select(regexp_replace('temp', '[\\\][\d][\d][\d]', '').alias('new_value')).show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Update
It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\\d][\\\d][\\\d]','') as new_value").show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Note that the string the regex is being applied to only requires a single escape (\) for each \, whereas the pattern requires two escape characters (\\) for each \.
Daniel,
Thank you SO MUCH for your help! With the examples you gave, I was able to tweak my SQL statement so it would work. Besides putting the r before the select statement string, I had to change the regular expression to this:
'[\\\\][\\\d][\\\d][\\\d]'
Thank you much!
Aaron

Printing Unnecessary escape character [duplicate]

I tried many ways to get a single backslash from an executed (I don't mean an input from html).
I can get special characters as tab, new line and many others then escape them to \\t or \\n or \\(someother character) but I cannot get a single backslash when a non-special character is next to it.
I don't want something like:
str = "\apple"; // I want this, to return:
console.log(str); // \apple
and if I try to get character at 0 then I get a instead of \.
(See ES2015 update at the end of the answer.)
You've tagged your question both string and regex.
In JavaScript, the backslash has special meaning both in string literals and in regular expressions. If you want an actual backslash in the string or regex, you have to write two: \\.
The following string starts with one backslash, the first one you see in the literal is an escape character starting an escape sequence. The \\ escape sequence tells the parser to put a single backslash in the string:
var str = "\\I have one backslash";
The following regular expression will match a single backslash (not two); again, the first one you see in the literal is an escape character starting an escape sequence. The \\ escape sequence tells the parser to put a single backslash character in the regular expression pattern:
var rex = /\\/;
If you're using a string to create a regular expression (rather than using a regular expression literal as I did above), note that you're dealing with two levels: The string level, and the regular expression level. So to create a regular expression using a string that matches a single backslash, you end up using four:
// Matches *one* backslash
var rex = new RegExp("\\\\");
That's because first, you're writing a string literal, but you want to actually put backslashes in the resulting string, so you do that with \\ for each one backslash you want. But your regex also requires two \\ for every one real backslash you want, and so it needs to see two backslashes in the string. Hence, a total of four. This is one of the reasons I avoid using new RegExp(string) whenver I can; I get confused easily. :-)
ES2015 and ES2018 update
Fast-forward to 2015, and as Dolphin_Wood points out the new ES2015 standard gives us template literals, tag functions, and the String.raw function:
// Yes, this unlikely-looking syntax is actually valid ES2015
let str = String.raw`\apple`;
str ends up having the characters \, a, p, p, l, and e in it. Just be careful there are no ${ in your template literal, since ${ starts a substitution in a template literal. E.g.:
let foo = "bar";
let str = String.raw`\apple${foo}`;
...ends up being \applebar.
Try String.raw method:
str = String.raw`\apple` // "\apple"
Reference here: String.raw()
\ is an escape character, when followed by a non-special character it doesn't become a literal \. Instead, you have to double it \\.
console.log("\apple"); //-> "apple"
console.log("\\apple"); //-> "\apple"
There is no way to get the original, raw string definition or create a literal string without escape characters.
please try the below one it works for me and I'm getting the output with backslash
String sss="dfsdf\\dfds";
System.out.println(sss);

Escape String interpolation in a string literal

In a normal String I can escape the ${variable} with a backslash:
"You can use \${variable} syntax in Kotlin."
Is it possible to do the same in a String literal? The backslash is no longer an escape character:
// Undesired: Produces "This \something will be substituted.
"""This \${variable} will be substituted."""
So far, the only solutions I see are String concatenation, which is terribly ugly, and nesting the interpolation, which starts to get a bit ridiculous:
// Desired: Produces "This ${variable} will not be substituted."
"""This ${"\${variable}"} will not be substituted."""
From kotlinlang.org:
If you need to represent a literal $ character in a raw string (which doesn't
support backslash escaping), you can use the following syntax:
val price = """
${'$'}9.99
"""
So, in your case:
"""This ${'$'}{variable} will not be substituted."""
As per String templates docs you can represent the $ directly in a raw string:
Templates are supported both inside raw strings and inside escaped strings. If you need to represent a literal $ character in a raw string (which doesn't support backslash escaping), you can use the following syntax:
val text = """This ${'$'}{variable} will be substituted."""
println(text) // This ${variable} will be substituted.

Hive convert a string to an array of characters

How can I convert a string to an array of characters, for example
"abcd" -> ["a","b","c","d"]
I know the split methd:
SELECT split("abcd","");
#["a","b","c","d",""]
is a bug for the last whitespace? or any other ideas?
This is not actually a bug. Hive split function simply calls the underlying Java String#split(String regexp, int limit) method with limit parameter set to -1, which causes trailing whitespace(s) to be returned.
I'm not going to dig into implementation details on why it's happening since there is already a brilliant answer that describes the issue. Note that str.split("", -1) will return different results depending on the version of Java you use.
A few alternatives:
Use "(?!\A|\z)" as a separator regexp, e.g. split("abcd", "(?!\\A|\\z)"). This will make the regexp matcher skip zero-width matches at the start and at the end positions of the string.
Create a custom UDF that uses either String#toCharArray(), or accepts limit as an argument of the UDF so you can use it as: SPLIT("", 0)
I don't know if it is a bug or that's how it works. As an alternative, you could use explode and collect_list to exclude blanks from a where clause
SELECT collect_list(l)
FROM ( SELECT EXPLODE(split('abcd','') ) as l ) t
WHERE t.l <> '';

how to escape a single quote in a pig script

Can pig scripts use double quotes? If not how to escape a single quote? I'm trying to parse a date time and I'm geting errors
Unexpected character '"'
And here is the script
logOutput = FOREACH parsedLog GENERATE uid, ToDate(timestamp,"YYYY-MM-DD'T'hh:mm ss:'00'") as theTime:datetime
You can escape a single quote using \\ (double backslash).
%declare CURRENT_TIME_ISO_FORMAT ToString($CURRENT_TIME,'yyyy-MM-dd\\'T\\'HH:mm:ss.SSSZ')
Just be aware that when you are using the escaping, you should not reuse the created String on another place of the script, but to everything on single call.
For example, let's say you want to send the String to the ISOToDay function, this script will fail:
%declare CURRENT_TIME_ISO_FORMAT ToString($CURRENT_TIME,'yyyy-MM-dd\\'T\\'HH:mm:ss.SSSZ')
%declare TODAY_BEGINNING_OF_DAY_ISO_FORMAT ISOToDay($CURRENT_TIME_ISO_FORMAT)
Instead, you should do:
%declare TODAY_BEGINNING_OF_DAY_ISO_FORMAT ISOToDay(ToString($CURRENT_TIME,'yyyy-MM-dd\\'T\\'HH:mm:ss.SSSZ'))
Have a try escaping them using \ and using single quotes.
logOutput = FOREACH parsedLog GENERATE uid, ToDate(timestamp,'YYYY-MM-DD\'T\'hh:mm ss:00') as theTime:datetime
Not sure what you mean with '00'.