SparkSQL Regular Expression: Cannot remove backslash from text - apache-spark-sql

I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()
I expected the result to be "ABC\XYZ\", but what I got instead was:
+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+
I'm not sure what the other characters are after the C and after Z.
However I need to remove the backslash as well. To get rid of the backslash, I then tried this:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()
However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.
I appreciate any suggestions!
EDIT: I tried Daniel's suggestion but received the following errors:
>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)
== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^
Thoughts?

If you check the string outside of the spark.sql call, you will notice the string has special characters due to how Python handles the backslash in a literal string:
>>> "select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\\][\\d][\\d][\\d]','') as new_value"
If we add the r prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:
>>> r"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\d][\\d][\\d]','') as new_value"
That looks correct, but if we just supply the raw Python string to spark.sql, the exception SparkException: The Spark SQL phase optimization failed with an internal error. is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.
It's a bit odd looking, but the end result is:
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
And the output is, as expected:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
It's also worth mentioning if you call regexp_replace as a function directly (from pyspark.sql.functions import regexp_replace), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r prefix - the backslash in the first character class just needs to be escaped:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
df.select(regexp_replace('temp', '[\\\][\d][\d][\d]', '').alias('new_value')).show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Update
It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\\d][\\\d][\\\d]','') as new_value").show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Note that the string the regex is being applied to only requires a single escape (\) for each \, whereas the pattern requires two escape characters (\\) for each \.

Daniel,
Thank you SO MUCH for your help! With the examples you gave, I was able to tweak my SQL statement so it would work. Besides putting the r before the select statement string, I had to change the regular expression to this:
'[\\\\][\\\d][\\\d][\\\d]'
Thank you much!
Aaron

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

'unicodeescape' codec can't decode bytes in position 0-1: malformed \N character escape

I am trying to push data from Databricks into SQL. However, I get the following error:
I noticed in the file that I am processing that one of the columns has the following as a value:
I have tried to filter out the records by using the following:
df = df.filter(df.COLUMN != "\N")
However, when the above runs, I get the error message idenitfied above. Is there some way to filter out values that have an escape character in them?
I would really appreciate any help. Thank you.
As the error suggested, you need to escape the backslash \
df.filter(df.value != "\\N")

Is there a way I can deal with poeple using double quotes in csv files in pandas/python?

I'm dealing with files we get sent by a client - so we can only get changes to the files we get sent with a lot of effort. Sometimes, in a free text field, we get a mention of length using the double quote characters to mean inches. For example, a file might look like this.
"count","desc","start_date","end_date"
"3","it is tall","3/18/2019","4/20/2020"
"10","height: 108" is nice,","04/11/2016","09/22/2015"
"8","it is short","7/20/2019","8/22/2020"
We are using python/pandas. When I load it using:
import pandas as pd
df = pd.read_csv("sample.csv", dtype=str)
I get:
There are two issues I am hoping to solve:
More important issue: I'd like the second value of start_date to be 04/11/2019 (without the comma at the start and the double quote at the end.
Less important issue: I'd like the second value of desc to be height: 108" is nice, (with the inches indicator).
I know that the right thing to do is to get the file escape the quote using \" but, like I said, that will be a hard change to get.
You can exploit the pattern that the values are separated by "," and remove first and last ". This solution will break if the free text field contains ",".
import pandas as pd
import io
with open('sample.csv') as f:
t = f.read()
print(t)
Out:
"count","desc","start_date","end_date"
"3","it is tall","3/18/2019","4/20/2020"
"10","height: 108" is nice,","04/11/2016","09/22/2015"
"8","it is short","7/20/2019","8/22/2020"
Remove first and last " in every row and read_csv with delimiter ","
t = '\n'.join([i.strip('"') for i in t.split('\n')])
pd.read_csv(io.StringIO(t), sep='","', engine='python')
Out:
count desc start_date end_date
0 3 it is tall 3/18/2019 4/20/2020
1 10 height: 108" is nice, 04/11/2016 09/22/2015
2 8 it is short 7/20/2019 8/22/2020

Error adding a string with a colon ":" to the "tsvector" data type

In my PostgreSQL 11 database, there is a "name" column with the "tsvector" data type for implementing full-text search.
But when I try to add to this column an ​​entry containing a colon ":", an error occurs:
Exception in thread Thread-10:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\threading.py", line 917, in_bootstrap_inner
self.run()
File "C:\Program Files\Python37\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\vs\Desktop\Арсений execute\allsave.py", line 209, in group_parsing
VALUES (%s,%s,%s,%s)''', a[i])
psycopg2.ProgrammingError: ERROR: syntax error in tsvector: "Reggae.FM:"
LINE 3: VALUES (181649,'Reggae.FM:'
When I added this data to the "text" field type, there were no problems. But apparently "tsvector" does not accept strings containing a colon ":" and, probably, some other characters.
The question is, how do I implement full-text search if the "tsvector" cannot store such characters?
P.S. Using "text" or "char" is not a solution; searching for such data types is very slow.
I get the lines by parsing groups vk.com (Russian social network), that is, the names of all existing groups. I need to keep these names in full form that the user could find them on my site. But any solutions will help me.
Use to_tsvector to normalize the string and return a tsvector:
INSERT INTO ...
VALUES (%s,to_tsvector(%s),%s,%s)''', a[i])
Note that casting as tsvector would not work here:
unutbu=# select 'Reggae.FM:'::tsvector;
ERROR: syntax error in tsvector: "Reggae.FM:"
LINE 1: select 'Reggae.FM:'::tsvector
^
This is what to_tsvector returns:
unutbu=# select to_tsvector('Reggae.FM:');
+---------------+
| to_tsvector |
+---------------+
| 'reggae.fm':1 |
+---------------+
(1 row)

psycopg2 equivalent of mysqldb.escape_string?

I'm passing some values into a postgres character field using psycopg2 in Python. Some of the string values contain periods, slashes, quotes etc.
With MySQL I'd just escape the string with
MySQLdb.escape_string(my_string)
Is there an equivalent for psycopg2?
Escaping is automatic, you just have to call:
cursor.execute("query with params %s %s", ("param1", "pa'ram2"))
(notice that the python % operator is not used) and the values will be correctly escaped.
You can escape manually a variable using extensions.adapt(var), but this would be error prone and not keep into account the connection encoding: it is not supposed to be used in regular client code.
Like piro said, escaping is automatic. But there's a method to also return the full sql escaped by psycopg2 using cursor.mogrify(sql, [params])
In the unlikely event that query parameters aren't sufficient and you need to escape strings yourself, you can use Postgres escaped string constants along with Python's repr (because Python's rules for escaping non-ascii and unicode characters are the same as Postgres's):
def postgres_escape_string(s):
if not isinstance(s, basestring):
raise TypeError("%r must be a str or unicode" %(s, ))
escaped = repr(s)
if isinstance(s, unicode):
assert escaped[:1] == 'u'
escaped = escaped[1:]
if escaped[:1] == '"':
escaped = escaped.replace("'", "\\'")
elif escaped[:1] != "'":
raise AssertionError("unexpected repr: %s", escaped)
return "E'%s'" %(escaped[1:-1], )
Psycopg2 doesn't have such a method. It has an extension for adapting Python values to ISQLQuote objects, and these objects have a getquoted() method to return PostgreSQL-compatible values.
See this blog for an example of how to use it:
Quoting bound values in SQL statements using psycopg2
Update 2019-03-03: changed the link to archive.org, because after nine years, the original is no longer available.
psycopg2 added a method in version 2.7 it seems:
http://initd.org/psycopg/docs/extensions.html#psycopg2.extensions.quote_ident
from psycopg2.extensions import quote_ident
with psycopg2.connect(<db config>) as conn:
with conn.cursor() as curs:
ident = quote_ident('foo', curs)
If you get an error like:
TypeError: argument 2 must be a connection or a cursor, try either:
ident = quote_ident('foo', curs.cursor)
# or
ident = quote_ident('food', curs.__wrapper__)