Error adding a string with a colon ":" to the "tsvector" data type - sql

In my PostgreSQL 11 database, there is a "name" column with the "tsvector" data type for implementing full-text search.
But when I try to add to this column an ​​entry containing a colon ":", an error occurs:
Exception in thread Thread-10:
Traceback (most recent call last):
File "C:\Program Files\Python37\lib\threading.py", line 917, in_bootstrap_inner
self.run()
File "C:\Program Files\Python37\lib\threading.py", line 865, in run
self._target(*self._args, **self._kwargs)
File "C:\Users\vs\Desktop\Арсений execute\allsave.py", line 209, in group_parsing
VALUES (%s,%s,%s,%s)''', a[i])
psycopg2.ProgrammingError: ERROR: syntax error in tsvector: "Reggae.FM:"
LINE 3: VALUES (181649,'Reggae.FM:'
When I added this data to the "text" field type, there were no problems. But apparently "tsvector" does not accept strings containing a colon ":" and, probably, some other characters.
The question is, how do I implement full-text search if the "tsvector" cannot store such characters?
P.S. Using "text" or "char" is not a solution; searching for such data types is very slow.
I get the lines by parsing groups vk.com (Russian social network), that is, the names of all existing groups. I need to keep these names in full form that the user could find them on my site. But any solutions will help me.

Use to_tsvector to normalize the string and return a tsvector:
INSERT INTO ...
VALUES (%s,to_tsvector(%s),%s,%s)''', a[i])
Note that casting as tsvector would not work here:
unutbu=# select 'Reggae.FM:'::tsvector;
ERROR: syntax error in tsvector: "Reggae.FM:"
LINE 1: select 'Reggae.FM:'::tsvector
^
This is what to_tsvector returns:
unutbu=# select to_tsvector('Reggae.FM:');
+---------------+
| to_tsvector |
+---------------+
| 'reggae.fm':1 |
+---------------+
(1 row)

Related

SparkSQL Regular Expression: Cannot remove backslash from text

I have data imbedded in my text fields that I need to suppress. The data is in the format of \nnn where nnn is a 3 digit number. I tried the following:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\d][\d][\d]','') as new_value").show()
I expected the result to be "ABC\XYZ\", but what I got instead was:
+---------+
|new_value|
+---------+
| ABCSXYZĮ|
+---------+
I'm not sure what the other characters are after the C and after Z.
However I need to remove the backslash as well. To get rid of the backslash, I then tried this:
spark.sql("select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value").show()
However it just simply crashed on me. No matter how I try to escape the backslash, it failed. I tested the regular expression on regex101.com and it was fine.
I appreciate any suggestions!
EDIT: I tried Daniel's suggestion but received the following errors:
>>> spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 649, in sql
return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/usr/lib/spark/python/pyspark/sql/utils.py", line 134, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
Literals of type 'R' are currently not supported.(line 1, pos 22)
== SQL ==
select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value
----------------------^^^
Thoughts?
If you check the string outside of the spark.sql call, you will notice the string has special characters due to how Python handles the backslash in a literal string:
>>> "select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABCSXYZĮ','[\\][\\d][\\d][\\d]','') as new_value"
If we add the r prefix to the string, it will be considered a "raw" string (see here), which treats backslashes as literal characters:
>>> r"select regexp_replace('ABC\123XYZ\456','[\\][\d][\d][\d]','') as new_value"
"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\d][\\d][\\d]','') as new_value"
That looks correct, but if we just supply the raw Python string to spark.sql, the exception SparkException: The Spark SQL phase optimization failed with an internal error. is raised. In addition to making the Python string a raw string, we also need to make the strings in the Spark SQL query raw, which can again be accomplished by providing the r prefix (see here) to the strings used within the query, to ensure backslashes are escaped correctly by the Spark SQL parser.
It's a bit odd looking, but the end result is:
spark.sql(r"select regexp_replace(r'ABC\123XYZ\456',r'[\\][\d][\d][\d]','') as new_value").show()
And the output is, as expected:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
It's also worth mentioning if you call regexp_replace as a function directly (from pyspark.sql.functions import regexp_replace), rather than within a Spark SQL query, the pattern in the regex seems to be implicitly treated as a raw string. It does not require an r prefix - the backslash in the first character class just needs to be escaped:
from pyspark.sql import SparkSession
from pyspark.sql.functions import regexp_replace
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
df = spark.createDataFrame([(r'ABC\123XYZ\456',)], ['temp'])
df.select(regexp_replace('temp', '[\\\][\d][\d][\d]', '').alias('new_value')).show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Update
It looks like when running on EMR, raw strings aren't supported in Spark SQL, which manifests in parsing errors. In that case, making the Python string a raw string and adding escape characters to the SQL string should also work:
from pyspark.sql import SparkSession
spark = SparkSession\
.builder\
.appName("test")\
.getOrCreate()
spark.sql(r"select regexp_replace('ABC\\123XYZ\\456','[\\\\][\\\d][\\\d][\\\d]','') as new_value").show()
Output:
+---------+
|new_value|
+---------+
| ABCXYZ|
+---------+
Note that the string the regex is being applied to only requires a single escape (\) for each \, whereas the pattern requires two escape characters (\\) for each \.
Daniel,
Thank you SO MUCH for your help! With the examples you gave, I was able to tweak my SQL statement so it would work. Besides putting the r before the select statement string, I had to change the regular expression to this:
'[\\\\][\\\d][\\\d][\\\d]'
Thank you much!
Aaron

pandas read_csv with multiple separators does not work

I need to be able to parse 2 different types of CSVs with read_csv, the first has ;-separated values and the second has ,-separated values. I need to do this at the same time.
That is, the CSV can have this format:
some;csv;values;here
or this:
some,csv,values,here
or even mixed:
some;csv,values;here
I tried many things like the following regex but nothing worked:
data = pd.read_csv(csv_file, sep=r'[,;]', engine='python')
Am I doing something wrong with the regex?
Instead of reading from a file, I ran your code sample
reading from a string:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1'''
data = pd.read_csv(io.StringIO(txt), sep='[,;]', engine='python')
and got a proper result:
C1 C2 C3 C4
0 some csv values here
1 some1 csv1 values1 here1
Note that the sep parameter can be even an ordinary (not raw) string,
because it does not contain any backslashes.
So your idea to specify multiple separators as a regex pattern is OK.
The reason that your code failed is probably an "inconsistent" division of
lines into fileds. Maybe you should ensure that each line contains the
same number of commas and semi-colons (at least not too many).
Look thoroughly at your stack trace. There should include some information
about which line of the source file caused the problem.
Then look at the indicated line and correct it.
Edit
To look what happens in a "failure case", I changed the source string to:
txt = '''C1;C2,C3;C4
some;csv,values;here
some1;csv1,values1;here1
some2;csv2,values2;here2,xxxx'''
i.e. I added one line with 5 fields (one too many).
Then execution of the above code results in an error message:
ParserError: Expected 4 fields in line 4, saw 5. ...
Note words in line 4, precisely indicating the offending input line
(line numbers starts from 1).

Hive -character '<EOF>' not supported here

Select * from mytable where field=
'ce7bd3d4-dbdd-407e-a3c3-ce093a65abc9;cdb597073;7cf6cda5fc'
Getting Below Error while running above query in Hive
FAILED: ParseException line 1:92 character '' not supported here
<EOF> here means End Of File. When you get an "unexpected End Of File" error it means the parser reached the end of the query unexpectedly. This typically happens when the parser is expecting to find a closing character, such as when you have started a string with ' or " but have not closed the string (with the closing ' or ").
When you come across these types of errors it is good to check that your query can be parsed correctly. In addition, the error gives you the location where the parser failed: line 1:92 in this case. You can usually look at this location (character 92 of the query) and work backwards to find the problem character.
Try adding the database name to the "from" statement as below.
Select * from my_db_name.mytable where field= 'ce7bd3d4-dbdd-407e-a3c3-
ce093a65abc9;cdb597073;7cf6cda5fc';
Hive uses the default database when no database was previously specified.

Line contains invalid enclosed character data or delimiter at position

I was trying to load the data from the csv file into the Oracle sql developer, when inserting the data I encountered the error which says:
Line contains invalid enclosed character data or delimiter at position
I am not sure how to tackle this problem!
For Example:
INSERT INTO PROJECT_LIST (Project_Number, Name, Manager, Projects_M,
Project_Type, In_progress, at_deck, Start_Date, release_date, For_work, nbr,
List, Expenses) VALUES ('5770','"Program Cardinal
(Agile)','','','','','',to_date('', 'YYYY-MM-DD'),'','','','','');
The Error shown were:
--Insert failed for row 4
--Line contains invalid enclosed character data or delimiter at position 79.
--Row 4
I've had success when I've converted the csv file to excel by "save as", then changing the format to .xlsx. I then load in SQL developer the .xlsx version. I think the conversion forces some of the bad formatting out. It worked at least on my last 2 files.
I fixed it by using the concatenate function in my CSV file first and then uploaded it on sql, which worked.
My guess is that it doesn't like to_date('', 'YYYY-MM-DD'). It's missing a date to format. Is that an actual input of your data?
But it could also possibly be the double quote in "Program Cardinal (Agile). Though I don't see why that would get picked up as an invalid character.

My regular expression with cyrillic symbols doesn't work

Good something, everyone. I have a kind of an SQL code (which is irrelevant for the matter) in which I'd like to find a number + "," + some string in Russian (my test string is "в"). Here's an example of a string in which I hope to find this:
insert into lemmas (id, word, lemma) values ("37","возбраняется","возбраняться");
Here's my code in python:
file_SQL = open('sql_code.txt', 'r', encoding = 'UTF-8')
SQLtext = file_SQL.read()
regux = '([0-9]+)?","' + wordform.lower() #wordform is "в"
find_it = re.search(regux, SQLtext)
found_it = find_it.group(1)
file_SQL.close()
return found_it
In the end, I want to get the particular number. The error I get with this code:
Traceback (most recent call last):
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 109, in <module>
main()
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 106, in main
imma_write_myself_a_SQL_file(val4, val3)
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 85, in imma_write_myself_a_SQL_file
f_id = find_f_id(wrdform)
File "C:\Users\Неро\my_study\homework_4_2016\holy_guacamole_SQL.py", line 95, in find_f_id
found_it = find_it.group(1)
AttributeError: 'NoneType' object has no attribute 'group'
Obviously, this means that re.search() found nothing.
I've also tried to just search with this regular expression in notepad++, but it didn't work:
A picture of me trying to find this number before a word starting with "в".
(Sorry for the Russian notepad, hope nobody minds it) As you can see in the picture a word starting with "в" exists in the file.
Also I've tried several other regular expressions such as ([0-9]+)?\",\", ([0-9]{1,3})",".
And I've tried to search with re.findall(), but I basically got an empty list.
Not sure this will help but it's at least good to share.
You can try to encode your string into unicode chars. For instance в is \x{0432}.
You can see the full match of возбраняется using [\x{0400}-\x{0450}]+ here: https://regex101.com/r/GRQBLK/1.
Here is a tool to convert to unicode: https://www.branah.com/unicode-converter. Then wrap it with \x{...}.