How to create UUID in Openrefine based on the MD5 hash of the values - jython

I am trying to create a UUID based on the md5 hash of a cell value in OpenRefine (using Jython) but I am having troubles passing the value to the function.
I am able to create UUID using the expression:
import uuid;
return str(uuid.uuid4());
but I want to use the md5 hash of the cell's value, so I tried to follow the formula
uuid.uuid3(namespace, name)
However, I am unable to pass the value to the function. The attempt:
import uuid;
return str(uuid.uuid3(uuid.NAMESPACE_DNS, value));
receive the following error:
Error: Traceback (most recent call last): File "", line 3,
in temp_448166737 File "/Applications/OpenRefine
3.2b.app/Contents/Resources/webapp/extensions/jython/module/MOD-INF/lib/jython-standalone-2.7.1.jar/Lib/uuid.py",
line 528, in uuid3 UnicodeDecodeError: 'ascii' codec can't decode byte
0xa7 in position 1: ordinal not in range(128)
Without using the cell's value, the expression works quite well. The example
import uuid;
return str(uuid.uuid3(uuid.NAMESPACE_DNS, 'example'));
use the string "example" and compute the UUID c5e5f349-28ef-3f5a-98d6-0b32ee4d1743 for each cells. However, it is not the desired result.
Any ideas how to pass to Jython the value of the cell present in OpenRefine within an expression?

You just have to encode your unicode strings in value with .encode('utf-8'), as explained here:
import uuid
return str(uuid.uuid3(uuid.NAMESPACE_DNS, value.encode('utf-8')))

Related

Is it possible to read a csv with `\r\n` line terminators in pandas?

I'm using pandas==1.1.5 to read a CSV file. I'm running the following code:
import pandas as pd
import csv
csv_kwargs = dict(
delimiter="\t",
lineterminator="\r\n",
quoting=csv.QUOTE_MINIMAL,
escapechar="!",
)
pd.read_csv("...", **csv_kwargs)
It raises the following error: ValueError: Only length-1 line terminators supported.
Pandas documentation confirms that line terminators should be length-1 (I suppose single character).
Is there any way to read this CSV with Pandas or should I read it some other way?
Note that the docs suggest length-1 for C parsers, maybe I can plugin some other parser?
EDIT: Not specifying the line terminator raises a parse error in the middle of the file. Specifically ParserError: Error tokenizing data., it expects the correct number of fields but gets too many.
EDIT2: I'm confident the kwargs above were used to created the csv file I'm trying to read.
The problem might be in the escapchar, since ! is a common text character.
Python's csv module defines a very strict use of escapechar:
A one-character string used by the writer to escape the delimiter if quoting is set to QUOTE_NONE and the quotechar if doublequote is False.
but it's possible that pandas interprets it differently:
One-character string used to escape other characters.
It's possible that you have a row that contains something like:
...\t"some important text!"\t...
which would escape the quote character and continue parsing text into that column.

Create a column of empty lists for appending strings

I need to intiialize a column in a dataframe with an empty list. I followed this closely related question. That method uses np.empty. The difference is that my lists will use the append method to add strings.
df1['Qual']=np.empty((len(df1),0)).tolist()
I get the error:
File "/usr/local/lib/python3.6/dist-packages/pandas/core/series.py", line 2582, in append
to_concat, ignore_index=ignore_index, verify_integrity=verify_integrity
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 281, in concat
sort=sort,
File "/usr/local/lib/python3.6/dist-packages/pandas/core/reshape/concat.py", line 357, in __init__
raise TypeError(msg)
TypeError: cannot concatenate object of type '<class 'str'>'; only Series and DataFrame objs are valid
as soon as I try to do my first append with a string (this is inside a double loop):
df1['Qual'].loc[(df1['Ship '+shipid1[index1]]=='Yes') & (df1['Pref '+str(i1+1)]==preflevel1)].append([shipcodes1[index1]])
shipcodes1 is list of strings.
How do I create such a column?

'UnicodeEncodeError' when using 'set_dataframe'

When using set_dataframe to update my Google Sheets via pygsheets and pandas, I get error:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 13: ordinal not in range(128)
This is due to utf-8 marks over some text, e.g.,: "seƱor"
This happens on executing:
wks.set_dataframe(df, start='A10')
Pandas to_csv accepts an encoding parameter similar to encoding="utf-8", may I suggests set_dataframe does the same?
wks.set_dataframe(df, start='A10', encoding="utf-8")
I see there's a ticket opened 10 days ago here but is there a workaround?
Solution:
I ran into the same issue, and I think that, more than a bug in the pygsheets module, it would be a limitation as you clearly point out.
What I did to solve this issue was:
def encodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == unicode:
try:
encodedSeries = series.str.encode(encoding)
df[column] = encodedSeries
except Exception as e:
print 'Could not encode column %s' % column
raise
And you can call the function this way:
encodeDataFrame(df)
wks.set_dataframe(df, start='A10')
This might no longer be a good solution because of a change that was made in pygsheets to avoid this issue. See EDIT section below
Explanation:
You solve the issue by encoding the unicode values yourself, before sending them to the set_dataframe function.
This problem comes up whenever you try to use the Worksheet.set_dataframe function, using a dataframe that contains unicode characters that cannot be encoded in ascii (like accents, and many other).
The exception is thrown because the set_dataframe function attempts to cast the unicode values into str values (using the default encoding). For Python 2, the default encoding is ascii and when a character out of the range of ascii is found, the exception is thrown.
Some people have suggested reloading the sys module to circumvent this problem, but here is explained why you should not do it
The other solution I would think of would be to use the pygsheets module in Python 3, where this should no longer be a problem because the default encoding for Python 3 is UTF-8 (see docs)
Bonus:
Ask yourself:
1) Is Unicode an encoding?
2) What is an encoding?
If you hesitated with any of those questions, you should read this article, which gave me the knowledge needed to think of this solution. The time invested was completely worth it.
For more information, you can try this article which links to the previous one at the end.
EDIT:
A change was made 2 days ago (07/26/19) to pygsheets that is supposed to fix this. It looks like the intention is to avoid encoding into the str type, but I would think that this change might try decoding strings into the unicode type from the default ascii encoding, which could also lead to trouble. When/If this change is released it is probably going to be better not to encode anything and pass values as unicode to the set_dataframe function.
EDIT 2:
This change has now been released in version 2.0.2. If my predictions were true, using the encodeDataFrame function I suggested will result in a UnicodeDecodeError because the Worksheet.set_dataframe function will attempt to decode the str values with the default ascii encoding. So the best way to use the function will be not to have any str values in your dataframe. If you have them, decode them to unicode before calling the set_dataframe function. You could have a mirror version of the
function I suggested. It would look something like this:
def decodeDataFrame(df, encoding='UTF-8'):
if df is not None and not df.empty:
for column, series in df.items():
if type(series.at[0]) == str:
try:
decodedSeries = series.str.decode(encoding)
df[column] = decodedSeries
except Exception as e:
print 'Could not decode column %s' % column
raise

Inserting into Postgres Multidimensional Text Array from NodeJS Knex

I am attempting to insert a single row into a Postgres table from a NodeJS application using the Knex-Seed-File module for Knex.
Upon each attempt, I receive an error for only one column/field which is a multidimensional text array: photo_urls text[][] NULL,. The error states there is a malformed array literal.
Having gone through the official Postgres documentation, I've tried using double quotes:
(8.14.2. Array Value Input)
"To write an array value as a literal constant, enclose the element values within curly braces and separate them by commas...You can put double quotes around any element value, and must do so if it contains commas or curly braces."
I've also tried using ARRAY constructor syntax.
Here are various attempts Ive had at constructing the input as well as what was returned as being the actual SQL generated and the returned error:
Attempt 1:
array[['ext_profile:','ext_random:','int_random:'],['https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile','https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2','https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3']]
Result 1:
'array[[\'ext_profile:\',\'ext_random:\',\'int_random:\'],[\'https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile\',\'https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2\',\'https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3\']]'
Error 1:
- malformed array literal: "array[['ext_profile:','ext_random:','int_random:'],['https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile','https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2','https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3']]
Attempt 2:
$${"ext_profile:", "ext_random:", "int_random:"},{"https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile", "https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2", "https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3"}$$
Result 2:
'"$${""ext_profile:"", ""ext_random:"", ""int_random:""},{""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""}$$"'
Error 2:
- malformed array literal: ""$${""ext_profile:"", ""ext_random:"", ""int_random:""},{""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""}$$"
Attempt 3:
($${"ext_profile:", "ext_random:", "int_random:"},{"https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile", "https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2", "https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3"}$$)
Result 3:
'"($${""ext_profile:"", ""ext_random:"", ""int_random:""},{""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""}$$)"'
Error 3:
- malformed array literal: ""($${""ext_profile:"", ""ext_random:"", ""int_random:""},{""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"", ""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2, https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""}$$)"
Attempt 4:
array[['ext_profile:','ext_random:','int_random:'],["https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3"]]
Result 4:
'"array[[\'ext_profile:\',\'ext_random:\',\'int_random:\'],[""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""]]"'
Error 4:
- malformed array literal: ""array[['ext_profile:','ext_random:','int_random:'],[""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""]]"
Attempt 5 (Post knex-seed-file upgrade):
[["ext_profile:","ext_random:","int_random:"],["https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3"]]
Result 5:
'"[[""ext_profile:"",""ext_random:"",""int_random:""],[""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""]]"'
Error 5:
- malformed array literal: ""[[""ext_profile:"",""ext_random:"",""int_random:""],[""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2"",""https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3""]]"
There appear to be many bugs/issues reported as related to knex postgres integration:
#658, #828, #869, #1602,... which seem to have been closed and/or merged into #1661.
From what I can tell, it appears the issue was closed as resolved.
Can anyone help identify what I'm doing wrong or what I can do to resolve the issue?
The module is now upgraded (0.3.1) and should now handle arrays properly. To enter array value after updating the package, you should use following pattern:
[["ext_profile:","ext_random:","int_random:"],["https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Profile","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Ext+Random+2","https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+1,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=In+Random+2,https://dummyimage.com/300x250/0cb3f5/fcfcfc.png&text=Int+Random+3"]]
Please open an issue at https://github.com/tohalla/knex-seed-file, upon encountering more problems.
#touko correctly identified that the issue is a result of the default behavior for csv files.
When saving a csv file, double quotes are added to anything with embedded commas or double-quoted characters.
It is explained in these posts and articles:
superuser post 1
superuser post 2
csvreader.com
https://www.rfc-editor.org/rfc/rfc4180
In regards to the knex-seed-file module there is an issue opened in Github. Currently, I'm using the workaround of opening the csv file in a text editor and manually removing the undesired double quotes.
Example (note: I am using a pipe-delimited csv file):
find "[[ and replace with [[
find ]]" and replace with ]]
find "" and replace with "

psycopg2: export csv to database, dealing with e+ expression

I have a csv file containing
numbers like "1.456e+07"
and I am using function "copy_expert" to export the file to database
but I am getting error
psycopg2.DataError: invalid input syntax for integer: "1.5637e+07"
I notice that I can insert "100" as an integer, but when I do "1.5637e+07" with qoute, it doesn't work.
I am using pandas dataframe's to_csv to generate the csv files. not sure how to get rid of qoute for integer like "1.5637e+07" only (I have string column), or whether there is other solution.
I find out the solution
Normally, pandas doesn't put quotes around number. However, I set float_format parameter which causes this. I reset
quoting=csv.QUOTE_MINIMAL
in the function call and the quotes go away.