OpenRefine remove duplicates from list with jython - jython

I have a column with values that are duplicated e.g.
VMS5796,VMS5650,VMS5650,CSL,VMA5216,CSL,VMA5113
I'm applying a transform using jython that removes the duplicates (On error is set to keep original), here's the code:
return list(set(value.split(",")))
Which works in the preview, but isn't getting applied to the column. What am I doing wrong?

The Map function is very powerful and an underused function in Python / Jython. It probably is unclear what this code does internally, but it is extremely fast in processing millions of bits of values from a list or array in your columns cells' values that need to be 'mapped' as a string type and then applying a join with a separator char such as a comma ', '
deduped_list = list(set(value.split(",")))
return ', '.join(map(str, deduped_list))
There are probably other, even slightly faster variations than this, but this should get you going in the right direction.
Interestingly, you can also get the 'printable representation' repr(object) which is acceptable to an EVAL like OpenRefine's and can be useful for seeing the representation of your values as well..., which I just found out about, researching this answer in more depth for you.
deduped_list = list(set(value.split(",")))
return ', '.join(map(repr, deduped_list))

Preview implicitly formats things for display. Your expression returns an array (which can't be stored in a cell), so if you'd like to get it string form, tack a .join(',') on the end.

Related

Array contains string from column in Athena

I'm trying to check if at least one element of my_array array contains a string from my_column column in athena. I want the end result to be a boolean.
I've tried :
contains(my_array, my_column)
It seems to be partially working and I actually can't understand why because sometimes I get false even though the element of the array and the string are matching, but I never get true if there's no match.
So I've also tried (as seen here : Presto array contains an element that likes some pattern )
cardinality(filter(my_array, x->x like my_column))>0
But same issue, ending with the exact same result as above.
I thought I had an issue with some values from my_column, but after checking all my problematic values it appears the strings from my_column are clean.
For instance I am able to find a match for the first value of my_array, but no match for the second, although both values are in my_column.
Could someone please help ?
Thank you so much !!

Go application making SQL Query using GROUP_CONCAT on FLOATS returns []uint8 instead of actual []float64

Have a problem using group_concat in a query made by my go application.
Any idea why a group_concat of FLOATS would look like a []uint8 on the Go side?
Cant seem to properly convert the suckers either.
It's definitely floats, I can see it in the raw query results, but when I do the same query in go and try to .Scan the result, Go complains that it's a []uint8 not a []float64 (which it actually is) Attempts to convert to floats gives me the wrong values (and way too many of them).
For example, at the database, I query and get 2 floats for the column in question, looks like this:
"5650.50, 5455.00"
On the go side however, go sees a []uint8 instead of []float64. Why does this happen? How does one workaround this to get the actual results?
My problem is that I have to use this SQL with the group_concat, due to the nature of the database I am working with, this is the best way to get the information, and more importantly the query itself works great, returns the data the function needs, but now I cant read it out because of type issues. No stranger to those, but Go isn't cooperating with me today.
I'd be more than pleased to learn WHY go is doing it this way, and delighted to learn of a way to deal with it.
Example:
SELECT ID, getDistance(33.1543,-110.4353, Loc.Lat, Loc.Lng) as distance,
GROUP_CONCAT(values) FROM stuff INNER JOIN device on device.ID = stuff.ID WHERE (someConditionsETC) GROUP BY ID ORDER BY ID
The actual result, when interfacing with the actual database (not within my application), is
"5650.00, 5850.50"
It's clearly 2 floats.
The same result produces a slice of uint8 when queried from Go and trying to .Scan the result in. If I range through and print those values, I get way more than 2, and they are uint8 (bytes) that look like this:
53,55,56,48,46,48,48
Not sure how Go expects me to handle this.
Solution.... stupid simple and not terribly obvious:
The solution: 
crazyBytes := []uint8("5760.00,5750.50")
aString := string(crazyBytes)
strSlice := strings.Split(aString,",") // string representation of our array (of floats)
var floatz []float64
for _, x := range strSlice {
fmt.Printf("At last, Float: %s \r\n",x)
f,err := strconv.ParseFloat(x,64)
if err != nil { fmt.Printf("Error: %s",err) }
floatz = append(floatz, f)
fmt.Printf("as float: %s \r\n", strconv.FormatFloat(f,'f',-1,64))
}
Yea sure, it's obvious NOW.
GROUP_CONCAT returns a string. So in Go you get a byte array of characters, not a float. The result you posted 53,55,56,48,46,48,48 translates into a string "5780.00" which does look like one of your values. So you need to either fix your SQL to return floats or use strings and strconv modules in Go to parse and convert your string into floats. I think the former approach is better, but it is up to you.

How to append to a list in Automation Anywhere 10.5?

The list starts empty. Then I want to append an value to it for each iteration in a loop if certain condition is met. I don't see append option in Variable Operation.
You can use string split for this, assuming you know of a delimiter that won't ever be in your list of values. I've used a semi-colon, and $local_joinedList$ starts off empty.
If (certain condition is met)
Variable Operation: $local_joinedList$;$local_newValue$ To $local_joinedList$
End If
String Operation: Split "$local_joinedList$" with delimiter ";" and assign output to $my-list-variable$
This overwrites $my-list-variable$.
If you need to append to an existing list, you can do it the same way by using String Join first, append your values to the string, then split it again afterward.
String Operation: Join elements of "$my-list-variable$" by delimiter ";" and assign output to $local_joinedList$
Lists are buggy in Automation Anywhere and have been buggy for several versions. I suggest not using them and instead use XML.
It it a much more versatile approach and allows you to do much more that with lists. You can search, filter, insert, delete etc.
For the example you mention, you would use the "Insert Node" command.
Throwing in my 2 cents as well - my-list-variable appears to be the only mutable in size list you can work with. From my experience with 10.7, it only grows though.
So if you made a list with 60 values, and you wanted to use my-list-variable again for 55, you'll need to clear out those remaining 5 values and create an if condition when looping over the list to ensure the values are not whatever you set those 5 values to be.
I used lime's answer as a reference (thanks lime!) to populate a list variable from some data in an Excel spreadsheet.
Here's my automation for it:

pyspark.sql data.frame understanding functions

I am taking a mooc.
It has one assignment where a column needs to be converted to the lower case. sentence=lower(column) does the trick. But initially I thought that the syntax should be sentence=column.lower(). I looked at the documentation and I couldnt figure out the problem with my syntax. Would it be possible to explain how I could have figured out that I have a wrong syntax by searching online documentation and function definition?
I am specially confused as This link shows that string.lower() does the trick in case of the regular string python objects
from pyspark.sql.functions import regexp_replace, trim, col, lower
def removePunctuation(column):
"""Removes punctuation, changes to lower case, and strips leading and trailing spaces.
Note:
Only spaces, letters, and numbers should be retained. Other characters should should be
eliminated (e.g. it's becomes its). Leading and trailing spaces should be removed after
punctuation is removed.
Args:
column (Column): A Column containing a sentence.
Returns:
Column: A Column named 'sentence' with clean-up operations applied.
"""
sentence=lower(column)
return sentence
sentenceDF = sqlContext.createDataFrame([('Hi, you!',),
(' No under_score!',),
(' * Remove punctuation then spaces * ',)], ['sentence'])
sentenceDF.show(truncate=False)
(sentenceDF
.select(removePunctuation(col('sentence')))
.show(truncate=False))
You are correct. When you are working with a string, if you want to convert it to lowercase, you should use str.lower().
And if you check the String page in the Python Documentation, you will see it has a lower method that should work as you expect:
a_string = "StringToConvert"
a_string.lower() # "stringtoconvert"
However. in the Spark example you provided, in your function removePunctuation you are NOT working with a singlestring, you are working with a Column. And a Column is a different object than a string, that is way you should use a method that works with a Column.
Specifically, you are working with this pyspark sql method. The next time you are in doubt on which method you need to implement, double check the datatype of your objects. Also, if you check the list of imports, you will see it is calling the lower method from pyspark.sql.functions
This is how i managed to do it:
lowered = lower(column)
np_lowered = regexp_replace(lowered, '[^\w\s]', '')
trimmed_np_lowered = trim(np_lowered)
return trimmed_np_lowered
return trim(lower(regexp_replace(column, "\p{Punct}", ""))).alias('sentence')

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!