Array contains string from column in Athena - sql

I'm trying to check if at least one element of my_array array contains a string from my_column column in athena. I want the end result to be a boolean.
I've tried :
contains(my_array, my_column)
It seems to be partially working and I actually can't understand why because sometimes I get false even though the element of the array and the string are matching, but I never get true if there's no match.
So I've also tried (as seen here : Presto array contains an element that likes some pattern )
cardinality(filter(my_array, x->x like my_column))>0
But same issue, ending with the exact same result as above.
I thought I had an issue with some values from my_column, but after checking all my problematic values it appears the strings from my_column are clean.
For instance I am able to find a match for the first value of my_array, but no match for the second, although both values are in my_column.
Could someone please help ?
Thank you so much !!

Related

Numpy - how do I erase elements of an array if it is found in an other array

TLDR: I have 2 arrays indices = numpy.arange(9) and another that contains some of the numbers in indices (maybe none at all, maybe it'll contain [2,4,7]). The output I'd like for this example is [0,1,3,5,6,8]. What method can be used to achieve this?
Edit: I found a method which works somewhat: casting both arrays to a set then taking the difference of the two does give the correct result, but as a set, even if I pass this result to a numpy.array(). I'll update this if I find a solution for that.
Edit2: Casting the result of the subtraction to a list, then casting passing that to a numpy.array() resolved my issue.
I guess I posted this question a little prematurely, given that I found the solution for it myself, but maybe this'll be useful to somebody in future!
You can make use of boolean masking:-
indices[~numpy.isin(indices,[2,4,7])]
Explanation:-
we are using numpy.isin() method to find out the values exists or not in incides array and then using ~ so that this gives opposite result and finally we are passing this boolean mask to indices

What is the meaning of having a variable="+" ? SAS (sql)

I'm new to SAS and I'm trying to understand a code:
if MAP_ID="+" then output WORK.0201_template;
else
do;
SHEET_ID=MAP_ID;
output WORK.0201_template_f;
end;
What does it mean the MAP_ID="+"? Does it mean that it search on the table for the values where MAP_ID=+, or does it have another menaing?
Thanks
The MAP_ID="+" is a boolean expression that compares the value the variable MAP_ID to the character string literal "+". It will be true when they are the same and false otherwise.
I suspect that the main purpose of this code is to split the data into two different output datasets based on the value of MAP_ID.
It also is changing the value of SHEET_ID. That type of code also looks like something that is designed to carry forward the value of MAP_ID in a retained field SHEET_ID. If I am right then the meaning of the value of + is to keep the same sheet_id. But we would need to seem more of the code and the data to really tell.

Regex match SQL values string with multiple rows and same number of columns

I tried to match the sql values string (0),(5),(12),... or (0,11),(122,33),(4,51),... or (0,121,12),(31,4,5),(26,227,38),... and so on with the regular expression
\(\s*\d+\s*(\s*,\s*\d+\s*)*\)(\s*,\s*\(\s*\d+\s*(\s*,\s*\d+\s*)*\))*
and it works. But...
How can I ensure that the regex does not match a values string like (0,12),(1,2,3),(56,7) with different number of columns?
Thanks in advance...
As i mentioned in comment to the question, the best way to check if input string is valid: contains the same count of numbers between brackets, is to use client side programm, but not clear SQL.
Implementation:
List<string> s = new List<string>(){
"(0),(5),(12)", "(0,11),(122,33),(4,51)",
"(0,121,12),(31,4,5),(26,227,38)","(0,12),(1,2,3),(56,7)"};
var qry = s.Select(a=>new
{
orig = a,
newst = a.Split(new string[]{"),(", "(", ")"},
StringSplitOptions.RemoveEmptyEntries)
})
.Select(a=>new
{
orig = a.orig,
isValid = (a.newst
.Sum(b=>b.Split(new char[]{','},
StringSplitOptions.RemoveEmptyEntries).Count()) %
a.newst.Count()) ==0
});
Result:
orig isValid
(0),(5),(12) True
(0,11),(122,33),(4,51) True
(0,121,12),(31,4,5),(26,227,38) True
(0,12),(1,2,3),(56,7) False
Note: The second Select statement gets the modulo of sum of comma instances and the count of items in string array returned by Split function. If the result isn't equal to zero, it means that input string is invalid.
I strongly believe there's a simplest way to achieve that, but - at this moment - i don't know how ;)
:(
Unless you add some more constraints, I don't think you can solve this problem only with regular expressions.
It isn't able to solve all of your string problems, just as it cannot be used to check that the opening and closing of brackets (like "((())()(()(())))") is invalid. That's a more complicated issue.
That's what I learnt in class :P If someone knows a way then that'd be sweet!
I'm sorry, I spent a bit of time looking into how we could turn this string into an array and do more work to it with SQL but built in functionality is lacking and the solution would end up being very hacky.
I'd recommend trying to handle this situation differently as large scale string computation isn't the best way to go if your database is to gradually fill up.
A combination of client and serverside validation can be used to help prevent bad data (like the ones with more numbers) from getting into the database.
If you need to keep those numbers then you could rework your schema to include some metadata which you can use in your queries, like how many numbers there are and whether it all matches nicely. This information can be computed inexpensively from your server and provided to the database.
Good luck!

Convert an alphanumeric string to integer format

I need to store an alphanumeric string in an integer column on one of my models.
I have tried:
#result.each do |i|
hex_id = []
i["id"].split(//).each{|c| hex_id.push(c.hex)}
hex_id = hex_id.join
...
Model.create(:origin_id => hex_id)
...
end
When I run this in the console using puts hex_id in place of the create line, it returns the correct values, however the above code results in the origin_id being set to "2147483647" for every instance. An example string input is "t6gnk3pp86gg4sboh5oin5vr40" so that doesn't make any sense to me.
Can anyone tell me what is going wrong here or suggest a better way to store a string like the aforementioned example as a unique integer?
Thanks.
Answering by request form OP
It seems that the hex_id.join operation does not concatenate strings in this case but instead sums or performs binary complement of the hex values. The issue could also be that hex_id is an array of hex-es rather than a string, or char array. Nevertheless, what seems to happen is reaching the maximum positive value for the integer type 2147483647. Still, I was unable to find any documented effects on array.join applied on a hex array, it appears it is not concatenation of the elements.
On the other hand, the desired result 060003008600401100500050040 is too large to be recorded as an integer either. A better approach would be to keep it as a string, or use different algorithm for producing a number form the original string. Perhaps aggregating the hex values by an arithmetic operation will do better than join ?

OpenRefine remove duplicates from list with jython

I have a column with values that are duplicated e.g.
VMS5796,VMS5650,VMS5650,CSL,VMA5216,CSL,VMA5113
I'm applying a transform using jython that removes the duplicates (On error is set to keep original), here's the code:
return list(set(value.split(",")))
Which works in the preview, but isn't getting applied to the column. What am I doing wrong?
The Map function is very powerful and an underused function in Python / Jython. It probably is unclear what this code does internally, but it is extremely fast in processing millions of bits of values from a list or array in your columns cells' values that need to be 'mapped' as a string type and then applying a join with a separator char such as a comma ', '
deduped_list = list(set(value.split(",")))
return ', '.join(map(str, deduped_list))
There are probably other, even slightly faster variations than this, but this should get you going in the right direction.
Interestingly, you can also get the 'printable representation' repr(object) which is acceptable to an EVAL like OpenRefine's and can be useful for seeing the representation of your values as well..., which I just found out about, researching this answer in more depth for you.
deduped_list = list(set(value.split(",")))
return ', '.join(map(repr, deduped_list))
Preview implicitly formats things for display. Your expression returns an array (which can't be stored in a cell), so if you'd like to get it string form, tack a .join(',') on the end.