Check subset using either string or array in Impala - sql

I have a table like this
col
-----
A,B
The col could be string with comma or array. I have flexibility on the storage.
How to check of col is a subset of either another string or array variable? For example:
B,A --> TRUE (order doesn't matter)
A,D,B --> TRUE (other item in between)
A,D,C --> FALSE (missing B)
I have flexibility on the type. The variable is something I cannot store in a table.
Please let me know if you have any suggestion for Impala only (no Hive).
Thanks

A not pretty method, but perhaps a starting point...
Assuming a table with a unique identifier column id and an array<string> column col, and a string variable with ',' as a separator (and no occurrences of escaped '\,')...
SELECT
yourTable.id
FROM
yourTable,
yourTable.col
GROUP BY
yourTable.id
HAVING
COUNT(DISTINCT CASE WHEN find_in_set(col.item, ${VAR:yourString}) > 0 THEN col.item END)
=
LENGTH(regexp_replace(${VAR:yourString},'[^,]',''))+1
Basically...
Expand the arrays in your table, to one row per array item.
Check if each item exists in your string.
Aggregate back up to count how many of the items were found in the string.
Check that the number of items found is the same as the number of items in the string
The COUNT(DISTINCT <CASE>) copes with arrays like {'a', 'a', 'b', 'b'}.
Without expanding the string to an array or table (which I don't know how to do) you're dependent on the items in the string being unique. (Because I'm just counting commas in the string to find out how many items there are...)

Related

SparkSQL: How to query a column with datatype: List of Maps

I have a dataframe with column of array (or list) with each element being a map of String, complex data type (meaning --String, nested map, list etc; in a way you may assume column data type is similar to List[Map[String,AnyRef]])
now i want to query on this table like..
select * from the tableX where column.<any of the array element>['someArbitaryKey'] in ('a','b','c')
I am not sure how to represent <any of the array element> in the spark SQL. Need help.
The idea is to transform the list of maps into a list of booleans, where each boolean indicates if the respective map contains the wanted key (k2 in the code below). After that all we have to check if the boolean array contains at least one true element.
select * from tableX where array_contains(transform(col1, map->map_contains_key(map,'k2')), true)
I have assumed that the name of the column holding the list of maps is col1.
The second parameter of the transform function could be replaced by any expression that returns a boolean value. In this example map_contains_key is used, but any check resulting in a boolean value would work.
A bit unrelated: I believe that the data type of the map cannot be Map[String,AnyRef] as there is no encoder for AnyRef available.

Get max on comma separated values in column

How to get max on comma separated values in Original_Ids column and get max value in one column and remaining ids in different column.
|Original_Ids | Max_Id| Remaining_Ids |
|123,534,243,345| 534 | 123,234,345 |
Upadte -
If I already have Max_id and just need below equation?
Remaining_Ids = Original_Ids - Max_id
Thanks
Thanks to the excellent possibilities of array manipulation in Postgres, this could be done relatively easy by converting the string to an array and from there to a set.
Then regular queries on that set are possible. With max() the maximum can be selected and with EXCEPT ALL the maximum can be removed from the set.
A set can then be converted to an array and with array_to_string() and the array can be converted to a delimited string again.
SELECT ids original_ids,
(SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id)) max_id,
array_to_string(ARRAY((SELECT un.id::integer
FROM unnest(string_to_array(ids,
',')) un(id)
EXCEPT ALL
SELECT max(un.id::integer)
FROM unnest(string_to_array(ids,
',')) un(id))),
',') remaining_ids
FROM elbat;
Another option would have been regexp_split_to_table() which directly produces a set (or regexp_split_to_array() but than we'd had the possible regular expression overhead and still had to convert the array to a set).
But nevertheless you just should (almost) never use delimited lists (nor arrays). Use a table, that's (almost) always the best option.
SQL Fiddle
You can use a window function (https://www.postgresql.org/docs/current/static/tutorial-window.html) to get the max element per unnested array. After that you can reaggregate the elements and remove the calculated max value from the array.
Result:
a max_elem remaining
123,534,243,345 534 123,243,345
3,23,1 23 3,17
42 42
56,123,234,345,345 345 56,123,234
This query needs only one split/unnest as well as only one max calculation.
SELECT
a,
max_elem,
array_remove(array_agg(elements), max_elem) as remaining -- C
FROM (
SELECT
*,
MAX(elements) OVER (PARTITION BY a) as max_elem -- B
FROM (
SELECT
a,
unnest((string_to_array(a, ','))::int[]) as elements -- A
FROM arrays
)s
)s
GROUP BY a, max_elem
A: string_to_array converts the string list into an array. Because the arrays are treated as string arrays you need the cast them into integer arrays by adding ::int[]. The unnest() expands all array elements into own rows.
B: window function MAX gives the maximum value of the single arrays as max_elem
C: array_agg reaggregates the elements through the GROUP BY id. After that array_remove removes the max_elem value from the array.
If you do not like to store them as pure arrays but as string list again you could add array_to_string. But I wouldn't recommend this because your data are integer arrays and not strings. For every further calculation you would need this string cast. A even better way (as already stated by #stickybit) is not to store the elements as arrays but as unnested data. As you can see in nearly every operation should would do the unnest before.
Note:
It would be better to use an ID to adress the columns/arrays instead of the origin string as in SQL Fiddle with IDs
If you install the extension intarray this is quite easy.
First you need to create the extension (you have to be superuser to do that):
create extension intarray;
Then you can do the following:
select original_ids,
original_ids[1] as max_id,
sort(original_ids - original_ids[1]) as remaining_ids
from (
select sort_desc(string_to_array(original_ids,',')::int[]) as original_ids
from bad_design
) t
But you shouldn't be storing comma separated values to begin with

How to use regexp_matches() in an UPDATE statement?

I am trying to clean up a table that has a very messy varchar column, with entries of the sorts:
<u><font color="#0000FF">VA Lidar</font></u> OR <u><font color="#0000FF">InPort Metadata</font></u>
I would like to update the column by keeping only the html links, and separating them with a coma if there are more than one. Ideally I would do something like this:
UPDATE mytable
SET column = array_to_string(regexp_matches(column,'(?<=href=").+?(?=\")','g') , ',');
But unfortunately this returns an error in Postgres 10:
ERROR: set-returning functions are not allowed in UPDATE
I assume regexp_matches() is the said set-returning function. Any ideas on how I can achieve this?
Notes
1.
You don't need to base the correlated subquery on a separate instance of the base table (like other answers suggested). That would be doing more work for nothing.
2.
For simple cases an ARRAY constructor is cheaper than array_agg(). See:
Why is array_agg() slower than the non-aggregate ARRAY() constructor?
3.
I use a regular expression without lookahead and lookbehind constraints and parentheses instead: href="([^"]+)
See query 1.
This works because parenthesized subexpressions are captured by regexp_matches() (and several other Postgres regexp functions). So we can replace the more sophisticated constraints with plain parentheses. The manual on regexp_match():
If a match is found, and the pattern contains no parenthesized
subexpressions, then the result is a single-element text array
containing the substring matching the whole pattern. If a match is
found, and the *pattern* contains parenthesized subexpressions, then the
result is a text array whose n'th element is the substring matching
the n'th parenthesized subexpression of the pattern
And for regexp_matches():
This function returns no rows if there is no match, one row if there
is a match and the g flag is not given, or N rows if there are N
matches and the g flag is given. Each returned row is a text array
containing the whole matched substring or the substrings matching
parenthesized subexpressions of the pattern, just as described above
for regexp_match.
4.
regexp_matches() returns a set of arrays (setof text[]) for a reason: not only can a regular expression match several times in a single string (hence the set), it can also produce multiple strings for each single match with multiple capturing parentheses (hence the array). Does not occur with this regexp, every array in the result holds a single element. But future readers shall not be lead into a trap:
When feeding the resulting 1-D arrays to array_agg() (or an ARRAY constructor) that produces a 2-D array - which is only even possible since Postgres 9.5 added a variant of array_agg() accepting array input. See:
Is there something like a zip() function in PostgreSQL that combines two arrays?
However, quoting the manual:
inputs must all have same dimensionality, and cannot be empty or NULL
I think this can never fail as the same regexp always produces the same number of array elements. Ours always produces one element. But that may be different with other regexp. If so, there are various options:
Only take the first element with (regexp_matches(...))[1]. See query 2.
Unnest arrays and use string_agg() on base elements. See query 3.
Each approach works here, too.
Query 1
UPDATE tbl t
SET col = (
SELECT array_to_string(ARRAY(SELECT regexp_matches(col, 'href="([^"]+)', 'g')), ',')
);
Columns with no match are set to '' (empty string).
Query 2
UPDATE tbl
SET col = (
SELECT string_agg(t.arr[1], ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
);
Columns with no match are set to NULL.
Query 3
UPDATE tbl
SET col = (
SELECT string_agg(elem, ',')
FROM regexp_matches(col, 'href="([^"]+)', 'g') t(arr)
, unnest(t.arr) elem
);
Columns with no match are set to NULL.
db<>fiddle here (with extended test case)
You could use a correlated subquery to deal with the offending set-returning function (which is regexp_matches). Something like this:
update mytable
set column = (
select array_to_string(array_agg(x), ',')
from (
select regexp_matches(t2.c, '(?<=href=").+?(?=\")', 'g')
from t t2
where t2.id = t.id
) dt(x)
)
You're still stuck with the "CSV in a column" nastiness but that's a separate issue and presumably not a problem for you.
Building on the approach of mu is too short with slightly different regex and a COALESCE function to retain values that do not contain href-links:
UPDATE a
SET bad_data = COALESCE(
(SELECT Array_to_string(Array_agg(x), ',')
FROM (SELECT Regexp_matches(a.bad_data,
'(?<=href=")[^"]+', 'g'
) AS x
FROM a a2
WHERE a2.id = a.id) AS sub), bad_data
);
SQL Fiddle

from string to map object in Hive

My input is a string that can contain any characters from A to Z (no duplicates, so maximum 26 characters it may have).
For example:-
set Input='ATK';
The characters within the string can appear in any order.
Now I want to create a map object out of this which will have fixed keys from A to Z. The value for a key is 1 if its corresponding character appears in the input string. So in case of this example (ATK) the map object should look like:-
So what is the best way to do this?
So the code should look like:-
set Input='ATK';
select <some logic>;
It should return a map object (Map<string,int>) with 26 key value pairs within it. What is the best way to do it, without creating any user defined functions in Hive. I know there is a function str_to_map that easily comes to mind.But it only works if key value pairs exist in source string and also it will only consider the key value pairs specified in the input.
Maybe not efficient but works:
select str_to_map(concat_ws('&',collect_list(concat_ws(":",a.dict,case when
b.character is null then '0' else '1' end))),'&',':')
from
(
select explode(split("A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z",',')) as dict
) a
left join
(
select explode(split(${hiveconf:Input},'')) as character
) b
on a.dict = b.character
The result:
{"A":"1","B":"0","C":"0","D":"0","E":"0","F":"0","G":"0","H":"0","I":"0","J":"0","K":"1","L":"0","M":"0","N":"0","O":"0","P":"0","Q":"0","R":"0","S":"0","T":"1","U":"0","V":"0","W":"0","X":"0","Y":"0","Z":"0"}

Problem with MySQL Select query with "IN" condition

I found a weird problem with MySQL select statement having "IN" in where clause:
I am trying this query:
SELECT ads.*
FROM advertisement_urls ads
WHERE ad_pool_id = 5
AND status = 1
AND ads.id = 23
AND 3 NOT IN (hide_from_publishers)
ORDER BY rank desc
In above SQL hide_from_publishers is a column of advertisement_urls table, with values as comma separated integers, e.g. 4,2 or 2,7,3 etc.
As a result, if hide_from_publishers contains same above two values, it should return only record for "4,2" but it returns both records
Now, if I change the value of hide_for_columns for second set to 3,2,7 and run the query again, it will return single record which is correct output.
Instead of hide_from_publishers if I use direct values there, i.e. (2,7,3) it does recognize and returns single record.
Any thoughts about this strange problem or am I doing something wrong?
There is a difference between the tuple (1, 2, 3) and the string "1, 2, 3". The former is three values, the latter is a single string value that just happens to look like three values to human eyes. As far as the DBMS is concerned, it's still a single value.
If you want more than one value associated with a record, you shouldn't be storing it as a comma-separated value within a single field, you should store it in another table and join it. That way the data remains structured and you can use it as part of a query.
You need to treat the comma-delimited hide_from_publishers column as a string. You can use the LOCATE function to determine if your value exists in the string.
Note that I've added leading and trailing commas to both strings so that a search for "3" doesn't accidentally match "13".
select ads.*
from advertisement_urls ads
where ad_pool_id = 5
and status = 1
and ads.id = 23
and locate(',3,', ','+hide_from_publishers+',') = 0
order by rank desc
You need to split the string of values into separate values. See this SO question...
Can Mysql Split a column?
As well as the supplied example...
http://blog.fedecarg.com/2009/02/22/mysql-split-string-function/
Here is another SO question:
MySQL query finding values in a comma separated string
And the suggested solution:
http://dev.mysql.com/doc/refman/5.0/en/string-functions.html#function_find-in-set