SparkSQL: How to query a column with datatype: List of Maps - dataframe

I have a dataframe with column of array (or list) with each element being a map of String, complex data type (meaning --String, nested map, list etc; in a way you may assume column data type is similar to List[Map[String,AnyRef]])
now i want to query on this table like..
select * from the tableX where column.<any of the array element>['someArbitaryKey'] in ('a','b','c')
I am not sure how to represent <any of the array element> in the spark SQL. Need help.

The idea is to transform the list of maps into a list of booleans, where each boolean indicates if the respective map contains the wanted key (k2 in the code below). After that all we have to check if the boolean array contains at least one true element.
select * from tableX where array_contains(transform(col1, map->map_contains_key(map,'k2')), true)
I have assumed that the name of the column holding the list of maps is col1.
The second parameter of the transform function could be replaced by any expression that returns a boolean value. In this example map_contains_key is used, but any check resulting in a boolean value would work.
A bit unrelated: I believe that the data type of the map cannot be Map[String,AnyRef] as there is no encoder for AnyRef available.

Related

Search JSON column for JSON that contains a specific value

I have a postgresql database with a table called choices, in the choices table I have a column called json that contains JSON entries, for example: [1,2,3]
I need a query that returns all entires that contains a specific value.
For example I have the following entries:
[1,2,3] [6,7,1] [4,5,2]
I want to get all entries that contain the value 1 so it would return:
[1,2,3]
[6,7,1]
Thanks,
demo: db<>fiddle
The json_array_elements_textfunctions expands the json arrays into one row each element (as text). With that you can filter it by any value you like.
SELECT
json_data
FROM choices, json_array_elements_text(json_data) elem
WHERE value = '1'
Documentation: JSON functions
Please notice that "json" is a the name for the json type in PostgreSQL. You should better rename your column to avoid some conflicts. (I called mine json_data)

Check subset using either string or array in Impala

I have a table like this
col
-----
A,B
The col could be string with comma or array. I have flexibility on the storage.
How to check of col is a subset of either another string or array variable? For example:
B,A --> TRUE (order doesn't matter)
A,D,B --> TRUE (other item in between)
A,D,C --> FALSE (missing B)
I have flexibility on the type. The variable is something I cannot store in a table.
Please let me know if you have any suggestion for Impala only (no Hive).
Thanks
A not pretty method, but perhaps a starting point...
Assuming a table with a unique identifier column id and an array<string> column col, and a string variable with ',' as a separator (and no occurrences of escaped '\,')...
SELECT
yourTable.id
FROM
yourTable,
yourTable.col
GROUP BY
yourTable.id
HAVING
COUNT(DISTINCT CASE WHEN find_in_set(col.item, ${VAR:yourString}) > 0 THEN col.item END)
=
LENGTH(regexp_replace(${VAR:yourString},'[^,]',''))+1
Basically...
Expand the arrays in your table, to one row per array item.
Check if each item exists in your string.
Aggregate back up to count how many of the items were found in the string.
Check that the number of items found is the same as the number of items in the string
The COUNT(DISTINCT <CASE>) copes with arrays like {'a', 'a', 'b', 'b'}.
Without expanding the string to an array or table (which I don't know how to do) you're dependent on the items in the string being unique. (Because I'm just counting commas in the string to find out how many items there are...)

sql IN operator with text array

How to find string values in text array using SQL query.
Suppose I have:
id location
1 {Moscow,New york}
2 {Mumbai}
3 {California,Texas}
I want to find id whose location is Moscow.I used:
select id from table where location in ('Moscow'); but get error:
ERROR: malformed array literal: "Moscow"
LINE 1: select id from table where location in ('Moscow');
DETAIL: Array value must start with "{" or dimension information.
I am using Postgres.
For DataType=Array, you can use the method=Any.
select id from table where 'Moscow' = Any(location)
As the document which describes DataType=Array:
8.14.5. Searching in Arrays
To search for a value in an array, each value must be checked. This
can be done manually, if you know the size of the array.
or use the method = Any:
9.21.3. ANY/SOME (array)
expression operator ANY (array expression) expression operator SOME
(array expression) The right-hand side is a parenthesized expression,
which must yield an array value. The left-hand expression is evaluated
and compared to each element of the array using the given operator,
which must yield a Boolean result. The result of ANY is "true" if any
true result is obtained. The result is "false" if no true result is
found (including the case where the array has zero elements).
For Searching in Array DataType you can use the ANY()
SELECT id FROM table WHERE 'Moscow' = ANY(location);
Live Demo
http://sqlfiddle.com/#!17/a6c3a/2
select id from demo where location like '%Moscow'%'

from string to map object in Hive

My input is a string that can contain any characters from A to Z (no duplicates, so maximum 26 characters it may have).
For example:-
set Input='ATK';
The characters within the string can appear in any order.
Now I want to create a map object out of this which will have fixed keys from A to Z. The value for a key is 1 if its corresponding character appears in the input string. So in case of this example (ATK) the map object should look like:-
So what is the best way to do this?
So the code should look like:-
set Input='ATK';
select <some logic>;
It should return a map object (Map<string,int>) with 26 key value pairs within it. What is the best way to do it, without creating any user defined functions in Hive. I know there is a function str_to_map that easily comes to mind.But it only works if key value pairs exist in source string and also it will only consider the key value pairs specified in the input.
Maybe not efficient but works:
select str_to_map(concat_ws('&',collect_list(concat_ws(":",a.dict,case when
b.character is null then '0' else '1' end))),'&',':')
from
(
select explode(split("A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,W,X,Y,Z",',')) as dict
) a
left join
(
select explode(split(${hiveconf:Input},'')) as character
) b
on a.dict = b.character
The result:
{"A":"1","B":"0","C":"0","D":"0","E":"0","F":"0","G":"0","H":"0","I":"0","J":"0","K":"1","L":"0","M":"0","N":"0","O":"0","P":"0","Q":"0","R":"0","S":"0","T":"1","U":"0","V":"0","W":"0","X":"0","Y":"0","Z":"0"}

Hive unordered pairs function

Is it possible to generate unordered pairs in Hive (similar to Pig Unordered Pairs function?) Does this function exist anywhere?
Ideally I would like to be able to pass in a table such as :
select * from mytable
array_1
["A","B","C"]
and get back
select unorderedPairs(array_1) from mytable
["A",B"]
["B","C"]
["C","A"]
There is no built-in function. In Hive this would be called a User Defined Table Generating Function. Here are the built-in UDTF's:
inline(ARRAY)
Explodes an array of structs into a table. (As of Hive 0.10.)
Array Type
explode(array a)
For each element in a, generates a row containing that element.
tuple
json_tuple(jsonStr, k1, k2, ...)
Takes a set of names (keys) and a JSON string, and returns a tuple of values. This is a more efficient version of the get_json_object UDF because it can get multiple keys with just one call.
tuple
parse_url_tuple(url, p1, p2, ...)
This is similar to the parse_url() UDF but can extract multiple parts at once out of a URL. Valid part names are: HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, USERINFO, QUERY:.
N rows
posexplode(ARRAY)
Behaves like explode for arrays, but includes the position of items in the original array by returning a tuple of (pos, value). (As of Hive 0.13.0.)
stack(INT n, v_1, v_2, ..., v_k)
Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n must be constant.