pyspark hive sql convert array(map(varchar, varchar)) to string by rows - sql

I would like to transform a column of
array(map(varchar, varchar))
to string as rows of a table on presto db by pyspark hive sql programmatically from jupyter notebook python3.
example
user_id sport_ids
'aca' [ {'sport_id': '5818'}, {'sport_id': '6712'}, {'sport_id': '1065'} ]
expected results
user_id. sport_ids
'aca'. '5815'
'aca'. '5712'
'aca'. '1065'
I have tried
sql_q= """
select distinct, user_id, transform(sport_ids, x -> element_at(x, 'sport_id')
from tab """
spark.sql(sql_q)
but got error:
'->' cannot be resolved
I have also tried
sql_q= """
select distinct, user_id, sport_ids
from tab"""
spark.sql(sql_q)
but got error:
org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column request_features[0] is map<string,string>;;
Did I miss something ?
I have tried this, but helpful
hive convert array<map<string, string>> to string
Extract map(varchar, array(varchar)) - Hive SQL
thanks

Lets try use higher order functions to find map values and explode into individual rows
df.withColumn('sport_ids', explode(expr("transform(sport_ids, x->map_values(x)[0])"))).show()
+-------+---------+
|user_id|sport_ids|
+-------+---------+
| aca| 5818|
| aca| 6712|
| aca| 1065|
+-------+---------+

You can process json data (json_parse, cast to array of json and json_extract_scalar - for more json functions - see here) and flatten (unnest) on presto side:
-- sample data
WITH dataset(user_id, sport_ids) AS (
VALUES
('aca', '[ {"sport_id": "5818"}, {"sport_id": "6712"}, {"sport_id": "1065"} ]')
)
-- query
select user_id,
json_extract_scalar(record, '$.sport_id') sport_id
from dataset,
unnest(cast(json_parse(sport_ids) as array(json))) as t(record)
Output:
user_id
sport_id
aca
5818
aca
6712
aca
1065

Related

How do you write a presto query to split a string into its own column

Trying to splint a string into multiple columns in qubole using presto query.
{"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","adultguests":5,"childguests":0,"pets":null}]}
Would like startdate,lastnightdate,adultguests,childguests and pets into its own column.
I tried to unnest string but that didn't work.
The data looks a lot like json, so you can process it using json functions first (parse, extract, cast to array(map(varchar, json)) or array(map(varchar, varcchar))) and then flatten with unnest:
-- sample data
WITH dataset(json_payload) AS (
VALUES
('{"field0":[{"startdate":"2022-07-13","lastnightdate":"2022-07-16","adultguests":5,"childguests":0,"pets":null}]}')
)
-- query
select m['startdate'] startdate,
m['lastnightdate'] lastnightdate,
m['adultguests'] adultguests,
m['childguests'] childguests,
m['pets'] pets
from dataset,
unnest(cast(json_extract(json_parse(json_payload), '$.field0') as array(map(varchar, json)))) t(m)
Output:
startdate
lastnightdate
adultguests
childguests
pets
2022-07-13
2022-07-16
5
0
null

Extract complex json with random key field

I am trying to extract the following JSON into its own rows like the table below in Presto query. The issue here is the name of the key/av engine name is different for each row, and I am stuck on how I can extract and iterate on the keys without knowing the value of the key.
The json is a value of a table row
{
"Bkav":
{
"detected": false,
"result": null,
},
"Lionic":
{
"detected": true,
"result": Trojan.Generic.3611249',
},
...
AV Engine Name
Detected Virus
Result
Bkav
false
null
Lionic
true
Trojan.Generic.3611249
I have tried to use json_extract following the documentation here https://teradata.github.io/presto/docs/141t/functions/json.html but there is no mention of extraction if we don't know the key :( I am trying to find a solution that works in both presto & hive query, is there a common query that is applicable to both?
You can cast your json to map(varchar, json) and process it with unnest to flatten:
-- sample data
WITH dataset (json_str) AS (
VALUES (
'{"Bkav":{"detected": false,"result": null},"Lionic":{"detected": true,"result": "Trojan.Generic.3611249"}}'
)
)
--query
select k "AV Engine Name", json_extract_scalar(v, '$.detected') "Detected Virus", json_extract_scalar(v, '$.result') "Result"
from (
select cast(json_parse(json_str) as map(varchar, json)) as m
from dataset
)
cross join unnest (map_keys(m), map_values(m)) t(k, v)
Output:
AV Engine Name
Detected Virus
Result
Bkav
false
Lionic
true
Trojan.Generic.3611249
The presto query suggested by #Guru works, but for hive, there is no easy way.
I had to extract the json
Parse it with replace to remove some character and bracket
Then convert it back to a map, and repeat for one more time to get the nested value out
SELECT
av_engine,
str_to_map(regexp_replace(engine_result, '\\}', ''),',', ':') AS output_map
FROM (
SELECT
str_to_map(regexp_replace(regexp_replace(get_json_object(raw_response, '$.scans'), '\"', ''), '\\{',''),'\\},', ':') AS key_val_map
FROM restricted_antispam.abuse_malware_scanning
) AS S
LATERAL VIEW EXPLODE(key_val_map) temp AS av_engine, engine_result

Extract information from a json string in BigQuery

I am storing a table in Bigquery with the results of a classification algorithm. The table schema is INT, STRING and looks something like this :
ID
Output
1001
{'Apple Cider': 0.7, 'Coffee' : 0.2, 'Juice' : 0.1}
1002
{'Black Coffee':0.9, 'Tea':0.1}
The problem is how to fetch the first (or second or any order) element of each string together with its score. It doesn't seem likely that JSON_EXTRACT can work and most likely it can be done with Javascript. Was wondering what an elegant solution would look like here.
Consider below
select ID,
trim(split(kv, ':')[offset(0)], " '") element,
cast(split(kv, ':')[offset(1)] as float64) score,
element_position
from `project.dataset.table` t,
unnest(regexp_extract_all(trim(Output, '{}'), r"'[^':']+'\s?:\s?[^,]+")) kv with offset as element_position
If applied to sample data in your question - output is
Note: you can use less verbose unnest statement if you wish
unnest(split(trim(Output, '{}'))) kv with offset as element_position

Pass date values from dataframe to query in Spark /Scala

I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.
You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")

Cannot have map type columns in DataFrame which calls set operations

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])