Cannot have map type columns in DataFrame which calls set operations - hive

: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col

Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])

Related

SparkSQL regexp_extract function java error

I am trying to extract the id's starting with srsa from the table structure below
id reason_text_field
34394 {"initial_customer":"sda_WWyfr4AXY1fIAS", customer_result":"srsa_CAkAaAvNKL2OSD"}
in order to get the following output:
id srsa_id
34394 srsa_CAkAaAvNKL2OSD
but when I use the following SparkSQL function
REGEXP_EXTRACT(reason_text_field, 'srsa[^"]*') as srsa_id
I get this error:
java.lang.IndexOutOfBoundsException: No group
You need to specify the group to capture. Try this:
SELECT id,
REGEXP_EXTRACT(reason_text_field, '\"(srsa[^"]*)\"', 1) as srsa_id
-- or REGEXP_EXTRACT(reason_text_field, 'srsa[^"]*', 0) as srsa_id
FROM tb
Note however that you can also convert the text column reason_text_field into a map or struct using from_json then extract the field customer_result:
SELECT id,
from_json(reason_text_field, 'map<string,string>')['customer_result'] as srsa_id
FROM tb

extract values inside an array column in amazon athena

I have a table in athena aws where the column 'metadata_stopinfo' has the structure that you can see in the image.
I am trying to extract values that are inside that array, however when I try
SELECT
"json_extract_scalar"(metadata_stopinfo, '$.city')
FROM "table"
I have the following problem
SYNTAX_ERROR: line 2:5: Unexpected parameters (array(row("address" row("addressline" varchar,"city" varchar,"countrycode" varchar,"countrycodeoriginal" varchar,"state" varchar,"zipcode" varchar),"carrierreference" varchar,"contacts" array(row("contacttype" varchar,"email" varchar,"fax" varchar,"mobilephone" varchar,"name" varchar,"officephone" varchar,"userid" varchar)),"containerinfo" array(row("containerid" varchar,"containeridtype" varchar,"equipmentcode" varchar,"equipmenttype" varchar)),"conveyancelinenumber" varchar,"conveyancetype" varchar,"conveyancetypeoriginal" varchar,"dateinfo" row("arrivalestimateddate" varchar,"arrivalestimateddateend" varchar,"arrivalestimatedendoffset" varchar,"arrivalestimatedoffset" varchar,"arrivalrequesteddate" varchar,"deliveryestimateddate" varchar,"deliveryestimateddateend" varchar,"deliveryestimatedendoffset" varchar,"deliveryestimatedoffset" varchar,"deliveryrequesteddate" varchar,"deliveryrequesteddateend" varchar,"deliveryrequestedendoffset" varchar,"deliveryrequestedoffset" varchar,"departureestimateddate" varchar,"departureestimateddateend" varchar,"departureestimatedendoffset" varchar,"departureestimatedoffset" varchar,"departurerequesteddate" varchar,"pickuprequesteddate" varchar,"pickuprequesteddateend" varchar,"pickuprequestedendoffset" varchar,"pickuprequestedoffset" varchar,"pickupestimateddate" varchar,"pickupestimateddateend" varchar,"pickupestimatedendoffset" varchar,"pickupestimatedoffset" varchar),"deliverynotenumber" varchar,"instructions" array(row("customerspecificsubtype" varchar,"header" boolean,"instructionsubtype" varchar,"instructiontype" varchar,"text" varchar)),"locationid" varchar,"partnercarrieraddress" row("addressline" varchar,"city" varchar,"countrycode" varchar,"countrycodeoriginal" varchar,"state" varchar,"zipcode" varchar),"partnercarriercontacts" array(row("contacttype" varchar,"email" varchar,"fax" varchar,"name" varchar,"officephone" varchar)),"partnercarrierid" varchar,"partnercarriername" varchar,"partnerid" varchar,"partnername" varchar,"partnertimezone" varchar,"partnertype" varchar,"productquantity" row("number" double,"originalunitofmeasure" varchar,"quantitytype" varchar,"unitofmeasure" varchar),"sequencenumber" bigint,"shipmentidentifier" varchar,"stoptype" varchar,"transportinfo" row("description" varchar,"transportcode" varchar,"transportoriginalcode" varchar),"vesselinfo" row("lloydsnumber" varchar,"shipsradiocallnumber" varchar,"vesselname" varchar,"vesselnumber" varchar,"voyagetripnumber" varchar))), varchar(6)) for function json_extract_scalar. Expected: json_extract_scalar(varchar(x), JsonPath) , json_extract_scalar(json, JsonPath)
My question is, how can i extract values inside de column ?
json_extract_scalar unsurprisingly works with json (note that even if yur data was in json format, json_extract_scalar(metadata_stopinfo, '$.city') still would not have worked cause your data is an array), while your column contains array's of row's, so you need to work with it correspondingly. For example you can use indexes to access elements in array (in presto array indexes start from 1):
SELECT
metadata_stopinfo[1] r
FROM "table"
And then access the fields:
The fields may be of any SQL type, and are accessed with field reference operator .
SELECT
metadata_stopinfo[1].city city
FROM "table"
Also you can flatten the array with unnest:
SELECT r.city
FROM "table",
unnest(metadata_stopinfo) as t(r)

Databricks notebook - return all values from a table

For test purposes, I have an empty DB into which I populate a tiny amount of data, extracted and transformed from a json file.
I would like to create a notebook using scala, which gets all values from all columns from a given table, and exit the notebook returning this result as a string.
I've tried variations of the following:
val result = spark.sql("select * from table.DB").as[String];
dbutils.notebook.exit(result)
However the first command fails with error:
AnalysisException: Try to map struct<Version:bigint,metadataInformation:struct<metadataID:string... etc ...> to Tuple1, but failed as the number of fields does not line up.;
However, something like the following works, to retrieve value of a specific field, from a column:
val result = spark.sql("select column.jsonfield from table.DB").as[String].first();
dbutils.notebook.exit(result)
How can I return the content of all columns?
val result = spark.sql("SELECT x FROM y").collect().toList.flatMap(x => x.toSeq).mkString(",")
dbutils.notebook.exit(result)

Pass date values from dataframe to query in Spark /Scala

I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.
You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")

Writing dataframe via sql query (pyodbc): pyodbc.Error: ('HY004', '[HY004])

I'd like to parse a dataframe to two pre-define columns in an sql table. The schema in sql is:
abc(varchar(255))
def(varchar(255))
With a dataframe like so:
df = pd.DataFrame(
[
[False, False],
[True, True],
],
columns=["ABC", "DEF"],
)
And the sql query is like so:
with conn.cursor() as cursor:
string = "INSERT INTO {0}.{1}(abc, def) VALUES (?,?)".format(db, table)
cursor.execute(string, (df["ABC"]), (df["DEF"]))
cursor.commit()
So that the query (string) looks like so:
'INSERT INTO my_table(abc, def) VALUES (?,?)'
This creates the following error message:
pyodbc.Error: ('HY004', '[HY004] [Cloudera][ODBC] (11320) SQL type not supported. (11320) (SQLBindParameter)')
So I try using a direct query (not via Python) in the Impala editor, on the following:
'INSERT INTO my_table(abc, def) VALUES ('Hey','Hi');'
And produces this error message:
AnalysisException: Possible loss of precision for target table 'my_table'. Expression ''hey'' (type: `STRING) would need to be cast to VARCHAR(255) for column 'abc'`
How come I cannot even insert into my table simple strings, like "Hi"? Is my schema set up correctly or perhaps something else?
STRING type in Impala has a size limit of 2GB.
VARCHAR's length is whatever you define it to be, but not more than 64KB.
Thus there is a potential of data loss if you implicitly convert one into another.
By default, literals are treated as type STRING. So, in order to insert a literal into VARCHAR field you need to CAST it appropriately.
INSERT INTO my_table(abc, def) VALUES (CAST('Hey' AS VARCHAR(255)),CAST('Hi' AS VARCHAR(255)));