I have a DataFrame orders:
+-----------------+-----------+--------------+
| Id| Order | Gender|
+-----------------+-----------+--------------+
| 1622|[101330001]| Male|
| 1622| [147678]| Male|
| 3837| [1710544]| Male|
+-----------------+-----------+--------------+
which I want to groupBy on Id and Gender and then aggregate orders.
I am using org.apache.spark.sql.functions package and code looks like:
DataFrame group = orders.withColumn("orders", col("order"))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
However since column Order is of type array I get this exception because it expects a primitive type:
User class threw exception: org.apache.spark.sql.AnalysisException: No handler for Hive udf class org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList because: Only primitive type arguments are accepted but array<string> was passed as parameter 1
I have looked in the package and there are sort functions for arrays but no aggregate functions. Any idea how to do it? Thanks.
In this case you can define your own function and register it as UDF
val userDefinedFunction = ???
val udfFunctionName = udf[U,T](userDefinedFunction)
Then instead of then pass that column inside that function so that it gets converted into primitive type and then pass it in the with Columns method.
Something like this:
val dataF:(Array[Int])=>Int=_.head
val dataUDF=udf[Int,Array[Int]](dataF)
DataFrame group = orders.withColumn("orders", dataUDF(col("order")))
.groupBy(col("Id"), col("Gender"))
.agg(collect_list("products"));
I hope it works !
Related
I have a table:
+-------------------+-------------+--------------+-------+-------------+
| session_id| insert_dttm| key| value| process_name|
+-------------------+-------------+--------------+-------+-------------+
|local-1641922005078|1641922023703|test_file1.csv|Success|ProcessResult|
|local-1641922005078|1641922023704|test_file1.csv|Success|ProcessResult|
|local-1641922005078|1641922023705|test_file2.csv|Success|ProcessResult|
|local-1641922005080|1641922023706|test_file2.csv|Success|ProcessResult|
|local-1641922005080|1641922023707|test_file3.csv|Success|ProcessResult|
|local-1641922005080|1641922023708|test_file3.csv|Success|ProcessResult|
+-------------------+-------------+--------------+-------+-------------+
I want to get last session from this table:
local-1641922005080 :String
Can I do this using a window function?
I have solution:
val lastSessionId = ds.select(max(struct(col("insert_dttm"), col("session_id")))("session_id"))
.first.getString(0)
Only I also want to implement this with a window function.
Actually you don't need here a window function since you can just sort the data in desc order and return the first record using limit(1).
But for the sake of practice, you can use window function like this:
import org.apache.spark.sql.functions.{col, row_number}
import org.apache.spark.sql.expressions.Window
val windowSpec = Window.orderBy(col("insert_dttm").desc)
val lastSessionId = df.withColumn("row_number", row_number.over(windowSpec)).filter("row_number=1").first.getString(0)
I have a value in a JSON column that is sometimes all null in an Azure Databricks table. The full process to get to JSON_TABLE is: read parquet, infer schema of JSON column, convert the column from JSON string to deeply nested structure, explode any arrays within. I am working in SQL with python-defined UDFs (json_exists() checks the schema to see if the key is possible to use, json_get() gets a key from the column or returns a default) and want to do the following:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM
JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
When the data has at least one instance of JSON_COL containing ARRAY, the schema is such that this has no problems. If, however, the data has all null values in JSON_COL.ARRAY, an error occurs because the column has been inferred as a string type (error received: input to function explode should be array or map type, not string). Unfortunately, while the json_exists() function returns the expected values, the error still occurs even when the returned dataset would be empty.
Can I get around this error via casting or replacement of nulls? If not, what is an alternative that still allows inferring the schema of the JSON?
Note: This is a simplified example. I am writing code to generate SQL code for hundreds of similar data structures, so while I am open to workarounds, a direct solution would be ideal. Please ask if anything is unclear.
Example table that causes error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "otherInfo": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
Example table that does not cause error:
| ID | JSON_COL |
| 1 | {"_corrupt_record": null, "array": [{"test": 1, "from": 3}]} |
| 2 | {"_corrupt_record": null, "otherInfo": [{"test": 5, "from": 2}]} |
This question seems like it might hold the answer, but I was not able to get anything working from it.
You can filter the table before calling json_get and explode, so that you only explode when json_get returns a non-null value:
SELECT
ID, EXPLODE(json_get(JSON_COL, 'ARRAY', NULL)) AS SINGLE_ARRAY_VALUE
FROM (
SELECT *
FROM JSON_TABLE
WHERE
JSON_COL IS NOT NULL AND
json_exists(JSON_COL, 'ARRAY')==1
)
I have a dataframe having a date column and has values as below.
df.show()
+----+----------+
|name| dob|
+----+----------+
| Jon|2001-04-15|
| Ben|2002-03-01|
+----+----------+
Now i need to query a table in hive which has "dob" from above dataframe (both 2001-04-15, 2002-03-01).So I need to pass the values under dob column as a parameter to my hive query.
I tried to collect the values to a variable like below which give me array of string.
val dobRead = df.select("updt_d").distinct().as[String].collect()
dobRead: Array[String] = Array(2001-04-15, 2002-03-01)
However when i try to pass to the query i see its not substituting properly and get error.
val tableRead = hive.executeQuery(s"select emp_name,emp_no,martial_status from <<table_name>> where dateOfBirth in ($dobRead)")
org.apache.hadoop.hive.ql.metadata.HiveException: Failed to compile query: org.apache.hadoop.hive.ql.parse.ParseException: line 1:480 cannot recognize input near '(' '[' 'Ljava' in expression specification
Can you please help me how to pass date values to a query in spark.
You can collect the dates as follows (Row.getAs):
val rows: Array[Row] = df.select("updt_d").distinct().collect()
val dates: Array[String] = rows.map(_.getAs[String](0))
And then build the query:
val hql: String = s"select ... where dateOfBirth in (${
dates.map(d => s"'${d}'").mkString(", ")
})"
Option 2
If the number of dates in first DataFrame is too big, you should use join operations instead of collecting them into the driver.
First, load every table as DataFrames (I'll call them dfEmp and dfDates). Then you can join on date fields to filter, either using a standard inner join plus filtering out null fields or using directly a left_semi join:
val dfEmp = hiveContext.table("EmpTable")
val dfEmpFiltered = dfEmp.join(dfDates,
col("dateOfBirth") === col("updt_d"), "left_semi")
: org.apache.spark.sql.AnalysisException: Cannot have map type columns in DataFrame which calls set operations(intersect, except, etc.), but the type of column map_col is map
I have a hive table with a column of type - MAP<Float, Float>. I get the above error when I try to do an insertion on this table in a spark context. Insertion works fine without the 'distinct'.
create table test_insert2(`test_col` string, `map_col` MAP<INT,INT>)
location 's3://mybucket/test_insert2';
insert into test_insert2
select distinct 'a' as test_col, map(0,0) as map_col
Try to convert dataframe to .rdd then apply .distinct function.
Example:
spark.sql("select 'a'test_col,map(0,0)map_col
union all
select 'a'test_col,map(0,0)map_col").rdd.distinct.collect
Result:
Array[org.apache.spark.sql.Row] = Array([a,Map(0 -> 0)])
The following query only returns vlabel.
Should it return elabls as well?
match
return distinct labels;
According to the docs in the "List functions" section, the labels() function will only return vlabels.
labels()
Returns vlabel of the vertex passed as an argument. You should be careful when passing arguments to the label function; when you find a vertex that matches the pattern using MATCH clause, assign a variable, and pass that variable as an argument, the vertex itself cannot be passed as an argument to the label function, but must always be passed as a variable.
If you wanted the edges/relationships, the docs state to use the relationships() function:
relationships()
Returns the edges present in the path passed as an argument. You should be careful when passing arguments to the relationships function; when you find a path that matches the pattern using MATCH clause, assign a variable, and pass that variable as an argument, the path itself cannot be passed as an argument to the relationships function, but must always be passed as variable. When used with the count function, the number of edges in the path can be found.
Therefore, to list both vlabels and elabels, you would want something like the following query (note we assign the resulting path to p which gets passed to the relationships function):
MATCH p=(n)-[r]->(m)
RETURN DISTINCT labels(n), relationships(p), labels(m);
-- Example results
labels | relationships | labels
----------+-------------------------------------------+----------
["part"] | [used_by[19.3][18.4,18.5]{"quantity": 1}] | ["part"]
["part"] | [used_by[19.4][18.5,18.6]{"quantity": 2}] | ["part"]
["part"] | [used_by[19.5][18.4,18.7]{"quantity": 1}] | ["part"]
(3 rows)