How to explode a row into multiple rows without using data frames? - apache-spark-sql

I have created dataframes for exploding a row into multiple rows based on delimiter. I have used explode function for same. Would like to know if i can by pass the use of dataframe here and use only SparkSQL to perform this operation.
Example there is Strtok function in teradata to perform this action.

Quick answer: There is no built-in function in SQL that helps you efficiently breaking a row to multiple rows based on (string value and delimiters), as compared to what flatMap() or explode() in (Dataset API) can achieve.
And simply it is because in Dataframe you can manipulate Rows programmatically in much higher level and granularity than Spark SQL.
Note: Dataset.explode() is deprecated starting from (Spark 2.0)
explode() Deprecated: (Since version 2.0.0) use flatMap() or select() with
functions.explode() instead
Here are two examples for both methods recommended from the previous quote.
Examples
// Loading testing data
val mockedData = sc.parallelize(Seq("hello, world", "foo, bar")).toDF
+------------+
| value|
+------------+
|hello, world|
| foo, bar|
+------------+
Option 1 - flatMap()
Breaking rows into multiples using flatMap()
scala> mockedData.flatMap( r => r.getString(0).split(",")).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Option 2 - functions.explode()
Replacing value column with a new set of Rows generated by explode(), which is deprecated in favor of using flatMap()
scala> mockedData.withColumn("value", explode(split($"value", "[,]"))).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Switching to Spark SQL API:
If you want to use sqlContext, and start querying data through SQL, now you can create a temporary view from the resulted Dataset:
scala> val resultedDf = mockedData.flatMap( r => r.getString(0).split(","))
resultedDf: org.apache.spark.sql.Dataset[String] = [value: string]
scala> resultedDf.createOrReplaceTempView("temp")
scala> spark.sql("select * from temp").show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
I hope this answers your question.

Related

Is there a method to extract the json array having 2 list object in sql?

I am having a json data containing column like below in sql database
{"Part_Quantities":[1,3],"Part_Numbers":["300900010022","300900010021"],"wip_id":10668,"wi_id":100094,"doco":"76006394","equipment_id":404}
{"Part_Quantities":[2],"Part_Numbers":["300900030028"],"wip_id":10668,"wi_id":100094,"doco":"76006394","equipment_id":404}
I need to get the output as
DOCO | Part_Number | Part_Quantities|
|76006394 |300900010022| 1|
|76006394 | 300900010021 |3|
|76006394 | 300900030028 | 2|
I have used cross apply and outer apply. But the result is getting duplicated.

How to make SQL Spark Case Insensitive with field values

How can I write SQL Spark Commands to return fields with Case Insensitive results?
Example:
Sample_DF below
+--------+
| name |
+--------+
| Johnny|
| Robert|
| ROBERT|
| robert|
+--------+
It seems by Default it seems Spark SQL is case sensitive via the field you query for:
spark.sql("select name from Sample_DF where status like '%Robert%'").show
+------+
|name |
+------+
|Robert|
+------+
What can I do to configure above query to be case insensitive so that it can return below, assuming there is a large list of various roberts of different lower/uppercase variations?
+--------+
| name |
+--------+
| Robert|
| ROBERT|
| robert|
+--------+
As I understand SQL Spark does not support MSSQL Collate
you can make all characters lowercaser.
spark.sql("select status from Sample_DF where lower(status) like '%' || lower('Robert') || '%'").show
there is also a builtin function How to change case of whole column to lowercase?
If you want to take a look at all the names in the name column, you could use the lower function, which converts all chars to lowercase.
Sample_DF.select(F.lower('name')).show()

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Sum of all elements in a an array column

I am new to spark and have a use case to find the sum of all the values in a column. Each column is an array of integers.
df.show(2,false)
+------------------+
|value |
+------------------+
|[3,4,5] |
+------------------+
|[1,2] |
+------------------+
Value to find 3 + 4 + 5 + 1 + 2 = 15
Can someone please help/guide me on how to achieve this?
Edit: I have to run this code in spark 2.3
One option is to sum up the array on each row and then compute the overall sum. This can be done with Spark SQL function aggregate available from Spark version 2.4.0.
val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))
tmp.show()
+---+---------+----------+
| id| val|summed_val|
+---+---------+----------+
| 1|[3, 4, 5]| 12|
| 2| [1, 2]| 3|
+---+---------+----------+
//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show()
+-----+
|total|
+-----+
| 15|
+-----+
Another option is to use explode. But beware this approach will generate a large amount of data to be aggregated on.
import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()

How to access data in a Spark Dataset Column

I have a Dataframe like this:
+------+---+
| Name|Age|
+------+---+
|A-2 | 26|
|B-1 | 30|
|C-3 | 20|
+------+---+
scala> p.select("Name", "Age")
res2: org.apache.spark.sql.DataFrame = [Name: string, Age: string]
We can see clearly here that the data in the columns are of type String
I want to transform the Name column with a split("-") like method to get only the first part of it (i.e A, B, C).
But type Column in spark doesn't have such a method, so i'm thinking how to get the 'string' inside of the Column so i can perform the split operation.
Does anyone know what should i do ?
Use functions.split method
df.select(split(col("Name"), "-").getItem(0))
Split function available for spark dataframe. See the example below.
//Creating Test Data
val df = Seq(("A-2", 26)
, ("B-1", 30)
, ("C-3", 20)
).toDF("name", "age")
df.withColumn("new_name", split(col("name"),"-")(0)).show(false)
+----+---+--------+
|name|age|new_name|
+----+---+--------+
|A-2 |26 |A |
|B-1 |30 |B |
|C-3 |20 |C |
+----+---+--------+