Sum of all elements in a an array column - dataframe

I am new to spark and have a use case to find the sum of all the values in a column. Each column is an array of integers.
df.show(2,false)
+------------------+
|value |
+------------------+
|[3,4,5] |
+------------------+
|[1,2] |
+------------------+
Value to find 3 + 4 + 5 + 1 + 2 = 15
Can someone please help/guide me on how to achieve this?
Edit: I have to run this code in spark 2.3

One option is to sum up the array on each row and then compute the overall sum. This can be done with Spark SQL function aggregate available from Spark version 2.4.0.
val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))
tmp.show()
+---+---------+----------+
| id| val|summed_val|
+---+---------+----------+
| 1|[3, 4, 5]| 12|
| 2| [1, 2]| 3|
+---+---------+----------+
//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show()
+-----+
|total|
+-----+
| 15|
+-----+
Another option is to use explode. But beware this approach will generate a large amount of data to be aggregated on.
import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()

Related

Consolidate each row of dataframe returning a dataframe into ouput dataframe

I am looking for help in a scenario where I have a scala dataframe PARENT. I need to
loop through each record in PARENT dataframe
Query the records from a database based on a filter using ID value of
parent (the output of this step is dataframe)
append few attributes from parent to queried dataframe
Ex:
ParentDF
id parentname
1 X
2 Y
Queried Dataframe for id 1
id queryid name
1 23 lobo
1 45 sobo
1 56 aobo
Queried Dataframe for id 2
id queryid name
2 53 lama
2 67 dama
2 56 pama
Final output required :
id parentname queryid name
1 X 23 lobo
1 X 45 sobo
1 X 56 aobo
2 Y 53 lama
2 Y 67 dama
2 Y 56 pama
Update1:
I tried using foreachpartition and use foreach internally to loop through each record and got below error.
error: Unable to find encoder for type org.apache.spark.sql.DataFrame. An implicit Encoder[org.apache.spark.sql.DataFrame] is needed to store org.apache.spark.sql.DataFrame instances in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
falttenedData.map(row=>{
I need to do this with scalability plz. Any help is really appreciated.
The solution is pretty straightforward, you just need to join your parentDF and your other one.
parentDF.join(
otherDF,
Seq("id"),
"left"
)
As you're caring about scalability, In case your "otherDF" is quite small (it has less than 10K rows for example with 2-3 cols), you should consider using broadcast join : parentDF.join(broadcast(otherDF), Seq("id"), "left).
You can use the .join method on a dataframe for this one.
Some example code would be something like this:
val df = Seq((1, "X"), (2, "Y")).toDF("id", "parentname")
df.show
+---+----------+
| id|parentname|
+---+----------+
| 1| X|
| 2| Y|
+---+----------+
val df2 = Seq((1, 23, "lobo"), (1, 45, "sobo"), (1, 56, "aobo"), (2, 53, "lama"), (2, 67, "dama"), (2, 56, "pama")).toDF("id", "queryid", "name")
df2.show
+---+-------+----+
| id|queryid|name|
+---+-------+----+
| 1| 23|lobo|
| 1| 45|sobo|
| 1| 56|aobo|
| 2| 53|lama|
| 2| 67|dama|
| 2| 56|pama|
+---+-------+----+
val output=df.join(df2, Seq("id"))
output.show
+---+----------+-------+----+
| id|parentname|queryid|name|
+---+----------+-------+----+
| 1| X| 23|lobo|
| 1| X| 45|sobo|
| 1| X| 56|aobo|
| 2| Y| 53|lama|
| 2| Y| 67|dama|
| 2| Y| 56|pama|
+---+----------+-------+----+
Hope this helps! :)

How to identify numeric values in a dataframe column with more than 10+ digits using pyspark

I am trying to identify numeric values from a column. I did the below option to achieve the same.
But for '7877177450' it is showing as non-numeric.According to my scenario the ID's can be number with 10+ digits also.
How to make that work?
values = [('695435',),('7877177450',),('PA-098',),('asv',),('23456123',)]
df = sqlContext.createDataFrame(values,['ID',])
df.show()
df = df.withColumn("Status",F.when((col("ID").cast("int").isNotNull()) ,lit("numeric")).otherwise(lit("non-numeric")))
df.show()
+----------+
| ID|
+----------+
| 695435|
|7877177450|
| PA-098|
| asv|
| 23456123|
+----------+
+----------+-----------+
| ID| Status|
+----------+-----------+
| 695435| numeric|
|7877177450|non-numeric|
| PA-098|non-numeric|
| asv|non-numeric|
| 23456123| numeric|
+----------+-----------+
You can cast to long instead:
df2 = df.withColumn("Status", F.when((F.col("ID").cast("long").isNotNull()), F.lit("numeric")).otherwise(F.lit("non-numeric")))
int has a maximum value of 2147483647, so it cannot handle values greater than this, and you'll get null.
Or you can use a regular expression:
df2 = df.withColumn("Status",F.when(F.col('ID').rlike('^(\\d)+$'), F.lit("numeric")).otherwise(F.lit("non-numeric")))

How to update a dataframe in PySpark with random values from another dataframe?

I have two dataframes in PySpark as below:
Dataframe A: total 1000 records
+-----+
|Name |
+-----+
| a|
| b|
| c|
+-----+
Dataframe B: Total 3 records
+-----+
|Zip |
+-----+
|06905|
|06901|
|06902|
+-----+
I need to add a new column named Zip in Dataframe A and populate the values with a randomly selected value from Dataframe B. So the Dataframe A will look something like this:
+-----+-----+
|Name |Zip |
+-----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06902|
+-----+-----+
I am running this on Azure Databricks and apparently, quinn isn't a module in there. So can't use quinn unfortunately.
If b is small (3 rows), you can just collect it into a Python list and add it as an array column to a. Then you can get a random element using shuffle.
import pyspark.sql.functions as F
df = a.withColumn(
'Zip',
F.shuffle(
F.array(*[F.lit(r[0]) for r in b.collect()])
)[0]
)
df.show()
+----+-----+
|Name| Zip|
+----+-----+
| a|06901|
| b|06905|
| c|06902|
| d|06901|
+----+-----+
You can agg the dataframe with zips and collect the values into one array column, then do a cross join and select a random element from the array of zips using for example shuffle on the array before picking the first element:
from pyspark.sql import functions as F
df_result = df_a.crossJoin(
df_b.agg(F.collect_list("Zip").alias("Zip"))
).withColumn(
"Zip",
F.expr("shuffle(Zip)[0]")
)
#+----+-----+
#|Name| Zip|
#+----+-----+
#| a|06901|
#| b|06902|
#| c|06901|
#| d|06901|
#+----+-----+

Dealing with negatives in calculations (Databricks-Spark SQL)

When multiplying two columns together in a spark SQL table with random negative values, returns "NaN" for those which have a negative in one of the columns.
Any techniques to help get the calculations work?
SELECT temperature * days FROM weather_data
If you get NaN from a multiplication, mybe one or more columns contains NaN values. You can use nanvl to set a default value (ex. 0) when the column is NaN. Use it with coalesce to handle nulls too:
SELECT coalesce(nanvl(temperature, 0), 0) * days FROM weather_data
Example:
weather_data table:
+-----------+----+
|temperature|days|
+-----------+----+
| NaN| 1|
| -12.34| 2|
| null| 3|
| 15.5| 4|
+-----------+----+
spark.sql("SELECT coalesce(nanvl(temperature, 0), 0) * days AS mul FROM weather_data").show()
+------+
| mul|
+------+
| 0.0|
|-24.68|
| 0.0|
| 62.0|
+------+

How to explode a row into multiple rows without using data frames?

I have created dataframes for exploding a row into multiple rows based on delimiter. I have used explode function for same. Would like to know if i can by pass the use of dataframe here and use only SparkSQL to perform this operation.
Example there is Strtok function in teradata to perform this action.
Quick answer: There is no built-in function in SQL that helps you efficiently breaking a row to multiple rows based on (string value and delimiters), as compared to what flatMap() or explode() in (Dataset API) can achieve.
And simply it is because in Dataframe you can manipulate Rows programmatically in much higher level and granularity than Spark SQL.
Note: Dataset.explode() is deprecated starting from (Spark 2.0)
explode() Deprecated: (Since version 2.0.0) use flatMap() or select() with
functions.explode() instead
Here are two examples for both methods recommended from the previous quote.
Examples
// Loading testing data
val mockedData = sc.parallelize(Seq("hello, world", "foo, bar")).toDF
+------------+
| value|
+------------+
|hello, world|
| foo, bar|
+------------+
Option 1 - flatMap()
Breaking rows into multiples using flatMap()
scala> mockedData.flatMap( r => r.getString(0).split(",")).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Option 2 - functions.explode()
Replacing value column with a new set of Rows generated by explode(), which is deprecated in favor of using flatMap()
scala> mockedData.withColumn("value", explode(split($"value", "[,]"))).show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
Switching to Spark SQL API:
If you want to use sqlContext, and start querying data through SQL, now you can create a temporary view from the resulted Dataset:
scala> val resultedDf = mockedData.flatMap( r => r.getString(0).split(","))
resultedDf: org.apache.spark.sql.Dataset[String] = [value: string]
scala> resultedDf.createOrReplaceTempView("temp")
scala> spark.sql("select * from temp").show
+------+
| value|
+------+
| hello|
| world|
| foo|
| bar|
+------+
I hope this answers your question.