Dealing with negatives in calculations (Databricks-Spark SQL) - apache-spark-sql

When multiplying two columns together in a spark SQL table with random negative values, returns "NaN" for those which have a negative in one of the columns.
Any techniques to help get the calculations work?
SELECT temperature * days FROM weather_data

If you get NaN from a multiplication, mybe one or more columns contains NaN values. You can use nanvl to set a default value (ex. 0) when the column is NaN. Use it with coalesce to handle nulls too:
SELECT coalesce(nanvl(temperature, 0), 0) * days FROM weather_data
Example:
weather_data table:
+-----------+----+
|temperature|days|
+-----------+----+
| NaN| 1|
| -12.34| 2|
| null| 3|
| 15.5| 4|
+-----------+----+
spark.sql("SELECT coalesce(nanvl(temperature, 0), 0) * days AS mul FROM weather_data").show()
+------+
| mul|
+------+
| 0.0|
|-24.68|
| 0.0|
| 62.0|
+------+

Related

How can I replace the values in one pyspark dataframe column with the values from another column in a sub-section of the dataframe?

I have to perform a group-by and pivot operation on a dataframe's "activity" column, and populate the new columns resulting from the pivot with the sum of the "quantity" column. One of the activity columns, however has to be populated with the sum of the "cost" column.
Data frame before group-by and pivot:
+----+-----------+-----------+-----------+-----------+
| id | quantity | cost | activity | category |
+----+-----------+-----------+-----------+-----------+
| 1 | 2 | 2 | skiing | outdoor |
| 2 | 0 | 2 | swimming | outdoor |
+----+-----------+-----------+-----------+-----------+
pivot code:
pivotDF = df.groupBy("category").pivot("activity").sum("quantity")
result:
+----+-----------+-----------+-----------+
| id | category | skiing | swimming |
+----+-----------+-----------+-----------+
| 1 | outdoor | 2 | 5 |
| 2 | outdoor | 4 | 7 |
+----+-----------+-----------+-----------+
The problem is that for one of these activities, I need the activity column to be populated with sum("cost") instead of sum("quantity"). I can't seem to find a way to specify this during the pivot operation itself, so I thought maybe I can just exchange the values in the quantity column for the ones in the cost column wherever the activity column value corresponds to the relevant activity. However, I can't find an example of how to do this in a pyspark data frame.
Any help would be much appreciated.
You can provide more than 1 aggregation after the pivot.
Let's say the input dataframe looks like the following
# +---+---+----+--------+-------+
# | id|qty|cost| act| cat|
# +---+---+----+--------+-------+
# | 1| 2| 2| skiing|outdoor|
# | 2| 0| 2|swimming|outdoor|
# | 3| 1| 2| skiing|outdoor|
# | 4| 2| 4|swimming|outdoor|
# +---+---+----+--------+-------+
Do a pivot and use agg() to provide more than 1 aggregation.
data_sdf. \
groupBy('id', 'cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +---+-------+-----------+----------+-------------+------------+
# | id| cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +---+-------+-----------+----------+-------------+------------+
# | 2|outdoor| null| null| 2| 0|
# | 1|outdoor| 2| 2| null| null|
# | 3|outdoor| 2| 1| null| null|
# | 4|outdoor| null| null| 4| 2|
# +---+-------+-----------+----------+-------------+------------+
Notice the field names. Pyspark automatically assigned the suffix based on the alias provided in the aggregations. Use a drop or select to retain the columns required and rename them per your choice.
Removing id from the groupBy makes the result much better.
data_sdf. \
groupBy('cat'). \
pivot('act'). \
agg(func.sum('cost').alias('cost'),
func.sum('qty').alias('qty')
). \
show()
# +-------+-----------+----------+-------------+------------+
# | cat|skiing_cost|skiing_qty|swimming_cost|swimming_qty|
# +-------+-----------+----------+-------------+------------+
# |outdoor| 4| 3| 6| 2|
# +-------+-----------+----------+-------------+------------+

Sum of all elements in a an array column

I am new to spark and have a use case to find the sum of all the values in a column. Each column is an array of integers.
df.show(2,false)
+------------------+
|value |
+------------------+
|[3,4,5] |
+------------------+
|[1,2] |
+------------------+
Value to find 3 + 4 + 5 + 1 + 2 = 15
Can someone please help/guide me on how to achieve this?
Edit: I have to run this code in spark 2.3
One option is to sum up the array on each row and then compute the overall sum. This can be done with Spark SQL function aggregate available from Spark version 2.4.0.
val tmp = df.withColumn("summed_val",expr("aggregate(val,0,(acc, x) -> acc + x)"))
tmp.show()
+---+---------+----------+
| id| val|summed_val|
+---+---------+----------+
| 1|[3, 4, 5]| 12|
| 2| [1, 2]| 3|
+---+---------+----------+
//one row dataframe with the overall sum. collecting to a scalar value is possible too.
tmp.agg(sum("summed_val").alias("total")).show()
+-----+
|total|
+-----+
| 15|
+-----+
Another option is to use explode. But beware this approach will generate a large amount of data to be aggregated on.
import org.apache.spark.sql.functions.explode
val tmp = df.withColumn("elem",explode($"val"))
tmp.agg(sum($"elem").alias("total")).show()

how to update a row based on another row with same id

With Spark dataframe, I want to update a row value based on other rows with same id.
For example,
I have records below,
id,value
1,10
1,null
1,null
2,20
2,null
2,null
I want to get the result as below
id,value
1,10
1,10
1,10
2,20
2,20
2,20
To summarize, the value column is null in some rows, I want to update them if there is another row with same id which has valid value.
In sql, I can simply write a update sentence with inner-join, but I didn't find the same way in Spark-sql.
update combineCols a
inner join combineCols b
on a.id = b.id
set a.value = b.value
(this is how I do it in sql)
Let's use SQL method to solve this issue -
myValues = [(1,10),(1,None),(1,None),(2,20),(2,None),(2,None)]
df = sqlContext.createDataFrame(myValues,['id','value'])
df.registerTempTable('table_view')
df1=sqlContext.sql(
'select id, sum(value) over (partition by id) as value from table_view'
)
df1.show()
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
Caveat: Thos code assumes that there is only one non-null value for any particular id. When we groupby values, we have to use an aggregation function, and I have used sum. In case there are 2 non-null values for any id, then the will be summed up. If id could have multiple non-null values, then it's bettwe to use min/max, so that we get one of the values rather than sum.
df1=sqlContext.sql(
'select id, max(value) over (partition by id) as value from table_view'
)
You can use window to do this(in pyspark):
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# create dataframe
df = sc.parallelize([
[1,10],
[1,None],
[1,None],
[2,20],
[2,None],
[2,None],
]).toDF(('id', 'value'))
window = Window.partitionBy('id').orderBy(F.desc('value'))
df \
.withColumn('value', F.first('value').over(window)) \
.show()
Results:
+---+-----+
| id|value|
+---+-----+
| 1| 10|
| 1| 10|
| 1| 10|
| 2| 20|
| 2| 20|
| 2| 20|
+---+-----+
You can use the same functions in scala.

How to merge rows in hive?

I have a production table in hive which gets incremental(changed records/new records) data from external source on daily basis. For values in row are possibly spread across different dates, for example, this is how records in table looks on first day
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| |
| 3| | b3|
+---+----+----+
on second day, we get following -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 4| a4| |
| 2| | b2 |
| 3| a3| |
+---+----+----+
which has new record as well as changed records
The result I want to achieve is, merge of rows based on Primary key (id in this case) and produce and output which is -
+---+----+----+
| id|col1|col2|
+---+----+----+
| 1| a1| b1|
| 2| a2| b2 |
| 3| a3| b3|
| 4| a4| b4|
+---+----+----+
Number of columns are pretty huge , typically in range of 100-150. Aim is to provide latest full view of all the data received so far.How can I do this within hive itself.
(ps:it doesnt have to be sorted)
This can archived using COALESCE and full outer join.
SELECT COALESCE(a.id ,b.id) as id ,
COALESCE(a.col1 ,b.col1) as col1,
COALESCE(a.col2 ,b.col2) as col2
FROM tbl1 a
FULL OUTER JOIN table2 b
on a.id =b.id

How to filter dates column with one condtion from the other column in Pyspark?

Assume, I have the following data frame named table_df in Pyspark
sid | date | label
------------------
1033| 20170521 | 0
1033| 20170520 | 0
1033| 20170519 | 1
1033| 20170516 | 0
1033| 20170515 | 0
1033| 20170511 | 1
1033| 20170511 | 0
1033| 20170509 | 0
.....................
The data frame table_df contains different IDs in different rows, the above is simply one typical case of ID.
For each ID and for each date with label 1, I would like to find the date with label 0 that is the closest and before.
For the above table, with ID 1033, date=20170519, label 1, the date of label 0 that is closest and before is 20170516.
And with ID 1033, date=20170511, label 1, the date of label 0 that is closest and before is 20170509 .
So, finally using groupBy and some complicated operations, I will obtain the following table:
sid | filtered_date |
-------------------------
1033| 20170516 |
1033| 20170509 |
-------------
Any help is highly appreciated. I tried but could not find any smart ways.
Thanks
We can use window partition ordered by date and find difference with the next row,
df.show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170521| 0|
|1033|20170520| 0|
|1033|20170519| 1|
|1033|20170516| 0|
|1033|20170515| 0|
|1033|20170511| 1|
|1033|20170511| 0|
|1033|20170509| 0|
+----+--------+-----+
from pyspark.sql import Window
from pyspark.sql import functions as F
w = Window.partitionBy('sid').orderBy('date')
df.withColumn('diff',F.lead('label').over(w) - df['label']).where(F.col('diff') == 1).drop('diff').show()
+----+--------+-----+
| sid| date|label|
+----+--------+-----+
|1033|20170509| 0|
|1033|20170516| 0|
+----+--------+-----+