SQL - How can I sum elements of an array? - sql

I am using SQL with pyspark and hive, and I'm new to all of it.
I have a hive table with a column of type string, like this:
id | values
1 | '2;4;4'
2 | '5;1'
3 | '8;0;4'
I want to create a query to obtain this:
id | values | sum
1 | '2.2;4;4' | 10.2
2 | '5;1.2' | 6.2
3 | '8;0;4' | 12
By using split(values, ';') I can get arrays like ['2.2','4','4'], but I still need to convert them into decimal numbers and sum them.
Is there a not too complicated way to do this?
Thank you so so much in advance! And happy coding to you all :)

From Spark-2.4+
We don't have to use explode on arrays but directly work on array's using higher order functions.
Example:
from pyspark.sql.functions import *
df=spark.createDataFrame([("1","2;4;4"),("2","5;1"),("3","8;0;4")],["id","values"])
#split and creating array<int> column
df1=df.withColumn("arr",split(col("values"),";").cast("array<int>"))
df1.createOrReplaceTempView("tmp")
spark.sql("select *,aggregate(arr,0,(x,y) -> x + y) as sum from tmp").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#| 1| 2;4;4| 10|
#| 2| 5;1| 6|
#| 3| 8;0;4| 12|
#+---+------+---+
#in dataframe API
df1.selectExpr("*","aggregate(arr,0,(x,y) -> x + y) as sum").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#| 1| 2;4;4| 10|
#| 2| 5;1| 6|
#| 3| 8;0;4| 12|
#+---+------+---+

PySpark solution
from pyspark.sql.functions import udf,col,split
from pyspark.sql.types import FloatType
#UDF to sum the split values returning none when non numeric values exist in the string
#Change the implementation of the function as needed
def values_sum(split_list):
total = 0
for num in split_list:
try:
total += float(num)
except ValueError:
return None
return total
values_summed = udf(values_sum,FloatType())
res = df.withColumn('summed',values_summed(split(col('values'),';')))
res.show()
The solution could've been a one-liner if it were known the array values are of a given data type. However, it is better to go with a safer implementation that covers all cases.
Hive solution
Use explode with split and group by to sum the values.
select id,sum(cast(split_value as float)) as summed
from tbl
lateral view explode(split(values,';')) t as split_value
group by id

write a stored procedure which does the job:
CREATE FUNCTION SPLIT_AND_SUM ( s VARCHAR(1024) ) RETURNS INT
BEGIN
...
END

Related

Querying struct within array - Databricks SQL

I am using Databricks SQL to query a dataset that has a column formatted as an array, and each item in the array is a struct with 3 named fields.
I have the following table:
id
array
1
[{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]
2
[{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}]
In a different SQL editor, I was able to achieve this by doing the following:
SELECT
id,
struct.firstName
FROM
table
CROSS JOIN UNNEST(array) as t(struct)
With a resulting table of:
id
firstName
1
John
1
Jane
2
Bob
2
Betty
Unfortunately, this syntax does not work in the Databricks SQL editor, and I get the following error.
[UNRESOLVED_COLUMN] A column or function parameter with name `array` cannot be resolved.
I feel like there is an easy way to query this, but my search on Stack Overflow and Google has come up empty so far.
1. SQL API
The first solution uses the SQL API. The first code snippet prepares the test case, so you can ignore it if you already have it in place.
import pyspark.sql.types
schema = StructType([
StructField('id', IntegerType(), True),
StructField("people", ArrayType(StructType([
StructField('firstName', StringType(), True),
StructField('lastName', StringType(), True),
StructField('age', StringType(), True)
])), True)
])
sql_df = spark.createDataFrame([
(1, [{"firstName":"John","lastName":"Smith","age":"10"},{"firstName":"Jane","lastName":"Smith","age":"12"}]),
(2, [{"firstName":"Bob","lastName":"Miller","age":"13"},{"firstName":"Betty","lastName":"Miller","age":"11"}])
], schema)
sql_df.createOrReplaceTempView("sql_df")
What you need to use is the LATERAL VIEW clause (docs) which allows to explode the nested structures, like this:
SELECT id, exploded.firstName
FROM sql_df
LATERAL VIEW EXPLODE(sql_df.people) sql_df AS exploded;
+---+---------+
| id|firstName|
+---+---------+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2| Betty|
+---+---------+
2. DataFrame API
The alternative approach is to use explode method (docs), which gives you the same results, like this:
from pyspark.sql.functions import explode, col
sql_df.select("id", explode(col("people.firstName"))).show()
+---+-----+
| id| col|
+---+-----+
| 1| John|
| 1| Jane|
| 2| Bob|
| 2|Betty|
+---+-----+

Dynamic/Variable Offset in SparkSQL Lead/Lag function

Can we somehow use an offset value that depends on the column value in lead/lag function in spark SQL ?
Example : Here is what works fine.
val sampleData = Seq( ("bob","Developer",125000),
("mark","Developer",108000),
("carl","Tester",70000),
("peter","Developer",185000),
("jon","Tester",65000),
("roman","Tester",82000),
("simon","Developer",98000),
("eric","Developer",144000),
("carlos","Tester",75000),
("henry","Developer",110000)).toDF("Name","Role","Salary")
val window = Window.orderBy("Role")
//Derive lag column for salary
val laggingCol = lag(col("Salary"), 1).over(window)
//Use derived column LastSalary to find difference between current and previous row
val salaryDifference = col("Salary") - col("LastSalary")
//Calculate trend based on the difference
//IF ELSE / CASE can be written using when.otherwise in spark
val trend = when(col("SalaryDiff").isNull || col("SalaryDiff").===(0), "SAME")
.when(col("SalaryDiff").>(0), "UP")
.otherwise("DOWN")
sampleData.withColumn("LastSalary", laggingCol)
.withColumn("SalaryDiff",salaryDifference)
.withColumn("Trend", trend).show()
Now, my use case is such that the offset that we have to pass depends on a particular Column of type Integer. This is somewhat I wanted to work :
val sampleData = Seq( ("bob","Developer",125000,2),
("mark","Developer",108000,3),
("carl","Tester",70000,3),
("peter","Developer",185000,2),
("jon","Tester",65000,1),
("roman","Tester",82000,1),
("simon","Developer",98000,2),
("eric","Developer",144000,3),
("carlos","Tester",75000,2),
("henry","Developer",110000,2)).toDF("Name","Role","Salary","ColumnForOffset")
val window = Window.orderBy("Role")
//Derive lag column for salary
val laggingCol = lag(col("Salary"), col("ColumnForOffset")).over(window)
//Use derived column LastSalary to find difference between current and previous row
val salaryDifference = col("Salary") - col("LastSalary")
//Calculate trend based on the difference
//IF ELSE / CASE can be written using when.otherwise in spark
val trend = when(col("SalaryDiff").isNull || col("SalaryDiff").===(0), "SAME")
.when(col("SalaryDiff").>(0), "UP")
.otherwise("DOWN")
sampleData.withColumn("LastSalary", laggingCol)
.withColumn("SalaryDiff",salaryDifference)
.withColumn("Trend", trend).show()
This will throw an exception as expected since offset only takes Integer value.
Let us discuss if we can somehow implement a logic for this.
You can add a row number column, and do a self join based on the row number and offset, e.g.:
val df = sampleData.withColumn("rn", row_number().over(window))
val df2 = df.alias("t1").join(
df.alias("t2"),
expr("t1.rn = t2.rn + t1.ColumnForOffset"),
"left"
).selectExpr("t1.*", "t2.Salary as LastSalary")
df2.show
+------+---------+------+---------------+---+----------+
| Name| Role|Salary|ColumnForOffset| rn|LastSalary|
+------+---------+------+---------------+---+----------+
| bob|Developer|125000| 2| 1| null|
| mark|Developer|108000| 3| 2| null|
| peter|Developer|185000| 2| 3| 125000|
| simon|Developer| 98000| 2| 4| 108000|
| eric|Developer|144000| 3| 5| 108000|
| henry|Developer|110000| 2| 6| 98000|
| carl| Tester| 70000| 3| 7| 98000|
| jon| Tester| 65000| 1| 8| 70000|
| roman| Tester| 82000| 1| 9| 65000|
|carlos| Tester| 75000| 2| 10| 65000|
+------+---------+------+---------------+---+----------+

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

Spark: how to perform loop fuction to dataframes

I have two dataframes as below, I'm trying to search the second df using the foreign key, and then generate a new data frame. I was thinking of doing a spark.sql("""select history.value as previous_year 1 from df1, history where df1.key=history.key and history.date=add_months($currentdate,-1*12)""" but then I need to do it multiple times for say 10 previous_years. and join them back together. How can I create a function for this? Many thanks. Quite new here.
dataframe one:
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|100| 2018-04-16|
| 2|200| 2018-04-16|
+---+---+-----------+
dataframe two : historical data
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|10 | 2017-04-16|
| 1|20 | 2016-04-16|
+---+---+-----------+
The result I want to generate is
+---+----------+-----------------+-----------------+
|key|date | previous_year_1 | previous_year_2 |
+---+----------+-----------------+-----------------+
| 1|2018-04-16| 10 | 20 |
| 2|null | null | null |
+---+----------+-----------------+-----------------+
To solve this, the following approach can be applied:
1) Join the two dataframes by key.
2) Filter out all the rows where previous dates are not exactly years before reference dates.
3) Calculate the years difference for the row and put the value in a dedicated column.
4) Pivot the DataFrame around the column calculated in the previous step and aggregate on the value of the respective year.
private def generateWhereForPreviousYears(nbYears: Int): Column =
(-1 to -nbYears by -1) // loop on each backwards year value
.map(yearsBack =>
/*
* Each year back count number is transformed in an expression
* to be included into the WHERE clause.
* This is equivalent to "history.date=add_months($currentdate,-1*12)"
* in your comment in the question.
*/
add_months($"df1.date", 12 * yearsBack) === $"df2.date"
)
/*
The previous .map call produces a sequence of Column expressions,
we need to concatenate them with "or" in order to obtain
a single Spark Column reference. .reduce() function is most
appropriate here.
*/
.reduce(_ or _) or $"df2.date".isNull // the last "or" is added to include empty lines in the result.
val nbYearsBack = 3
val result = sourceDf1.as("df1")
.join(sourceDf2.as("df2"), $"df1.key" === $"df2.key", "left")
.where(generateWhereForPreviousYears(nbYearsBack))
.withColumn("diff_years", concat(lit("previous_year_"), year($"df1.date") - year($"df2.date")))
.groupBy($"df1.key", $"df1.date")
.pivot("diff_years")
.agg(first($"df2.value"))
.drop("null") // drop the unwanted extra column with null values
The output is:
+---+----------+---------------+---------------+
|key|date |previous_year_1|previous_year_2|
+---+----------+---------------+---------------+
|1 |2018-04-16|10 |20 |
|2 |2018-04-16|null |null |
+---+----------+---------------+---------------+
Let me "read through the lines" and give you a "similar" solution to what you are asking:
val df1Pivot = df1.groupBy("key").pivot("date").agg(max("val"))
val df2Pivot = df2.groupBy("key").pivot("date").agg(max("val"))
val result = df1Pivot.join(df2Pivot, Seq("key"), "left")
result.show
+---+----------+----------+----------+
|key|2018-04-16|2016-04-16|2017-04-16|
+---+----------+----------+----------+
| 1| 100| 20| 10|
| 2| 200| null| null|
+---+----------+----------+----------+
Feel free to manipulate the data a bit if you really need to change the column names.
Or even better:
df1.union(df2).groupBy("key").pivot("date").agg(max("val")).show
+---+----------+----------+----------+
|key|2016-04-16|2017-04-16|2018-04-16|
+---+----------+----------+----------+
| 1| 20| 10| 100|
| 2| null| null| 200|
+---+----------+----------+----------+

How to aggregate data into ranges (bucketize)?

I have a table like
+---------------+------+
|id | value|
+---------------+------+
| 1|118.0|
| 2|109.0|
| 3|113.0|
| 4| 82.0|
| 5| 60.0|
| 6|111.0|
| 7|107.0|
| 8| 84.0|
| 9| 91.0|
| 10|118.0|
+---------------+------+
ans would like aggregate or bin the values to a range 0,10,20,30,40,...80,90,100,110,120how can I perform this in SQL or more specific spark sql?
Currently I have a lateral view join with the range but this seems rather clumsy / inefficient.
The quantile discretized is not really what I want, rather a CUT with this range.
edit
https://github.com/collectivemedia/spark-ext/blob/master/sparkext-mllib/src/main/scala/org/apache/spark/ml/feature/Binning.scala would perform dynamic bins, but I would rather need this specified range.
In the general case, static binning can be performed using org.apache.spark.ml.feature.Bucketizer:
val df = Seq(
(1, 118.0), (2, 109.0), (3, 113.0), (4, 82.0), (5, 60.0),
(6, 111.0), (7, 107.0), (8, 84.0), (9, 91.0), (10, 118.0)
).toDF("id", "value")
val splits = (0 to 12).map(_ * 10.0).toArray
import org.apache.spark.ml.feature.Bucketizer
val bucketizer = new Bucketizer()
.setInputCol("value")
.setOutputCol("bucket")
.setSplits(splits)
val bucketed = bucketizer.transform(df)
val solution = bucketed.groupBy($"bucket").agg(count($"id") as "count")
Result:
scala> solution.show
+------+-----+
|bucket|count|
+------+-----+
| 8.0| 2|
| 11.0| 4|
| 10.0| 2|
| 6.0| 1|
| 9.0| 1|
+------+-----+
The bucketizer throws errors when values lie outside the defined bins. It is possible to define split points as Double.NegativeInfinity or Double.PositiveInfinity to capture outliers.
Bucketizer is designed to work efficiently with arbitrary splits by performing binary search of the right bucket. In the case of regular bins like yours, one can simply do something like:
val binned = df.withColumn("bucket", (($"value" - bin_min) / bin_width) cast "int")
where bin_min and bin_width are the left interval of the minimum bin and the bin width, respectively.
Try "GROUP BY" with this
SELECT id, (value DIV 10)*10 FROM table_name ;
The following would be using the Dataset API for Scala:
df.select(('value divide 10).cast("int")*10)