Pyspark DataFrame Conditional groupBy - dataframe

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
tdf.groupBy( iff(len(agg.strip()) > 0 , F.col(agg), )).agg(F.count('*').alias('row_count')).show()
Is there a way to use a column or no column based on some condition in the dataframe groupBy?

You can provide an empty list to groupBy if the condition you are looking for is not met, which will groupBy no column:
tdf.groupBy(agg if len(agg) > 0 else []).agg(...)
agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
| 10|
+---------+
agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
| B| 4|
| A| 6|
+---+---------+

Related

Spark Scala Dataframe case when like function

I am using spark scala with DataFrame API, trying to convert the below sql logic
CASE
WHEN col_1 like '%XYZ' OR col_1 like '%ZYX' THEN
CASE WHEN col_2 like '%TTT' THEN 'ABC' ELSE 'BBA' END
WHEN col_1 not like '%XYZ' OR col_1 not like '%ZYX'
CASE WHEN col_2 like '%YYY' THEN BBC' END
END as new_col
How to construct CASE WHEN with multiple like and not like conditions with spark scala dataframe api?
Use the expr function and pass the whole case statement in it as below.
import org.apache.spark.sql.functions._
val df=Seq(
("A","01/01/2022",1), ("AXYZ","02/01/2022",1), ("AZYX","03/01/2022",1),("AXYZ","04/01/2022",0), ("AZYX","05/01/2022",0),("AB","06/01/2022",1), ("A","07/01/2022",0) ).toDF("Category", "date", "Indictor")
df.select(col("*"),expr("""CASE WHEN Category like '%XYZ' OR Category like '%ZYX' THEN
CASE WHEN Indictor = 1 THEN 'ABC' ELSE 'BBA' END
WHEN Category not like '%XYZ' OR Category not like '%ZYX' then
CASE WHEN Indictor = 1 THEN 'BBC' ELSE 'BBD' END
END""").alias("new_col")).show()
+--------+----------+--------+-------+
|Category| date|Indictor|new_col|
+--------+----------+--------+-------+
| A|01/01/2022| 1| BBC|
| AXYZ|02/01/2022| 1| ABC|
| AZYX|03/01/2022| 1| ABC|
| AXYZ|04/01/2022| 0| BBA|
| AZYX|05/01/2022| 0| BBA|
| AB|06/01/2022| 1| BBC|
| A|07/01/2022| 0| BBD|
+--------+----------+--------+-------+

How to validate particular column in a Dataframe without troubling other columns using spark-sql?

set.createOrReplaceTempView("input1");
String look = "select case when length(date)>0 then 'Y' else 'N' end as date from input1";
Dataset<Row> Dataset_op = spark.sql(look);
Dataset_op.show();
In the above code the dataframe 'set' has 10 columns and i've done the validation for one column among them (i.e) 'date'. It return date column alone.
My question is how to return all the columns with the validated date column in a single dataframe?
Is there any way to get all the columns in the dataframe without manually selecting all the columns in the select statement. Please share your suggestions.TIA
Data
df= spark.createDataFrame([
(1,'2022-03-01'),
(2,'2022-04-17'),
(3,None)
],('id','date'))
df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
| 3| null|
+---+----------+
You have two options
Option 1 select without projecting a new column with N and Y
df.createOrReplaceTempView("input1");
String_look = "select id, date from input1 where length(date)>0";
Dataset_op = spark.sql(String_look).show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
+---+----------+
Or project Y and N into a new column. Remember the where clause is applied before column projection. So you cant use the newly created column in the where clause
String_look = "select id, date, case when length(date)>0 then 'Y' else 'N' end as status from input1 where length(date)>0";
+---+----------+------+
| id| date|status|
+---+----------+------+
| 1|2022-03-01| Y|
| 2|2022-04-17| Y|
+---+----------+------+

GROUP BY with overlapping rows in PySpark SQL

The following table was created using Parquet / PySpark, and the objective is to aggregate rows where 1 < count < 5 and rows where 2 < count < 6. Note the row where count is 4.1 falls in both ranges.
+-----+-----+
|count|value|
+-----+-----+
| 1.1| 1|
| 1.2| 2|
| 4.1| 3|
| 5.5| 4|
| 5.6| 5|
| 5.7| 6|
+-----+-----+
Here is code to create and then read the above table as a PySpark DataFrame.
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
from pyspark import SparkContext, SQLContext
# create Parquet DataFrame
pdf = pd.DataFrame({
'count': [1.1, 1.2, 4.1, 5.5, 5.6, 5.7],
'value': [1, 2, 3, 4, 5, 6]})
table = pa.Table.from_pandas(pdf)
pq.write_to_dataset(table, r'c:/data/data.parquet')
# read Parquet DataFrame and create view
sc = SparkContext()
sql = SQLContext(sc)
df = sql.read.parquet(r'c:/data/data.parquet')
df.createTempView('data')
The operation can use two separate queries.
q1 = sql.sql("""
SELECT AVG(value) AS va
FROM data
WHERE count > 1
AND count < 5
""")
+---+
| va|
+---+
|2.0|
+---+
and, similarly
q2 = sql.sql("""
SELECT AVG(value) as va
FROM data
WHERE count > 2
AND count < 6
""")
+---+
| va|
+---+
|4.5|
+---+
However I want to do this in one efficient query.
Here is an approach that does not work because the row where count is 4.1 is included in only one group.
qc = sql.sql("""
SELECT AVG(value) AS va,
(CASE WHEN count > 1 AND count < 5 THEN 1
WHEN count > 2 AND count < 6 THEN 2
ELSE 0 END) AS id
FROM data
GROUP BY id
""")
The above query produces
+---+---+
| va| id|
+---+---+
|2.0| 1|
|5.0| 2|
+---+---+
To be clear the desired result is something more like
+---+---+
| va| id|
+---+---+
|2.0| 1|
|4.5| 2|
+---+---+
The simplest method is probably union all:
SELECT 1, AVG(value) AS va
FROM data
WHERE count > 1 AND count < 5
UNION ALL
SELECT 2, AVG(value) as va
FROM data
WHERE count > 2 AND count < 6;
You can also phrase this as:
select r.id, avg(d.value)
from data d join
(select 1 as lo, 5 as hi, 1 as id union all
select 2 as lo, 6 as hi, 2 as id
) r
on d.count > r.lo and d.count < r.hi
group by r.id;

SQL - How can I sum elements of an array?

I am using SQL with pyspark and hive, and I'm new to all of it.
I have a hive table with a column of type string, like this:
id | values
1 | '2;4;4'
2 | '5;1'
3 | '8;0;4'
I want to create a query to obtain this:
id | values | sum
1 | '2.2;4;4' | 10.2
2 | '5;1.2' | 6.2
3 | '8;0;4' | 12
By using split(values, ';') I can get arrays like ['2.2','4','4'], but I still need to convert them into decimal numbers and sum them.
Is there a not too complicated way to do this?
Thank you so so much in advance! And happy coding to you all :)
From Spark-2.4+
We don't have to use explode on arrays but directly work on array's using higher order functions.
Example:
from pyspark.sql.functions import *
df=spark.createDataFrame([("1","2;4;4"),("2","5;1"),("3","8;0;4")],["id","values"])
#split and creating array<int> column
df1=df.withColumn("arr",split(col("values"),";").cast("array<int>"))
df1.createOrReplaceTempView("tmp")
spark.sql("select *,aggregate(arr,0,(x,y) -> x + y) as sum from tmp").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#| 1| 2;4;4| 10|
#| 2| 5;1| 6|
#| 3| 8;0;4| 12|
#+---+------+---+
#in dataframe API
df1.selectExpr("*","aggregate(arr,0,(x,y) -> x + y) as sum").drop("arr").show()
#+---+------+---+
#| id|values|sum|
#+---+------+---+
#| 1| 2;4;4| 10|
#| 2| 5;1| 6|
#| 3| 8;0;4| 12|
#+---+------+---+
PySpark solution
from pyspark.sql.functions import udf,col,split
from pyspark.sql.types import FloatType
#UDF to sum the split values returning none when non numeric values exist in the string
#Change the implementation of the function as needed
def values_sum(split_list):
total = 0
for num in split_list:
try:
total += float(num)
except ValueError:
return None
return total
values_summed = udf(values_sum,FloatType())
res = df.withColumn('summed',values_summed(split(col('values'),';')))
res.show()
The solution could've been a one-liner if it were known the array values are of a given data type. However, it is better to go with a safer implementation that covers all cases.
Hive solution
Use explode with split and group by to sum the values.
select id,sum(cast(split_value as float)) as summed
from tbl
lateral view explode(split(values,';')) t as split_value
group by id
write a stored procedure which does the job:
CREATE FUNCTION SPLIT_AND_SUM ( s VARCHAR(1024) ) RETURNS INT
BEGIN
...
END

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))