GROUP BY with overlapping rows in PySpark SQL - sql

The following table was created using Parquet / PySpark, and the objective is to aggregate rows where 1 < count < 5 and rows where 2 < count < 6. Note the row where count is 4.1 falls in both ranges.
+-----+-----+
|count|value|
+-----+-----+
| 1.1| 1|
| 1.2| 2|
| 4.1| 3|
| 5.5| 4|
| 5.6| 5|
| 5.7| 6|
+-----+-----+
Here is code to create and then read the above table as a PySpark DataFrame.
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
from pyspark import SparkContext, SQLContext
# create Parquet DataFrame
pdf = pd.DataFrame({
'count': [1.1, 1.2, 4.1, 5.5, 5.6, 5.7],
'value': [1, 2, 3, 4, 5, 6]})
table = pa.Table.from_pandas(pdf)
pq.write_to_dataset(table, r'c:/data/data.parquet')
# read Parquet DataFrame and create view
sc = SparkContext()
sql = SQLContext(sc)
df = sql.read.parquet(r'c:/data/data.parquet')
df.createTempView('data')
The operation can use two separate queries.
q1 = sql.sql("""
SELECT AVG(value) AS va
FROM data
WHERE count > 1
AND count < 5
""")
+---+
| va|
+---+
|2.0|
+---+
and, similarly
q2 = sql.sql("""
SELECT AVG(value) as va
FROM data
WHERE count > 2
AND count < 6
""")
+---+
| va|
+---+
|4.5|
+---+
However I want to do this in one efficient query.
Here is an approach that does not work because the row where count is 4.1 is included in only one group.
qc = sql.sql("""
SELECT AVG(value) AS va,
(CASE WHEN count > 1 AND count < 5 THEN 1
WHEN count > 2 AND count < 6 THEN 2
ELSE 0 END) AS id
FROM data
GROUP BY id
""")
The above query produces
+---+---+
| va| id|
+---+---+
|2.0| 1|
|5.0| 2|
+---+---+
To be clear the desired result is something more like
+---+---+
| va| id|
+---+---+
|2.0| 1|
|4.5| 2|
+---+---+

The simplest method is probably union all:
SELECT 1, AVG(value) AS va
FROM data
WHERE count > 1 AND count < 5
UNION ALL
SELECT 2, AVG(value) as va
FROM data
WHERE count > 2 AND count < 6;
You can also phrase this as:
select r.id, avg(d.value)
from data d join
(select 1 as lo, 5 as hi, 1 as id union all
select 2 as lo, 6 as hi, 2 as id
) r
on d.count > r.lo and d.count < r.hi
group by r.id;

Related

How to validate particular column in a Dataframe without troubling other columns using spark-sql?

set.createOrReplaceTempView("input1");
String look = "select case when length(date)>0 then 'Y' else 'N' end as date from input1";
Dataset<Row> Dataset_op = spark.sql(look);
Dataset_op.show();
In the above code the dataframe 'set' has 10 columns and i've done the validation for one column among them (i.e) 'date'. It return date column alone.
My question is how to return all the columns with the validated date column in a single dataframe?
Is there any way to get all the columns in the dataframe without manually selecting all the columns in the select statement. Please share your suggestions.TIA
Data
df= spark.createDataFrame([
(1,'2022-03-01'),
(2,'2022-04-17'),
(3,None)
],('id','date'))
df.show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
| 3| null|
+---+----------+
You have two options
Option 1 select without projecting a new column with N and Y
df.createOrReplaceTempView("input1");
String_look = "select id, date from input1 where length(date)>0";
Dataset_op = spark.sql(String_look).show()
+---+----------+
| id| date|
+---+----------+
| 1|2022-03-01|
| 2|2022-04-17|
+---+----------+
Or project Y and N into a new column. Remember the where clause is applied before column projection. So you cant use the newly created column in the where clause
String_look = "select id, date, case when length(date)>0 then 'Y' else 'N' end as status from input1 where length(date)>0";
+---+----------+------+
| id| date|status|
+---+----------+------+
| 1|2022-03-01| Y|
| 2|2022-04-17| Y|
+---+----------+------+

PySpark - how to update Dataframe by using join?

I have a dataframe a:
id,value
1,11
2,22
3,33
And another dataframe b:
id,value
1,123
3,345
I want to update dataframe a with all matching values from b (based on column 'id').
Final dataframe 'c' would be:
id,value
1,123
2,22
3,345
How to achieve that using datafame joins (or other approach)?
Tried:
a.join(b, a.id == b.id, "inner").drop(a.value)
Gives (not desired output):
+---+---+-----+
| id| id|value|
+---+---+-----+
| 1| 1| 123|
| 3| 3| 345|
+---+---+-----+
Thanks.
I don't think there is an update functionality. But this should work:
import pyspark.sql.functions as F
df1.join(df2, df1.id == df2.id, "left_outer") \
.select(df1.id, df2.id, F.when(df2.value.isNull(), df1.value).otherwise(df2.value).alias("value")))

Spark: how to perform loop fuction to dataframes

I have two dataframes as below, I'm trying to search the second df using the foreign key, and then generate a new data frame. I was thinking of doing a spark.sql("""select history.value as previous_year 1 from df1, history where df1.key=history.key and history.date=add_months($currentdate,-1*12)""" but then I need to do it multiple times for say 10 previous_years. and join them back together. How can I create a function for this? Many thanks. Quite new here.
dataframe one:
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|100| 2018-04-16|
| 2|200| 2018-04-16|
+---+---+-----------+
dataframe two : historical data
+---+---+-----------+
|key|val| date |
+---+---+-----------+
| 1|10 | 2017-04-16|
| 1|20 | 2016-04-16|
+---+---+-----------+
The result I want to generate is
+---+----------+-----------------+-----------------+
|key|date | previous_year_1 | previous_year_2 |
+---+----------+-----------------+-----------------+
| 1|2018-04-16| 10 | 20 |
| 2|null | null | null |
+---+----------+-----------------+-----------------+
To solve this, the following approach can be applied:
1) Join the two dataframes by key.
2) Filter out all the rows where previous dates are not exactly years before reference dates.
3) Calculate the years difference for the row and put the value in a dedicated column.
4) Pivot the DataFrame around the column calculated in the previous step and aggregate on the value of the respective year.
private def generateWhereForPreviousYears(nbYears: Int): Column =
(-1 to -nbYears by -1) // loop on each backwards year value
.map(yearsBack =>
/*
* Each year back count number is transformed in an expression
* to be included into the WHERE clause.
* This is equivalent to "history.date=add_months($currentdate,-1*12)"
* in your comment in the question.
*/
add_months($"df1.date", 12 * yearsBack) === $"df2.date"
)
/*
The previous .map call produces a sequence of Column expressions,
we need to concatenate them with "or" in order to obtain
a single Spark Column reference. .reduce() function is most
appropriate here.
*/
.reduce(_ or _) or $"df2.date".isNull // the last "or" is added to include empty lines in the result.
val nbYearsBack = 3
val result = sourceDf1.as("df1")
.join(sourceDf2.as("df2"), $"df1.key" === $"df2.key", "left")
.where(generateWhereForPreviousYears(nbYearsBack))
.withColumn("diff_years", concat(lit("previous_year_"), year($"df1.date") - year($"df2.date")))
.groupBy($"df1.key", $"df1.date")
.pivot("diff_years")
.agg(first($"df2.value"))
.drop("null") // drop the unwanted extra column with null values
The output is:
+---+----------+---------------+---------------+
|key|date |previous_year_1|previous_year_2|
+---+----------+---------------+---------------+
|1 |2018-04-16|10 |20 |
|2 |2018-04-16|null |null |
+---+----------+---------------+---------------+
Let me "read through the lines" and give you a "similar" solution to what you are asking:
val df1Pivot = df1.groupBy("key").pivot("date").agg(max("val"))
val df2Pivot = df2.groupBy("key").pivot("date").agg(max("val"))
val result = df1Pivot.join(df2Pivot, Seq("key"), "left")
result.show
+---+----------+----------+----------+
|key|2018-04-16|2016-04-16|2017-04-16|
+---+----------+----------+----------+
| 1| 100| 20| 10|
| 2| 200| null| null|
+---+----------+----------+----------+
Feel free to manipulate the data a bit if you really need to change the column names.
Or even better:
df1.union(df2).groupBy("key").pivot("date").agg(max("val")).show
+---+----------+----------+----------+
|key|2016-04-16|2017-04-16|2018-04-16|
+---+----------+----------+----------+
| 1| 20| 10| 100|
| 2| null| null| 200|
+---+----------+----------+----------+

Pyspark DataFrame Conditional groupBy

from pyspark.sql import Row, functions as F
row = Row("UK_1","UK_2","Date","Cat")
agg = ''
agg = 'Cat'
tdf = (sc.parallelize
([
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(3,3,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A'),
row(1,1,'12/10/2016',"A"),
row(1,2,None,'A'),
row(2,1,'14/10/2016','B'),
row(None,None,'!~2016/2/276','B'),
row(None,1,'26/09/2016','A')
]).toDF())
tdf.groupBy( iff(len(agg.strip()) > 0 , F.col(agg), )).agg(F.count('*').alias('row_count')).show()
Is there a way to use a column or no column based on some condition in the dataframe groupBy?
You can provide an empty list to groupBy if the condition you are looking for is not met, which will groupBy no column:
tdf.groupBy(agg if len(agg) > 0 else []).agg(...)
agg = ''
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---------+
|row_count|
+---------+
| 10|
+---------+
agg = 'Cat'
tdf.groupBy(agg if len(agg) > 0 else []).agg(F.count('*').alias('row_count')).show()
+---+---------+
|Cat|row_count|
+---+---------+
| B| 4|
| A| 6|
+---+---------+

SQL; Only count the values specified in each column

In SQL I have a column called "answer", and the value can either be 1 or 2. I need to generate an SQL query which counts the number of 1's and 2's for each month. I have the following query, but it does not work:
SELECT MONTH(`date`), YEAR(`date`),COUNT(`answer`=1) as yes,
COUNT(`answer`=2) as nope,` COUNT(*) as total
FROM results
GROUP BY YEAR(`date`), MONTH(`date`)
I would group by the year, month, and in addition the answer itself. This will result in two lines per month: one counting the appearances for answer 1, and another for answer 2 (it's also generic for additional answer values)
SELECT MONTH(`date`), YEAR(`date`), answer, COUNT(*)
FROM results
GROUP BY YEAR(`date`), MONTH(`date`), answer
Try the SUM-CASE trick:
SELECT
MONTH(`date`),
YEAR(`date`),
SUM(case when `answer` = 1 then 1 else 0 end) as yes,
SUM(case when `answer` = 2 then 1 else 0 end) as nope,
COUNT(*) as total
FROM results
GROUP BY YEAR(`date`), MONTH(`date`)
SELECT year,
month,
answer
COUNT(answer) AS quantity
FROM results
GROUP BY year, month, quantity
year|month|answer|quantity
2001| 1| 1| 2
2001| 1| 2| 1
2004| 1| 1| 2
2004| 1| 2| 2
SELECT * FROM results;
year|month|answer
2001| 1| 1
2001| 1| 1
2001| 1| 2
2004| 1| 1
2004| 1| 1
2004| 1| 2
2004| 1| 2