SparkSQL: conditional sum on range of dates - sql

I have a dataframe like this:
| id | prodId | date | value |
| 1 | a | 2015-01-01 | 100 |
| 2 | a | 2015-01-02 | 150 |
| 3 | a | 2015-01-03 | 120 |
| 4 | b | 2015-01-01 | 100 |
and I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates. In other words, I need to build a table with the following columns:
prodId
val_1: sum value if date is between date1 and date2
val_2: sum value if date is between date2 and date3
val_3: same as before
etc.
| prodId | val_1 | val_2 |
| | (01-01 to 01-02) | (01-03 to 01-04) |
| a | 250 | 120 |
| b | 100 | 0 |
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
Thanks a lot!

First lets recreate example dataset
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
(1, "a", "2015-01-01", 100), (2, "a", "2015-01-02", 150),
(3, "a", "2015-01-03", 120), (4, "b", "2015-01-01", 100)
)).toDF("id", "prodId", "date", "value").withColumn("date", to_date($"date"))
val dates = List(("2015-01-01", "2015-01-02"), ("2015-01-03", "2015-01-04"))
All you have to do is something like this:
import org.apache.spark.sql.functions.{when, lit, sum}
val exprs = dates.map{
case (x, y) => {
// Create label for a column name
val alias = s"${x}_${y}".replace("-", "_")
// Convert strings to dates
val xd = to_date(lit(x))
val yd = to_date(lit(y))
// Generate expression equivalent to
// SUM(
// CASE
// WHEN date BETWEEN ... AND ... THEN value
// ELSE 0
// END
// ) AS ...
// for each pair of dates.
sum(when($"date".between(xd, yd), $"value").otherwise(0)).alias(alias)
}
}
df.groupBy($"prodId").agg(exprs.head, exprs.tail: _*).show
// +------+---------------------+---------------------+
// |prodId|2015_01_01_2015_01_02|2015_01_03_2015_01_04|
// +------+---------------------+---------------------+
// | a| 250| 120|
// | b| 100| 0|
// +------+---------------------+---------------------+

Related

Pyspark get rows with max value for a column over a window

I have a dataframe as follows:
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650367659030 | x | 2020-05-08 | 3 |
| 1639429213087 | x | 2020-05-08 | 2 |
| 1650983874871 | x | 2020-06-08 | 5 |
| 1650367659030 | x | 2020-06-08 | 3 |
| 1639429213087 | x | 2020-06-08 | 2 |
I want to get max of created for every date.
The table should look like :
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650983874871 | x | 2020-06-08 | 5 |
I tried:
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created_max')
)
df3 = df.join(df2, on=['id', 'date'], how='left')
But this is not working as expected.
Can anyone help me.
You need to make two changes.
The join condition needs to include created as well. Here I have changed alias to alias("created") to make the join easier. This will ensure a unique join condition (if there are no duplicate created values).
The join type must be inner.
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created')
)
)
df3 = df.join(df2, on=['id', 'date','created'], how='inner')
df3.show()
+---+----------+-------------+-----+
| id| date| created|value|
+---+----------+-------------+-----+
| x|2020-05-08|1650983874871| 5|
| x|2020-06-08|1650983874871| 5|
+---+----------+-------------+-----+
Instead of using the group by and joining, you can also use the Window in pyspark.sql:
from pyspark.sql import functions as func
from pyspark.sql.window import Window
df = df\
.withColumn('max_created', func.max('created').over(Window.partitionBy('date', 'id')))\
.filter(func.col('created')==func.col('max_created'))\
.drop('max_created')
Step:
Get the max value based on the Window
Filter the row by using the matched timestamp

Distinct Sum and Group by

I have a dataset [attached example] and I want to create 2 tables out of this;
+------+------------+-------+-------+-------+--------+
| corp | product | data | Group | sales | market |
+------+------------+-------+-------+-------+--------+
| A | Eli | 43831 | A | 100 | I |
| A | Eli | 43831 | B | 100 | I |
| B | Sut | 43831 | A | 80 | I |
| A | Api | 43831 | C | 50 | C or D |
| A | Api | 43831 | D | 50 | C or D |
| B | Konkurent2 | 43831 | C | 40 | C or D |
+------+------------+-------+-------+-------+--------+
1st - sum(sales) by market and exclude duplicated rows, so I want to end up with Sales for each market in specific date rage (Data column) but exclude duplicated - I have them because 1 product can be in more than 1 group
So first table, for exmaple, for MRCC I, would look like:
+--------+-------+-------+
| market | sales | data |
+--------+-------+-------+
| I | 180 | 43831 |
+--------+-------+-------+
Then second table I would like to look like above one, but add as a 'dictionary' aditionall column with uniqe product name within Market and Date, so for MRCC I it would look like:
+--------+-------+-------+----------------+
| market | sales | data | unique product |
+--------+-------+-------+----------------+
| I | 180 | 43831 | eli |
| I | 180 | 43831 | Sut |
+--------+-------+-------+----------------+
The thing is, im not that experienced in SQL, and i'm fairly new to DataProcessing, the system I am working in allows me to do some of data processing either by some "visual" recipes or by SQL code which im not that familiar with. And even moe confusing is I can choose between 3 SQL DBMS , Impala, Hive, Spark SQL - for example to create market column I used Impala and script looks like this, and im not sure if this is "pure" Impala syntax:
SELECT * from
(
-- mrc I --
SELECT *,case when
(`product`="Eli")
or
(`product`="Sut")
THEN "MRCC I"
end as market
FROM x.`y`
)a
where market is not null
Could you give me some tips on a structure of a code and if this is even possible?
Thanks,
eM
import spark.implicits._
import org.apache.spark.sql.functions._
case class Sale(
corp: String,
product: String,
data: Long,
group: String,
sales: Long,
market: String
)
val df = Seq(
Sale("A", "Eli", 43831, "A", 100, "I"),
Sale("A", "Eli", 43831, "B", 100, "I"),
Sale("A", "Sut", 43831, "A", 80, "I"),
Sale("A", "Api", 43831, "C", 50, "C or D"),
Sale("A", "Api", 43831, "D", 50, "C or D"),
Sale("B", "Konkurent2", 43831, "C", 40, "C or D")
).toDF()
val t2 = df.dropDuplicates(Seq("corp", "product", "data", "market"))
.groupBy("market", "product", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data,
'product.alias("unique product")
)
t2.show(false)
// +------+-----+-----+--------------+
// |market|sales|data |unique product|
// +------+-----+-----+--------------+
// |I |80 |43831|Sut |
// |I |100 |43831|Eli |
// |C or D|40 |43831|Konkurent2 |
// |C or D|50 |43831|Api |
// +------+-----+-----+--------------+
val t1 = t2.drop("unique product")
.groupBy("market", "data").sum("sales")
.select(
'market,
col("sum(sales)").alias("sales"),
'data)
t1.show(false)
// +------+-----+-----+
// |market|sales|data |
// +------+-----+-----+
// |I |180 |43831|
// |C or D|90 |43831|
// +------+-----+-----+

Apache Spark/PySpark - How to Caculate a Column Values Incrementally?

Data frame below is already populated with some data.
(A)
--------------------------------
| id | date | some_count |
--------------------------------
| 3 | 2020-03-31 | 5 |
| 2 | 2020-03-24 | 6 |
| 1 | 2020-03-17 | 3 |
--------------------------------
I want to create another data frame based on the above, but with an additional column,
that contains change from the previous week, of some_count field for every week. (change of the first record is taken as 0, because it has no previous week record to compare with)
(B)
-----------------------------------------------
| id | date | some_count | count_change |
-----------------------------------------------
| 3 | 2020-03-31 | 5 | -1 |
| 2 | 2020-03-24 | 6 | 3 |
| 1 | 2020-03-17 | 3 | 0 |
-----------------------------------------------
How to do this calculation with Apache Spark (SQL/PySpark)?
Because order is important to calculate 'count_change', I think we can load it into driver memory, calculate, and re-create another dataframe.
The sample code is implemented by Java, but I believe there is exactly same way in python also.
#Test
public void test() {
StructType schema = createStructType(Arrays.asList(
createStructField("id", IntegerType, true),
createStructField("date", StringType, true),
createStructField("some_count", IntegerType, true)));
// assume source data is already sorted by desc.
Dataset<Row> data = ss.createDataFrame(Arrays.asList(
RowFactory.create(3, "2020-03-31", 5),
RowFactory.create(2, "2020-03-24", 6),
RowFactory.create(1, "2020-03-17", 3)), schema);
// add column and set 0 as default value.
Dataset<Row> dataWithColumn = data.withColumn("count_change", lit(0));
// load driver memory to calculate 'count_change' based on order.
List<Row> dataWithColumnList = dataWithColumn.collectAsList();
List<Row> newList = new ArrayList<>();
// add first row which has count_change 0.
newList.add(dataWithColumnList.get(dataWithColumnList.size() - 1));
for (int i = dataWithColumnList.size() - 2; i >= 0; i--) {
Row currWeek = dataWithColumnList.get(i);
Row prevWeek = dataWithColumnList.get(i+1);
int currCount = currWeek.getInt(currWeek.fieldIndex("some_count"));
int prevCount = prevWeek.getInt(prevWeek.fieldIndex("some_count"));
int countChange = currCount - prevCount;
newList.add(RowFactory.create(currWeek.get(0), currWeek.get(1), currWeek.get(2), countChange));
}
Dataset<Row> result = ss.createDataFrame(newList, dataWithColumn.schema()).sort(col("date").desc());
result.show();
}
This is a result of show():
+---+----------+----------+------------+
| id| date|some_count|count_change|
+---+----------+----------+------------+
| 3|2020-03-31| 5| -1|
| 2|2020-03-24| 6| 3|
| 1|2020-03-17| 3| 0|
+---+----------+----------+------------+

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

I want to find max value comparing 100 columns with data frame

I have a dataframe
syr | P1 | P2
-----------------
1 | 200 | 300
2 | 500 | 700
3 | 900 | 400
I want to create another DataFrame which has max value between col2 & col3. An expected output is like:
syr | P1 | P2 | max
-------------------------
1 | 200 | 300 | 300
2 | 500 | 700 | 700
3 | 900 | 400 | 900
You could define a new UDF function to catch the max value between two column, like:
def maxDef(p1: Int, p2: Int): Int = if(p1>p2) p1 else p2
val max = udf[Int, Int, Int](maxDef)
And then apply the UDF in a withColumn() to define a new Column, like:
val df1 = df.withColumn("max", max(df.col("P1"), df.col("P2")))
+---+---+---+---+
|syr| P1| P2|max|
+---+---+---+---+
| 1|200|300|300|
| 2|500|700|700|
| 3|900|400|900|
+---+---+---+---+
EDIT: Iterate through columns
First initialize the max Column:
df = df.withColumn("max", lit(0))
then foreach Column you want (use filter function property) compare it with the max Column.
df.columns.filter(_.startsWith("P")).foreach(col => {
df = df.withColumn("max", max(df.col("max"), df.col(col)))
})