Apache Spark/PySpark - How to Caculate a Column Values Incrementally? - dataframe

Data frame below is already populated with some data.
(A)
--------------------------------
| id | date | some_count |
--------------------------------
| 3 | 2020-03-31 | 5 |
| 2 | 2020-03-24 | 6 |
| 1 | 2020-03-17 | 3 |
--------------------------------
I want to create another data frame based on the above, but with an additional column,
that contains change from the previous week, of some_count field for every week. (change of the first record is taken as 0, because it has no previous week record to compare with)
(B)
-----------------------------------------------
| id | date | some_count | count_change |
-----------------------------------------------
| 3 | 2020-03-31 | 5 | -1 |
| 2 | 2020-03-24 | 6 | 3 |
| 1 | 2020-03-17 | 3 | 0 |
-----------------------------------------------
How to do this calculation with Apache Spark (SQL/PySpark)?

Because order is important to calculate 'count_change', I think we can load it into driver memory, calculate, and re-create another dataframe.
The sample code is implemented by Java, but I believe there is exactly same way in python also.
#Test
public void test() {
StructType schema = createStructType(Arrays.asList(
createStructField("id", IntegerType, true),
createStructField("date", StringType, true),
createStructField("some_count", IntegerType, true)));
// assume source data is already sorted by desc.
Dataset<Row> data = ss.createDataFrame(Arrays.asList(
RowFactory.create(3, "2020-03-31", 5),
RowFactory.create(2, "2020-03-24", 6),
RowFactory.create(1, "2020-03-17", 3)), schema);
// add column and set 0 as default value.
Dataset<Row> dataWithColumn = data.withColumn("count_change", lit(0));
// load driver memory to calculate 'count_change' based on order.
List<Row> dataWithColumnList = dataWithColumn.collectAsList();
List<Row> newList = new ArrayList<>();
// add first row which has count_change 0.
newList.add(dataWithColumnList.get(dataWithColumnList.size() - 1));
for (int i = dataWithColumnList.size() - 2; i >= 0; i--) {
Row currWeek = dataWithColumnList.get(i);
Row prevWeek = dataWithColumnList.get(i+1);
int currCount = currWeek.getInt(currWeek.fieldIndex("some_count"));
int prevCount = prevWeek.getInt(prevWeek.fieldIndex("some_count"));
int countChange = currCount - prevCount;
newList.add(RowFactory.create(currWeek.get(0), currWeek.get(1), currWeek.get(2), countChange));
}
Dataset<Row> result = ss.createDataFrame(newList, dataWithColumn.schema()).sort(col("date").desc());
result.show();
}
This is a result of show():
+---+----------+----------+------------+
| id| date|some_count|count_change|
+---+----------+----------+------------+
| 3|2020-03-31| 5| -1|
| 2|2020-03-24| 6| 3|
| 1|2020-03-17| 3| 0|
+---+----------+----------+------------+

Related

Pyspark get rows with max value for a column over a window

I have a dataframe as follows:
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650367659030 | x | 2020-05-08 | 3 |
| 1639429213087 | x | 2020-05-08 | 2 |
| 1650983874871 | x | 2020-06-08 | 5 |
| 1650367659030 | x | 2020-06-08 | 3 |
| 1639429213087 | x | 2020-06-08 | 2 |
I want to get max of created for every date.
The table should look like :
| created | id | date |value|
| 1650983874871 | x | 2020-05-08 | 5 |
| 1650983874871 | x | 2020-06-08 | 5 |
I tried:
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created_max')
)
df3 = df.join(df2, on=['id', 'date'], how='left')
But this is not working as expected.
Can anyone help me.
You need to make two changes.
The join condition needs to include created as well. Here I have changed alias to alias("created") to make the join easier. This will ensure a unique join condition (if there are no duplicate created values).
The join type must be inner.
df2 = (
df
.groupby(['id', 'date'])
.agg(
F.max(F.col('created')).alias('created')
)
)
df3 = df.join(df2, on=['id', 'date','created'], how='inner')
df3.show()
+---+----------+-------------+-----+
| id| date| created|value|
+---+----------+-------------+-----+
| x|2020-05-08|1650983874871| 5|
| x|2020-06-08|1650983874871| 5|
+---+----------+-------------+-----+
Instead of using the group by and joining, you can also use the Window in pyspark.sql:
from pyspark.sql import functions as func
from pyspark.sql.window import Window
df = df\
.withColumn('max_created', func.max('created').over(Window.partitionBy('date', 'id')))\
.filter(func.col('created')==func.col('max_created'))\
.drop('max_created')
Step:
Get the max value based on the Window
Filter the row by using the matched timestamp

Using PySpark window functions with conditions to add rows

I have a need to be able to add new rows to a PySpark df will values based upon the contents of other rows with a common id. There will eventually millions of ids with lots rows for each id. I have tried the below method which works but seems overly complicated.
I start with a df in the format below (but in reality have more columns):
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
+-------+----------+-------+
Currently I am pivoting this df to get it in the following format:
+-----+------+------+------+
| id | varA | varB | varC |
+-----+------+------+------+
| 1 | 30 | 1 | -9 |
+-----+------+------+------+
On this df I can then use the standard withColumn and when functionality to add new columns based on the values in other columns. For example:
df = df.withColumn("varD", when((col("varA") > 16) & (col("varC") != -9)), 2).otherwise(1)
Which leads to:
+-----+------+------+------+------+
| id | varA | varB | varC | varD |
+-----+------+------+------+------+
| 1 | 30 | 1 | -9 | 1 |
+-----+------+------+------+------+
I can then pivot this df back to the original format leading to this:
+-------+----------+-------+
| id | variable | value |
+-------+----------+-------+
| 1 | varA | 30 |
| 1 | varB | 1 |
| 1 | varC | -9 |
| 1 | varD | 1 |
+-------+----------+-------+
This works but seems like it could, with millions of rows, lead to expensive and unnecessary operations. It feels like it should be doable without the need to pivot and unpivot the data. Do I need to do this?
I have read about Window functions and it sounds as if they may be another way to achieve the same result but to be honest I am struggling to get started with them. I can see how they can be used to generate a value, say a sum, for each id, or to find a maximum value but have not found a way to even get started on applying complex conditions that lead to a new row.
Any help to get started with this problem would be gratefully received.
You can use pandas_udf for adding/deleting rows/col on grouped data, and implement your processing logic in pandas udf.
import pyspark.sql.functions as F
row_schema = StructType(
[StructField("id", IntegerType(), True),
StructField("variable", StringType(), True),
StructField("value", IntegerType(), True)]
)
#F.pandas_udf(row_schema, F.PandasUDFType.GROUPED_MAP)
def addRow(pdf):
val = 1
if (len(pdf.loc[(pdf['variable'] == 'varA') & (pdf['value'] > 16)]) > 0 ) & \
(len(pdf.loc[(pdf['variable'] == 'varC') & (pdf['value'] != -9)]) > 0):
val = 2
return pdf.append(pd.Series([1, 'varD', val], index=['id', 'variable', 'value']), ignore_index=True)
df = spark.createDataFrame([[1, 'varA', 30],
[1, 'varB', 1],
[1, 'varC', -9]
], schema=['id', 'variable', 'value'])
df.groupBy("id").apply(addRow).show()
which resuts
+---+--------+-----+
| id|variable|value|
+---+--------+-----+
| 1| varA| 30|
| 1| varB| 1|
| 1| varC| -9|
| 1| varD| 1|
+---+--------+-----+

How to get distinct value, count of a column in dataframe and store in another dataframe as (k,v) pair using Spark2 and Scala

I want to get the distinct values and their respective counts of every column of a dataframe and store them as (k,v) in another dataframe.
Note: My Columns are not static, they keep changing. So, I cannot hardcore the column names instead I should loop through them.
For Example, below is my dataframe
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| Blaze | IND| 19950312|
| Scarlet | USA| 19950313|
| Jonas | CAD| 19950312|
| Blaze | USA| 19950312|
| Jonas | CAD| 19950312|
| mark | USA| 19950313|
| mark | CAD| 19950313|
| Smith | USA| 19950313|
| mark | UK | 19950313|
| scarlet | CAD| 19950313|
My final result should be created in a new dataframe as (k,v) where k is the distinct record and v is the count of it.
+----------------+-----------+------------+
|name |country |DOB |
+----------------+-----------+------------+
| (Blaze,2) | (IND,1) |(19950312,3)|
| (Scarlet,2) | (USA,4) |(19950313,6)|
| (Jonas,3) | (CAD,4) | |
| (mark,3) | (UK,1) | |
| (smith,1) | | |
Can anyone please help me with this, I'm using Spark 2.4.0 and Scala 2.11.12
Note: My columns are dynamic, so I can't hardcore the columns and do groupby on them.
I don't have exact solution to your query but I can surely provide you with some help that can get you started working on your issue.
Create dataframe
scala> val df = Seq(("Blaze ","IND","19950312"),
| ("Scarlet","USA","19950313"),
| ("Jonas ","CAD","19950312"),
| ("Blaze ","USA","19950312"),
| ("Jonas ","CAD","19950312"),
| ("mark ","USA","19950313"),
| ("mark ","CAD","19950313"),
| ("Smith ","USA","19950313"),
| ("mark ","UK ","19950313"),
| ("scarlet","CAD","19950313")).toDF("name", "country","dob")
Next calculate count of distinct element of each column
scala> val distCount = df.columns.map(c => df.groupBy(c).count)
Create a range to iterate over distCount
scala> val range = Range(0,distCount.size)
range: scala.collection.immutable.Range = Range(0, 1, 2)
Aggregate your data
scala> val aggVal = range.toList.map(i => distCount(i).collect().mkString).toSeq
aggVal: scala.collection.immutable.Seq[String] = List([Jonas ,2][Smith ,1][Scarlet,1][scarlet,1][mark ,3][Blaze ,2], [CAD,4][USA,4][IND,1][UK ,1], [19950313,6][19950312,4])
Create data frame:
scala> Seq((aggVal(0),aggVal(1),aggVal(2))).toDF("name", "country","dob").show()
+--------------------+--------------------+--------------------+
| name| country| dob|
+--------------------+--------------------+--------------------+
|[Jonas ,2][Smith...|[CAD,4][USA,4][IN...|[19950313,6][1995...|
+--------------------+--------------------+--------------------+
I hope this helps you in some way.

How to select Multiple Rows based on one Column

So I have looked around the internet, and couldn't find anything that could be related to my issue.
This is part of my DB:
ID | English | Pun | SID | Writer |
=======================================================
1 | stuff | stuff | 1 | Full |
2 | stuff | stuff | 1 | Rec. |
3 | stuff | stuff | 2 | Full |
4 | stuff | stuff | 2 | Rec. |
Now how would I get all rows with SID being equal to 1.
Like this
ID | English | Pun | SID | Writer |
=======================================================
1 | stuff | stuff | 1 | Full |
2 | stuff | stuff | 1 | Rec. |
Or when I want to get all rows with SID being equal to 2.
ID | English | Pun | SID | Writer |
=======================================================
3 | stuff | stuff | 2 | Full |
4 | stuff | stuff | 2 | Rec. |
This is my current SQL Query using SQLite:
SELECT * FROM table_name WHERE SID = 1
And I only get the first row, how would I be able to get all of the rows?
Here is my PHP Code:
class GurDB extends SQLite3
{
function __construct()
{
$this->open('gurbani.db3');
}
}
$db = new GurDB();
$mode = $_GET["mode"];
if($mode == "2") {
$shabadnum = $_GET["shabadNo"];
$result = $db->query("SELECT * FROM table_name WHERE SID = $shabadnum");
$array = $result->fetchArray(SQLITE3_ASSOC);
print_r($array);
}
Fetch array only gives you one row... you want something like this:
while($row = $result->fetch_array())
{
$rows[] = $row;
}

SparkSQL: conditional sum on range of dates

I have a dataframe like this:
| id | prodId | date | value |
| 1 | a | 2015-01-01 | 100 |
| 2 | a | 2015-01-02 | 150 |
| 3 | a | 2015-01-03 | 120 |
| 4 | b | 2015-01-01 | 100 |
and I would love to do a groupBy prodId and aggregate 'value' summing it for ranges of dates. In other words, I need to build a table with the following columns:
prodId
val_1: sum value if date is between date1 and date2
val_2: sum value if date is between date2 and date3
val_3: same as before
etc.
| prodId | val_1 | val_2 |
| | (01-01 to 01-02) | (01-03 to 01-04) |
| a | 250 | 120 |
| b | 100 | 0 |
Is there any predefined aggregated function in spark that allows doing conditional sums? Do you recommend develop a aggr. UDF (if so, any suggestions)?
Thanks a lot!
First lets recreate example dataset
import org.apache.spark.sql.functions.to_date
val df = sc.parallelize(Seq(
(1, "a", "2015-01-01", 100), (2, "a", "2015-01-02", 150),
(3, "a", "2015-01-03", 120), (4, "b", "2015-01-01", 100)
)).toDF("id", "prodId", "date", "value").withColumn("date", to_date($"date"))
val dates = List(("2015-01-01", "2015-01-02"), ("2015-01-03", "2015-01-04"))
All you have to do is something like this:
import org.apache.spark.sql.functions.{when, lit, sum}
val exprs = dates.map{
case (x, y) => {
// Create label for a column name
val alias = s"${x}_${y}".replace("-", "_")
// Convert strings to dates
val xd = to_date(lit(x))
val yd = to_date(lit(y))
// Generate expression equivalent to
// SUM(
// CASE
// WHEN date BETWEEN ... AND ... THEN value
// ELSE 0
// END
// ) AS ...
// for each pair of dates.
sum(when($"date".between(xd, yd), $"value").otherwise(0)).alias(alias)
}
}
df.groupBy($"prodId").agg(exprs.head, exprs.tail: _*).show
// +------+---------------------+---------------------+
// |prodId|2015_01_01_2015_01_02|2015_01_03_2015_01_04|
// +------+---------------------+---------------------+
// | a| 250| 120|
// | b| 100| 0|
// +------+---------------------+---------------------+