Totalize count column with grand total - apache-spark-sql

I want to add a row with grand total of previously grouped rows.
I have code:
df_join = (
df.join(df1, df.serialnumber == df1.entityid)
.distinct()
.groupBy("SW_version").count().show(truncate=False)
I need to add the grand total row, summing all values in a count column.
For now the result of the code is:
+-----------+-----+
|SW_version |count|
+-----------+-----+
|SG4J000078C|63 |
|SG4J000092C|670 |
|SG4J000094C|43227|
+-----------+-----+

You can use rollup instead of groupBy in this case. Rollup will produce one additional row with null group and aggregation for all rows.
For df like this:
+-------+
|version|
+-------+
| A|
| A|
| B|
| B|
| B|
| C|
+-------+
df.rollup("version").count().sort("version", ascending=False).show() will return:
+-------+-----+
|version|count|
+-------+-----+
| C| 1|
| B| 3|
| A| 2|
| null| 6| <-- this is the grand total
+-------+-----+
You can read more about rollup in this post What is the difference between cube, rollup and groupBy operators?

Related

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you
You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

pyspark-strange behavior of count function inside agg

I am using spark 2.4.0
I am observing a strange behavior while using count function to aggregate.
from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 5|
| 2|null|
| 2| 3|
| 3|null|
| 3|null|
+----+----+
tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
| 1| 2|
| 3| 0|
| 2| 1|
+----+-----------+
Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values.
More surprising for me is this
tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+---------------------+
Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?
Any thing i am missing?
In both cases the results that you see are the expected ones.
Concerning the first example: Checking the Scala source of count there is a subtle difference between count(*) and count('col2'):
FUNC(*) - Returns the total number of retrieved rows, including rows containing null.
FUNC(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
This explains why the null entries are not counted.
If you change the code to
tst.groupby('col1').agg(F.count('*')).show()
you get
+----+--------+
|col1|count(1)|
+----+--------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+--------+
About the second part: the expression F.col('col2').isNull() returns a boolean value. No matter what the actual value of this boolean is, the row is counted and therefore you see a 2.

How to filter rows in SQL statement for aggregate function by window function?

I have some table and provide tools to the user to generate new columns based on existings.
Table:
+---+
| a|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
New column name: b
New column rule must be like: max(a) over(WHERE a < 3)
How to correct write this?
Result must be like SQL statement: SELECT *, (SELECT max(a) FROM table WHERE a < 3) as b FROM table. And returns:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 5| 2|
+---+---+
But I can't wrote inside over() WHERE statement and can't allow user to know name of table. How do I solve this problem?
Just use a window function with case:
select a, max(case when a < 3 then a end) over () as b
from t;

Spark: sum preceding rows

I'm using spark to create a DataFrame. I have a column like this one:
+---+
|cid|
+---+
| 0|
| 0|
| 0|
| 1|
| 0|
| 1|
| 0|
+---+
And would like to use it to create a new column where each row has the sum value of all the preceding rows and it's own value, so it'd end up looking like:
+---+
|sid|
+---+
| 0|
| 0|
| 0|
| 1|
| 1|
| 2|
| 2|
+---+

how to count sequential column and each one is counted before rows in sql

I am using MSSQL for my application and I need to count sequential number status predecessor line.
Table is like this,
+------+------+
|Number|Status|
| 1| N|
| 2| G|
| 3| G|
| 4| N|
| 5| N|
| 6| G|
| 7| G|
| 8| N|
+------+------+
I suppose result set as follows
ex.
+------+------+-----+
|Number|Status|Count|
| 1| N| 1 |
| 2| G| 1 |
| 3| G| 2 |
| 4| N| 1 |
| 5| N| 2 |
| 6| G| 1 |
| 7| G| 2 |
| 8| G| 3 |
+------+------+-----+
I couldn't use cursor for performance of query. it is worst case options....
You need to identify groups of consecutive "N" and "G" values. I like to approach this with a difference of row numbers. Then you can use row_number() to enumerate the rows:
select t.number, t.status,
row_number() over (partition by status, grp order by number) as seqnum
from (select t.*,
(row_number() over (order by number) -
row_number() over (partition by status order by number
) as grp
from table t
) t;