pyspark-strange behavior of count function inside agg

pyspark-strange behavior of count function inside agg - apache-spark-sql

I am using spark 2.4.0
I am observing a strange behavior while using count function to aggregate.
from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 5|
| 2|null|
| 2| 3|
| 3|null|
| 3|null|
+----+----+
tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
| 1| 2|
| 3| 0|
| 2| 1|
+----+-----------+
Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values.
More surprising for me is this
tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+---------------------+
Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?
Any thing i am missing?

In both cases the results that you see are the expected ones.
Concerning the first example: Checking the Scala source of count there is a subtle difference between count(*) and count('col2'):
FUNC(*) - Returns the total number of retrieved rows, including rows containing null.
FUNC(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
This explains why the null entries are not counted.
If you change the code to
tst.groupby('col1').agg(F.count('*')).show()
you get
+----+--------+
|col1|count(1)|
+----+--------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+--------+
About the second part: the expression F.col('col2').isNull() returns a boolean value. No matter what the actual value of this boolean is, the row is counted and therefore you see a 2.

Related

Pyspark Dataframe Difference - Where param != null not returning?

I'm writing a function to outputting a dataframe with the difference between two dataframes. Simplified, this looks like this:
differences = df1.join(df2, df1['id'] == df2['id'], how='full') \
.select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['name'], df2['name'])
.where(df1['name'] != df2['name'])
With the following 2 datasets, I expect the 3rd to be the output:
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Bob|
| 3|Carol|
| 4| Dan|
| 5| Eve|
+---+-----+
+---+-----+
| id| name|
+---+-----+
| 1|Alice|
| 2| Ben|
| 4| Dan|
| 5| Eve|
| 6| Finn|
+---+-----+
+---+-------+-------+
|age| name| name|
+---+-------+-------+
| 2| Bob| Ben|
| 3| Carol| null|
| 6| null| Finn|
+---+-------+-------+
But when I run it in databricks, the null columns are omitted from the result dataframe.
+---+-------+-------+
|age| name| name|
+---+-------+-------+
| 2| Bob| Ben|
+---+-------+-------+
Are they not considered != by the where clause? Is there hidden logic when these frames are created?

From Why [table].[column] != null is not working?:
"NULL in a database is not a value. It means something like "unknown" or "data missing".
You cannot tell if something where you don't have any information about is equal to something else where you also don't have any information about (=, != operators). But you can say whether there is any information available (IS NULL, IS NOT NULL)."
So you will have to add more conditions:
differences = df1.join(df2, df1['id'] == df2['id'], how='full')\
.select(F.coalesce(df1['id'], df2['id']).alias('id'), df1['Name'], df2['Name'])\
.where((df1['Name'] != df2['Name']) | ((df1['Name']).isNull() | (df2['Name']).isNull()))
+---+-----+----+
|id |Name |Name|
+---+-----+----+
|2 |Bob |Ben |
|3 |Carol|null|
|6 |null |Finn|
+---+-----+----+

Totalize count column with grand total

I want to add a row with grand total of previously grouped rows.
I have code:
df_join = (
df.join(df1, df.serialnumber == df1.entityid)
.distinct()
.groupBy("SW_version").count().show(truncate=False)
I need to add the grand total row, summing all values in a count column.
For now the result of the code is:
+-----------+-----+
|SW_version |count|
+-----------+-----+
|SG4J000078C|63 |
|SG4J000092C|670 |
|SG4J000094C|43227|
+-----------+-----+

You can use rollup instead of groupBy in this case. Rollup will produce one additional row with null group and aggregation for all rows.
For df like this:
+-------+
|version|
+-------+
| A|
| A|
| B|
| B|
| B|
| C|
+-------+
df.rollup("version").count().sort("version", ascending=False).show() will return:
+-------+-----+
|version|count|
+-------+-----+
| C| 1|
| B| 3|
| A| 2|
| null| 6| <-- this is the grand total
+-------+-----+
You can read more about rollup in this post What is the difference between cube, rollup and groupBy operators?

Pyspark crossJoin with specific condition

The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you

You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()

How to filter rows in SQL statement for aggregate function by window function?

I have some table and provide tools to the user to generate new columns based on existings.
Table:
+---+
| a|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
New column name: b
New column rule must be like: max(a) over(WHERE a < 3)
How to correct write this?
Result must be like SQL statement: SELECT *, (SELECT max(a) FROM table WHERE a < 3) as b FROM table. And returns:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 5| 2|
+---+---+
But I can't wrote inside over() WHERE statement and can't allow user to know name of table. How do I solve this problem?

Just use a window function with case:
select a, max(case when a < 3 then a end) over () as b
from t;

Spark: sum preceding rows

I'm using spark to create a DataFrame. I have a column like this one:
+---+
|cid|
+---+
| 0|
| 0|
| 0|
| 1|
| 0|
| 1|
| 0|
+---+
And would like to use it to create a new column where each row has the sum value of all the preceding rows and it's own value, so it'd end up looking like:
+---+
|sid|
+---+
| 0|
| 0|
| 0|
| 1|
| 1|
| 2|
| 2|
+---+

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

pyspark-strange behavior of count function inside agg - apache-spark-sql

Related

Pyspark Dataframe Difference - Where param != null not returning?

Totalize count column with grand total

Pyspark crossJoin with specific condition

How to filter rows in SQL statement for aggregate function by window function?

Spark: sum preceding rows

Categories

Resources