I want to add a row with grand total of previously grouped rows.
I have code:
df_join = (
df.join(df1, df.serialnumber == df1.entityid)
.distinct()
.groupBy("SW_version").count().show(truncate=False)
I need to add the grand total row, summing all values in a count column.
For now the result of the code is:
+-----------+-----+
|SW_version |count|
+-----------+-----+
|SG4J000078C|63 |
|SG4J000092C|670 |
|SG4J000094C|43227|
+-----------+-----+
You can use rollup instead of groupBy in this case. Rollup will produce one additional row with null group and aggregation for all rows.
For df like this:
+-------+
|version|
+-------+
| A|
| A|
| B|
| B|
| B|
| C|
+-------+
df.rollup("version").count().sort("version", ascending=False).show() will return:
+-------+-----+
|version|count|
+-------+-----+
| C| 1|
| B| 3|
| A| 2|
| null| 6| <-- this is the grand total
+-------+-----+
You can read more about rollup in this post What is the difference between cube, rollup and groupBy operators?
Related
The crossJoin of two dataframes of 5 rows for each one gives a dataframe of 25 rows (5*5).
What I want is to do a crossJoin but which is "not full".
For example:
df1: df2:
+-----+ +-----+
|index| |value|
+-----+ +-----+
| 0| | A|
| 1| | B|
| 2| | C|
| 3| | D|
| 4| | E|
+-----+ +-----+
The result must be a dataframe of number of rows < 25, while for each row in index choosing randomly the number of rows in value with which the crossJoin is done.
It will be something like that:
+-----+-----+
|index|value|
+-----+-----+
| 0| D|
| 0| A|
| 1| A|
| 1| D|
| 1| B|
| 1| C|
| 2| A|
| 2| E|
| 3| D|
| 4| A|
| 4| B|
| 4| E|
+-----+-----+
Thank you
You can try with sample(withReplacement, fraction, seed=None) to get the less number of rows after cross join.
Example:
spark.sql("set spark.sql.crossJoin.enabled=true")
df.join(df1).sample(False,0.6).show()
I am using spark 2.4.0
I am observing a strange behavior while using count function to aggregate.
from pyspark.sql import functions as F
tst=sqlContext.createDataFrame([(1,2),(1,5),(2,None),(2,3),(3,None),(3,None)],schema=['col1','col2'])
tst.show()
+----+----+
|col1|col2|
+----+----+
| 1| 2|
| 1| 5|
| 2|null|
| 2| 3|
| 3|null|
| 3|null|
+----+----+
tst.groupby('col1').agg(F.count('col2')).show()
+----+-----------+
|col1|count(col2)|
+----+-----------+
| 1| 2|
| 3| 0|
| 2| 1|
+----+-----------+
Here you can see that the null values are not counted. I searched for the docus, but no where it is mentioned that the function count does not count null values.
More surprising for me is this
tst.groupby('col1').agg(F.count(F.col('col2').isNull())).show()
+----+---------------------+
|col1|count((col2 IS NULL))|
+----+---------------------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+---------------------+
Here I am totally confused. When I use isNull(), shouldn't it count only null values? Why is it counting all the values?
Any thing i am missing?
In both cases the results that you see are the expected ones.
Concerning the first example: Checking the Scala source of count there is a subtle difference between count(*) and count('col2'):
FUNC(*) - Returns the total number of retrieved rows, including rows containing null.
FUNC(expr[, expr...]) - Returns the number of rows for which the supplied expression(s) are all non-null.
This explains why the null entries are not counted.
If you change the code to
tst.groupby('col1').agg(F.count('*')).show()
you get
+----+--------+
|col1|count(1)|
+----+--------+
| 1| 2|
| 3| 2|
| 2| 2|
+----+--------+
About the second part: the expression F.col('col2').isNull() returns a boolean value. No matter what the actual value of this boolean is, the row is counted and therefore you see a 2.
I have some table and provide tools to the user to generate new columns based on existings.
Table:
+---+
| a|
+---+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
+---+
New column name: b
New column rule must be like: max(a) over(WHERE a < 3)
How to correct write this?
Result must be like SQL statement: SELECT *, (SELECT max(a) FROM table WHERE a < 3) as b FROM table. And returns:
+---+---+
| a| b|
+---+---+
| 0| 2|
| 1| 2|
| 2| 2|
| 3| 2|
| 4| 2|
| 5| 2|
+---+---+
But I can't wrote inside over() WHERE statement and can't allow user to know name of table. How do I solve this problem?
Just use a window function with case:
select a, max(case when a < 3 then a end) over () as b
from t;
I'm using spark to create a DataFrame. I have a column like this one:
+---+
|cid|
+---+
| 0|
| 0|
| 0|
| 1|
| 0|
| 1|
| 0|
+---+
And would like to use it to create a new column where each row has the sum value of all the preceding rows and it's own value, so it'd end up looking like:
+---+
|sid|
+---+
| 0|
| 0|
| 0|
| 1|
| 1|
| 2|
| 2|
+---+
I am using MSSQL for my application and I need to count sequential number status predecessor line.
Table is like this,
+------+------+
|Number|Status|
| 1| N|
| 2| G|
| 3| G|
| 4| N|
| 5| N|
| 6| G|
| 7| G|
| 8| N|
+------+------+
I suppose result set as follows
ex.
+------+------+-----+
|Number|Status|Count|
| 1| N| 1 |
| 2| G| 1 |
| 3| G| 2 |
| 4| N| 1 |
| 5| N| 2 |
| 6| G| 1 |
| 7| G| 2 |
| 8| G| 3 |
+------+------+-----+
I couldn't use cursor for performance of query. it is worst case options....
You need to identify groups of consecutive "N" and "G" values. I like to approach this with a difference of row numbers. Then you can use row_number() to enumerate the rows:
select t.number, t.status,
row_number() over (partition by status, grp order by number) as seqnum
from (select t.*,
(row_number() over (order by number) -
row_number() over (partition by status order by number
) as grp
from table t
) t;