What is the difference between cube, rollup and groupBy operators? - sql

I can't find any detailed documentation regarding the differences.
I do notice a difference, because when interchanging cube and groupBy function calls, I get different results. I noticed that for the result using cube, I got a lot of null values on the expressions where I used to use groupBy.

These are not intended to work in the same way. groupBy is simply an equivalent of the GROUP BY clause in standard SQL. In other words
table.groupBy($"foo", $"bar")
is equivalent to:
SELECT foo, bar, [agg-expressions] FROM table GROUP BY foo, bar
cube is equivalent to CUBE extension to GROUP BY. It takes a list of columns and applies aggregate expressions to all possible combinations of the grouping columns. Lets say you have data like this:
val df = Seq(("foo", 1L), ("foo", 2L), ("bar", 2L), ("bar", 2L)).toDF("x", "y")
df.show
// +---+---+
// | x| y|
// +---+---+
// |foo| 1|
// |foo| 2|
// |bar| 2|
// |bar| 2|
// +---+---+
and you compute cube(x, y) with count as an aggregation:
df.cube($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// |null| 1| 1| <- count of records where y = 1
// |null| 2| 3| <- count of records where y = 2
// | foo|null| 2| <- count of records where x = foo
// | bar| 2| 2| <- count of records where x = bar AND y = 2
// | foo| 1| 1| <- count of records where x = foo AND y = 1
// | foo| 2| 1| <- count of records where x = foo AND y = 2
// |null|null| 4| <- total count of records
// | bar|null| 2| <- count of records where x = bar
// +----+----+-----+
A similar function to cube is rollup which computes hierarchical subtotals from left to right:
df.rollup($"x", $"y").count.show
// +----+----+-----+
// | x| y|count|
// +----+----+-----+
// | foo|null| 2| <- count where x is fixed to foo
// | bar| 2| 2| <- count where x is fixed to bar and y is fixed to 2
// | foo| 1| 1| ...
// | foo| 2| 1| ...
// |null|null| 4| <- count where no column is fixed
// | bar|null| 2| <- count where x is fixed to bar
// +----+----+-----+
Just for comparison lets see the result of plain groupBy:
df.groupBy($"x", $"y").count.show
// +---+---+-----+
// | x| y|count|
// +---+---+-----+
// |foo| 1| 1| <- this is identical to x = foo AND y = 1 in CUBE or ROLLUP
// |foo| 2| 1| <- this is identical to x = foo AND y = 2 in CUBE or ROLLUP
// |bar| 2| 2| <- this is identical to x = bar AND y = 2 in CUBE or ROLLUP
// +---+---+-----+
To summarize:
When using plain GROUP BY every row is included only once in its corresponding summary.
With GROUP BY CUBE(..) every row is included in summary of each combination of levels it represents, wildcards included. Logically, the shown above is equivalent to something like this (assuming we could use NULL placeholders):
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT NULL, y, COUNT(*) FROM table GROUP BY y
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
With GROUP BY ROLLUP(...) is similar to CUBE but works hierarchically by filling colums from left to right.
SELECT NULL, NULL, COUNT(*) FROM table
UNION ALL
SELECT x, NULL, COUNT(*) FROM table GROUP BY x
UNION ALL
SELECT x, y, COUNT(*) FROM table GROUP BY x, y
ROLLUP and CUBE come from data warehousing extensions so if you want to get a better understanding how this works you can also check documentation of your favorite RDMBS. For example PostgreSQL introduced both in 9.5 and these are relatively well documented.

There's one more member in the "family" which can explain it all - GROUPING SETS. We don't have it in PySpark/Scala, but it exists in SQL API.
GROUPING SETS is used to design whatever combination of groupings is required. Others (cube, rollup, groupBy) return predefined existent combinations:
cube("id", "x", "y") will return (), (id), (x), (y), (id, x), (id, y), (x, y), (id, x, y).
(All the possible existent combinations.)
rollup("id", "x", "y") will only return (), (id), (id, x), (id, x, y).
(Combinations which include the beginning of the provided sequence.)
groupBy("id", "x", "y") will only return (id, x, y) combination.
Examples
Input df:
df = spark.createDataFrame(
[("a", "foo", 1),
("a", "foo", 2),
("a", "bar", 2),
("a", "bar", 2)],
["id", "x", "y"])
df.createOrReplaceTempView("df")
cube
df.cube("id", "x", "y").count()
is the same as...
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
(x),
(y),
(id, x),
(id, y),
(x, y),
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null| 2| 3|
|null|null|null| 4|
| a|null| 2| 3|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null| 1| 1|
|null| foo|null| 2|
| a|null|null| 4|
|null|null| 1| 1|
|null| foo| 2| 1|
|null| foo| 1| 1|
| a| foo| 2| 1|
|null| bar|null| 2|
|null| bar| 2| 2|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
rollup
df.rollup("id", "x", "y").count()
is the same as... GROUPING SETS ((), (id), (id, x), (id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
(),
(id),
--(x), <- (not used)
--(y), <- (not used)
(id, x),
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+----+----+----+-----+
| id| x| y|count|
+----+----+----+-----+
|null|null|null| 4|
| a| foo|null| 2|
| a| foo| 1| 1|
| a|null|null| 4|
| a| foo| 2| 1|
| a| bar|null| 2|
| a| bar| 2| 2|
+----+----+----+-----+
groupBy
df.groupBy("id", "x", "y").count()
is the same as... GROUPING SETS ((id, x, y))
spark.sql("""
SELECT id, x, y, count(1) count
FROM df
GROUP BY
GROUPING SETS (
--(), <- (not used)
--(id), <- (not used)
--(x), <- (not used)
--(y), <- (not used)
--(id, x), <- (not used)
--(id, y), <- (not used)
--(x, y), <- (not used)
(id, x, y)
)
""")
+---+---+---+-----+
| id| x| y|count|
+---+---+---+-----+
| a|foo| 2| 1|
| a|foo| 1| 1|
| a|bar| 2| 2|
+---+---+---+-----+
Note. All the above return existent combinations. In the example dataframe, there is no row for "id":"a", "x":"bar", "y":1. Even cube does not return it. In order to get all the possible combinations (existent or not) we should do something like the following (crossJoin):
df_cartesian = spark.range(1).toDF('_tmp')
for c in (cols:=["id", "x", "y"]):
df_cartesian = df_cartesian.crossJoin(df.select(c).distinct())
df_final = (df_cartesian.drop("_tmp")
.join(df.cube(*cols).count(), cols, 'full')
)
df_final.show()
# +----+----+----+-----+
# | id| x| y|count|
# +----+----+----+-----+
# |null|null|null| 4|
# |null|null| 1| 1|
# |null|null| 2| 3|
# |null| bar|null| 2|
# |null| bar| 2| 2|
# |null| foo|null| 2|
# |null| foo| 1| 1|
# |null| foo| 2| 1|
# | a|null|null| 4|
# | a|null| 1| 1|
# | a|null| 2| 3|
# | a| bar|null| 2|
# | a| bar| 1| null|
# | a| bar| 2| 2|
# | a| foo|null| 2|
# | a| foo| 1| 1|
# | a| foo| 2| 1|
# +----+----+----+-----+

If you do not want null first remove it using below example
Dfwithoutnull=df.na.drop("all",seq(col name 1,col name 2))
Above expression will delete null form the original dataframe
2.group by you know I guess.
3.rollup and cube is GROUPING SET operator.
Roll-up is a multidimensional aggrigation and treating element in hierarchical manner
And in cube rather than treating element hierarchically a cube does the same thing accross all dimension.
You can try grouping_id to understand the level of abstraction

Related

pyspark get latest non-null element of every column in one row

Let me explain my question using an example:
I have a dataframe:
pd_1 = pd.DataFrame({'day':[1,2,3,2,1,3],
'code': [10, 10, 20,20,30,30],
'A': [44, 55, 66,77,88,99],
'B':['a',None,'c',None,'d', None],
'C':[None,None,'12',None,None, None]
})
df_1 = sc.createDataFrame(pd_1)
df_1.show()
Output:
+---+----+---+----+----+
|day|code| A| B| C|
+---+----+---+----+----+
| 1| 10| 44| a|null|
| 2| 10| 55|null|null|
| 3| 20| 66| c| 12|
| 2| 20| 77|null|null|
| 1| 30| 88| d|null|
| 3| 30| 99|null|null|
+---+----+---+----+----+
What I want to achieve is a new dataframe, each row corresponds to a code, and for each column I want to have the most recent non-null value (with highest day).
In pandas, I can simply do
pd_2 = pd_1.sort_values('day', ascending=True).groupby('code').last()
pd_2.reset_index()
to get
code day A B C
0 10 2 55 a None
1 20 3 66 c 12
2 30 3 99 d None
My question is, how can I do it in pyspark (preferably version < 3)?
What I have tried so far is:
from pyspark.sql import Window
import pyspark.sql.functions as F
w = Window.partitionBy('code').orderBy(F.desc('day')).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
## Update: after applying #Steven's idea to remove for loop:
df_1 = df_1 .select([F.collect_list(x).over(w).getItem(0).alias(x) for x in df_.columns])
##for x in df_1.columns:
## df_1 = df_1.withColumn(x, F.collect_list(x).over(w).getItem(0))
df_1 = df_1.distinct()
df_1.show()
Output
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Which I'm not very happy with, especially due to the for loop.
I think your current solution is quite nice. If you want another solution, you can try using first/last window functions :
from pyspark.sql import functions as F, Window
w = Window.partitionBy("code").orderBy(F.col("day").desc())
df2 = (
df.select(
"day",
"code",
F.row_number().over(w).alias("rwnb"),
*(
F.first(F.col(col), ignorenulls=True)
.over(w.rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing))
.alias(col)
for col in ("A", "B", "C")
),
)
.where("rwnb = 1")
.drop("rwnb")
)
and the result :
df2.show()
+---+----+---+---+----+
|day|code| A| B| C|
+---+----+---+---+----+
| 2| 10| 55| a|null|
| 3| 30| 99| d|null|
| 3| 20| 66| c| 12|
+---+----+---+---+----+
Here's another way of doing by using array functions and struct ordering instead of Window:
from pyspark.sql import functions as F
other_cols = ["day", "A", "B", "C"]
df_1 = df_1.groupBy("code").agg(
F.collect_list(F.struct(*other_cols)).alias("values")
).selectExpr(
"code",
*[f"array_max(filter(values, x-> x.{c} is not null))['{c}'] as {c}" for c in other_cols]
)
df_1.show()
#+----+---+---+---+----+
#|code|day| A| B| C|
#+----+---+---+---+----+
#| 10| 2| 55| a|null|
#| 30| 3| 99| d|null|
#| 20| 3| 66| c| 12|
#+----+---+---+---+----+

how to create & sort by an ordered categorical variable in pyspark

I'm migrating some code from pandas to pyspark. My source dataframe looks like this:
a b c
0 1 insert 1
1 2 update 1
2 3 seed 1
3 4 insert 2
4 5 update 2
5 6 delete 2
6 7 snapshot 1
and the operation (in python / pandas) that I'm applying is:
df.b = pd.Categorical(df.b, ordered=True, categories=['insert', 'seed', 'update', 'snapshot', 'delete'])
df.sort_values(['c', 'b'])
resulting in the output dataframe:
a b c
0 1 insert 1
2 3 seed 1
1 2 update 1
6 7 snapshot 1
3 4 insert 2
4 5 update 2
5 6 delete 2
I'm unsure how best to set up ordered categoricals using pyspark, and my initial approach creates a new column using case-when and attempts to use that subsequently:
df = df.withColumn(
"_precedence",
when(col("b") == "insert", 1)
.when(col("b") == "seed", 2)
.when(col("b") == "update", 3)
.when(col("b") == "snapshot", 4)
.when(col("b") == "delete", 5)
)
You can use a map:
from pyspark.sql.functions import create_map, lit, col
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
# per #HaleemurAli, adjusted the below list comprehension to create map
map1 = create_map([val for (i, c) in enumerate(categories) for val in (c, lit(i))])
#Column<b'map(insert, 0, seed, 1, update, 2, snapshot, 3, delete, 4)'>
df.orderBy('c', map1[col('b')]).show()
+---+---+--------+---+
| id| a| b| c|
+---+---+--------+---+
| 0| 1| insert| 1|
| 2| 3| seed| 1|
| 1| 2| update| 1|
| 6| 7|snapshot| 1|
| 3| 4| insert| 2|
| 4| 5| update| 2|
| 5| 6| delete| 2|
+---+---+--------+---+
to reverse the order on column-b: df.orderBy('c', map1[col('b')].desc()).show()
You could also do this using coalesce with ur when statements.
from pyspark.sql import functions as F
categories=['insert', 'seed', 'update', 'snapshot', 'delete']
cols=[(F.when(F.col("b")==x,F.lit(y))) for x,y in zip(categories,[x for x in (range(1, len(categories)+1))])]
df.orderBy("c",F.coalesce(*cols)).show()
#+---+--------+---+
#| a| b| c|
#+---+--------+---+
#| 1| insert| 1|
#| 3| seed| 1|
#| 2| update| 1|
#| 7|snapshot| 1|
#| 4| insert| 2|
#| 5| update| 2|
#| 6| delete| 2|
#+---+--------+---+

Count types for every time difference from the time of one specific type within a time range with a granularity of one second in pyspark

I have the following time-series data in a DataFrame in pyspark:
(id, timestamp, type)
the id column can be any integer value and many rows of the same id
can exist in the table
the timestamp column is a timestamp represented by an integer (for simplification)
the type column is a string type variable where each distinct
string on the column represents one category. One special category
out of all is 'A'
My question is the following:
Is there any way to compute (with SQL or pyspark DataFrame operations):
the counts of every type
for all the time differences from the timestamp corresponding to all the rows
of type='A' within a time range (e.g. [-5,+5]), with granularity of 1 second
For example, for the following DataFrame:
ts_df = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(1,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
ts_df.show()
+---+----+-----+
| id|type| ts|
+---+----+-----+
| 1| A| 100|
| 2| A| 1000|
| 3| A|10000|
| 1| b| 99|
| 1| b| 99|
| 1| b| 99|
| 2| b| 999|
| 2| b| 999|
| 2| c| 999|
| 2| c| 999|
| 1| d| 999|
| 3| c| 9999|
| 3| c| 9999|
| 3| d| 9999|
| 1| b| 98|
| 1| b| 98|
| 2| b| 998|
| 2| c| 998|
| 3| c| 9998|
+---+----+-----+
for a time difference of -1 second the result should be:
# result for time difference = -1 sec
# b: 5
# c: 4
# d: 2
while for a time difference of -2 seconds the result should be:
# result for time difference = -2 sec
# b: 3
# c: 2
# d: 0
and so on so forth for any time difference within a time range for a granularity of 1 second.
I tried many different ways by using mostly groupBy but nothing seems to work.
I am mostly having difficulties on how to express the time difference from each row of type=A even if I have to do it for one specific time difference.
Any suggestions would be greatly appreciated!
EDIT:
If I only have to do it for one specific time difference time_difference then I could do it with the following way:
time_difference = -1
df_type_A = ts_df.where(F.col("type")=='A').selectExpr("ts as fts")
res = df_type_A.join(ts_df, on=df_type_A.fts+time_difference==ts_df.ts)\
.drop("ts","fts").groupBy(F.col("type")).count()
The the returned res DataFrame will give me exactly what I want for one specific time difference. I create a loop and solve the problem by repeating the same query over and over again.
However, is there any more efficient way than that?
EDIT2 (solution)
So that's how I did it at the end:
df1 = sc.parallelize([
(1,'b',99),(1,'b',99),(1,'b',99),
(2,'b',999),(2,'b',999),(2,'c',999),(2,'c',999),(2,'d',999),
(3,'c',9999),(3,'c',9999),(3,'d',9999),
(1,'b',98),(1,'b',98),
(2,'b',998),(2,'c',998),
(3,'c',9998)
]).toDF(["id","type","ts"])
df1.show()
df2 = sc.parallelize([
(1,'A',100),(2,'A',1000),(3,'A',10000),
]).toDF(["id","type","ts"]).selectExpr("id as fid","ts as fts","type as ftype")
df2.show()
df3 = df2.join(df1, on=df1.id==df2.fid).withColumn("td", F.col("ts")-F.col("fts"))
df3.show()
df4 = df3.groupBy([F.col("type"),F.col("td")]).count()
df4.show()
Will update performance details as soon as I'll have any.
Thanks!
Another way to solve this problem would be:
Divide existing data-frames in two data-frames - with A and without A
Add a new column in without A df, which is sum of "ts" and time_difference
Join both data frame, group By and count.
Here is a code:
from pyspark.sql.functions import lit
time_difference = 1
ts_df_A = (
ts_df
.filter(ts_df["type"] == "A")
.drop("id")
.drop("type")
)
ts_df_td = (
ts_df
.withColumn("ts_plus_td", lit(ts_df['ts'] + time_difference))
.filter(ts_df["type"] != "A")
.drop("ts")
)
joined_df = ts_df_A.join(ts_df_td, ts_df_A["ts"] == ts_df_td["ts_plus_td"])
agg_df = joined_df.groupBy("type").count()
>>> agg_df.show()
+----+-----+
|type|count|
+----+-----+
| d| 2|
| c| 4|
| b| 5|
+----+-----+
>>>
Let me know if this is what you are looking for?
Thanks,
Hussain Bohra

How to get value_counts for a spark row?

I have a spark dataframe with 3 columns storing 3 different predictions. I want to know the count of each output value so as to pick the value that was obtained max number of times as the final output.
I can do this in pandas easily by calling my lambda function for each row to get value_counts as shown below. I have converted my spark df to pandas df here, but I need to be able to perform similar operation on the spark df directly.
r=[Row(run_1=1, run_2=2, run_3=1, name='test run', id=1)]
df1=spark.createDataFrame(r)
df1.show()
df2=df1.toPandas()
r=df2.iloc[0]
val_counts=r[['run_1','run_2','run_3']].value_counts()
print(val_counts)
top_val=val_counts.index[0]
top_val_cnt=val_counts.values[0]
print('Majority output = %s, occured %s out of 3 times'%(top_val,top_val_cnt))
The output tells me that the value 1 occurred the most number of times- twice in this case -
+---+--------+-----+-----+-----+
| id| name|run_1|run_2|run_3|
+---+--------+-----+-----+-----+
| 1|test run| 1| 2| 1|
+---+--------+-----+-----+-----+
1 2
2 1
Name: 0, dtype: int64
Majority output = 1, occured 2 out of 3 times
I am trying to write a udf function which can take each of the df1 rows and get the top_val and top_val_cnt. Is there a way to achieve this using spark df?
python's code should be similar, maybe it will help you
val df1 = Seq((1, 1, 1, 2), (1, 2, 3, 3), (2, 2, 2, 2)).toDF()
df1.show()
df1.select(array('*)).map(s=>{
val list = s.getList(0)
(list.toString(),list.toArray.groupBy(i => i).mapValues(_.size).toList.toString())
}).show(false)
output:
+---+---+---+---+
| _1| _2| _3| _4|
+---+---+---+---+
| 1| 1| 1| 2|
| 1| 2| 3| 3|
| 2| 2| 2| 2|
+---+---+---+---+
+------------+-------------------------+
|_1 |_2 |
+------------+-------------------------+
|[1, 1, 1, 2]|List((2,1), (1,3)) |
|[1, 2, 3, 3]|List((2,1), (1,1), (3,2))|
|[2, 2, 2, 2]|List((2,4)) |
+------------+-------------------------+
Let's have a test dataframe similar like yours.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
newdf = df.rdd.map(lambda x : (x[0],x[1],x[2:])) \
.map(lambda x : (x[0],x[1],x[2][0],x[2][1],x[2][2],[max(set(x[2]),key=x[2].count )])) \
.toDF(['id','test','run_1','run_2','run_3','most_frequent'])
>>> newdf.show()
+---+--------+-----+-----+-----+-------------+
| id| test|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| [1]|
| 2|test run| 3| 2| 3| [3]|
| 3|test run| 4| 4| 4| [4]|
+---+--------+-----+-----+-----+-------------+
Or you need to handle a case when each item in list is different. i.e returning a null.
list = [(1,'test run',1,2,1),(2,'test run',3,2,3),(3,'test run',4,4,4),(4,'test run',1,2,3)]
df=spark.createDataFrame(list, ['id', 'name','run_1','run_2','run_3'])
from pyspark.sql.functions import udf
#udf
def most_frequent(*mylist):
counter = 1
num = mylist[0]
for i in mylist:
curr_frequency = mylist.count(i)
if(curr_frequency> counter):
counter = curr_frequency
num = i
return num
else:
return None
Initializing counter to '1' and returning count if its greater than '1' only.
df.withColumn('most_frequent', most_frequent('run_1', 'run_2', 'run_3')).show()
+---+--------+-----+-----+-----+-------------+
| id| name|run_1|run_2|run_3|most_frequent|
+---+--------+-----+-----+-----+-------------+
| 1|test run| 1| 2| 1| 1|
| 2|test run| 3| 2| 3| 3|
| 3|test run| 4| 4| 4| 4|
| 4|test run| 1| 2| 3| null|
+---+--------+-----+-----+-----+-------------+
+---+--------+-----+-----+-----+----+

Spark SQL: Is there a way to distinguish columns with same name?

I have a csv with a header with columns with same name.
I want to process them with spark using only SQL and be able to refer these columns unambiguously.
Ex.:
id name age height name
1 Alex 23 1.70
2 Joseph 24 1.89
I want to get only first name column using only Spark SQL
As mentioned in the comments, I think that the less error prone method would be to have the schema of the input data changed.
Yet, in case you are looking for a quick workaround, you can simply index the duplicated names of the columns.
For instance, let's create a dataframe with three id columns.
val df = spark.range(3)
.select('id * 2 as "id", 'id * 3 as "x", 'id, 'id * 4 as "y", 'id)
df.show
+---+---+---+---+---+
| id| x| id| y| id|
+---+---+---+---+---+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+---+---+---+---+---+
Then I can use toDF to set new column names. Let's consider that I know that only id is duplicated. If we don't, adding the extra logic to figure out which columns are duplicated would not be very difficult.
var i = -1
val names = df.columns.map( n =>
if(n == "id") {
i+=1
s"id_$i"
} else n )
val new_df = df.toDF(names : _*)
new_df.show
+----+---+----+---+----+
|id_0| x|id_1| y|id_2|
+----+---+----+---+----+
| 0| 0| 0| 0| 0|
| 2| 3| 1| 4| 1|
| 4| 6| 2| 8| 2|
+----+---+----+---+----+